The goal was to achieve Speed-of-Light SW efficiency by building ultra low-latency primitives.
6 months of 14 hr work days with AI assisted R&D. 2 months of unsuccessful fundraising attempt. Attempted to build a venture scale company around these ideas. Work will be cleaned up and OSS'd:
- Rust shared-memory IPC (1M msgs/sec @ p99.9 ≈7µs, 1kb) that speeds up my vllm.rs prefill by 30% (not vllm).
- 80% Speed-of-Light (CPU cache electrical signal) in AMD Ryzen 7 5800X. Works on all CPUs & UNIX systems. This is CO-aware latency measurement, not raw throughput. More Details
- Implemented in Rust, with python bindings and is among the world's fastest (if not the fastest) library.
- Lib with ping-pong, and single producer, multi-consumer broadcast semantics. This beats cpp boost, open-mpi, zmq.
- Built with assistance from claude-code & codex-cli. It took 2 months.
- GH repo WIP/TBA
- p2p all_reduce_sum.80-85% Speed-of-Light in low-latency & bandwidth saturation mode with over 50x stable tail latency.
- When integrated into vllm.rs results in improvement of 20% prefill and 10% decode in LLM Inference.
- Heuristic based tuning for data sizes and types from 1kb to 16GB, very high compute and memory overlap
- Github Repo & Blog Post & Extensive Profiling. 100% built with AI agents in 3 weeks. LLMs like to cheat with GPU kernels.
- I did this exercise to see how HFT & Distributed Systems tricks can be applied for kernels without knowing CUDA syntax. Understanding HW architecture, rigorous profiling and tests were enough. Optimizing p2p all_reduce_sum over 50% of inter gpu comms in TP felt like a right test bed to me.
- tcp, http, io_uring, zero-copy, mmap, WAL, extent. 32MB max payload limit in v0.0.1. Optimized client and written in Rust.
- Single process setup, 1MB payload, reads 1200 req/s, warm cache reads 800ns, cold cache reads 1.5µs, writes 250 req/s.
- GH repo WIP/TBA