Stealth R&D for Distributed AI Systems.

Intro

The goal was to achieve Speed-of-Light SW efficiency by building ultra low-latency primitives.

6 months of 14 hr work days with AI assisted R&D. 2 months of unsuccessful fundraising attempt. Attempted to build a venture scale company around these ideas. Work will be cleaned up and OSS'd:

Myelon - ZERO COPY SHM Based IPC Library – Inference Engine <> GPU worker process

Rust shared-memory IPC (1M msgs/sec @ p99.9 ≈7µs, 1kb) that speeds up my vllm.rs prefill by 30% (not vllm).
80% Speed-of-Light (CPU cache electrical signal) in AMD Ryzen 7 5800X. Works on all CPUs & UNIX systems. This is CO-aware latency measurement, not raw throughput. More Details
Implemented in Rust, with python bindings and is among the world's fastest (if not the fastest) library.
Lib with ping-pong, and single producer, multi-consumer broadcast semantics. This beats cpp boost, open-mpi, zmq.
Built with assistance from claude-code & codex-cli. It took 2 months.
GH repo WIP/TBA

YALI – P2P CUDA Library Outperforms NVIDIA NCCL by 1.2x to 2.4x on 2xA100 & above.

p2p all_reduce_sum.80-85% Speed-of-Light in low-latency & bandwidth saturation mode with over 50x stable tail latency.
When integrated into vllm.rs results in improvement of 20% prefill and 10% decode in LLM Inference.
Heuristic based tuning for data sizes and types from 1kb to 16GB, very high compute and memory overlap
Github Repo & Blog Post & Extensive Profiling. 100% built with AI agents in 3 weeks. LLMs like to cheat with GPU kernels.
I did this exercise to see how HFT & Distributed Systems tricks can be applied for kernels without knowing CUDA syntax. Understanding HW architecture, rigorous profiling and tests were enough. Optimizing p2p all_reduce_sum over 50% of inter gpu comms in TP felt like a right test bed to me.

DRAKE – Distributed Reusable Activations & KV Cache Engine – Inspired by turbopuffer’s 10x wedge

tcp, http, io_uring, zero-copy, mmap, WAL, extent. 32MB max payload limit in v0.0.1. Optimized client and written in Rust.
Single process setup, 1MB payload, reads 1200 req/s, warm cache reads 800ns, cold cache reads 1.5µs, writes 250 req/s.
GH repo WIP/TBA

Venkat2811/2025_Startup_Attempt.md

Select an option