Skip to content

Instantly share code, notes, and snippets.

@Venkat2811
Last active April 9, 2026 14:00
Show Gist options
  • Select an option

  • Save Venkat2811/0ece8e52177d2319e944e754687c6857 to your computer and use it in GitHub Desktop.

Select an option

Save Venkat2811/0ece8e52177d2319e944e754687c6857 to your computer and use it in GitHub Desktop.
Stealth R&D for Distributed AI Systems.

Intro

The goal was to achieve Speed-of-Light SW efficiency by building ultra low-latency primitives.

6 months of 14 hr work days with AI assisted R&D. 2 months of unsuccessful fundraising attempt. Attempted to build a venture scale company around these ideas. Work will be cleaned up and OSS'd:


Myelon - ZERO COPY SHM Based IPC Library – Inference Engine <> GPU worker process

  • Rust shared-memory IPC (1M msgs/sec @ p99.9 ≈7µs, 1kb) that speeds up my vllm.rs prefill by 30% (not vllm).
  • 80% Speed-of-Light (CPU cache electrical signal) in AMD Ryzen 7 5800X. Works on all CPUs & UNIX systems. This is CO-aware latency measurement, not raw throughput. More Details
  • Implemented in Rust, with python bindings and is among the world's fastest (if not the fastest) library.
  • Lib with ping-pong, and single producer, multi-consumer broadcast semantics. This beats cpp boost, open-mpi, zmq.
  • Built with assistance from claude-code & codex-cli. It took 2 months.
  • GH repo WIP/TBA

YALI – P2P CUDA Library Outperforms NVIDIA NCCL by 1.2x to 2.4x on 2xA100 & above.

  • p2p all_reduce_sum.80-85% Speed-of-Light in low-latency & bandwidth saturation mode with over 50x stable tail latency.
  • When integrated into vllm.rs results in improvement of 20% prefill and 10% decode in LLM Inference.
  • Heuristic based tuning for data sizes and types from 1kb to 16GB, very high compute and memory overlap
  • Github Repo & Blog Post & Extensive Profiling. 100% built with AI agents in 3 weeks. LLMs like to cheat with GPU kernels.
  • I did this exercise to see how HFT & Distributed Systems tricks can be applied for kernels without knowing CUDA syntax. Understanding HW architecture, rigorous profiling and tests were enough. Optimizing p2p all_reduce_sum over 50% of inter gpu comms in TP felt like a right test bed to me.

DRAKE – Distributed Reusable Activations & KV Cache Engine – Inspired by turbopuffer’s 10x wedge

  • tcp, http, io_uring, zero-copy, mmap, WAL, extent. 32MB max payload limit in v0.0.1. Optimized client and written in Rust.
  • Single process setup, 1MB payload, reads 1200 req/s, warm cache reads 800ns, cold cache reads 1.5µs, writes 250 req/s.
  • GH repo WIP/TBA
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment