Project Popcorn: Generate SOTA kernels with LLMs in public

TL;DR: We're building an LLM that can codegenerate efficient CUDA kernels in public. Today models like ChatGPT are terrible at systems programming because they don't seem to understand how GPUs work and frequently hallucinate. However projects like llm.c with a smart human in the loop with an LLM have shown us that it should be possible to make this happen. There's a lot we need to innovate on both in terms of how we create more kernel tokens, what are the right abstractions LLMs should use, how to scale test-time compute and considering how hard this we want to do everything in public in Discord. We will share infra, loss curves, chat messages all on Discord and try to include as many people as possible so we can actually crack this problem

Logistics

We're distributed research effort so we mostly chat async on discord.gg/gpumode on the popcorn channel

If you prefer longer form content you can check out https://drive.google.com/drive/folders/1nt2KcRRKb8YdySxkRxUu5PR4c7UPM_rK

Top level goals for the next 6 months

We released the first downloadable checkpoints of a SOTA kernel generation LLM on public benchmarks
We publish a library and VS code extension to make it easy for developers to try out the bot and have at least 10 active weekly users

Workstreams

Dataset workstream

We publish the largest collection of Triton and CUDA kernels on the internet and as a baseline outcompete ChatGPT after a Llama 70b by using this dataset
We produce an evaluation set of kernels
We kickstart a public competition for kernel authoring and its leaderboard in collaboration with Stanford KernelBench and get our competition accepted at NeurIPS 2025

Infra workstream

We raise funding for at least 100 GPUs for 6 months
We create a Discord based leaderboard infra that is used by at least 10 active users in a week
We publish public loss curves for our large scale training run that update in real time

Science workstream

We ship the baseline of a “really good prompt that explains how GPUs work”
Measure efficacy of scaling test time compute by repeated sampling, performance checklists and using profilers as feedback
We complete a training run with 100 GPUs with publicly shared loss curves with “high GPU” utilization Language workstream

Language workstream

We baseline find that better kernels are produced using thunderkittens vs cuda because some abstractions are "good"
If abstractions are good then we want to 3x the number of contributions to Thunderkittens and integrate it in popular compilers like torch.compile

msaroufim/🍿.md