TL;DR: We're building an LLM that can codegenerate efficient CUDA kernels in public. Today models like ChatGPT are terrible at systems programming because they don't seem to understand how GPUs work and frequently hallucinate. However projects like llm.c with a smart human in the loop with an LLM have shown us that it should be possible to make this happen. There's a lot we need to innovate on both in terms of how we create more kernel tokens, what are the right abstractions LLMs should use, how to scale test-time compute and considering how hard this we want to do everything in public in Discord. We will share infra, loss curves, chat messages all on Discord and try to include as many people as possible so we can actually crack this problem
We're distributed research effort so we mostly chat async on discord.gg/gpumode on the popcorn channel
If you prefer longer form content you can check out https://drive.google.com/drive/folders/1nt2KcRRKb8YdySxkRxUu5PR4c7UPM_rK
- We released the first downloadable checkpoints of a SOTA kernel generation LLM on public benchmarks
- We publish a library and VS code extension to make it easy for developers to try out the bot and have at least 10 active weekly users
- We publish the largest collection of Triton and CUDA kernels on the internet and as a baseline outcompete ChatGPT after a Llama 70b by using this dataset
- We produce an evaluation set of kernels
- We kickstart a public competition for kernel authoring and its leaderboard in collaboration with Stanford KernelBench and get our competition accepted at NeurIPS 2025
- We raise funding for at least 100 GPUs for 6 months
- We create a Discord based leaderboard infra that is used by at least 10 active users in a week
- We publish public loss curves for our large scale training run that update in real time
- We ship the baseline of a “really good prompt that explains how GPUs work”
- Measure efficacy of scaling test time compute by repeated sampling, performance checklists and using profilers as feedback
- We complete a training run with 100 GPUs with publicly shared loss curves with “high GPU” utilization Language workstream
- We baseline find that better kernels are produced using thunderkittens vs cuda because some abstractions are "good"
- If abstractions are good then we want to 3x the number of contributions to Thunderkittens and integrate it in popular compilers like torch.compile