I spent my recharge more like a sabbatical to get my AI skills up to date, so I wanted to share what I’ve been working on.
My plan was to spend the first two weeks building projects that explored how to work with AI, then spend the next two weeks building my own AI model. Things mostly went according to plan.
TL;DR this is my toy LLM: https://giftofgab.chat/
Carrot.Code is my vibe-coding environment. It started as a research project to understand how an agentic harness works in practice: tool calling, context compaction, and related features. The project was inspired by Pi and Lite.
All font rendering is done with the recently public-domained Slug algorithm. The editor has been tested with GPT-OSS-20B, Qwen3-Coder-Next 80B, and Qwen3 235B, all running through LM Studio.
I used this project to practice long-running, unsupervised agent work. Claude ran uninterrupted for about six hours to build the first working version, followed by roughly four hours of guided edits. The workflow used sub-agents and /goal to drive the work:
- Create a
plan.mdfile that outlines what is going to be built - Launch 3 research agents to figure out everything
plan.mdneeds (UI, font rendering, text editing) - Ask claude to turn
plan.mdand theResearchfolder intotodo.md, a checklist of tasks - Use
/goalfeature in claude code to drive work until everything intodo.mdis checked off. - Let it bake for 6 hours
I think the biggest contributor to the project’s success was asking the model to create Playwright unit tests for every major step, use Playwright to take screenshots, and visually inspect those screenshots. The visual test acted as a gate after each major TODO section. This was inspired by an OpenAI blog post about building web games with Codex, and it worked just as well with Claude.
This one was a swing and a miss. I wanted to create a simple 3D model editor. I followed the same flow as Carrot.Code, but the outcome was not as good. I think the added complexity of 3D editing threw the model off. After the agent ran for about five hours, the first iteration of the editor simply did not work.
- The UI didn't work.
- Some mesh operations did not work.
- The mesh operations that did work where not exposed to the broken UI.
- The model wrote fake unit tests (it just returned true with a comment that the unit test was a placeholder).
I tried to compensate by breaking the project into smaller, more achievable sub-projects, then hooking them together afterward. The idea was to build the UI and mesh editing components independently, verify that they worked, and then combine them.
For the UI, I had Claude port GWEN to JavaScript, along with its own Playwright unit tests. That worked really well.
Then I had Claude build the mesh editing core, which also worked. Once both pieces were done, I started a run to hook up the editor. That went pretty well, and you can find the first working iteration here:
With a few more days of guided feedback, this could become a real 3D editor. But I had run out of time and needed to move on to the next project. I might come back to finish it later.
The goal of this project was to learn how to build a minimal but functional causal transformer stack, entirely in JavaScript. The core of the project is only about 800 lines of code, but it is surprisingly capable. It uses a small autograd engine with only SGD training (no Adam optimizer).
I then designed a roughly 250K-parameter model (about 230K), that could be trained to do real tasks. I built a small 500 MiB dataset from a few thousand games of Clue and trained the model on it. It can guess the murderer with 100% accuracy. :D
There is a vibe-coded front end that let's you configure the prompt visually and have the AI solve a small murder mistery
- demo: https://gszauer.github.io/Clue
- source: https://github.com/gszauer/Clue
- trainind data: https://huggingface.co/datasets/gszauer/Clue250K
This is the project i'm most proud of, i designed and trained a 100M LLM that's able to kind of have a conversation, and think. It has the gift of gab if you will.
This project brings the work I did for Clue 250K full circle by extending it into a real LLM. The architecture moves from something resembling GPT-2 to something closer to Llama.
The main differences from Clue 250K:
- The vocabulary was expanded to 10K tokens, still using BPE.
- Replaced learned positional encoding with RoPE.
- Single-headed attention was extended to multi-headed attention.
- Switched the activation function from GeLU to ReLU.
- Trained on 2B tokens
The model can speak coherent English, but that is about it. The training data formats model responses in a ChatML style conversation format, including thinking blocks. You can ask it things like "What game should I play?" or "How do I make pasta?"
Almost all of the model’s shortcomings come from its narrow training data. For example, there are no conversations that start with a simple “Hello,” so just saying “Hi” confuses the model.
I trained this model on a Mac Pro without gradient checkpointing. It took about 150 GiB of ram, 96 hours to pre-train and 16 hours to fine-tune.
- Demo: https://giftofgab.chat/
- Model: https://huggingface.co/gszauer/Gab100M
- Pretrain Data: https://huggingface.co/datasets/gszauer/Gab100MPretrain
- Finetune Data: https://huggingface.co/datasets/gszauer/Gab100MFinetune
I learned a lot about the kind of data that goes into an LLM. Learning how to implement the architecture was great, but the model’s capabilities are ultimately tied to the data. I think the lack of summarization tasks in the training data hurt the models quality.