Oliver Mannion tekumara

I get asked pretty regularly what my opinion is on merge commits vs rebasing vs squashing. I've typed up this response so many times that I've decided to just put it in a gist so I can reference it whenever it comes up again.

I use merge, squash, rebase all situationally. I believe they all have their merits but their usage depends on the context. I think anyone who says any particular strategy is the right answer 100% of the time is wrong, but I think there is considerable acceptable leeway in when you use each. What follows is my personal and professional opinion:

Parallel Python within the same process or hacking around the cursed GIL with a hand-rolled library loader

From its obscure beginnings in Amsterdam, the Python programming language has become a fundamental building block of our digital society. It is used literally everywhere and by everyone for a mind-boggingly wide variety of tasks.

Python is also the lingua franca of Data Science, tying together tools for data loading, wrangling, analysis and AI. There is a massive ecosystem of contributed Python packages, which - for example - allows reading every obscure data format under the sun. This makes Python and its ecosystem extremely valuable for analytical data management systems: Users are likely somewhat familiar with Python due to its immense popularity and the ecosystem provides solutions for most data problems. As a result, Python is being integrated into SQL systems, typically through so-called User-Defined Functions (UDFs). For example, [Apach

Anti-hype LLM reading list

Goals: Add links that are reasonable and good explanations of how stuff works. No hype and no vendor content if possible. Practical first-hand accounts of models in prod eagerly sought.

Foundational Concepts

Pre-Transformer Models

Reinforcement Learning for Language Models

Yoav Goldberg, April 2023.

Why RL?

With the release of the ChatGPT model and followup large language models (LLMs), there was a lot of discussion of the importance of "RLHF training", that is, "reinforcement learning from human feedback". I was puzzled for a while as to why RL (Reinforcement Learning) is better than learning from demonstrations (a.k.a supervised learning) for training language models. Shouldn't learning from demonstrations (or, in language model terminology "instruction fine tuning", learning to immitate human written answers) be sufficient? I came up with a theoretical argument that was somewhat convincing. But I came to realize there is an additional argumment which not only supports the case of RL training, but also requires it, in particular for models like ChatGPT. This additional argument is spelled out in (the first half of) a talk by John Schulman from OpenAI. This post pretty much

Notes on Forma

Recently @dragostis at Google released an experimental vector graphics renderer called Forma.

The renderer has a pretty cool set of goals: portability, performance, simplicity, and size. Graphics and GPU computation models are a topic that I'm pretty interested in learning more about personally, and this project seems like an especially accessible / well-written codebase to learn from.

I'm very happy to see this work! The era of rendering vector graphics in GPU compute shaders is upon us, and I have no doubt it we'll start seeing these in production soon, as there's just such a performance advantage over CPU rendering, and I believe trying to run vector 2D graphics through the GPU rasterization pipeline doesn't quite work.

_This code is simpler than Vello (the new name for piet-gpu), focused on vector path rendering. It's also a strong demo of the power of WebGPU, while also having a performant software-only pipe

Zig Type System

Zig aims to be a simple language. It is not easy to define what simple exactly means, but zig is also a low-level programming language that aims for c-compatibility. To reach this goal, it needs good semantics in its type system so that developers have a complete toolbox to manipulate data.

So types in zig are composable, but this can become rapidly overwhelming. See those examples. Are you able to understand them at a glance, as soon as you read them?

*const ?u8
?*const u8
*const [2]u8

ChatGPT Resources

Context

ChatGPT appeared like an explosion on all my social media timelines in early December 2022. While I keep up with machine learning as an industry, I wasn't focused so much on this particular corner, and all the screenshots seemed like they came out of nowhere. What was this model? How did the chat prompting work? What was the context of OpenAI doing this work and collecting my prompts for training data?

I decided to do a quick investigation. Here's all the information I've found so far. I'm aggregating and synthesizing it as I go, so it's currently changing pretty frequently.

	from uuid import UUID

	import orjson
	import pendulum
	import rich.console
	from websockets.client import connect
	from websockets.exceptions import ConnectionClosedError

	from prefect.cli import root
	from prefect.cli._types import PrefectTyper


	/*
	Usage: you'll want to search for the strings <bucket> and <prefix>, and insert the S3 bucket where your access
	logs are being delivered. Use (or delete) <prefix> to filter to a subset of your logs.
	*/


	/*
	These commented out configuration settings you can either run yourself in the REPL and source this file using
	`.read parse_s3_access_logs.sql`, or you can uncomment them and supply values for yourself.

	mp=1; size=7B; # to run 7B
	mp=8; size=65B; # to run 65B
	for seed in $(randint 1000000)
	do
	export TARGET_FOLDER=~/ml/data/llama/LLaMA
	time python3 -m torch.distributed.run --nproc_per_node $mp example.py --ckpt_dir $TARGET_FOLDER/$size --tokenizer_path $TARGET_FOLDER/tokenizer.model --seed $seed --max_seq_len 2048 --max_gen_len 2048 --count 0 \| tee -a ${size}_startrek.txt
	done