Goals: Add links that are reasonable and good explanations of how stuff works. No hype and no vendor content if possible. Practical first-hand accounts of models in prod eagerly sought.
- The Illustrated Word2vec - A Gentle Intro to Word Embeddings in Machine Learning (YouTube)
- Transformers as Support Vector Machines
- Survey of LLMS
- Deep Learning Systems
- Fundamental ML Reading List
- What are embeddings
- Concepts from Operating Systems that Found their way into LLMS
- Talking about Large Language Models
- Language Modeling is Compression
- Vector Search - Long-Term Memory in AI
- Eight things to know about large language models
- The Bitter Lesson
- The Hardware Lottery
- The Scaling Hypothesis
- Tokenization
- LLM Course
- Seq2Seq
- Attention is all you Need
- BERT
- GPT-1
- Scaling Laws for Neural Language Models
- T5
- GPT-2: Language Models are Unsupervised Multi-Task Learners
- InstructGPT: Training Language Models to Follow Instructions
- GPT-3: Language Models are Few-Shot Learners
- Transformers from Scratch
- Transformer Math
- Five Years of GPT Progress
- Lost in the Middle: How Language Models Use Long Contexts
- Self-attention and transformer networks
- Attention
- Understanding and Coding the Attention Mechanism
- Attention Mechanisms
- Keys, Queries, and Values
- What is ChatGPT doing and why does it work
- My own notes from a few months back.
- Karpathy's The State of GPT (YouTube)
- OpenAI Cookbook
- Catching up on the weird world of LLMS
- How open are open architectures?
- Building an LLM from Scratch
- Large Language Models in 2023 and Slides
- Timeline of Transformer Models
- Large Language Model Evolutionary Tree
- Why host your own LLM?
- How to train your own LLMs
- Hugging Face Resources on Training Your Own
- Training Compute-Optimal Large Language Models
- Opt-175B Logbook
- RLHF
- Instruction-tuning for LLMs: Survey
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- RLHF and DPO Compared
- The Complete Guide to LLM Fine-tuning
- LLaMAntino: LLaMA 2 Models for Effective Text Generation in Italian Language - Really great overview of SOTA fine-tuning techniques
- On the Structural Pruning of Large Language Models
- Quantiztion
- PEFT
- How is LlamaCPP Possible?
- How to beat GPT-4 with a 13-B Model
- Efficient LLM Inference on CPUs
- Tiny Language Models Come of Age
- Efficiency LLM Spectrum
- TinyML at MIT
- Building LLM Applications for Production
- Challenges and Applications of Large Language Models
- All the Hard Stuff Nobody talks about when building products with LLMs
- Scaling Kubernetes to run ChatGPT
- Numbers every LLM Developer should know
- Against LLM Maximalism
- A Guide to Inference and Performance
- (InThe)WildChat: 570K ChatGPT Interaction Logs In The Wild
- The State of Production LLMs in 2023
- Machine Learning Engineering for successful training of large language models and multi-modal models.
- Fine-tuning RedPajama on Slack Data
- LLM Inference Performance Engineering: Best Practices
- How to Make LLMs go Fast
- Transformer Inference Arithmetic
- Which serving technology to use for LLMs?
- Speeding up the K-V cache
- Large Transformer Model Inference Optimization
- On Prompt Engineering
- Prompt Engineering Versus Blind Prompting
- Building RAG-Based Applications for Production
- Full Fine-Tuning, PEFT, or RAG?
- Prompt Engineering Guide
- The Best GPUS for Deep Learning 2023
- Making Deep Learning Go Brr from First Principles
- Everything about Distributed Training and Efficient Finetuning
- Training LLMs at Scale with AMD MI250 GPUs
- GPU Programming
- Evaluating ChatGPT
- ChatGPT: Jack of All Trades, Master of None
- What's Going on with the Open LLM Leaderboard
- Challenges in Evaluating AI Systems
- LLM Evaluation Papers
- Evaluating LLMs is a MineField
- Generative Interfaces Beyond Chat (YouTube)
- Why Chatbots are not the Future
- The Future of Search is Boutique
- As a Large Language Model, I
- Natural Language is an Unnatural Interface
Thanks to everyone who added suggestions on Twitter, Mastodon, and Bluesky.
It seems that the gzip approach, altough really cool, was 'optimistic' and thus overhyped, see: https://kenschutte.com/gzip-knn-paper/ (basiccaly they confused k in k-nn and top-k accuracy, reporting top-2 accuracy). More recent studies found that it is, as expected, on 'bag of words' performance level Gzip versus bag-of-words for text classification.
I don't know if you intend to (or are even interested) but I am on the look out for "usecases for normies".