- Implement a basic decoder-only transformer from scratch with attention mechanisms
- Add positional encoding to a transformer model (relative vs rotary)
- Implement causal/masked (ie. similarities.masked_fill(mask=mask, value=float("-inf")))
- Build/use BPE tokenizer and vocabulary system
- Implement beam search and sampling strategies for text generation
- Explain/add multi-head attention
- Different attention patterns (sliding, linear, local, sparse)
- Explain model paralellism elements
- gradient checkpointing to reduce memory usage
- mixed precision traning support
- explain/add model quantization
- explain/add model distilation
- explain/add model prunning
- implement efficient inference strategies (ie. KV caching)
- distributed training (ie. deepspeed)
- add custom learning rate schedulers
- add gradient clipping and normalization
- million different way to optimize data loding pipelines (ie. straming vs batching, lowering IO, etc)
- checkpoint saving and loading with verification
- logging and monitoring in production
- model serving and versioning endpoints
- distributed model serving and low-latency pipelines
- model A/B testing
- data drift
- SafeAI ?