An LLM fine-tuning course online conference for everything LLMs.
Build skills to be effective with LLMs
Course website: https://maven.com/parlance-labs/fine-tuning
Slide deck | Video recording | Q&A (video recording + password)
<<< Fine-Tuning Workshop 3 | Conference Talk: Best Practices For Fine Tuning Mistral >>>
Guest speakers: Travis Addair, Charles Frye, Joe Hoover
We will discuss inference servers, backends and platforms like Replicate that you can host models on.
Guest speaker: Joe Hoover, Machine Learning Engineer at Replicate
(WIP)
Deploying large language models is hard! (But maybe not in the way you think.)
- Performance is multidimensional and zero-sum
- Think very, very carefully about your success criteria.
- Technology never stops evolving
- Minimize the cost of experimentation.
(WIP)
(WIP)
Theme: Lessons Learned Building a Platform for Training and Serving Fine-Tuned LLMs
Guest speaker: Travis Addair, co-founder and CTO at Predibase
- Uber (2016 - 2021) Michaelangelo ML platform
- Lead maintainer of Horovod (14k stars), distributed deep learning framework
- Co-maintainer of Ludwig (10k stars), low code declarative deep learning framework (like Axolotl but designed for image models and pre-LLMs)
- Lead maintainer of LoRAX (1.7k stars), fine-tuned LLM inference system
(WIP)
General intelligence is great. But I don't need my point-of-sales system to recite French Poetry.
(WIP)
Great for rapid experimentation but
- Lack model ownership
- Need to give up access to data
- Too slow, expensive & overkill for most tasks
(WIP)
+---> Mistral-7B
| (Prioritize Customer Support Tickets)
|
GPT-4 ---|---> Llama-3-70B
| (Customer Service Chatbot)
|
+---> BERT
(Determine Customer)
Benefits of smaller task-specific LLMs
- Own your models
- Don't share data
- Smaller and faster
- Control the output
(WIP)
Guest speaker: Charles Frye at Modal
Batch vs Real Time
Batch vs Interactive
Batch vs Streaming
Throughput vs Latency
Inference trilemma
-
Throughput, Latency, Cost
- Throughput
- requests completed per unit time
- Dependent on up/downstream systems
- Latency
- Single request completion time
- Human perception: ~400ms to respond
- Cost
- Resources required to achieve service level
- Dependent on value delivered
- Throughput
-
Batch, Real Time, Cost
- Batch - Nightly recsys refresh, eval CI/CD
- Real Time - Chatbots, copilots, audio/video, guardrails
- Cost - Consumer-facing, large-scale
(WIP)
-
Deployment Considerations:
- Throughput vs. Latency:
- Throughput: Requests completed per unit time.
- Latency: Time to complete a single request.
- Cost:
- Resources needed to meet service levels.
- High for consumer-facing, large-scale deployments.
- Throughput vs. Latency:
-
Processing Types:
- Batch Processing:
- Used for nightly refreshes, CI/CD evaluations.
- Real-Time Processing:
- Applied in chatbots, copilot applications, audio/video, and guardrails.
- Batch Processing:
-
Throughput and Latency Relationship:
- Increasing batch size improves throughput but penalizes latency.
- Strategies to improve both:
- Quantizing, distilling, truncating models.
- Using more expensive hardware.
- Writing optimized software.
- Very short latency requires advanced solutions like cache memory/SRAM.
-
GPU Utilization:
- GPUs are throughput-oriented.
- Balancing latency and cost is challenging for models >13B parameters.
- High costs for GPUs but falling over time.
- Peak GPU utilization averages 60%, but providers charge for 100%.
-
Modal Platform:
- Supports storage (dictionaries, queues, volumes, mounts), compute (functions, crons, GPU acceleration), and I/O (web endpoints, servers).
- Operates as a serverless runtime environment.
-
Cost of Deployment:
- Example costs: $1.10/hr per A10G GPU, $7.65/hr per H100 GPU.
-
Demo and Resources:
- Includes a demo and a link for further details.
(For the Q&A, these are small notes instead of detailed notes.)
(watch the video at 00:00:00)
-
Axolotl: Merging a LoRA back to the base model: https://openaccess-ai-collective.github.io/axolotl/#merge-lora-to-base
-
Dan's conference demo Huggingface repo: https://huggingface.co/dansbecker/conference-demo
-
Hosted Services
- Amazon SageMaker: https://aws.amazon.com/sagemaker/
- Anyscale: https://www.anyscale.com/
- Fireworks AI: https://fireworks.ai/
-
FastAPI: https://fastapi.tiangolo.com/
-
Nvidia Triton: https://developer.nvidia.com/triton-inference-server
-
The Many Ways to Deploy a Model: https://outerbounds.com/blog/the-many-ways-to-deploy-a-model/
-
Replicate: Run and fine-tune open-source models. Deploy custom models at scale. All with one line of code. https://replicate.com/
-
Parlance Labs (A consultancy focused on LLMs):
- Replicate Examples: https://github.com/parlance-labs/ftcourse/tree/master/replicate-examples
- HoneyComb model: https://huggingface.co/parlance-labs/hc-mistral-alpaca-merged-awq
-
Cog: Containers for machine learning: https://cog.run/
-
Hugginface: Download files from the Hub: https://huggingface.co/docs/huggingface_hub/guides/download
-
Speculative Decoding: Fast inference from large lauguage models via speculative decoding: https://github.com/feifeibear/LLMSpeculativeSampling
-
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution: https://arxiv.org/abs/2405.19325
-
Recommended reading by presenter:
- Continous Batching: How continuous batching enables 23x throughput in LLM inference while reducing p50 latency: https://www.anyscale.com/blog/continuous-batching-llm-inference
-
Inference servers:
- Llama.cpp: https://github.com/ggerganov/llama.cpp
- vLLM: https://docs.vllm.ai/en/stable/
- Exllama: https://github.com/turboderp/exllama
- Huggingface TGI: https://huggingface.co/docs/text-generation-inference/en/index
- DeepSpeed-FastGen: https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen
- TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM
- SGLang: https://github.com/sgl-project/sglang
- Ollama: https://www.ollama.com/
- MLC: Universal LLM Deployment Engine with ML Compilation: https://llm.mlc.ai/
- Lorax: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs: https://github.com/predibase/lorax
-
Cog-vLLM: Inference LLM on replicate with vllm: https://github.com/replicate/cog-vllm
-
Predibase: "The Fastest Way to Fine-tune and Serve LLMs": https://predibase.com/
-
Efficiently Serving LLMs course by Travis Addair: https://www.deeplearning.ai/short-courses/efficiently-serving-llms/
-
The Kraken-LoRA model and Architecture Kraken is a joint effort between Cognitive Computations, VAGO Solutions and Hyperspace.ai.: https://huggingface.co/VAGOsolutions/Kraken-LoRA
-
Efficient finetuning of Llama 3 with FSDP QDoRA: https://www.answer.ai/posts/2024-04-26-fsdp-qdora-llama3.html
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads: https://arxiv.org/abs/2401.10774
-
The State of AI Infrastructure at Scale 2024 (report):
-
"Latency lags bandwidth" by David A. Patterson: https://dl.acm.org/doi/pdf/10.1145/1022594.1022596
-
Programming Massively Parallel Processors: REDACTED THE LINK (is this legal for free?)
-
Performance benchmarks from fine-tuning 700+ open-source LLMs: https://predibase.com/fine-tuning-index
-
Modal: Featured examples: https://modal.com/docs/examples
-
vLLM: To create a new 4-bit quantized model, you can leverage AutoAWQ. https://docs.vllm.ai/en/stable/quantization/auto_awq.html
-
Travis Addair on LinkedIn: https://www.linkedin.com/in/travisaddair/
-
Predibase, "The Fastest Way to Fine-tune and Serve LLMs": https://predibase.com/
Source: Discord
Some highlights:
(WIP)


