Mastering LLMs: A Conference For Developers & Data Scientists

An ~~LLM fine-tuning course~~ online conference for everything LLMs.

Build skills to be effective with LLMs

Course website: https://maven.com/parlance-labs/fine-tuning

Slide deck | Video recording | Q&A (video recording + password)

<<< Fine-Tuning Workshop 3 | Conference Talk: Best Practices For Fine Tuning Mistral >>>

Fine-Tuning Workshop 4: Deploying Fine-Tuned Models (WIP)

Guest speakers: Travis Addair, Charles Frye, Joe Hoover

We will discuss inference servers, backends and platforms like Replicate that you can host models on.

Deploying Large Language Models

Guest speaker: Joe Hoover, Machine Learning Engineer at Replicate

(WIP)

Deploying LLMs

Deploying large language models is hard! (But maybe not in the way you think.)

Deploying language models is hard because...

Performance is multidimensional and zero-sum
- Think very, very carefully about your success criteria.
Technology never stops evolving
- Minimize the cost of experimentation.

(WIP)

What makes LLMs slow?

(WIP)

Optimizing Inference for Fine-Tunes LLMs

Theme: Lessons Learned Building a Platform for Training and Serving Fine-Tuned LLMs

Guest speaker: Travis Addair, co-founder and CTO at Predibase

About Me

Uber (2016 - 2021) Michaelangelo ML platform
Lead maintainer of Horovod (14k stars), distributed deep learning framework
Co-maintainer of Ludwig (10k stars), low code declarative deep learning framework (like Axolotl but designed for image models and pre-LLMs)
Lead maintainer of LoRAX (1.7k stars), fine-tuned LLM inference system

(WIP)

Our Story

Bigger isn't always better

General intelligence is great. But I don't need my point-of-sales system to recite French Poetry.

(WIP)

Graduating from OpenAI to open-source

Commercial LLMs are a good starting point

Great for rapid experimentation but

Lack model ownership
Need to give up access to data
Too slow, expensive & overkill for most tasks

(WIP)

But the future is fine-tuned open-source

         +---> Mistral-7B
         |     (Prioritize Customer Support Tickets)
         |
GPT-4 ---|---> Llama-3-70B
         |     (Customer Service Chatbot)
         |
         +---> BERT 
               (Determine Customer)

Benefits of smaller task-specific LLMs

Own your models
Don't share data
Smaller and faster
Control the output

(WIP)

Deploying LLM Services on Modal

Guest speaker: Charles Frye at Modal

Slide deck

Batch vs Real Time

Batch vs Interactive

Batch vs Streaming

Throughput vs Latency

Inference trilemma

Throughput, Latency, Cost
- Throughput
  - requests completed per unit time
  - Dependent on up/downstream systems
- Latency
  - Single request completion time
  - Human perception: ~400ms to respond
- Cost
  - Resources required to achieve service level
  - Dependent on value delivered
Batch, Real Time, Cost
- Batch - Nightly recsys refresh, eval CI/CD
- Real Time - Chatbots, copilots, audio/video, guardrails
- Cost - Consumer-facing, large-scale

(WIP)

Summary

Deployment Considerations:
- Throughput vs. Latency:
  - Throughput: Requests completed per unit time.
  - Latency: Time to complete a single request.
- Cost:
  - Resources needed to meet service levels.
  - High for consumer-facing, large-scale deployments.
Processing Types:
- Batch Processing:
  - Used for nightly refreshes, CI/CD evaluations.
- Real-Time Processing:
  - Applied in chatbots, copilot applications, audio/video, and guardrails.
Throughput and Latency Relationship:
- Increasing batch size improves throughput but penalizes latency.
- Strategies to improve both:
  - Quantizing, distilling, truncating models.
  - Using more expensive hardware.
  - Writing optimized software.
- Very short latency requires advanced solutions like cache memory/SRAM.
GPU Utilization:
- GPUs are throughput-oriented.
- Balancing latency and cost is challenging for models >13B parameters.
- High costs for GPUs but falling over time.
- Peak GPU utilization averages 60%, but providers charge for 100%.
Modal Platform:
- Supports storage (dictionaries, queues, volumes, mounts), compute (functions, crons, GPU acceleration), and I/O (web endpoints, servers).
- Operates as a serverless runtime environment.
Cost of Deployment:
- Example costs: $1.10/hr per A10G GPU, $7.65/hr per H100 GPU.
Demo and Resources:
- Includes a demo and a link for further details.

Q&A

(For the Q&A, these are small notes instead of detailed notes.)

(watch the video at 00:00:00)

Lesson Resources

Axolotl: Merging a LoRA back to the base model: https://openaccess-ai-collective.github.io/axolotl/#merge-lora-to-base
Dan's conference demo Huggingface repo: https://huggingface.co/dansbecker/conference-demo
Hosted Services
- Amazon SageMaker: https://aws.amazon.com/sagemaker/
- Anyscale: https://www.anyscale.com/
- Fireworks AI: https://fireworks.ai/
FastAPI: https://fastapi.tiangolo.com/
OpenLLM: https://github.com/bentoml/OpenLLM
Nvidia Triton: https://developer.nvidia.com/triton-inference-server
vLLM: https://docs.vllm.ai/en/stable/
The Many Ways to Deploy a Model: https://outerbounds.com/blog/the-many-ways-to-deploy-a-model/
Replicate: Run and fine-tune open-source models. Deploy custom models at scale. All with one line of code. https://replicate.com/
Parlance Labs (A consultancy focused on LLMs):
- Replicate Examples: https://github.com/parlance-labs/ftcourse/tree/master/replicate-examples
- HoneyComb model: https://huggingface.co/parlance-labs/hc-mistral-alpaca-merged-awq
Cog: Containers for machine learning: https://cog.run/
Hugginface: Download files from the Hub: https://huggingface.co/docs/huggingface_hub/guides/download
Speculative Decoding: Fast inference from large lauguage models via speculative decoding: https://github.com/feifeibear/LLMSpeculativeSampling
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution: https://arxiv.org/abs/2405.19325
Recommended reading by presenter:
- Continous Batching: How continuous batching enables 23x throughput in LLM inference while reducing p50 latency: https://www.anyscale.com/blog/continuous-batching-llm-inference
Inference servers:
- Llama.cpp: https://github.com/ggerganov/llama.cpp
- vLLM: https://docs.vllm.ai/en/stable/
- Exllama: https://github.com/turboderp/exllama
- Huggingface TGI: https://huggingface.co/docs/text-generation-inference/en/index
- DeepSpeed-FastGen: https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen
- TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM
- SGLang: https://github.com/sgl-project/sglang
- Ollama: https://www.ollama.com/
- MLC: Universal LLM Deployment Engine with ML Compilation: https://llm.mlc.ai/
- Lorax: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs: https://github.com/predibase/lorax
Cog-vLLM: Inference LLM on replicate with vllm: https://github.com/replicate/cog-vllm
Predibase: "The Fastest Way to Fine-tune and Serve LLMs": https://predibase.com/
Efficiently Serving LLMs course by Travis Addair: https://www.deeplearning.ai/short-courses/efficiently-serving-llms/
The Kraken-LoRA model and Architecture Kraken is a joint effort between Cognitive Computations, VAGO Solutions and Hyperspace.ai.: https://huggingface.co/VAGOsolutions/Kraken-LoRA
Efficient finetuning of Llama 3 with FSDP QDoRA: https://www.answer.ai/posts/2024-04-26-fsdp-qdora-llama3.html
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads: https://arxiv.org/abs/2401.10774
The State of AI Infrastructure at Scale 2024 (report):
- Website: https://ai-infrastructure.org/the-state-of-ai-infrastructure-at-scale-2024/
- PDF: https://ai-infrastructure.org/wp-content/uploads/2024/03/The-State-of-AI-Infrastructure-at-Scale-2024.pdf
"Latency lags bandwidth" by David A. Patterson: https://dl.acm.org/doi/pdf/10.1145/1022594.1022596
Programming Massively Parallel Processors: REDACTED THE LINK (is this legal for free?)
Performance benchmarks from fine-tuning 700+ open-source LLMs: https://predibase.com/fine-tuning-index
Modal: Featured examples: https://modal.com/docs/examples
vLLM: To create a new 4-bit quantized model, you can leverage AutoAWQ. https://docs.vllm.ai/en/stable/quantization/auto_awq.html
Travis Addair on LinkedIn: https://www.linkedin.com/in/travisaddair/
Predibase, "The Fastest Way to Fine-tune and Serve LLMs": https://predibase.com/

Source: Discord

Discord Messages

Some highlights:

(WIP)

cedrickchee/mastering-llm-ft-workshop-4.md

Select an option

No results found

Select an option

No results found

Mastering LLMs: A Conference For Developers & Data Scientists

Fine-Tuning Workshop 4: Deploying Fine-Tuned Models (WIP)

Deploying Large Language Models

Deploying LLMs

Deploying language models is hard because...

What makes LLMs slow?

Optimizing Inference for Fine-Tunes LLMs

About Me

Our Story

Bigger isn't always better

Graduating from OpenAI to open-source

Commercial LLMs are a good starting point

But the future is fine-tuned open-source

Deploying LLM Services on Modal

Summary

Q&A

Lesson Resources

Discord Messages