Skip to content

Instantly share code, notes, and snippets.

@cedrickchee
Last active August 3, 2024 19:19
Show Gist options
  • Select an option

  • Save cedrickchee/6480b2326b3f6a9c4723bec331da829d to your computer and use it in GitHub Desktop.

Select an option

Save cedrickchee/6480b2326b3f6a9c4723bec331da829d to your computer and use it in GitHub Desktop.
Fine-Tuning Workshop 4: Deploying Fine-Tuned Models (WIP)

Mastering LLMs: A Conference For Developers & Data Scientists

An LLM fine-tuning course online conference for everything LLMs.

Build skills to be effective with LLMs

Course website: https://maven.com/parlance-labs/fine-tuning

Slide deck | Video recording | Q&A (video recording + password)

<<< Fine-Tuning Workshop 3 | Conference Talk: Best Practices For Fine Tuning Mistral >>>

Fine-Tuning Workshop 4: Deploying Fine-Tuned Models (WIP)

Guest speakers: Travis Addair, Charles Frye, Joe Hoover

We will discuss inference servers, backends and platforms like Replicate that you can host models on.


Deploying Large Language Models

Guest speaker: Joe Hoover, Machine Learning Engineer at Replicate

(WIP)

Deploying LLMs

Deploying large language models is hard! (But maybe not in the way you think.)

Deploying language models is hard because...

  • Performance is multidimensional and zero-sum
    • Think very, very carefully about your success criteria.
  • Technology never stops evolving
    • Minimize the cost of experimentation.

(WIP)

What makes LLMs slow?

(WIP)

Optimizing Inference for Fine-Tunes LLMs

Theme: Lessons Learned Building a Platform for Training and Serving Fine-Tuned LLMs

Guest speaker: Travis Addair, co-founder and CTO at Predibase

About Me

  • Uber (2016 - 2021) Michaelangelo ML platform
  • Lead maintainer of Horovod (14k stars), distributed deep learning framework
  • Co-maintainer of Ludwig (10k stars), low code declarative deep learning framework (like Axolotl but designed for image models and pre-LLMs)
  • Lead maintainer of LoRAX (1.7k stars), fine-tuned LLM inference system

(WIP)

Our Story

Bigger isn't always better

General intelligence is great. But I don't need my point-of-sales system to recite French Poetry.

(WIP)

Graduating from OpenAI to open-source

Commercial LLMs are a good starting point

Great for rapid experimentation but

  • Lack model ownership
  • Need to give up access to data
  • Too slow, expensive & overkill for most tasks

(WIP)

But the future is fine-tuned open-source

         +---> Mistral-7B
         |     (Prioritize Customer Support Tickets)
         |
GPT-4 ---|---> Llama-3-70B
         |     (Customer Service Chatbot)
         |
         +---> BERT 
               (Determine Customer)

Benefits of smaller task-specific LLMs

  • Own your models
  • Don't share data
  • Smaller and faster
  • Control the output

(WIP)

Deploying LLM Services on Modal

Guest speaker: Charles Frye at Modal

Slide deck

Slide 3

Slide 4

Slide 5

Batch vs Real Time

Batch vs Interactive

Batch vs Streaming

Throughput vs Latency

Inference trilemma

  • Throughput, Latency, Cost

    • Throughput
      • requests completed per unit time
      • Dependent on up/downstream systems
    • Latency
      • Single request completion time
      • Human perception: ~400ms to respond
    • Cost
      • Resources required to achieve service level
      • Dependent on value delivered
  • Batch, Real Time, Cost

    • Batch - Nightly recsys refresh, eval CI/CD
    • Real Time - Chatbots, copilots, audio/video, guardrails
    • Cost - Consumer-facing, large-scale

(WIP)

Summary

  • Deployment Considerations:

    • Throughput vs. Latency:
      • Throughput: Requests completed per unit time.
      • Latency: Time to complete a single request.
    • Cost:
      • Resources needed to meet service levels.
      • High for consumer-facing, large-scale deployments.
  • Processing Types:

    • Batch Processing:
      • Used for nightly refreshes, CI/CD evaluations.
    • Real-Time Processing:
      • Applied in chatbots, copilot applications, audio/video, and guardrails.
  • Throughput and Latency Relationship:

    • Increasing batch size improves throughput but penalizes latency.
    • Strategies to improve both:
      • Quantizing, distilling, truncating models.
      • Using more expensive hardware.
      • Writing optimized software.
    • Very short latency requires advanced solutions like cache memory/SRAM.
  • GPU Utilization:

    • GPUs are throughput-oriented.
    • Balancing latency and cost is challenging for models >13B parameters.
    • High costs for GPUs but falling over time.
    • Peak GPU utilization averages 60%, but providers charge for 100%.
  • Modal Platform:

    • Supports storage (dictionaries, queues, volumes, mounts), compute (functions, crons, GPU acceleration), and I/O (web endpoints, servers).
    • Operates as a serverless runtime environment.
  • Cost of Deployment:

    • Example costs: $1.10/hr per A10G GPU, $7.65/hr per H100 GPU.
  • Demo and Resources:

    • Includes a demo and a link for further details.

Q&A

(For the Q&A, these are small notes instead of detailed notes.)

(watch the video at 00:00:00)

Lesson Resources

Source: Discord

Discord Messages

Some highlights:

(WIP)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment