Skip to content

Instantly share code, notes, and snippets.

@cedrickchee
Last active August 3, 2024 19:19
Show Gist options
  • Select an option

  • Save cedrickchee/881b67121b31442dabd3a9ddd0782402 to your computer and use it in GitHub Desktop.

Select an option

Save cedrickchee/881b67121b31442dabd3a9ddd0782402 to your computer and use it in GitHub Desktop.
Fine-Tuning Workshop 3: Instrumenting & Evaluating LLMs (WIP)

Mastering LLMs: A Conference For Developers & Data Scientists

An LLM fine-tuning course online conference for everything LLMs.

Build skills to be effective with LLMs

Course website: https://maven.com/parlance-labs/fine-tuning

Slide deck | Video recording

<<< Fine-Tuning Workshop 2 | Fine-Tuning Workshop 4 >>>

Fine-Tuning Workshop 3: Instrumenting & Evaluating LLMs (WIP)

Guest speakers: Harrison Chase, Bryan Bischof, Shreya Shankar, Eugene Yan

We will discuss how to instrument and evaluate LLMs with industry guest speakers.


Plan For Today

Slide 2

  • Eval types and tradeoffs
  • Langsmith Deep Dive - Harrison Chase
  • Hex Case Study - Bryan Bischof
  • Metrics - Eugene Yan
  • Evals UX & Workflows - Shreya Shankar

(WIP)

Key: Make the dev cycle fast

Slide 4

Slide 4

(WIP)

Types of Evaluations

Slide 5

Slide 5

(WIP)

Editing Out Stereotypes In Academic Writing

Slide 6

Norway's mining economy flourished during the period due to Norwegians' natural hardiness.

(WIP)

Unit Tests

Slide 7

Slide 7

(WIP)

Slide 8

Slide 8

(WIP)

Generate Data For Each Scenario

Slide 9

Slide 9

(WIP)

Use LLMs to synthetically generate inputs to the system

Slide 10

Slide 10

(WIP)

Log to Results To Database / Visualize

Slide 11

Slide 11

(WIP)

Spellgrounds for Prodigious Prestidigitation

Presenter: Dr. Bryan Bischof, Head of AI @ Hex [Twitter]

Slide deck

(watch the video at 00:00:00)

Summary

The presentation discusses the evaluation of LLMs in the context of AI and Data Science. It outlines several key points and best practices for effectively evaluating LLMs:

  1. Purpose of Evaluations: Evaluations help determine when a capability is ready for customers, ensure system reliability, and assist in debugging issues.

  2. Common Mistakes:

    • Misunderstanding the novelty of LLM evaluations and neglecting insights from data scientists.
    • Excluding use-case experts who can define what "good" looks like.
    • Delaying evaluations until late in the development process.
    • Confusing product metrics with evaluation metrics.
    • Over-relying on purchased evaluation frameworks without understanding user stories.
    • Prematurely using LLM-assisted evaluations without thorough validation.
  3. Evaluation Techniques:

    • For Retrieval-Augmented Generation (RAG), treat evaluations like retrieval systems, label data, and measure hit-rate.
    • For planning and agent systems, use structured outputs and evaluate consistency at each decision step.
    • Implement repeated-measures designs for updates and changes to ensure significant improvements.
    • Bonus: Minimize drift by using production endpoints in the evaluation framework.
  4. Specific Examples:

    • Magic, an AI copilot for Data Science that lives in Hex, is used to illustrate practical applications and the need for specific evaluations such as SQL generation, polyglot code chains, and reaction to user edits.

The presentation emphasizes the importance of integrating evaluations into the development cycle, involving experts, and using practical, user-focused metrics to ensure LLMs perform reliably and effectively in real-world applications.

(WIP)

Q&A

(For the Q&A, these are small notes instead of detailed notes.)

(watch the video at 00:00:00)

Lesson Resources

Here are all the links I gathered today:

Source: Discord

Discord Messages

Some highlights:

(WIP)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment