Mastering LLMs: A Conference For Developers & Data Scientists

An ~~LLM fine-tuning course~~ online conference for everything LLMs.

Build skills to be effective with LLMs

Course website: https://maven.com/parlance-labs/fine-tuning

Slide deck | Video recording

<<< Fine-Tuning Workshop 2 | Fine-Tuning Workshop 4 >>>

Fine-Tuning Workshop 3: Instrumenting & Evaluating LLMs (WIP)

Guest speakers: Harrison Chase, Bryan Bischof, Shreya Shankar, Eugene Yan

We will discuss how to instrument and evaluate LLMs with industry guest speakers.

Plan For Today

Slide 2

Eval types and tradeoffs
Langsmith Deep Dive - Harrison Chase
Hex Case Study - Bryan Bischof
Metrics - Eugene Yan
Evals UX & Workflows - Shreya Shankar

(WIP)

Key: Make the dev cycle fast

Slide 4

(WIP)

Types of Evaluations

Slide 5

(WIP)

Editing Out Stereotypes In Academic Writing

Slide 6

Norway's mining economy flourished during the period ~~due to Norwegians' natural hardiness.~~

(WIP)

Unit Tests

Slide 7

(WIP)

Slide 8

(WIP)

Generate Data For Each Scenario

Slide 9

(WIP)

Use LLMs to synthetically generate inputs to the system

Slide 10

(WIP)

Log to Results To Database / Visualize

Slide 11

(WIP)

Spellgrounds for Prodigious Prestidigitation

Presenter: Dr. Bryan Bischof, Head of AI @ Hex [Twitter]

Slide deck

(watch the video at 00:00:00)

Summary

The presentation discusses the evaluation of LLMs in the context of AI and Data Science. It outlines several key points and best practices for effectively evaluating LLMs:

Purpose of Evaluations: Evaluations help determine when a capability is ready for customers, ensure system reliability, and assist in debugging issues.
Common Mistakes:
- Misunderstanding the novelty of LLM evaluations and neglecting insights from data scientists.
- Excluding use-case experts who can define what "good" looks like.
- Delaying evaluations until late in the development process.
- Confusing product metrics with evaluation metrics.
- Over-relying on purchased evaluation frameworks without understanding user stories.
- Prematurely using LLM-assisted evaluations without thorough validation.
Evaluation Techniques:
- For Retrieval-Augmented Generation (RAG), treat evaluations like retrieval systems, label data, and measure hit-rate.
- For planning and agent systems, use structured outputs and evaluate consistency at each decision step.
- Implement repeated-measures designs for updates and changes to ensure significant improvements.
- Bonus: Minimize drift by using production endpoints in the evaluation framework.
Specific Examples:
- Magic, an AI copilot for Data Science that lives in Hex, is used to illustrate practical applications and the need for specific evaluations such as SQL generation, polyglot code chains, and reaction to user edits.

The presentation emphasizes the importance of integrating evaluations into the development cycle, involving experts, and using practical, user-focused metrics to ensure LLMs perform reliably and effectively in real-world applications.

(WIP)

Q&A

(For the Q&A, these are small notes instead of detailed notes.)

(watch the video at 00:00:00)

Lesson Resources

Here are all the links I gathered today:

A few tips for working on high-surface-area problems https://johnowhitaker.dev/dsc/2024-01-23-tips.html
SQLModel, SQL databases in Python, designed for simplicity, compatibility, and robustness. https://sqlmodel.tiangolo.com/
OpenLLMetry is an open source project that allows you to easily start monitoring and debugging the execution of your LLM app. Tracing is done in a non-intrusive way, built on top of OpenTelemetry. You can choose to export the traces to Traceloop, or to your existing observability stack. https://www.traceloop.com/docs/openllmetry/introduction
Py.test plugin for managing VCR.py cassettes. https://pytest-vcr.readthedocs.io/en/latest/#quick-start The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" https://arxiv.org/abs/2309.12288
LLM Evaluators Recognize and Favor Their Own Generations https://arxiv.org/abs/2404.13076
Automated Evaluation w/ LLMs https://hamel.dev/blog/posts/evals/#automated-evaluation-w-llms
Book: Noise: The new book from the authors of ‘Thinking, Fast and Slow’ and ‘Nudge’ https://www.amazon.co.uk/Noise-Daniel-Kahneman/dp/0008308993
An all-in-one developer platform for every step of the LLM-powered application lifecycle, whether you’re building with LangChain or not. https://www.langchain.com/langsmith
From the team behind Pydantic, Logfire is a new type of observability platform built on the same belief as our open source library — that the most powerful tools can be easy to use. https://pydantic.dev/logfire
The all-in-one hiring solution, now with AI https://www.usebraintrust.com/
Develop GenAI applications with confidence using W&B Weave https://wandb.ai/site/weave
OSS -> Instruct (?) -> Couldn't find link (for rendering and logging traces)
Open-source observability for your LLM application, based on OpenTelemetry https://github.com/traceloop/openllmetry
OpenTelemetry on Hackernews: https://news.ycombinator.com/item?id=37843907
Traces, evals, prompt management and metrics to debug and improve your LLM application. https://langfuse.com/
Queries, notebooks, reports, data apps, and AI — all in the world's leading collaborative data workspace. https://hex.tech/
Save hours every week by using Magic AI to write queries, build charts, and fix bugs. https://hex.tech/product/magic-ai/
Task-Specific LLM Evals that Do & Don't Work https://eugeneyan.com/writing/evals/
Eugene Yan's talk: https://github.com/eugeneyan/visualizing-finetunes
What We Learned from a Year of Building with LLMs (Part I) https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/
Sentence Transformers 3.0 Release: https://x.com/tomaarsen/status/1795425797408235708
USB: A Unified Summarization Benchmark Across Tasks and Domains https://arxiv.org/abs/2305.14296
Prompting Fundamentals and How to Apply them Effectively https://eugeneyan.com/writing/prompting/
Scaling Up “Vibe Checks” for LLMs - Shreya Shankar | Stanford MLSys #97 https://www.youtube.com/watch?v=eGVDKegRdgM&t=139s
ChainForge is an open-source visual programming environment for prompt engineering. https://chainforge.ai/
Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences https://arxiv.org/abs/2404.12272
Breaking Down EvalGen: Who Validates the Validators? https://arize.com/blog/breaking-down-evalgen-who-validates-the-validators/
Shreya Shankar's personal Github: https://github.com/shreyashankar
SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines https://arxiv.org/abs/2401.03038
LlamaIndex Workshop: Evaluation-Driven Development (EDD) https://www.youtube.com/watch?v=ua93WTjIN7s
Eugene Yan's personal Twitter: https://x.com/eugeneyan
Fast.ai forums: https://forums.fast.ai/
"My colleagues and I distilled practical advice re: LLMs into this three-part series. Lot's of bangers." https://x.com/HamelHusain/status/1795526367637049629

Source: Discord

Discord Messages

Some highlights:

(WIP)

cedrickchee/mastering-llm-ft-workshop-3.md

Select an option

No results found

Select an option

No results found

Mastering LLMs: A Conference For Developers & Data Scientists

Fine-Tuning Workshop 3: Instrumenting & Evaluating LLMs (WIP)

Plan For Today

Key: Make the dev cycle fast

Types of Evaluations

Editing Out Stereotypes In Academic Writing

Unit Tests

Generate Data For Each Scenario

Use LLMs to synthetically generate inputs to the system

Log to Results To Database / Visualize

Spellgrounds for Prodigious Prestidigitation

Summary

Q&A

Lesson Resources

Discord Messages