An LLM fine-tuning course online conference for everything LLMs.
Build skills to be effective with LLMs
Course website: https://maven.com/parlance-labs/fine-tuning
<<< Fine-Tuning Workshop 2 | Fine-Tuning Workshop 4 >>>
Guest speakers: Harrison Chase, Bryan Bischof, Shreya Shankar, Eugene Yan
We will discuss how to instrument and evaluate LLMs with industry guest speakers.
Slide 2
- Eval types and tradeoffs
- Langsmith Deep Dive - Harrison Chase
- Hex Case Study - Bryan Bischof
- Metrics - Eugene Yan
- Evals UX & Workflows - Shreya Shankar
(WIP)
Slide 4
(WIP)
Slide 5
(WIP)
Slide 6
Norway's mining economy flourished during the period due to Norwegians' natural hardiness.
(WIP)
Slide 7
(WIP)
Slide 8
(WIP)
Slide 9
(WIP)
Slide 10
(WIP)
Slide 11
(WIP)
Presenter: Dr. Bryan Bischof, Head of AI @ Hex [Twitter]
(watch the video at 00:00:00)
The presentation discusses the evaluation of LLMs in the context of AI and Data Science. It outlines several key points and best practices for effectively evaluating LLMs:
-
Purpose of Evaluations: Evaluations help determine when a capability is ready for customers, ensure system reliability, and assist in debugging issues.
-
Common Mistakes:
- Misunderstanding the novelty of LLM evaluations and neglecting insights from data scientists.
- Excluding use-case experts who can define what "good" looks like.
- Delaying evaluations until late in the development process.
- Confusing product metrics with evaluation metrics.
- Over-relying on purchased evaluation frameworks without understanding user stories.
- Prematurely using LLM-assisted evaluations without thorough validation.
-
Evaluation Techniques:
- For Retrieval-Augmented Generation (RAG), treat evaluations like retrieval systems, label data, and measure hit-rate.
- For planning and agent systems, use structured outputs and evaluate consistency at each decision step.
- Implement repeated-measures designs for updates and changes to ensure significant improvements.
- Bonus: Minimize drift by using production endpoints in the evaluation framework.
-
Specific Examples:
- Magic, an AI copilot for Data Science that lives in Hex, is used to illustrate practical applications and the need for specific evaluations such as SQL generation, polyglot code chains, and reaction to user edits.
The presentation emphasizes the importance of integrating evaluations into the development cycle, involving experts, and using practical, user-focused metrics to ensure LLMs perform reliably and effectively in real-world applications.
(WIP)
(For the Q&A, these are small notes instead of detailed notes.)
(watch the video at 00:00:00)
Here are all the links I gathered today:
-
A few tips for working on high-surface-area problems https://johnowhitaker.dev/dsc/2024-01-23-tips.html
-
SQLModel, SQL databases in Python, designed for simplicity, compatibility, and robustness. https://sqlmodel.tiangolo.com/
-
OpenLLMetry is an open source project that allows you to easily start monitoring and debugging the execution of your LLM app. Tracing is done in a non-intrusive way, built on top of OpenTelemetry. You can choose to export the traces to Traceloop, or to your existing observability stack. https://www.traceloop.com/docs/openllmetry/introduction
-
Py.test plugin for managing VCR.py cassettes. https://pytest-vcr.readthedocs.io/en/latest/#quick-start The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" https://arxiv.org/abs/2309.12288
-
LLM Evaluators Recognize and Favor Their Own Generations https://arxiv.org/abs/2404.13076
-
Automated Evaluation w/ LLMs https://hamel.dev/blog/posts/evals/#automated-evaluation-w-llms
-
Book: Noise: The new book from the authors of ‘Thinking, Fast and Slow’ and ‘Nudge’ https://www.amazon.co.uk/Noise-Daniel-Kahneman/dp/0008308993
-
An all-in-one developer platform for every step of the LLM-powered application lifecycle, whether you’re building with LangChain or not. https://www.langchain.com/langsmith
-
From the team behind Pydantic, Logfire is a new type of observability platform built on the same belief as our open source library — that the most powerful tools can be easy to use. https://pydantic.dev/logfire
-
The all-in-one hiring solution, now with AI https://www.usebraintrust.com/
-
Develop GenAI applications with confidence using W&B Weave https://wandb.ai/site/weave
-
OSS -> Instruct (?) -> Couldn't find link (for rendering and logging traces)
-
Open-source observability for your LLM application, based on OpenTelemetry https://github.com/traceloop/openllmetry
-
OpenTelemetry on Hackernews: https://news.ycombinator.com/item?id=37843907
-
Traces, evals, prompt management and metrics to debug and improve your LLM application. https://langfuse.com/
-
Queries, notebooks, reports, data apps, and AI — all in the world's leading collaborative data workspace. https://hex.tech/
-
Save hours every week by using Magic AI to write queries, build charts, and fix bugs. https://hex.tech/product/magic-ai/
-
Task-Specific LLM Evals that Do & Don't Work https://eugeneyan.com/writing/evals/
-
Eugene Yan's talk: https://github.com/eugeneyan/visualizing-finetunes
-
What We Learned from a Year of Building with LLMs (Part I) https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/
-
Sentence Transformers 3.0 Release: https://x.com/tomaarsen/status/1795425797408235708
-
USB: A Unified Summarization Benchmark Across Tasks and Domains https://arxiv.org/abs/2305.14296
-
Prompting Fundamentals and How to Apply them Effectively https://eugeneyan.com/writing/prompting/
-
Scaling Up “Vibe Checks” for LLMs - Shreya Shankar | Stanford MLSys #97 https://www.youtube.com/watch?v=eGVDKegRdgM&t=139s
-
ChainForge is an open-source visual programming environment for prompt engineering. https://chainforge.ai/
-
Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences https://arxiv.org/abs/2404.12272
-
Breaking Down EvalGen: Who Validates the Validators? https://arize.com/blog/breaking-down-evalgen-who-validates-the-validators/
-
Shreya Shankar's personal Github: https://github.com/shreyashankar
-
SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines https://arxiv.org/abs/2401.03038
-
LlamaIndex Workshop: Evaluation-Driven Development (EDD) https://www.youtube.com/watch?v=ua93WTjIN7s
-
Eugene Yan's personal Twitter: https://x.com/eugeneyan
-
Fast.ai forums: https://forums.fast.ai/
-
"My colleagues and I distilled practical advice re: LLMs into this three-part series. Lot's of bangers." https://x.com/HamelHusain/status/1795526367637049629
Source: Discord
Some highlights:
(WIP)






