An LLM fine-tuning course online conference for everything LLMs.
Build skills to be effective with LLMs
Course website: https://maven.com/parlance-labs/fine-tuning
Slides | Code | Video recording
Guest speaker: JJ Allaire
Date: May 30, 2024
This talk will cover using and extending Inspect, a new OSS Python framework for LLM evals. Inspect's developer (J.J. Allaire) will walk through the core concepts and design of Inspect and demonstrate its use for a variety of evaluation tasks. Inspect makes it very straightforward to implement simple “benchmark” style evaluations, but also has the depth and flexibility to support highly complex evals. Inspect supports both exploratory (notebook/REPL) workflows as well creating large scale eval suites that run in batch. We'll cover all of this ground as well as creating custom extensions to Inspect.
Bio: long time developer tools creator.
A Wikipedia page dedicated for JJ - created Coldfusion, RStudio, ...
Hamel's Tweet: https://x.com/HamelHusain/status/1798964392819892303
Up until recently, there wasn't an OSS vendorless LLM eval framework I liked. Then I found Inspect, and I love it because:
- VSCode Plugins w/Viz & UI
- Composibility & Developer experience
- Made by JJ Allaire (cracked eng w/a track record)
Slide 1:
Open Source Framework for LLM Evals
Slide 2:
- A Python package (
inspect_ai) used to create LLM evaluationspip install inspect_ai
- Developed and maintained by the UK AI Safety Institute
- Similar in function to the eval frameworks embedded in benchmark suites (e.g. Open AI Evals, Eluether LM Eval Harness, etc.) but designed from the ground up for development of more complex evals
- Focus on bridging research and production: provide a great development experience for researchers that results in evals that can be reproducibly run at scale
Slide 3:
{Inspect VSCode extension screenshot}
Slide 4:
- Dataset - List of samples with input and target
- Solvers - Functions that transform dataset inputs, call the model for generation, and act further on model output. Can be composed together as layers, or can be a single layer with higher internal complexity
- Scorer - Evaluates final output of solvers. May use text comparisons, model grading, or other custom schemes
(My thoughts: very well design software architecture, abstraction is simple, the library is design for composing things easily for creating more complex things.)
Slide 5:
(WIP)
Slide 16:
A Solver is a Python function that tasks a TaskState and transforms it in some
useful fashion
TaskState (initialised from sample)
class TaskState:1
messages: list[ChatMessage]2
output: ModelOutput3
# ...Solver Function
async def solve(state: TaskState, generate: Generate) -> TaskState:
# do something useful with state (prompt engineering,
# generating model output, critique and regenerate, etc.)
return state(WIP)
Slide 17:
prompt_template()
async def solve(state: TaskState, generate: Generate) -> TaskState:
prompt = state.user_prompt
prompt.text = prompt_template.format(prompt=prompt.text, **params)
return stateModifies the existing prompt by passing it through a template
generate()
async def solve(state: TaskState, generate: Generate) -> TaskState:
return await generate(state)Calls the model, appends the assistant message, and updates the model output
(WIP)
Slide 20:
Eval development frequently involves creating custom solvers and scorers. If made available in a Python package these can re-used across many evals
Some jailbreaking solvers from an internal sheppard package:
encode()- message obfuscation jailbreakpap_jailbreak()- Persuasion Adversarial Prompt (PAP)payload_splitting()- PARROT jailbreakcr_jailbreak()- Content reinforcement
(WIP)
Slide 22:
TaskState also includes tools:
class TaskState:
messages: list[ChatMessage]
tools: list[ToolDef]
tool_choice: ToolChoice
output: ModelOutput
# ...use_tools() makes tools available to generate():
return Task(
dataset=example_dataset("biology_qa"),
plan=[
use_tools(web_search()),
generate()
],
scorer=model_graded_qa(template=GRADER_TEMPLATE),
)(WIP)
Slide 23:
- Many permutations of agents and tool use are possible
- Bespoke agent logic inside a solver (swapping various tools in and out)
- Bridges to various agent libraries are as solvers (e.g.
langchain_agent(),langroid_agent(), etc.)
(WIP)
Slide 24:
Cybersecurity eval using hand-rolled agent loop (custom agents and agent frameworks can both be embedded in solvers)
(WIP)
Slide 25:
Convert any LangChain agent into a Solver
@solver
def wikipedia_search() -> Solver:
tavily_api = TavilySearchAPIWrapper()
tools = ([TavilySearchResults(api_wrapper=tavily_api)] +
load_tools(["wikipedia"]))
async def agent(llm: BaseChatModel, input: dict[str, Any]):
tools_agent = create_openai_tools_agent(llm, tools, prompt)
agent_executor = AgentExecutor.from_agent_and_tools(
agent=tools_agent,
tools=tools
)
result = await agent_executor.ainvoke(input)
return result["output"]
return langchain_solver(agent)(WIP)
Slide 30:
- Conventional pattern matching / templated answer based scoring built in.
- Model graded scorer built-in (oen heavily customized)
- Scorers also pluggable (i.e. provided from other packages). We expect lots of innovation in model graded scoring!
- Offline / human scoring workflow is supported.
- Plan to build tools to help rigerously evaluate model graded scorers against human baselines.
Slide 32:
expression_equivalence() is implemented using regex.
(WIP)
Slide 33:
- Capture all context required to debug, analyse, and reproduce evaluations
- Python API for computing on log file contents
- Log viewer for interactive exploration of eval results
I want to talk a little bit about what might seem kind of a mundane concern. But logging ends up being like massively important for doing good evals.
Obviously we built a log viewer on top of the log but the log also has an API so that you can interrogate it.
Plot the results(logs).
Hamel: This abstraction is very cool.
(WIP)
Slide 38:
Interface with any other model by creating a custom model provider…
{insert table}
It's very easy to to make a model provider and publish it in a package or what have you. You should be able to get to the models you want to get to without without trouble.
(WIP)
Slide 39:
- Lots of interactive exploration occurs during eval development, so critical to have good support for iterating in a Notebook / REPL
- Eventually though, evals need to end up in a form that enables reproducibly running them in an eval suite
- Need to support a continuum of workflows that transition well into each other
- Provide good tooling in Jupyter and VS Code for entire spectrum
(WIP)
Slide 41:
Ad-hoc exploration of an eval in a Notebook/REPL
(WIP)
Slide 45:
We want to allow for arbitrary source code organisation but still be able to discover and enumerate tasks for a suite
{screenshot}
Run them all
eval(list_tasks("security"), model="mistral/mistral-large-latest")(WIP)
Slide 46:
The production version would look more like this:
(WIP)
Slide 47:
If you run an eval from a Git repository, you should be able to reproduce the eval with only its log file as context.
# read the log and extract the origin and commit1
log = read_eval_log("security-log.json")
origin = log.spec.revision.origin
commit = log.spec.revision.commit
# clone the repo, checkout the commit, install deps, and run6
run(["git", "clone", revision.origin, "eval-dir"])
with chdir("eval-dir"):
run(["git", "checkout", revision.commit])
run(["pip", "install", "-r", "requirements.txt"])
eval(log)One last piece on sort of work flow is one principle is that if you run an eval from a git repository we want to. If you only have the log file, you should be able to completely reproduce the eval.
It won't necessarily give you all the same obviously, since the models are non deterministic. It won't give you the same results. But you can reproduce all the input parameters and everything.
So, for example, ...
(WIP)
The idea is that the the log file is like assuming it was run from a git repo is sort of a unit of of reproducibility.
(Discord reactions: wow! really cool.)
Slide 48:
- Docs: https://ukgovernmentbeis.github.io/inspect_ai/
- GitHub: https://github.com/UKGovernmentBEIS/inspect_ai
- Slides/Code: https://github.com/jjallaire/inspect-llm-workshop
Questions?
(Found this work done at AISI UK govt: https://www.aisi.gov.uk/work/advanced-ai-evaluations-may-update)
Questions?
(watch the video at 00:37:52)
(WIP)
List of links from the session:
- Inspect homepage: https://ukgovernmentbeis.github.io/inspect_ai/
- Inspect repo: https://github.com/UKGovernmentBEIS/inspect_ai
- AI Safety Institute: https://www.aisi.gov.uk/
- Slides: https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/main/slides/intro-to-inspect.pdf
- Source code: https://github.com/jjallaire/inspect-llm-workshop
Source: Discord (thanks to Shamik)
Some highlights:
- Shamik: Not sure if JJ is on the channel, but I'd like to highlight how simple and elegant the design of this library is. It's the perfect combination of abstraction and detail.
- plattenschieber: There are only 116 installs yet. Let's change this.
- BenG: the meta workflow transition is great, explore in sandbox/notebook, then accumlate a suite etc.
- Shamik: The git reproducibility is mindblowing!
- BenG: Think I spy some Quarto being used for the site. Love it.
- MatijaG: Nice collection of LLM evaluation tools for testing models and improving prompts. https://ianww.com/llm-tools
- Karthik | Langtrace AI: logs as a unit of reproducibility is genius. Sharing
some more thoughts here:
- Solvers is amazing - basically you can chain together a bunch of solvers which gets translated to a prompt. this is really cool and handy.
- Integration with VSCode is great for devex
- Passing datasets is very flexible and mapping datasets to the correct format is also super nice without having to transform the dataset completely.
- I truly admire the code design of this library. Very flexible and composible.
- What it lacks though is the ability to compare runs on the reporting side which is mostly a UI thing. And also the ability to override and annotate the results for model graded scorers from the UI. Again a UI thing.
- Shamik: JJ mentioned that they are working on annotations, but it's not an immediate thing on their roadmap AFAIR.
- JJA: Yeah, comparing runs in the UI is a big weak spot! Right now people just break out matplotlib for this but there is so much we could do out of the box that would yield a lot of insight/visiblity for free. We're on it! Some of the highest impact UI stuff would require more of a shared server/datastore, beyond the scope of the package but definitely would be a worthwhile undertaking.