Conference Talk: Inspect, An OSS framework for LLM evals [WIP]

Mastering LLMs: A Conference For Developers & Data Scientists

An ~~LLM fine-tuning course~~ online conference for everything LLMs.

Build skills to be effective with LLMs

Course website: https://maven.com/parlance-labs/fine-tuning

Conference Talk: Inspect, An OSS framework for LLM evals (WIP)

Guest speaker: JJ Allaire

Date: May 30, 2024

This talk will cover using and extending Inspect, a new OSS Python framework for LLM evals. Inspect's developer (J.J. Allaire) will walk through the core concepts and design of Inspect and demonstrate its use for a variety of evaluation tasks. Inspect makes it very straightforward to implement simple “benchmark” style evaluations, but also has the depth and flexibility to support highly complex evals. Inspect supports both exploratory (notebook/REPL) workflows as well creating large scale eval suites that run in batch. We'll cover all of this ground as well as creating custom extensions to Inspect.

Bio: long time developer tools creator.

A Wikipedia page dedicated for JJ - created Coldfusion, RStudio, ...

Hamel's Tweet: https://x.com/HamelHusain/status/1798964392819892303

Up until recently, there wasn't an OSS vendorless LLM eval framework I liked. Then I found Inspect, and I love it because:

VSCode Plugins w/Viz & UI

Composibility & Developer experience

Made by JJ Allaire (cracked eng w/a track record)

Intro to Inspect

Slide 1:

Open Source Framework for LLM Evals

Inspect

Slide 2:

A Python package (inspect_ai) used to create LLM evaluations
- pip install inspect_ai
Developed and maintained by the UK AI Safety Institute
Similar in function to the eval frameworks embedded in benchmark suites (e.g. Open AI Evals, Eluether LM Eval Harness, etc.) but designed from the ground up for development of more complex evals
Focus on bridging research and production: provide a great development experience for researchers that results in evals that can be reproducibly run at scale

Slide 3:

{Inspect VSCode extension screenshot}

Core Design

Slide 4:

Dataset - List of samples with input and target
Solvers - Functions that transform dataset inputs, call the model for generation, and act further on model output. Can be composed together as layers, or can be a single layer with higher internal complexity
Scorer - Evaluates final output of solvers. May use text comparisons, model grading, or other custom schemes

(My thoughts: very well design software architecture, abstraction is simple, the library is design for composing things easily for creating more complex things.)

Hello, World

Slide 5:

(WIP)

Solvers

Slide 16:

A Solver is a Python function that tasks a TaskState and transforms it in some useful fashion

TaskState (initialised from sample)

class TaskState:1
  messages: list[ChatMessage]2
  output: ModelOutput3
  # ...

Solver Function

async def solve(state: TaskState, generate: Generate) -> TaskState:
  # do something useful with state (prompt engineering,
  # generating model output, critique and regenerate, etc.)
  return state

(WIP)

Baseline Solvers

Slide 17:

prompt_template()

async def solve(state: TaskState, generate: Generate) -> TaskState:
  prompt = state.user_prompt
  prompt.text = prompt_template.format(prompt=prompt.text, **params)
  return state

Modifies the existing prompt by passing it through a template

generate()

async def solve(state: TaskState, generate: Generate) -> TaskState:
  return await generate(state)

Calls the model, appends the assistant message, and updates the model output

(WIP)

Composition

Slide 20:

Eval development frequently involves creating custom solvers and scorers. If made available in a Python package these can re-used across many evals

Some jailbreaking solvers from an internal sheppard package:

encode() - message obfuscation jailbreak
pap_jailbreak() - Persuasion Adversarial Prompt (PAP)
payload_splitting() - PARROT jailbreak
cr_jailbreak() - Content reinforcement

(WIP)

Tool Use

Slide 22:

TaskState also includes tools:

class TaskState:
  messages: list[ChatMessage]
  tools: list[ToolDef]
  tool_choice: ToolChoice
  output: ModelOutput
  # ...

use_tools() makes tools available to generate():

return Task(
  dataset=example_dataset("biology_qa"),
  plan=[
    use_tools(web_search()),
    generate()
  ],
  scorer=model_graded_qa(template=GRADER_TEMPLATE),
)

(WIP)

Agents and Tools

Slide 23:

Many permutations of agents and tool use are possible
Bespoke agent logic inside a solver (swapping various tools in and out)
Bridges to various agent libraries are as solvers (e.g. langchain_agent(), langroid_agent(), etc.)

(WIP)

Agent: Capture the Flag

Slide 24:

Cybersecurity eval using hand-rolled agent loop (custom agents and agent frameworks can both be embedded in solvers)

(WIP)

Agent: LangChain

Slide 25:

Convert any LangChain agent into a Solver

@solver
def wikipedia_search() -> Solver:

  tavily_api = TavilySearchAPIWrapper()
  tools = ([TavilySearchResults(api_wrapper=tavily_api)] +
  load_tools(["wikipedia"]))

  async def agent(llm: BaseChatModel, input: dict[str, Any]):
    tools_agent = create_openai_tools_agent(llm, tools, prompt)
    agent_executor = AgentExecutor.from_agent_and_tools(
      agent=tools_agent,
      tools=tools
    )
    result = await agent_executor.ainvoke(input)
    return result["output"]

  return langchain_solver(agent)

(WIP)

Scoring

Slide 30:

Conventional pattern matching / templated answer based scoring built in.
Model graded scorer built-in (oen heavily customized)
Scorers also pluggable (i.e. provided from other packages). We expect lots of innovation in model graded scoring!
Offline / human scoring workflow is supported.
Plan to build tools to help rigerously evaluate model graded scorers against human baselines.

Scorer: `expression_equivalence()`

Slide 32:

expression_equivalence() is implemented using regex.

(WIP)

Logging

Slide 33:

Capture all context required to debug, analyse, and reproduce evaluations
Python API for computing on log file contents
Log viewer for interactive exploration of eval results

I want to talk a little bit about what might seem kind of a mundane concern. But logging ends up being like massively important for doing good evals.

Obviously we built a log viewer on top of the log but the log also has an API so that you can interrogate it.

`EvalLog`

Plot the results(logs).

Hamel: This abstraction is very cool.

(WIP)

Models

Slide 38:

Interface with any other model by creating a custom model provider…

{insert table}

It's very easy to to make a model provider and publish it in a package or what have you. You should be able to get to the models you want to get to without without trouble.

(WIP)

Workflow

Slide 39:

Lots of interactive exploration occurs during eval development, so critical to have good support for iterating in a Notebook / REPL
Eventually though, evals need to end up in a form that enables reproducibly running them in an eval suite
Need to support a continuum of workflows that transition well into each other
Provide good tooling in Jupyter and VS Code for entire spectrum

(WIP)

Interactive Exploration

Slide 41:

Ad-hoc exploration of an eval in a Notebook/REPL

(WIP)

Eval Suites

Slide 45:

We want to allow for arbitrary source code organisation but still be able to discover and enumerate tasks for a suite

{screenshot}

Run them all

eval(list_tasks("security"), model="mistral/mistral-large-latest")

(WIP)

Resiliency

Slide 46:

The production version would look more like this:

(WIP)

Provenance

Slide 47:

If you run an eval from a Git repository, you should be able to reproduce the eval with only its log file as context.

# read the log and extract the origin and commit1
log = read_eval_log("security-log.json")
origin = log.spec.revision.origin
commit = log.spec.revision.commit

# clone the repo, checkout the commit, install deps, and run6
run(["git", "clone", revision.origin, "eval-dir"])
with chdir("eval-dir"):
  run(["git", "checkout", revision.commit])
  run(["pip", "install", "-r", "requirements.txt"])
  eval(log)

One last piece on sort of work flow is one principle is that if you run an eval from a git repository we want to. If you only have the log file, you should be able to completely reproduce the eval.

It won't necessarily give you all the same obviously, since the models are non deterministic. It won't give you the same results. But you can reproduce all the input parameters and everything.

So, for example, ...

(WIP)

The idea is that the the log file is like assuming it was run from a git repo is sort of a unit of of reproducibility.

(Discord reactions: wow! really cool.)

Learning More

Slide 48:

Docs: https://ukgovernmentbeis.github.io/inspect_ai/
GitHub: https://github.com/UKGovernmentBEIS/inspect_ai
Slides/Code: https://github.com/jjallaire/inspect-llm-workshop

Questions?

(Found this work done at AISI UK govt: https://www.aisi.gov.uk/work/advanced-ai-evaluations-may-update)

Q&A

Questions?

(watch the video at 00:37:52)

(WIP)

Learning Resources

List of links from the session:

Inspect homepage: https://ukgovernmentbeis.github.io/inspect_ai/
Inspect repo: https://github.com/UKGovernmentBEIS/inspect_ai
AI Safety Institute: https://www.aisi.gov.uk/
Slides: https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/main/slides/intro-to-inspect.pdf
Source code: https://github.com/jjallaire/inspect-llm-workshop

Source: Discord (thanks to Shamik)

Discord Messages

Channel

Some highlights:

Shamik: Not sure if JJ is on the channel, but I'd like to highlight how simple and elegant the design of this library is. It's the perfect combination of abstraction and detail.
plattenschieber: There are only 116 installs yet. Let's change this.
BenG: the meta workflow transition is great, explore in sandbox/notebook, then accumlate a suite etc.
Shamik: The git reproducibility is mindblowing!
BenG: Think I spy some Quarto being used for the site. Love it.
MatijaG: Nice collection of LLM evaluation tools for testing models and improving prompts. https://ianww.com/llm-tools
Karthik | Langtrace AI: logs as a unit of reproducibility is genius. Sharing some more thoughts here:
- Solvers is amazing - basically you can chain together a bunch of solvers which gets translated to a prompt. this is really cool and handy.
- Integration with VSCode is great for devex
- Passing datasets is very flexible and mapping datasets to the correct format is also super nice without having to transform the dataset completely.
- I truly admire the code design of this library. Very flexible and composible.
- What it lacks though is the ability to compare runs on the reporting side which is mostly a UI thing. And also the ability to override and annotate the results for model graded scorers from the UI. Again a UI thing.
- Shamik: JJ mentioned that they are working on annotations, but it's not an immediate thing on their roadmap AFAIR.
- JJA: Yeah, comparing runs in the UI is a big weak spot! Right now people just break out matplotlib for this but there is so much we could do out of the box that would yield a lot of insight/visiblity for free. We're on it! Some of the highest impact UI stuff would require more of a shared server/datastore, beyond the scope of the package but definitely would be a worthwhile undertaking.

cedrickchee/mastering-llm-conf-4.md

Select an option

No results found

Select an option

No results found

Mastering LLMs: A Conference For Developers & Data Scientists

Conference Talk: Inspect, An OSS framework for LLM evals (WIP)

Intro to Inspect

Inspect

Core Design

Hello, World

Solvers

Baseline Solvers

Composition

Tool Use

Agents and Tools

Agent: Capture the Flag

Agent: LangChain

Scoring

Scorer: `expression_equivalence()`

Logging

`EvalLog`

Models

Workflow

Interactive Exploration

Eval Suites

Resiliency

Provenance

Learning More

Q&A

Learning Resources

Discord Messages

cedrickchee/mastering-llm-conf-4.md

Mastering LLMs: A Conference For Developers & Data Scientists

Conference Talk: Inspect, An OSS framework for LLM evals (WIP)

Intro to Inspect

Inspect

Core Design

Hello, World

Solvers

Baseline Solvers

Composition

Tool Use

Agents and Tools

Agent: Capture the Flag

Agent: LangChain

Scoring

Scorer: expression_equivalence()

Logging

EvalLog

Models

Workflow

Interactive Exploration

Eval Suites

Resiliency

Provenance

Learning More

Q&A

Learning Resources

Discord Messages

Scorer: `expression_equivalence()`

`EvalLog`