Skip to content

Instantly share code, notes, and snippets.

@damek
Last active May 25, 2025 18:43
Show Gist options
  • Save damek/c5dcf37e5776128a7470c5708b5779f4 to your computer and use it in GitHub Desktop.
Save damek/c5dcf37e5776128a7470c5708b5779f4 to your computer and use it in GitHub Desktop.
Concatenated DSPy documentation (May 12, 2025)

dspy.Adapter

::: dspy.Adapter handler: python options: members: - call - acall - format - format_assistant_message_content - format_conversation_history - format_demos - format_field_description - format_field_structure - format_task_description - format_user_message_content - parse show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.ChatAdapter

::: dspy.ChatAdapter handler: python options: members: - call - acall - format - format_assistant_message_content - format_conversation_history - format_demos - format_field_description - format_field_structure - format_field_with_value - format_finetune_data - format_task_description - format_user_message_content - parse - user_message_output_requirements show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.JSONAdapter

::: dspy.JSONAdapter handler: python options: members: - call - acall - format - format_assistant_message_content - format_conversation_history - format_demos - format_field_description - format_field_structure - format_field_with_value - format_finetune_data - format_task_description - format_user_message_content - parse - user_message_output_requirements show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.TwoStepAdapter

::: dspy.TwoStepAdapter handler: python options: members: - call - acall - format - format_assistant_message_content - format_conversation_history - format_demos - format_field_description - format_field_structure - format_task_description - format_user_message_content - parse show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.evaluate.CompleteAndGrounded

::: dspy.evaluate.CompleteAndGrounded handler: python options: members: - call - acall - batch - deepcopy - dump_state - forward - get_lm - load - load_state - map_named_predictors - named_parameters - named_predictors - named_sub_modules - parameters - predictors - reset_copy - save - set_lm show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.Evaluate

::: dspy.Evaluate handler: python options: members: - call show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.evaluate.SemanticF1

::: dspy.evaluate.SemanticF1 handler: python options: members: - call - acall - batch - deepcopy - dump_state - forward - get_lm - load - load_state - map_named_predictors - named_parameters - named_predictors - named_sub_modules - parameters - predictors - reset_copy - save - set_lm show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.evaluate.answer_exact_match

::: dspy.evaluate.answer_exact_match handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.evaluate.answer_passage_match

::: dspy.evaluate.answer_passage_match handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


API Reference

Welcome to the DSPy API reference documentation. This section provides detailed information about DSPy's classes, modules, and functions.


dspy.Embedder

::: dspy.Embedder handler: python options: members: - call - acall show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.LM

::: dspy.LM handler: python options: members: - call - acall - aforward - copy - dump_state - finetune - forward - infer_provider - inspect_history - kill - launch - update_global_history show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.BestOfN

::: dspy.BestOfN handler: python options: members: - call - acall - batch - deepcopy - dump_state - forward - get_lm - load - load_state - map_named_predictors - named_parameters - named_predictors - named_sub_modules - parameters - predictors - reset_copy - save - set_lm show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.ChainOfThought

::: dspy.ChainOfThought handler: python options: members: - call - acall - aforward - batch - deepcopy - dump_state - forward - get_lm - load - load_state - map_named_predictors - named_parameters - named_predictors - named_sub_modules - parameters - predictors - reset_copy - save - set_lm show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.ChainOfThoughtWithHint

::: dspy.ChainOfThoughtWithHint handler: python options: members: - call - acall - batch - deepcopy - dump_state - forward - get_lm - load - load_state - map_named_predictors - named_parameters - named_predictors - named_sub_modules - parameters - predictors - reset_copy - save - set_lm show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.Module

::: dspy.Module handler: python options: members: - call - acall - batch - deepcopy - dump_state - get_lm - load - load_state - map_named_predictors - named_parameters - named_predictors - named_sub_modules - parameters - predictors - reset_copy - save - set_lm show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.MultiChainComparison

::: dspy.MultiChainComparison handler: python options: members: - call - acall - batch - deepcopy - dump_state - forward - get_lm - load - load_state - map_named_predictors - named_parameters - named_predictors - named_sub_modules - parameters - predictors - reset_copy - save - set_lm show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.Parallel

::: dspy.Parallel handler: python options: members: - call - forward show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.Predict

::: dspy.Predict handler: python options: members: - call - acall - aforward - batch - deepcopy - dump_state - forward - get_config - get_lm - load - load_state - map_named_predictors - named_parameters - named_predictors - named_sub_modules - parameters - predictors - reset - reset_copy - save - set_lm - update_config show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.Program

::: dspy.Program handler: python options: members: - call - acall - batch - deepcopy - dump_state - get_lm - load - load_state - map_named_predictors - named_parameters - named_predictors - named_sub_modules - parameters - predictors - reset_copy - save - set_lm show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.ProgramOfThought

::: dspy.ProgramOfThought handler: python options: members: - call - acall - batch - deepcopy - dump_state - forward - get_lm - load - load_state - map_named_predictors - named_parameters - named_predictors - named_sub_modules - parameters - predictors - reset_copy - save - set_lm show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.ReAct

::: dspy.ReAct handler: python options: members: - call - acall - batch - deepcopy - dump_state - forward - get_lm - load - load_state - map_named_predictors - named_parameters - named_predictors - named_sub_modules - parameters - predictors - reset_copy - save - set_lm - truncate_trajectory show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.Refine

::: dspy.Refine handler: python options: members: - call - acall - batch - deepcopy - dump_state - forward - get_lm - load - load_state - map_named_predictors - named_parameters - named_predictors - named_sub_modules - parameters - predictors - reset_copy - save - set_lm show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.BetterTogether

::: dspy.BetterTogether handler: python options: members: - compile - get_params show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.BootstrapFewShot

::: dspy.BootstrapFewShot handler: python options: members: - compile - get_params show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.BootstrapFewShotWithRandomSearch

::: dspy.BootstrapFewShotWithRandomSearch handler: python options: members: - compile - get_params show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.BootstrapFinetune

::: dspy.BootstrapFinetune handler: python options: members: - compile - convert_to_lm_dict - finetune_lms - get_params show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.BootstrapRS

::: dspy.BootstrapRS handler: python options: members: - compile - get_params show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.COPRO

::: dspy.COPRO handler: python options: members: - compile - get_params show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.Ensemble

::: dspy.Ensemble handler: python options: members: - compile - get_params show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.InferRules

::: dspy.InferRules handler: python options: members: - compile - evaluate_program - format_examples - get_params - get_predictor_demos - induce_natural_language_rules - update_program_instructions show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.KNN

::: dspy.KNN handler: python options: members: - call show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.KNNFewShot

::: dspy.KNNFewShot handler: python options: members: - compile - get_params show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.LabeledFewShot

::: dspy.LabeledFewShot handler: python options: members: - compile - get_params show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.MIPROv2

MIPROv2 (Multiprompt Instruction PRoposal Optimizer Version 2) is an prompt optimizer capable of optimizing both instructions and few-shot examples jointly. It does this by bootstrapping few-shot example candidates, proposing instructions grounded in different dynamics of the task, and finding an optimized combination of these options using Bayesian Optimization. It can be used for optimizing few-shot examples & instructions jointly, or just instructions for 0-shot optimization.

::: dspy.MIPROv2 handler: python options: members: - compile - get_params show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true

Example Usage

The program below shows optimizing a math program with MIPROv2

import dspy
from dspy.datasets.gsm8k import GSM8K, gsm8k_metric

# Import the optimizer
from dspy.teleprompt import MIPROv2

# Initialize the LM
lm = dspy.LM('openai/gpt-4o-mini', api_key='YOUR_OPENAI_API_KEY')
dspy.configure(lm=lm)

# Initialize optimizer
teleprompter = MIPROv2(
    metric=gsm8k_metric,
    auto="medium", # Can choose between light, medium, and heavy optimization runs
)

# Optimize program
print(f"Optimizing program with MIPROv2...")
optimized_program = teleprompter.compile(
    dspy.ChainOfThought("question -> answer"),
    trainset=gsm8k.train,
    requires_permission_to_run=False,
)

# Save optimize program for future use
optimized_program.save(f"optimized.json")

How MIPROv2 works

At a high level, MIPROv2 works by creating both few-shot examples and new instructions for each predictor in your LM program, and then searching over these using Bayesian Optimization to find the best combination of these variables for your program. If you want a visual explanation check out this twitter thread.

These steps are broken down in more detail below:

  1. Bootstrap Few-Shot Examples: Randomly samples examples from your training set, and run them through your LM program. If the output from the program is correct for this example, it is kept as a valid few-shot example candidate. Otherwise, we try another example until we've curated the specified amount of few-shot example candidates. This step creates num_candidates sets of max_bootstrapped_demos bootstrapped examples and max_labeled_demos basic examples sampled from the training set.

  2. Propose Instruction Candidates. The instruction proposer includes (1) a generated summary of properties of the training dataset, (2) a generated summary of your LM program's code and the specific predictor that an instruction is being generated for, (3) the previously bootstrapped few-shot examples to show reference inputs / outputs for a given predictor and (4) a randomly sampled tip for generation (i.e. "be creative", "be concise", etc.) to help explore the feature space of potential instructions. This context is provided to a prompt_model which writes high quality instruction candidates.

  3. Find an Optimized Combination of Few-Shot Examples & Instructions. Finally, we use Bayesian Optimization to choose which combinations of instructions and demonstrations work best for each predictor in our program. This works by running a series of num_trials trials, where a new set of prompts are evaluated over our validation set at each trial. The new set of prompts are only evaluated on a minibatch of size minibatch_size at each trial (when minibatch=True). The best averaging set of prompts is then evalauted on the full validation set every minibatch_full_eval_steps. At the end of the optimization process, the LM program with the set of prompts that performed best on the full validation set is returned.

For those interested in more details, more information on MIPROv2 along with a study on MIPROv2 compared with other DSPy optimizers can be found in this paper.


dspy.Example

::: dspy.Example handler: python options: members: - copy - get - inputs - items - keys - labels - toDict - values - with_inputs - without show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.History

::: dspy.History handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.Image

::: dspy.Image handler: python options: members: - from_PIL - from_file - from_url - serialize_model - validate_input show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.Prediction

::: dspy.Prediction handler: python options: members: - copy - from_completions - get - get_lm_usage - inputs - items - keys - labels - set_lm_usage - toDict - values - with_inputs - without show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.Tool

::: dspy.Tool handler: python options: members: - call - acall show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.InputField

::: dspy.InputField handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.OutputField

::: dspy.OutputField handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.Signature

::: dspy.Signature handler: python options: members: - append - delete - dump_state - equals - insert - load_state - prepend - with_instructions - with_updated_fields show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.ColBERTv2

::: dspy.ColBERTv2 handler: python options: members: - call show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.retrievers.Embeddings

::: dspy.retrievers.Embeddings handler: python options: members: - call - forward show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.PythonInterpreter

::: dspy.PythonInterpreter handler: python options: members: - call - execute - shutdown show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.streaming.StatusMessage

::: dspy.streaming.StatusMessage handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.streaming.StatusMessageProvider

::: dspy.streaming.StatusMessageProvider handler: python options: members: - lm_end_status_message - lm_start_status_message - module_end_status_message - module_start_status_message - tool_end_status_message - tool_start_status_message show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.streaming.StreamListener

::: dspy.streaming.StreamListener handler: python options: members: - flush - receive show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.asyncify

::: dspy.asyncify handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.disable_litellm_logging

::: dspy.disable_litellm_logging handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.disable_logging

::: dspy.disable_logging handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.enable_litellm_logging

::: dspy.enable_litellm_logging handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.enable_logging

::: dspy.enable_logging handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.inspect_history

::: dspy.inspect_history handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.load

::: dspy.load handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true


dspy.streamify

::: dspy.streamify handler: python options: show_source: true show_root_heading: true heading_level: 2 docstring_style: google show_root_full_path: true show_object_full_path: false separate_signature: false inherited_members: true



sidebar_position: 999

!!! warning "This page is outdated and may not be fully accurate in DSPy 2.5 and 2.6"

DSPy Cheatsheet

This page will contain snippets for frequent usage patterns.

DSPy DataLoaders

Import and initializing a DataLoader Object:

import dspy
from dspy.datasets import DataLoader

dl = DataLoader()

Loading from HuggingFace Datasets

code_alpaca = dl.from_huggingface("HuggingFaceH4/CodeAlpaca_20K")

You can access the dataset of the splits by calling key of the corresponding split:

train_dataset = code_alpaca['train']
test_dataset = code_alpaca['test']

Loading specific splits from HuggingFace

You can also manually specify splits you want to include as a parameters and it'll return a dictionary where keys are splits that you specified:

code_alpaca = dl.from_huggingface(
    "HuggingFaceH4/CodeAlpaca_20K",
    split = ["train", "test"],
)

print(f"Splits in dataset: {code_alpaca.keys()}")

If you specify a single split then dataloader will return a List of dspy.Example instead of dictionary:

code_alpaca = dl.from_huggingface(
    "HuggingFaceH4/CodeAlpaca_20K",
    split = "train",
)

print(f"Number of examples in split: {len(code_alpaca)}")

You can slice the split just like you do with HuggingFace Dataset too:

code_alpaca_80 = dl.from_huggingface(
    "HuggingFaceH4/CodeAlpaca_20K",
    split = "train[:80%]",
)

print(f"Number of examples in split: {len(code_alpaca_80)}")

code_alpaca_20_80 = dl.from_huggingface(
    "HuggingFaceH4/CodeAlpaca_20K",
    split = "train[20%:80%]",
)

print(f"Number of examples in split: {len(code_alpaca_20_80)}")

Loading specific subset from HuggingFace

If a dataset has a subset you can pass it as an arg like you do with load_dataset in HuggingFace:

gms8k = dl.from_huggingface(
    "gsm8k",
    "main",
    input_keys = ("question",),
)

print(f"Keys present in the returned dict: {list(gms8k.keys())}")

print(f"Number of examples in train set: {len(gms8k['train'])}")
print(f"Number of examples in test set: {len(gms8k['test'])}")

Loading from CSV

dolly_100_dataset = dl.from_csv("dolly_subset_100_rows.csv",)

You can choose only selected columns from the csv by specifying them in the arguments:

dolly_100_dataset = dl.from_csv(
    "dolly_subset_100_rows.csv",
    fields=("instruction", "context", "response"),
    input_keys=("instruction", "context")
)

Splitting a List of dspy.Example

splits = dl.train_test_split(dataset, train_size=0.8) # `dataset` is a List of dspy.Example
train_dataset = splits['train']
test_dataset = splits['test']

Sampling from List of dspy.Example

sampled_example = dl.sample(dataset, n=5) # `dataset` is a List of dspy.Example

DSPy Programs

dspy.Signature

class BasicQA(dspy.Signature):
    """Answer questions with short factoid answers."""

    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

dspy.ChainOfThought

generate_answer = dspy.ChainOfThought(BasicQA)

# Call the predictor on a particular input alongside a hint.
question='What is the color of the sky?'
pred = generate_answer(question=question)

dspy.ChainOfThoughtwithHint

generate_answer = dspy.ChainOfThoughtWithHint(BasicQA)

# Call the predictor on a particular input alongside a hint.
question='What is the color of the sky?'
hint = "It's what you often see during a sunny day."
pred = generate_answer(question=question, hint=hint)

dspy.ProgramOfThought

pot = dspy.ProgramOfThought(BasicQA)

question = 'Sarah has 5 apples. She buys 7 more apples from the store. How many apples does Sarah have now?'
result = pot(question=question)

print(f"Question: {question}")
print(f"Final Predicted Answer (after ProgramOfThought process): {result.answer}")

dspy.ReACT

react_module = dspy.ReAct(BasicQA)

question = 'Sarah has 5 apples. She buys 7 more apples from the store. How many apples does Sarah have now?'
result = react_module(question=question)

print(f"Question: {question}")
print(f"Final Predicted Answer (after ReAct process): {result.answer}")

dspy.Retrieve

colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(rm=colbertv2_wiki17_abstracts)

#Define Retrieve Module
retriever = dspy.Retrieve(k=3)

query='When was the first FIFA World Cup held?'

# Call the retriever on a particular query.
topK_passages = retriever(query).passages

for idx, passage in enumerate(topK_passages):
    print(f'{idx+1}]', passage, '\n')

DSPy Metrics

Function as Metric

To create a custom metric you can create a function that returns either a number or a boolean value:

def parse_integer_answer(answer, only_first_line=True):
    try:
        if only_first_line:
            answer = answer.strip().split('\n')[0]

        # find the last token that has a number in it
        answer = [token for token in answer.split() if any(c.isdigit() for c in token)][-1]
        answer = answer.split('.')[0]
        answer = ''.join([c for c in answer if c.isdigit()])
        answer = int(answer)

    except (ValueError, IndexError):
        # print(answer)
        answer = 0

    return answer

# Metric Function
def gsm8k_metric(gold, pred, trace=None) -> int:
    return int(parse_integer_answer(str(gold.answer))) == int(parse_integer_answer(str(pred.answer)))

LLM as Judge

class FactJudge(dspy.Signature):
    """Judge if the answer is factually correct based on the context."""

    context = dspy.InputField(desc="Context for the prediction")
    question = dspy.InputField(desc="Question to be answered")
    answer = dspy.InputField(desc="Answer for the question")
    factually_correct = dspy.OutputField(desc="Is the answer factually correct based on the context?", prefix="Factual[Yes/No]:")

judge = dspy.ChainOfThought(FactJudge)

def factuality_metric(example, pred):
    factual = judge(context=example.context, question=example.question, answer=pred.answer)
    return int(factual=="Yes")

DSPy Evaluation

from dspy.evaluate import Evaluate

evaluate_program = Evaluate(devset=devset, metric=your_defined_metric, num_threads=NUM_THREADS, display_progress=True, display_table=num_rows_to_display)

evaluate_program(your_dspy_program)

DSPy Optimizers

LabeledFewShot

from dspy.teleprompt import LabeledFewShot

labeled_fewshot_optimizer = LabeledFewShot(k=8)
your_dspy_program_compiled = labeled_fewshot_optimizer.compile(student = your_dspy_program, trainset=trainset)

BootstrapFewShot

from dspy.teleprompt import BootstrapFewShot

fewshot_optimizer = BootstrapFewShot(metric=your_defined_metric, max_bootstrapped_demos=4, max_labeled_demos=16, max_rounds=1, max_errors=5)

your_dspy_program_compiled = fewshot_optimizer.compile(student = your_dspy_program, trainset=trainset)

Using another LM for compilation, specifying in teacher_settings

from dspy.teleprompt import BootstrapFewShot

fewshot_optimizer = BootstrapFewShot(metric=your_defined_metric, max_bootstrapped_demos=4, max_labeled_demos=16, max_rounds=1, max_errors=5, teacher_settings=dict(lm=gpt4))

your_dspy_program_compiled = fewshot_optimizer.compile(student = your_dspy_program, trainset=trainset)

Compiling a compiled program - bootstrapping a bootstrapped program

your_dspy_program_compiledx2 = teleprompter.compile(
    your_dspy_program,
    teacher=your_dspy_program_compiled,
    trainset=trainset,
)

Saving/loading a compiled program

save_path = './v1.json'
your_dspy_program_compiledx2.save(save_path)
loaded_program = YourProgramClass()
loaded_program.load(path=save_path)

BootstrapFewShotWithRandomSearch

Detailed documentation on BootstrapFewShotWithRandomSearch can be found here.

from dspy.teleprompt import BootstrapFewShotWithRandomSearch

fewshot_optimizer = BootstrapFewShotWithRandomSearch(metric=your_defined_metric, max_bootstrapped_demos=2, num_candidate_programs=8, num_threads=NUM_THREADS)

your_dspy_program_compiled = fewshot_optimizer.compile(student = your_dspy_program, trainset=trainset, valset=devset)

Other custom configurations are similar to customizing the BootstrapFewShot optimizer.

Ensemble

from dspy.teleprompt import BootstrapFewShotWithRandomSearch
from dspy.teleprompt.ensemble import Ensemble

fewshot_optimizer = BootstrapFewShotWithRandomSearch(metric=your_defined_metric, max_bootstrapped_demos=2, num_candidate_programs=8, num_threads=NUM_THREADS)
your_dspy_program_compiled = fewshot_optimizer.compile(student = your_dspy_program, trainset=trainset, valset=devset)

ensemble_optimizer = Ensemble(reduce_fn=dspy.majority)
programs = [x[-1] for x in your_dspy_program_compiled.candidate_programs]
your_dspy_program_compiled_ensemble = ensemble_optimizer.compile(programs[:3])

BootstrapFinetune

from dspy.teleprompt import BootstrapFewShotWithRandomSearch, BootstrapFinetune

#Compile program on current dspy.settings.lm
fewshot_optimizer = BootstrapFewShotWithRandomSearch(metric=your_defined_metric, max_bootstrapped_demos=2, num_threads=NUM_THREADS)
your_dspy_program_compiled = tp.compile(your_dspy_program, trainset=trainset[:some_num], valset=trainset[some_num:])

#Configure model to finetune
config = dict(target=model_to_finetune, epochs=2, bf16=True, bsize=6, accumsteps=2, lr=5e-5)

#Compile program on BootstrapFinetune
finetune_optimizer = BootstrapFinetune(metric=your_defined_metric)
finetune_program = finetune_optimizer.compile(your_dspy_program, trainset=some_new_dataset_for_finetuning_model, **config)

finetune_program = your_dspy_program

#Load program and activate model's parameters in program before evaluation
ckpt_path = "saved_checkpoint_path_from_finetuning"
LM = dspy.HFModel(checkpoint=ckpt_path, model=model_to_finetune)

for p in finetune_program.predictors():
    p.lm = LM
    p.activated = False

COPRO

Detailed documentation on COPRO can be found here.

from dspy.teleprompt import COPRO

eval_kwargs = dict(num_threads=16, display_progress=True, display_table=0)

copro_teleprompter = COPRO(prompt_model=model_to_generate_prompts, metric=your_defined_metric, breadth=num_new_prompts_generated, depth=times_to_generate_prompts, init_temperature=prompt_generation_temperature, verbose=False)

compiled_program_optimized_signature = copro_teleprompter.compile(your_dspy_program, trainset=trainset, eval_kwargs=eval_kwargs)

MIPRO

from dspy.teleprompt import MIPRO

teleprompter = MIPRO(prompt_model=model_to_generate_prompts, task_model=model_that_solves_task, metric=your_defined_metric, num_candidates=num_new_prompts_generated, init_temperature=prompt_generation_temperature)

kwargs = dict(num_threads=NUM_THREADS, display_progress=True, display_table=0)

compiled_program_optimized_bayesian_signature = teleprompter.compile(your_dspy_program, trainset=trainset, num_trials=100, max_bootstrapped_demos=3, max_labeled_demos=5, eval_kwargs=kwargs)

MIPROv2

Note: detailed documentation can be found here. MIPROv2 is the latest extension of MIPRO which includes updates such as (1) improvements to instruction proposal and (2) more efficient search with minibatching.

Optimizing with MIPROv2

This shows how to perform an easy out-of-the box run with auto=light, which configures many hyperparameters for you and performs a light optimization run. You can alternatively set auto=medium or auto=heavy to perform longer optimization runs. The more detailed MIPROv2 documentation here also provides more information about how to set hyperparameters by hand.

# Import the optimizer
from dspy.teleprompt import MIPROv2

# Initialize optimizer
teleprompter = MIPROv2(
    metric=gsm8k_metric,
    auto="light", # Can choose between light, medium, and heavy optimization runs
)

# Optimize program
print(f"Optimizing program with MIPRO...")
optimized_program = teleprompter.compile(
    program.deepcopy(),
    trainset=trainset,
    max_bootstrapped_demos=3,
    max_labeled_demos=4,
    requires_permission_to_run=False,
)

# Save optimize program for future use
optimized_program.save(f"mipro_optimized")

# Evaluate optimized program
print(f"Evaluate optimized program...")
evaluate(optimized_program, devset=devset[:])

Optimizing instructions only with MIPROv2 (0-Shot)

# Import the optimizer
from dspy.teleprompt import MIPROv2

# Initialize optimizer
teleprompter = MIPROv2(
    metric=gsm8k_metric,
    auto="light", # Can choose between light, medium, and heavy optimization runs
)

# Optimize program
print(f"Optimizing program with MIPRO...")
optimized_program = teleprompter.compile(
    program.deepcopy(),
    trainset=trainset,
    max_bootstrapped_demos=0,
    max_labeled_demos=0,
    requires_permission_to_run=False,
)

# Save optimize program for future use
optimized_program.save(f"mipro_optimized")

# Evaluate optimized program
print(f"Evaluate optimized program...")
evaluate(optimized_program, devset=devset[:])

Signature Optimizer with Types

from dspy.teleprompt.signature_opt_typed import optimize_signature
from dspy.evaluate.metrics import answer_exact_match
from dspy.functional import TypedChainOfThought

compiled_program = optimize_signature(
    student=TypedChainOfThought("question -> answer"),
    evaluator=Evaluate(devset=devset, metric=answer_exact_match, num_threads=10, display_progress=True),
    n_iterations=50,
).program

KNNFewShot

from sentence_transformers import SentenceTransformer
from dspy import Embedder
from dspy.teleprompt import KNNFewShot
from dspy import ChainOfThought

knn_optimizer = KNNFewShot(k=3, trainset=trainset, vectorizer=Embedder(SentenceTransformer("all-MiniLM-L6-v2").encode))

qa_compiled = knn_optimizer.compile(student=ChainOfThought("question -> answer"))

BootstrapFewShotWithOptuna

from dspy.teleprompt import BootstrapFewShotWithOptuna

fewshot_optuna_optimizer = BootstrapFewShotWithOptuna(metric=your_defined_metric, max_bootstrapped_demos=2, num_candidate_programs=8, num_threads=NUM_THREADS)

your_dspy_program_compiled = fewshot_optuna_optimizer.compile(student=your_dspy_program, trainset=trainset, valset=devset)

Other custom configurations are similar to customizing the dspy.BootstrapFewShot optimizer.

DSPy Refine and BestofN

dspy.Suggest and dspy.Assert are replaced by dspy.Refine and dspy.BestofN in DSPy 2.6.

BestofN

Runs a module up to N times with different temperatures and returns the best prediction, as defined by the reward_fn, or the first prediction that passes the threshold.

import dspy

qa = dspy.ChainOfThought("question -> answer")
def one_word_answer(args, pred):
    return 1.0 if len(pred.answer) == 1 else 0.0
best_of_3 = dspy.BestOfN(module=qa, N=3, reward_fn=one_word_answer, threshold=1.0)
best_of_3(question="What is the capital of Belgium?").answer
# Brussels

Refine

Refines a module by running it up to N times with different temperatures and returns the best prediction, as defined by the reward_fn, or the first prediction that passes the threshold. After each attempt (except the final one), Refine automatically generates detailed feedback about the module's performance and uses this feedback as hints for subsequent runs, creating an iterative refinement process.

import dspy

qa = dspy.ChainOfThought("question -> answer")
def one_word_answer(args, pred):
    return 1.0 if len(pred.answer) == 1 else 0.0
best_of_3 = dspy.Refine(module=qa, N=3, reward_fn=one_word_answer, threshold=1.0)
best_of_3(question="What is the capital of Belgium?").answer
# Brussels

Error Handling

By default, Refine will try to run the module up to N times until the threshold is met. If the module encounters an error, it will keep going up to N failed attempts. You can change this behavior by setting fail_count to a smaller number than N.

refine = dspy.Refine(module=qa, N=3, reward_fn=one_word_answer, threshold=1.0, fail_count=1)
...
refine(question="What is the capital of Belgium?")
# If we encounter just one failed attempt, the module will raise an error.

If you want to run the module up to N times without any error handling, you can set fail_count to N. This is the default behavior.

refine = dspy.Refine(module=qa, N=3, reward_fn=one_word_answer, threshold=1.0, fail_count=3)
...
refine(question="What is the capital of Belgium?")

Resources

This is the list of tutorials and blog posts on DSPy. If you would like to add your own tutorial, please make a PR.

A Few Blogs & Videos on using DSPy

Blogs

Name Link
Why I bet on DSPy Blog
Not Your Average Prompt Engineering Blog
Why I'm excited about DSPy Blog
Achieving GPT-4 Performance at Lower Cost Link
Prompt engineering is a task best left to AI models Link
What makes DSPy a valuable framework for developing complex language model pipelines? Link
DSPy: A new framework to program your foundation models just by prompting Link
Intro to DSPy: Goodbye Prompting, Hello Programming Link
DSPyGen: Revolutionizing AI Link
Building an AI Assistant with DSPy Link
Building Self-improving Agents in Production with DSPy Link

Videos

Name Link
DSPy Explained! (60K views) Link
DSPy Intro from Sephora (25K views) Link
Structured Outputs with DSPy Link
DSPy and ColBERT - Weaviate Podcast Link
SBTB23 DSPy Link
Optimization with DSPy and LangChain Link
Automated Prompt Engineering + Visualization Link
Transforming LM Calls into Pipelines Link
NeurIPS Hacker Cup: DSPy for Code Gen Link
MIPRO and DSPy - Weaviate Podcast Link
Getting Started with RAG in DSPy Link
Adding Depth to DSPy Programs Link
Programming Foundation Models with DSPy Link
DSPy End-to-End: SF Meetup Link
Monitoring & Tracing DSPy with Langtrace Link
Teaching chat models to solve chess puzzles using DSPy + Finetuning Link
Build Self-Improving AI Agents with DSPy (No Code) Link

Podcasts

Weaviate has a directory of 10 amazing notebooks and 6 podcasts! Huge shoutout to them for the massive support ❤️. See the Weaviate DSPy directory.

TODO: This list in particular is highly incomplete. There are dozens of other good ones. To allow space, divide into opintionated blogs / podcasts / interviews vs. tutorials & talks.

Credit: Some of these resources were originally compiled in the Awesome DSPy repo.


Contributing

DSPy is an actively growing project and community, and we welcome your contributions and involvement! Please read the contributing guide for how to contribute to DSPy.


Use Cases

We often get questions like "How are people using DSPy in practice?", both in production and for research. This list was created to collect a few pointers and to encourage others in the community to add their own work below.

This list is ever expanding and highly incomplete (WIP)! We'll be adding a bunch more. If you would like to add your product or research to this list, please make a PR.

A Few Company Use Cases

Name Use Cases
JetBlue Multiple chatbot use cases. Blog
Replit Synthesize diffs using code LLMs using a DSPy pipeline. Blog
Databricks Research, products, and customer solutions around LM Judges, RAG, classification, and other applications. Blog, Blog II
Sephora Undisclosed agent usecases; perspectives shared in DAIS Session.
Zoro UK E-commerce applications around structured shopping. Portkey Session
VMware RAG and other prompt optimization applications. Interview in The Register. Business Insider.
Haize Labs Automated red-teaming for LLMs. Blog
Plastic Labs R&D pipelines for Honcho. Blog
PingCAP Building a knowledge graph. Article
Salomatic Enriching medical reports using DSPy. Blog
Truelaw How Truelaw builds bespoke LLM pipelines for law firms using DSPy. Podcast
STChealth Using DSPy for entity resolution including human-readable rationale for decisions.
Moody's Leveraging DSPy to optimize RAG systems, LLM-as-a-Judge, and agentic systems for financial workflows.
Normal Computing Translate specs from chip companies from English to intermediate formal languages
Procure.FYI Process messy, publicly available technology spending and pricing data via DSPy.
RadiantLogic AI Data Assistant. DSPy is used for the agent that routes the query, the context extraction module, the text-to-sql conversion engine, and the table summarization module.
Raia Using DSPy for AI-powered Personal Healthcare Agents.
Hyperlint Uses DSPy to generate technical documentation. DSPy helps to fetch relevant information and synthesize that into tutorials.
Starops & Saya Building research documents given a user's corpus. Generate prompts to create more articles from example articles.
Tessel AI Enhancing human-machine interaction with data use cases.
Dicer.ai Uses DSPy for marketing AI to get the most from their paid ads.
Howie Using DSPy to automate meeting scheduling through email.
Isoform.ai Building custom integrations using DSPy.
Trampoline AI Uses DSPy to power their data-augmentation and LM pipelines.
Pretrain Uses DSPy to automatically optimize AI performance towards user-defined tasks based on uploaded examples.

WIP. This list mainly the small fraction of companies that have public posts or have OKed being publicly included for specific products so far, out of hundreds of other industry use cases.

A Few Papers Using DSPy

Name Description
STORM Writing Wikipedia-like Articles From Scratch.
PATH Prompts as Auto-Optimized Training Hyperparameters: Training Best-in-Class IR Models from Scratch with 10 Gold Labels
WangLab @ MEDIQA UofT's winning system at MEDIQA, outperforms the next best system by 20 points
UMD's Suicide Detection System Outperforms 20-hour expert human prompt engineering by 40%
IReRa Infer-Retrieve-Rank: Extreme Classification with > 10,000 Labels
Unreasonably Effective Eccentric Prompts General Prompt Optimization
Palimpzest A Declarative System for Optimizing AI Workloads
AI Agents that Matter Agent Efficiency Optimization
EDEN Empathetic Dialogues for English Learning: Uses adaptive empathetic feedback to improve student grit
ECG-Chat Uses DSPy with GraphRAG for medical report generation
DSPy Assertions Various applications of imposing hard and soft constraints on LM outputs
DSPy Guardrails Reduce the attack success rate of CodeAttack, decreasing from 75% to 5%
Co-STORM Collaborative STORM: Generate Wikipedia-like articles through collaborative discourse among users and multiple LM agents

WIP, this list is highly outdated and incomplete.

A Few Repositories (or other OSS examples) using DSPy

Name Description/Link
Stanford CS 224U Homework Github
STORM Report Generation (10,000 GitHub stars) Github
DSPy Redteaming Github
DSPy Theory of Mind Github
Indic cross-lingual Natural Language Inference Github
Optimizing LM for Text2SQL using DSPy Github
DSPy PII Masking Demo by Eric Ness Colab
DSPy on BIG-Bench Hard Example Github
Building a chess playing agent using DSPy Github
Ittia Research Fact Checking Github
Strategic Debate via Tree-of-Thought Github
Sanskrit to English Translation App Github
DSPy for extracting features from PDFs on arXiv Github
DSPygen: DSPy in Ruby on Rails Github
DSPy Inspector Github
DSPy with FastAPI Github
DSPy for Indian Languages Github
Hurricane: Blog Posts with Generative Feedback Loops! Github
RAG example using DSPy, Gradio, FastAPI, and Ollama Github
Synthetic Data Generation Github
Self Discover Github

TODO: This list in particular is highly incomplete. There are a couple dozen other good ones.

A Few Providers, Integrations, and related Blog Releases

Name Link
Databricks Link
Zenbase Link
LangWatch Link
Gradient Link
Snowflake Link
Langchain Link
Weaviate Link
Qdrant Link
Weights & Biases Weave Link
Milvus Link
Neo4j Link
Lightning AI Link
Haystack Link
Arize Link
LlamaIndex Link
Langtrace Link
Langfuse Link
OpenLIT Link
Relevance AI Link

Credit: Some of these resources were originally compiled in the Awesome DSPy repo.



sidebar_position: 2

!!! warning "This page is outdated and may not be fully accurate in DSPy 2.5"

Utilizing Built-in Datasets

It's easy to use your own data in DSPy: a dataset is just a list of Example objects. Using DSPy well involves being able to find and re-purpose existing datasets for your own pipelines in new ways; DSPy makes this a particularly powerful strategy.

For convenience, DSPy currently also provides support for the following dataset out of the box:

  • HotPotQA (multi-hop question answering)
  • GSM8k (math questions)
  • Color (basic dataset of colors)

Loading HotPotQA

HotPotQA is which is a collection of question-answer pairs.

from dspy.datasets import HotPotQA

dataset = HotPotQA(train_seed=1, train_size=5, eval_seed=2023, dev_size=50, test_size=0)

print(dataset.train)

Output:

[Example({'question': 'At My Window was released by which American singer-songwriter?', 'answer': 'John Townes Van Zandt'}) (input_keys=None),
 Example({'question': 'which  American actor was Candace Kita  guest starred with ', 'answer': 'Bill Murray'}) (input_keys=None),
 Example({'question': 'Which of these publications was most recently published, Who Put the Bomp or Self?', 'answer': 'Self'}) (input_keys=None),
 Example({'question': 'The Victorians - Their Story In Pictures is a documentary series written by an author born in what year?', 'answer': '1950'}) (input_keys=None),
 Example({'question': 'Which magazine has published articles by Scott Shaw, Tae Kwon Do Times or Southwest Art?', 'answer': 'Tae Kwon Do Times'}) (input_keys=None)]

We just loaded trainset (5 examples) and devset (50 examples). Each example in our training set contains just a question and its (human-annotated) answer. As you can see, it is loaded as a list of Example objects. However, one thing to note is that it doesn't set the input keys implicitly, so that is something that we'll need to do!!

trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

print(trainset)

Output:

[Example({'question': 'At My Window was released by which American singer-songwriter?', 'answer': 'John Townes Van Zandt'}) (input_keys={'question'}),
 Example({'question': 'which  American actor was Candace Kita  guest starred with ', 'answer': 'Bill Murray'}) (input_keys={'question'}),
 Example({'question': 'Which of these publications was most recently published, Who Put the Bomp or Self?', 'answer': 'Self'}) (input_keys={'question'}),
 Example({'question': 'The Victorians - Their Story In Pictures is a documentary series written by an author born in what year?', 'answer': '1950'}) (input_keys={'question'}),
 Example({'question': 'Which magazine has published articles by Scott Shaw, Tae Kwon Do Times or Southwest Art?', 'answer': 'Tae Kwon Do Times'}) (input_keys={'question'})]

DSPy typically requires very minimal labeling. Whereas your pipeline may involve six or seven complex steps, you only need labels for the initial question and the final answer. DSPy will bootstrap any intermediate labels needed to support your pipeline. If you change your pipeline in any way, the data bootstrapped will change accordingly!

Advanced: Inside DSPy's Dataset class (Optional)

We've seen how you can use HotPotQA dataset class and load the HotPotQA dataset, but how does it actually work? The HotPotQA class inherits from the Dataset class, which takes care of the conversion of the data loaded from a source into train-test-dev split, all of which are list of examples. In the HotPotQA class, you only implement the __init__ method, where you populate the splits from HuggingFace into the variables _train, _test and _dev. The rest of the process is handled by methods in the Dataset class.

Dataset Loading Process in HotPotQA Class

But how do the methods of the Dataset class convert the data from HuggingFace? Let's take a deep breath and think step by step...pun intended. In example above, we can see the splits accessed by .train, .dev and .test methods, so let's take a look at the implementation of the train() method:

@property
def train(self):
    if not hasattr(self, '_train_'):
        self._train_ = self._shuffle_and_sample('train', self._train, self.train_size, self.train_seed)

    return self._train_

As you can see, the train() method serves as a property, not a regular method. Within this property, it first checks if the _train_ attribute exists. If not, it calls the _shuffle_and_sample() method to process the self._train where the HuggingFace dataset is loaded. Let's see the _shuffle_and_sample() method:

def _shuffle_and_sample(self, split, data, size, seed=0):
    data = list(data)
    base_rng = random.Random(seed)

    if self.do_shuffle:
        base_rng.shuffle(data)

    data = data[:size]
    output = []

    for example in data:
        output.append(Example(**example, dspy_uuid=str(uuid.uuid4()), dspy_split=split))
    
        return output

The _shuffle_and_sample() method does two things:

  • It shuffles the data if self.do_shuffle is True.
  • It then takes a sample of size size from the shuffled data.
  • It then loops through the sampled data and converts each element in data into an Example object. The Example along with example data also contains a unique ID, and the split name.

Converting the raw examples into Example objects allows the Dataset class to process them in a standardized way later. For example, the collate method, which is used by the PyTorch DataLoader, expects each item to be an Example.

To summarize, the Dataset class handles all the necessary data processing and provides a simple API to access the different splits. This differentiates from the dataset classes like HotpotQA which require only definitions on how to load the raw data.



sidebar_position: 1

!!! warning "This page is outdated and may not be fully accurate in DSPy 2.5"

Examples in DSPy

Working in DSPy involves training sets, development sets, and test sets. This is like traditional ML, but you usually need far fewer labels (or zero labels) to use DSPy effectively.

The core data type for data in DSPy is Example. You will use Examples to represent items in your training set and test set.

DSPy Examples are similar to Python dicts but have a few useful utilities. Your DSPy modules will return values of the type Prediction, which is a special sub-class of Example.

Creating an Example

When you use DSPy, you will do a lot of evaluation and optimization runs. Your individual datapoints will be of type Example:

qa_pair = dspy.Example(question="This is a question?", answer="This is an answer.")

print(qa_pair)
print(qa_pair.question)
print(qa_pair.answer)

Output:

Example({'question': 'This is a question?', 'answer': 'This is an answer.'}) (input_keys=None)
This is a question?
This is an answer.

Examples can have any field keys and any value types, though usually values are strings.

object = Example(field1=value1, field2=value2, field3=value3, ...)

Specifying Input Keys

In traditional ML, there are separated "inputs" and "labels".

In DSPy, the Example objects have a with_inputs() method, which can mark specific fields as inputs. (The rest are just metadata or labels.)

# Single Input.
print(qa_pair.with_inputs("question"))

# Multiple Inputs; be careful about marking your labels as inputs unless you mean it.
print(qa_pair.with_inputs("question", "answer"))

This flexibility allows for customized tailoring of the Example object for different DSPy scenarios.

When you call with_inputs(), you get a new copy of the example. The original object is kept unchanged.

Element Access and Updation

Values can be accessed using the .(dot) operator. You can access the value of key name in defined object Example(name="John Doe", job="sleep") through object.name.

To access or exclude certain keys, use inputs() and labels() methods to return new Example objects containing only input or non-input keys, respectively.

article_summary = dspy.Example(article= "This is an article.", summary= "This is a summary.").with_inputs("article")

input_key_only = article_summary.inputs()
non_input_key_only = article_summary.labels()

print("Example object with Input fields only:", input_key_only)
print("Example object with Non-Input fields only:", non_input_key_only)

Output

Example object with Input fields only: Example({'article': 'This is an article.'}) (input_keys=None)
Example object with Non-Input fields only: Example({'summary': 'This is a summary.'}) (input_keys=None)

To exclude keys, use without():

article_summary = dspy.Example(context="This is an article.", question="This is a question?", answer="This is an answer.", rationale= "This is a rationale.").with_inputs("context", "question")

print("Example object without answer & rationale keys:", article_summary.without("answer", "rationale"))

Output

Example object without answer & rationale keys: Example({'context': 'This is an article.', 'question': 'This is a question?'}) (input_keys=None)

Updating values is simply assigning a new value using the . operator.

article_summary.context = "new context"

Iterating over Example

Iteration in the Example class also functions like a dictionary, supporting methods like keys(), values(), etc:

for k, v in article_summary.items():
    print(f"{k} = {v}")

Output

context = This is an article.
question = This is a question?
answer = This is an answer.
rationale = This is a rationale.


sidebar_position: 3

!!! warning "This page is outdated and may not be fully accurate in DSPy 2.5"

Creating a Custom Dataset

We've seen how to work with with Example objects and use the HotPotQA class to load the HuggingFace HotPotQA dataset as a list of Example objects. But in production, such structured datasets are rare. Instead, you'll find yourself working on a custom dataset and might question: how do I create my own dataset or what format should it be?

In DSPy, your dataset is a list of Examples, which we can accomplish in two ways:

  • Recommended: The Pythonic Way: Using native python utility and logic.
  • Advanced: Using DSPy's Dataset class

Recommended: The Pythonic Way

To create a list of Example objects, we can simply load data from the source and formulate it into a Python list. Let's load an example CSV sample.csv that contains 3 fields: (context, question and summary) via Pandas. From there, we can construct our data list.

import pandas as pd

df = pd.read_csv("sample.csv")
print(df.shape)

Output:

(1000, 3)
dataset = []

for context, question, answer in df.values:
    dataset.append(dspy.Example(context=context, question=question, answer=answer).with_inputs("context", "question"))

print(dataset[:3])

Output:

[Example({'context': nan, 'question': 'Which is a species of fish? Tope or Rope', 'answer': 'Tope'}) (input_keys={'question', 'context'}),
 Example({'context': nan, 'question': 'Why can camels survive for long without water?', 'answer': 'Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.'}) (input_keys={'question', 'context'}),
 Example({'context': nan, 'question': "Alice's parents have three daughters: Amy, Jessy, and what’s the name of the third daughter?", 'answer': 'The name of the third daughter is Alice'}) (input_keys={'question', 'context'})]

While this is fairly simple, let's take a look at how loading datasets would look in DSPy - via the DSPythonic way!

Advanced: Using DSPy's Dataset class (Optional)

Let's take advantage of the Dataset class we defined in the previous article to accomplish the preprocessing:

  • Load data from CSV to a dataframe.
  • Split the data to train, dev and test splits.
  • Populate _train, _dev and _test class attributes. Note that these attributes should be a list of dictionary, or an iterator over mapping like HuggingFace Dataset, to make it work.

This is all done through the __init__ method, which is the only method we have to implement.

import pandas as pd
from dspy.datasets.dataset import Dataset

class CSVDataset(Dataset):
    def __init__(self, file_path, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        
        df = pd.read_csv(file_path)
        self._train = df.iloc[0:700].to_dict(orient='records')

        self._dev = df.iloc[700:].to_dict(orient='records')

dataset = CSVDataset("sample.csv")
print(dataset.train[:3])

Output:

[Example({'context': nan, 'question': 'Which is a species of fish? Tope or Rope', 'answer': 'Tope'}) (input_keys={'question', 'context'}),
 Example({'context': nan, 'question': 'Why can camels survive for long without water?', 'answer': 'Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.'}) (input_keys={'question', 'context'}),
 Example({'context': nan, 'question': "Alice's parents have three daughters: Amy, Jessy, and what’s the name of the third daughter?", 'answer': 'The name of the third daughter is Alice'}) (input_keys={'question', 'context'})]

Let's understand the code step by step:

  • It inherits the base Dataset class from DSPy. This inherits all the useful data loading/processing functionality.
  • We load the data in CSV into a DataFrame.
  • We get the train split i.e first 700 rows in the DataFrame and convert it to lists of dicts using to_dict(orient='records') method and is then assigned to self._train.
  • We get the dev split i.e first 300 rows in the DataFrame and convert it to lists of dicts using to_dict(orient='records') method and is then assigned to self._dev.

Using the Dataset base class now makes loading custom datasets incredibly easy and avoids having to write all that boilerplate code ourselves for every new dataset.

!!! caution

We did not populate `_test` attribute in the above code, which is fine and won't cause any unnecessary error as such. However it'll give you an error if you try to access the test split.

```python
dataset.test[:5]
```
****
```text
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-59-5202f6de3c7b> in <cell line: 1>()
----> 1 dataset.test[:5]

/usr/local/lib/python3.10/dist-packages/dspy/datasets/dataset.py in test(self)
    51     def test(self):
    52         if not hasattr(self, '_test_'):
---> 53             self._test_ = self._shuffle_and_sample('test', self._test, self.test_size, self.test_seed)
    54 
    55         return self._test_

AttributeError: 'CSVDataset' object has no attribute '_test'
```

To prevent that you'll just need to make sure `_test` is not `None` and populated with the appropriate data.

You can override the methods in Dataset class to customize your class even more.

In summary, the Dataset base class provides a simplistic way to load and preprocess custom datasets with minimal code!



sidebar_position: 998

!!! warning "This page is outdated and may not be fully accurate in DSPy 2.5 and 2.6"

FAQs

Is DSPy right for me? DSPy vs. other frameworks

The DSPy philosophy and abstraction differ significantly from other libraries and frameworks, so it's usually straightforward to decide when DSPy is (or isn't) the right framework for your usecase. If you're a NLP/AI researcher (or a practitioner exploring new pipelines or new tasks), the answer is generally an invariable yes. If you're a practitioner doing other things, please read on.

DSPy vs. thin wrappers for prompts (OpenAI API, MiniChain, basic templating) In other words: Why can't I just write my prompts directly as string templates? Well, for extremely simple settings, this might work just fine. (If you're familiar with neural networks, this is like expressing a tiny two-layer NN as a Python for-loop. It kinda works.) However, when you need higher quality (or manageable cost), then you need to iteratively explore multi-stage decomposition, improved prompting, data bootstrapping, careful finetuning, retrieval augmentation, and/or using smaller (or cheaper, or local) models. The true expressive power of building with foundation models lies in the interactions between these pieces. But every time you change one piece, you likely break (or weaken) multiple other components. DSPy cleanly abstracts away (and powerfully optimizes) the parts of these interactions that are external to your actual system design. It lets you focus on designing the module-level interactions: the same program expressed in 10 or 20 lines of DSPy can easily be compiled into multi-stage instructions for GPT-4, detailed prompts for Llama2-13b, or finetunes for T5-base. Oh, and you wouldn't need to maintain long, brittle, model-specific strings at the core of your project anymore.

DSPy vs. application development libraries like LangChain, LlamaIndex LangChain and LlamaIndex target high-level application development; they offer batteries-included, pre-built application modules that plug in with your data or configuration. If you'd be happy to use a generic, off-the-shelf prompt for question answering over PDFs or standard text-to-SQL, you will find a rich ecosystem in these libraries. DSPy doesn't internally contain hand-crafted prompts that target specific applications. Instead, DSPy introduces a small set of much more powerful and general-purpose modules that can learn to prompt (or finetune) your LM within your pipeline on your data. when you change your data, make tweaks to your program's control flow, or change your target LM, the DSPy compiler can map your program into a new set of prompts (or finetunes) that are optimized specifically for this pipeline. Because of this, you may find that DSPy obtains the highest quality for your task, with the least effort, provided you're willing to implement (or extend) your own short program. In short, DSPy is for when you need a lightweight but automatically-optimizing programming model — not a library of predefined prompts and integrations. If you're familiar with neural networks: This is like the difference between PyTorch (i.e., representing DSPy) and HuggingFace Transformers (i.e., representing the higher-level libraries).

DSPy vs. generation control libraries like Guidance, LMQL, RELM, Outlines These are all exciting new libraries for controlling the individual completions of LMs, e.g., if you want to enforce JSON output schema or constrain sampling to a particular regular expression. This is very useful in many settings, but it's generally focused on low-level, structured control of a single LM call. It doesn't help ensure the JSON (or structured output) you get is going to be correct or useful for your task. In contrast, DSPy automatically optimizes the prompts in your programs to align them with various task needs, which may also include producing valid structured outputs. That said, we are considering allowing Signatures in DSPy to express regex-like constraints that are implemented by these libraries.

Basic Usage

How should I use DSPy for my task? We wrote a eight-step guide on this. In short, using DSPy is an iterative process. You first define your task and the metrics you want to maximize, and prepare a few example inputs — typically without labels (or only with labels for the final outputs, if your metric requires them). Then, you build your pipeline by selecting built-in layers (modules) to use, giving each layer a signature (input/output spec), and then calling your modules freely in your Python code. Lastly, you use a DSPy optimizer to compile your code into high-quality instructions, automatic few-shot examples, or updated LM weights for your LM.

How do I convert my complex prompt into a DSPy pipeline? See the same answer above.

What do DSPy optimizers tune? Or, what does compiling actually do? Each optimizer is different, but they all seek to maximize a metric on your program by updating prompts or LM weights. Current DSPy optimizers can inspect your data, simulate traces through your program to generate good/bad examples of each step, propose or refine instructions for each step based on past results, finetune the weights of your LM on self-generated examples, or combine several of these to improve quality or cut cost. We'd love to merge new optimizers that explore a richer space: most manual steps you currently go through for prompt engineering, "synthetic data" generation, or self-improvement can probably generalized into a DSPy optimizer that acts on arbitrary LM programs.

Other FAQs. We welcome PRs to add formal answers to each of these here. You will find the answer in existing issues, tutorials, or the papers for all or most of these.

  • How do I get multiple outputs?

You can specify multiple output fields. For the short-form signature, you can list multiple outputs as comma separated values, following the "->" indicator (e.g. "inputs -> output1, output2"). For the long-form signature, you can include multiple dspy.OutputFields.

  • How do I define my own metrics? Can metrics return a float?

You can define metrics as simply Python functions that process model generations and evaluate them based on user-defined requirements. Metrics can compare existent data (e.g. gold labels) to model predictions or they can be used to assess various components of an output using validation feedback from LMs (e.g. LLMs-as-Judges). Metrics can return bool, int, and float types scores. Check out the official Metrics docs to learn more about defining custom metrics and advanced evaluations using AI feedback and/or DSPy programs.

  • How expensive or slow is compiling??

To reflect compiling metrics, we highlight an experiment for reference, compiling the SimplifiedBaleen using the dspy.BootstrapFewShotWithRandomSearch optimizer on the gpt-3.5-turbo-1106 model over 7 candidate programs and 10 threads. We report that compiling this program takes around 6 minutes with 3200 API calls, 2.7 million input tokens and 156,000 output tokens, reporting a total cost of $3 USD (at the current pricing of the OpenAI model).

Compiling DSPy optimizers naturally will incur additional LM calls, but we substantiate this overhead with minimalistic executions with the goal of maximizing performance. This invites avenues to enhance performance of smaller models by compiling DSPy programs with larger models to learn enhanced behavior during compile-time and propagate such behavior to the tested smaller model during inference-time.

Deployment or Reproducibility Concerns

  • How do I save a checkpoint of my compiled program?

Here is an example of saving/loading a compiled module:

cot_compiled = teleprompter.compile(CoT(), trainset=trainset, valset=devset)

#Saving
cot_compiled.save('compiled_cot_gsm8k.json')

#Loading:
cot = CoT()
cot.load('compiled_cot_gsm8k.json')
  • How do I export for deployment?

Exporting DSPy programs is simply saving them as highlighted above!

  • How do I search my own data?

Open source libraries such as RAGautouille enable you to search for your own data through advanced retrieval models like ColBERT with tools to embed and index documents. Feel free to integrate such libraries to create searchable datasets while developing your DSPy programs!

  • How do I turn off the cache? How do I export the cache?

From v2.5, you can turn off the cache by setting cache parameter in dspy.LM to False:

dspy.LM('openai/gpt-4o-mini',  cache=False)

Your local cache will be saved to the global env directory os.environ["DSP_CACHEDIR"] or for notebooks os.environ["DSP_NOTEBOOK_CACHEDIR"]. You can usually set the cachedir to os.path.join(repo_path, 'cache') and export this cache from here:

os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join(os.getcwd(), 'cache')

!!! warning "Important" DSP_CACHEDIR is responsible for old clients (including dspy.OpenAI, dspy.ColBERTv2, etc.) and DSPY_CACHEDIR is responsible for the new dspy.LM client.

In the AWS lambda deployment, you should disable both DSP_\* and DSPY_\*.

Advanced Usage

  • How do I parallelize? You can parallelize DSPy programs during both compilation and evaluation by specifying multiple thread settings in the respective DSPy optimizers or within the dspy.Evaluate utility function.

  • How do freeze a module?

Modules can be frozen by setting their ._compiled attribute to be True, indicating the module has gone through optimizer compilation and should not have its parameters adjusted. This is handled internally in optimizers such as dspy.BootstrapFewShot where the student program is ensured to be frozen before the teacher propagates the gathered few-shot demonstrations in the bootstrapping process.

  • How do I use DSPy assertions?

    a) How to Add Assertions to Your Program:

    • Define Constraints: Use dspy.Assert and/or dspy.Suggest to define constraints within your DSPy program. These are based on boolean validation checks for the outcomes you want to enforce, which can simply be Python functions to validate the model outputs.
    • Integrating Assertions: Keep your Assertion statements following a model generations (hint: following a module layer)

    b) How to Activate the Assertions:

    1. Using assert_transform_module:
      • Wrap your DSPy module with assertions using the assert_transform_module function, along with a backtrack_handler. This function transforms your program to include internal assertions backtracking and retry logic, which can be customized as well: program_with_assertions = assert_transform_module(ProgramWithAssertions(), backtrack_handler)
    2. Activate Assertions:
      • Directly call activate_assertions on your DSPy program with assertions: program_with_assertions = ProgramWithAssertions().activate_assertions()

    Note: To use Assertions properly, you must activate a DSPy program that includes dspy.Assert or dspy.Suggest statements from either of the methods above.

Errors

  • How do I deal with "context too long" errors?

If you're dealing with "context too long" errors in DSPy, you're likely using DSPy optimizers to include demonstrations within your prompt, and this is exceeding your current context window. Try reducing these parameters (e.g. max_bootstrapped_demos and max_labeled_demos). Additionally, you can also reduce the number of retrieved passages/docs/embeddings to ensure your prompt is fitting within your model context length.

A more general fix is simply increasing the number of max_tokens specified to the LM request (e.g. lm = dspy.OpenAI(model = ..., max_tokens = ...).

Set Verbose Level

DSPy utilizes the logging library to print logs. If you want to debug your DSPy code, set the logging level to DEBUG with the example code below.

import logging
logging.getLogger("dspy").setLevel(logging.DEBUG)

Alternatively, if you want to reduce the amount of logs, set the logging level to WARNING or ERROR.

import logging
logging.getLogger("dspy").setLevel(logging.WARNING)


sidebar_position: 1 hide:

  • navigation
  • toc

DSPy{ width="200", align=left }

Programming—not prompting—LMs

PyPI Downloads

DSPy is a declarative framework for building modular AI software. It allows you to iterate fast on structured code, rather than brittle strings, and offers algorithms that compile AI programs into effective prompts and weights for your language models, whether you're building simple classifiers, sophisticated RAG pipelines, or Agent loops.

Instead of wrangling prompts or training jobs, DSPy (Declarative Self-improving Python) enables you to build AI software from natural-language modules and to generically compose them with different models, inference strategies, or learning algorithms. This makes AI software more reliable, maintainable, and portable across models and strategies.

tl;dr Think of DSPy as a higher-level language for AI programming (lecture), like the shift from assembly to C or pointer arithmetic to SQL. Meet the community, seek help, or start contributing via GitHub and Discord.

!!! info "Getting Started I: Install DSPy and set up your LM"

```bash
> pip install -U dspy
```

=== "OpenAI"
    You can authenticate by setting the `OPENAI_API_KEY` env variable or passing `api_key` below.

    ```python linenums="1"
    import dspy
    lm = dspy.LM('openai/gpt-4o-mini', api_key='YOUR_OPENAI_API_KEY')
    dspy.configure(lm=lm)
    ```

=== "Anthropic"
    You can authenticate by setting the ANTHROPIC_API_KEY env variable or passing `api_key` below.

    ```python linenums="1"
    import dspy
    lm = dspy.LM('anthropic/claude-3-opus-20240229', api_key='YOUR_ANTHROPIC_API_KEY')
    dspy.configure(lm=lm)
    ```

=== "Databricks"
    If you're on the Databricks platform, authentication is automatic via their SDK. If not, you can set the env variables `DATABRICKS_API_KEY` and `DATABRICKS_API_BASE`, or pass `api_key` and `api_base` below.

    ```python linenums="1"
    import dspy
    lm = dspy.LM('databricks/databricks-meta-llama-3-1-70b-instruct')
    dspy.configure(lm=lm)
    ```

=== "Local LMs on your laptop"
      First, install [Ollama](https://github.com/ollama/ollama) and launch its server with your LM.

      ```bash
      > curl -fsSL https://ollama.ai/install.sh | sh
      > ollama run llama3.2:1b
      ```

      Then, connect to it from your DSPy code.

    ```python linenums="1"
    import dspy
    lm = dspy.LM('ollama_chat/llama3.2', api_base='http://localhost:11434', api_key='')
    dspy.configure(lm=lm)
    ```

=== "Local LMs on a GPU server"
      First, install [SGLang](https://sgl-project.github.io/start/install.html) and launch its server with your LM.

      ```bash
      > pip install "sglang[all]"
      > pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ 

      > CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Llama-3.1-8B-Instruct
      ```
    
    If you don't have access from Meta to download `meta-llama/Llama-3.1-8B-Instruct`, use `Qwen/Qwen2.5-7B-Instruct` for example.

    Next, connect to your local LM from your DSPy code as an `OpenAI`-compatible endpoint.

      ```python linenums="1"
      lm = dspy.LM("openai/meta-llama/Llama-3.1-8B-Instruct",
                   api_base="http://localhost:7501/v1",  # ensure this points to your port
                   api_key="local", model_type='chat')
      dspy.configure(lm=lm)
      ```

=== "Other providers"
    In DSPy, you can use any of the dozens of [LLM providers supported by LiteLLM](https://docs.litellm.ai/docs/providers). Simply follow their instructions for which `{PROVIDER}_API_KEY` to set and how to write pass the `{provider_name}/{model_name}` to the constructor.

    Some examples:

    - `anyscale/mistralai/Mistral-7B-Instruct-v0.1`, with `ANYSCALE_API_KEY`
    - `together_ai/togethercomputer/llama-2-70b-chat`, with `TOGETHERAI_API_KEY`
    - `sagemaker/<your-endpoint-name>`, with `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_REGION_NAME`
    - `azure/<your_deployment_name>`, with `AZURE_API_KEY`, `AZURE_API_BASE`, `AZURE_API_VERSION`, and the optional `AZURE_AD_TOKEN` and `AZURE_API_TYPE`

    
    If your provider offers an OpenAI-compatible endpoint, just add an `openai/` prefix to your full model name.

    ```python linenums="1"
    import dspy
    lm = dspy.LM('openai/your-model-name', api_key='PROVIDER_API_KEY', api_base='YOUR_PROVIDER_URL')
    dspy.configure(lm=lm)
    ```

??? "Calling the LM directly."

 Idiomatic DSPy involves using _modules_, which we define in the rest of this page. However, it's still easy to call the `lm` you configured above directly. This gives you a unified API and lets you benefit from utilities like automatic caching.

 ```python linenums="1"       
 lm("Say this is a test!", temperature=0.7)  # => ['This is a test!']
 lm(messages=[{"role": "user", "content": "Say this is a test!"}])  # => ['This is a test!']
 ``` 

1) Modules help you describe AI behavior as code, not strings.

To build reliable AI systems, you must iterate fast. But maintaining prompts makes that hard: it forces you to tinker with strings or data every time you change your LM, metrics, or pipeline. Having built over a dozen best-in-class compound LM systems since 2020, we learned this the hard way—and so built DSPy to decouple AI system design from messy incidental choices about specific LMs or prompting strategies.

DSPy shifts your focus from tinkering with prompt strings to programming with structured and declarative natural-language modules. For every AI component in your system, you specify input/output behavior as a signature and select a module to assign a strategy for invoking your LM. DSPy expands your signatures into prompts and parses your typed outputs, so you can compose different modules together into ergonomic, portable, and optimizable AI systems.

!!! info "Getting Started II: Build DSPy modules for various tasks" Try the examples below after configuring your lm above. Adjust the fields to explore what tasks your LM can do well out of the box. Each tab below sets up a DSPy module, like dspy.Predict, dspy.ChainOfThought, or dspy.ReAct, with a task-specific signature. For example, question -> answer: float tells the module to take a question and to produce a float answer.

=== "Math"

    ```python linenums="1"
    math = dspy.ChainOfThought("question -> answer: float")
    math(question="Two dice are tossed. What is the probability that the sum equals two?")
    ```
    
    **Possible Output:**
    ```text
    Prediction(
        reasoning='When two dice are tossed, each die has 6 faces, resulting in a total of 6 x 6 = 36 possible outcomes. The sum of the numbers on the two dice equals two only when both dice show a 1. This is just one specific outcome: (1, 1). Therefore, there is only 1 favorable outcome. The probability of the sum being two is the number of favorable outcomes divided by the total number of possible outcomes, which is 1/36.',
        answer=0.0277776
    )
    ```

=== "RAG"

    ```python linenums="1"       
    def search_wikipedia(query: str) -> list[str]:
        results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=3)
        return [x['text'] for x in results]
    
    rag = dspy.ChainOfThought('context, question -> response')

    question = "What's the name of the castle that David Gregory inherited?"
    rag(context=search_wikipedia(question), question=question)
    ```
    
    **Possible Output:**
    ```text
    Prediction(
        reasoning='The context provides information about David Gregory, a Scottish physician and inventor. It specifically mentions that he inherited Kinnairdy Castle in 1664. This detail directly answers the question about the name of the castle that David Gregory inherited.',
        response='Kinnairdy Castle'
    )
    ```

=== "Classification"

    ```python linenums="1"
    from typing import Literal

    class Classify(dspy.Signature):
        """Classify sentiment of a given sentence."""
        
        sentence: str = dspy.InputField()
        sentiment: Literal['positive', 'negative', 'neutral'] = dspy.OutputField()
        confidence: float = dspy.OutputField()

    classify = dspy.Predict(Classify)
    classify(sentence="This book was super fun to read, though not the last chapter.")
    ```
    
    **Possible Output:**

    ```text
    Prediction(
        sentiment='positive',
        confidence=0.75
    )
    ```

=== "Information Extraction"

    ```python linenums="1"        
    class ExtractInfo(dspy.Signature):
        """Extract structured information from text."""
        
        text: str = dspy.InputField()
        title: str = dspy.OutputField()
        headings: list[str] = dspy.OutputField()
        entities: list[dict[str, str]] = dspy.OutputField(desc="a list of entities and their metadata")
    
    module = dspy.Predict(ExtractInfo)

    text = "Apple Inc. announced its latest iPhone 14 today." \
        "The CEO, Tim Cook, highlighted its new features in a press release."
    response = module(text=text)

    print(response.title)
    print(response.headings)
    print(response.entities)
    ```
    
    **Possible Output:**
    ```text
    Apple Inc. Announces iPhone 14
    ['Introduction', "CEO's Statement", 'New Features']
    [{'name': 'Apple Inc.', 'type': 'Organization'}, {'name': 'iPhone 14', 'type': 'Product'}, {'name': 'Tim Cook', 'type': 'Person'}]
    ```

=== "Agents"

    ```python linenums="1"       
    def evaluate_math(expression: str):
        return dspy.PythonInterpreter({}).execute(expression)

    def search_wikipedia(query: str):
        results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=3)
        return [x['text'] for x in results]

    react = dspy.ReAct("question -> answer: float", tools=[evaluate_math, search_wikipedia])

    pred = react(question="What is 9362158 divided by the year of birth of David Gregory of Kinnairdy castle?")
    print(pred.answer)
    ```
    
    **Possible Output:**

    ```text
    5761.328
    ```

=== "Multi-Stage Pipelines"

    ```python linenums="1"       
    class Outline(dspy.Signature):
        """Outline a thorough overview of a topic."""
        
        topic: str = dspy.InputField()
        title: str = dspy.OutputField()
        sections: list[str] = dspy.OutputField()
        section_subheadings: dict[str, list[str]] = dspy.OutputField(desc="mapping from section headings to subheadings")

    class DraftSection(dspy.Signature):
        """Draft a top-level section of an article."""
        
        topic: str = dspy.InputField()
        section_heading: str = dspy.InputField()
        section_subheadings: list[str] = dspy.InputField()
        content: str = dspy.OutputField(desc="markdown-formatted section")

    class DraftArticle(dspy.Module):
        def __init__(self):
            self.build_outline = dspy.ChainOfThought(Outline)
            self.draft_section = dspy.ChainOfThought(DraftSection)

        def forward(self, topic):
            outline = self.build_outline(topic=topic)
            sections = []
            for heading, subheadings in outline.section_subheadings.items():
                section, subheadings = f"## {heading}", [f"### {subheading}" for subheading in subheadings]
                section = self.draft_section(topic=outline.title, section_heading=section, section_subheadings=subheadings)
                sections.append(section.content)
            return dspy.Prediction(title=outline.title, sections=sections)

    draft_article = DraftArticle()
    article = draft_article(topic="World Cup 2002")
    ```
    
    **Possible Output:**

    A 1500-word article on the topic, e.g.

    ```text
    ## Qualification Process

    The qualification process for the 2002 FIFA World Cup involved a series of..... [shortened here for presentation].

    ### UEFA Qualifiers

    The UEFA qualifiers involved 50 teams competing for 13..... [shortened here for presentation].

    .... [rest of the article]
    ```

    Note that DSPy makes it straightforward to optimize multi-stage modules like this. As long as you can evaluate the _final_ output of the system, every DSPy optimizer can tune all of the intermediate modules.

??? "Using DSPy in practice: from quick scripting to building sophisticated systems."

Standard prompts conflate interface (“what should the LM do?”) with implementation (“how do we tell it to do that?”). DSPy isolates the former as _signatures_ so we can infer the latter or learn it from data — in the context of a bigger program.

Even before you start using optimizers, DSPy's modules allow you to script effective LM systems as ergonomic, portable _code_. Across many tasks and LMs, we maintain _signature test suites_ that assess the reliability of the built-in DSPy adapters. Adapters are the components that map signatures to prompts prior to optimization. If you find a task where a simple prompt consistently outperforms idiomatic DSPy for your LM, consider that a bug and [file an issue](https://github.com/stanfordnlp/dspy/issues). We'll use this to improve the built-in adapters.

2) Optimizers tune the prompts and weights of your AI modules.

DSPy provides you with the tools to compile high-level code with natural language annotations into the low-level computations, prompts, or weight updates that align your LM with your program’s structure and metrics. If you change your code or your metrics, you can simply re-compile accordingly.

Given a few tens or hundreds of representative inputs of your task and a metric that can measure the quality of your system's outputs, you can use a DSPy optimizer. Different optimizers in DSPy work by synthesizing good few-shot examples for every module, like dspy.BootstrapRS,1 proposing and intelligently exploring better natural-language instructions for every prompt, like dspy.MIPROv2,2 and building datasets for your modules and using them to finetune the LM weights in your system, like dspy.BootstrapFinetune.3

!!! info "Getting Started III: Optimizing the LM prompts or weights in DSPy programs" A typical simple optimization run costs on the order of $2 USD and takes around 20 minutes, but be careful when running optimizers with very large LMs or very large datasets. Optimization can cost as little as a few cents or up to tens of dollars, depending on your LM, dataset, and configuration.

=== "Optimizing prompts for a ReAct agent"
    This is a minimal but fully runnable example of setting up a `dspy.ReAct` agent that answers questions via
    search from Wikipedia and then optimizing it using `dspy.MIPROv2` in the cheap `light` mode on 500
    question-answer pairs sampled from the `HotPotQA` dataset.

    ```python linenums="1"
    import dspy
    from dspy.datasets import HotPotQA

    dspy.configure(lm=dspy.LM('openai/gpt-4o-mini'))

    def search_wikipedia(query: str) -> list[str]:
        results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=3)
        return [x['text'] for x in results]

    trainset = [x.with_inputs('question') for x in HotPotQA(train_seed=2024, train_size=500).train]
    react = dspy.ReAct("question -> answer", tools=[search_wikipedia])

    tp = dspy.MIPROv2(metric=dspy.evaluate.answer_exact_match, auto="light", num_threads=24)
    optimized_react = tp.compile(react, trainset=trainset)
    ```

    An informal run like this raises ReAct's score from 24% to 51%, by teaching `gpt-4o-mini` more about the specifics of the task.

=== "Optimizing prompts for RAG"
    Given a retrieval index to `search`, your favorite `dspy.LM`, and a small `trainset` of questions and ground-truth responses, the following code snippet can optimize your RAG system with long outputs against the built-in `SemanticF1` metric, which is implemented as a DSPy module.

    ```python linenums="1"
    class RAG(dspy.Module):
        def __init__(self, num_docs=5):
            self.num_docs = num_docs
            self.respond = dspy.ChainOfThought('context, question -> response')

        def forward(self, question):
            context = search(question, k=self.num_docs)   # defined in tutorial linked below
            return self.respond(context=context, question=question)

    tp = dspy.MIPROv2(metric=dspy.evaluate.SemanticF1(decompositional=True), auto="medium", num_threads=24)
    optimized_rag = tp.compile(RAG(), trainset=trainset, max_bootstrapped_demos=2, max_labeled_demos=2)
    ```

    For a complete RAG example that you can run, start this [tutorial](/tutorials/rag/). It improves the quality of a RAG system over a subset of StackExchange communities by 10% relative gain.

=== "Optimizing weights for Classification"
    This is a minimal but fully runnable example of setting up a `dspy.ChainOfThought` module that classifies
    short texts into one of 77 banking labels and then using `dspy.BootstrapFinetune` with 2000 text-label pairs
    from the `Banking77` to finetune the weights of GPT-4o-mini for this task. We use the variant
    `dspy.ChainOfThoughtWithHint`, which takes an optional `hint` at bootstrapping time, to maximize the utility of
    the training data. Naturally, hints are not available at test time.

    <details><summary>Click to show dataset setup code.</summary>

    ```python linenums="1"
    import random
    from typing import Literal
    from dspy.datasets import DataLoader
    from datasets import load_dataset

    # Load the Banking77 dataset.
    CLASSES = load_dataset("PolyAI/banking77", split="train", trust_remote_code=True).features['label'].names
    kwargs = dict(fields=("text", "label"), input_keys=("text",), split="train", trust_remote_code=True)

    # Load the first 2000 examples from the dataset, and assign a hint to each *training* example.
    trainset = [
        dspy.Example(x, hint=CLASSES[x.label], label=CLASSES[x.label]).with_inputs("text", "hint")
        for x in DataLoader().from_huggingface(dataset_name="PolyAI/banking77", **kwargs)[:2000]
    ]
    random.Random(0).shuffle(trainset)
    ```
    </details>

    ```python linenums="1"
    import dspy
    dspy.configure(lm=dspy.LM('openai/gpt-4o-mini-2024-07-18'))
    
    # Define the DSPy module for classification. It will use the hint at training time, if available.
    signature = dspy.Signature("text -> label").with_updated_fields('label', type_=Literal[tuple(CLASSES)])
    classify = dspy.ChainOfThoughtWithHint(signature)

    # Optimize via BootstrapFinetune.
    optimizer = dspy.BootstrapFinetune(metric=(lambda x, y, trace=None: x.label == y.label), num_threads=24)
    optimized = optimizer.compile(classify, trainset=trainset)

    optimized(text="What does a pending cash withdrawal mean?")
    ```

    **Possible Output (from the last line):**
    ```text
    Prediction(
        reasoning='A pending cash withdrawal indicates that a request to withdraw cash has been initiated but has not yet been completed or processed. This status means that the transaction is still in progress and the funds have not yet been deducted from the account or made available to the user.',
        label='pending_cash_withdrawal'
    )
    ```

    An informal run similar to this on DSPy 2.5.29 raises GPT-4o-mini's score 66% to 87%.

??? "What's an example of a DSPy optimizer? How do different optimizers work?"

Take the `dspy.MIPROv2` optimizer as an example. First, MIPRO starts with the **bootstrapping stage**. It takes your program, which may be unoptimized at this point, and runs it many times across different inputs to collect traces of input/output behavior for each one of your modules. It filters these traces to keep only those that appear in trajectories scored highly by your metric. Second, MIPRO enters its **grounded proposal stage**. It previews your DSPy program's code, your data, and traces from running your program, and uses them to draft many potential instructions for every prompt in your program. Third, MIPRO launches the **discrete search stage**. It samples mini-batches from your training set, proposes a combination of instructions and traces to use for constructing every prompt in the pipeline, and evaluates the candidate program on the mini-batch. Using the resulting score, MIPRO updates a surrogate model that helps the proposals get better over time.

One thing that makes DSPy optimizers so powerful is that they can be composed. You can run `dspy.MIPROv2` and use the produced program as an input to `dspy.MIPROv2` again or, say, to `dspy.BootstrapFinetune` to get better results. This is partly the essence of `dspy.BetterTogether`. Alternatively, you can run the optimizer and then extract the top-5 candidate programs and build a `dspy.Ensemble` of them. This allows you to scale _inference-time compute_ (e.g., ensembles) as well as DSPy's unique _pre-inference time compute_ (i.e., optimization budget) in highly systematic ways.

3) DSPy's Ecosystem advances open-source AI research.

Compared to monolithic LMs, DSPy's modular paradigm enables a large community to improve the compositional architectures, inference-time strategies, and optimizers for LM programs in an open, distributed way. This gives DSPy users more control, helps them iterate much faster, and allows their programs to get better over time by applying the latest optimizers or modules.

The DSPy research effort started at Stanford NLP in Feb 2022, building on what we had learned from developing early compound LM systems like ColBERT-QA, Baleen, and Hindsight. The first version was released as DSP in Dec 2022 and evolved by Oct 2023 into DSPy. Thanks to 250 contributors, DSPy has introduced tens of thousands of people to building and optimizing modular LM programs.

Since then, DSPy's community has produced a large body of work on optimizers, like MIPROv2, BetterTogether, and LeReT, on program architectures, like STORM, IReRa, and DSPy Assertions, and on successful applications to new problems, like PAPILLON, PATH, WangLab@MEDIQA, UMD's Prompting Case Study, and Haize's Red-Teaming Program, in addition to many open-source projects, production applications, and other use cases.



sidebar_position: 5

Data

DSPy is a machine learning framework, so working in it involves training sets, development sets, and test sets. For each example in your data, we distinguish typically between three types of values: the inputs, the intermediate labels, and the final label. You can use DSPy effectively without any intermediate or final labels, but you will need at least a few example inputs.

DSPy Example objects

The core data type for data in DSPy is Example. You will use Examples to represent items in your training set and test set.

DSPy Examples are similar to Python dicts but have a few useful utilities. Your DSPy modules will return values of the type Prediction, which is a special sub-class of Example.

When you use DSPy, you will do a lot of evaluation and optimization runs. Your individual datapoints will be of type Example:

qa_pair = dspy.Example(question="This is a question?", answer="This is an answer.")

print(qa_pair)
print(qa_pair.question)
print(qa_pair.answer)

Output:

Example({'question': 'This is a question?', 'answer': 'This is an answer.'}) (input_keys=None)
This is a question?
This is an answer.

Examples can have any field keys and any value types, though usually values are strings.

object = Example(field1=value1, field2=value2, field3=value3, ...)

You can now express your training set for example as:

trainset = [dspy.Example(report="LONG REPORT 1", summary="short summary 1"), ...]

Specifying Input Keys

In traditional ML, there are separated "inputs" and "labels".

In DSPy, the Example objects have a with_inputs() method, which can mark specific fields as inputs. (The rest are just metadata or labels.)

# Single Input.
print(qa_pair.with_inputs("question"))

# Multiple Inputs; be careful about marking your labels as inputs unless you mean it.
print(qa_pair.with_inputs("question", "answer"))

Values can be accessed using the .(dot) operator. You can access the value of key name in defined object Example(name="John Doe", job="sleep") through object.name.

To access or exclude certain keys, use inputs() and labels() methods to return new Example objects containing only input or non-input keys, respectively.

article_summary = dspy.Example(article= "This is an article.", summary= "This is a summary.").with_inputs("article")

input_key_only = article_summary.inputs()
non_input_key_only = article_summary.labels()

print("Example object with Input fields only:", input_key_only)
print("Example object with Non-Input fields only:", non_input_key_only)

Output

Example object with Input fields only: Example({'article': 'This is an article.'}) (input_keys=None)
Example object with Non-Input fields only: Example({'summary': 'This is a summary.'}) (input_keys=None)


sidebar_position: 5

Metrics

DSPy is a machine learning framework, so you must think about your automatic metrics for evaluation (to track your progress) and optimization (so DSPy can make your programs more effective).

What is a metric and how do I define a metric for my task?

A metric is just a function that will take examples from your data and the output of your system and return a score that quantifies how good the output is. What makes outputs from your system good or bad?

For simple tasks, this could be just "accuracy" or "exact match" or "F1 score". This may be the case for simple classification or short-form QA tasks.

However, for most applications, your system will output long-form outputs. There, your metric should probably be a smaller DSPy program that checks multiple properties of the output (quite possibly using AI feedback from LMs).

Getting this right on the first try is unlikely, but you should start with something simple and iterate.

Simple metrics

A DSPy metric is just a function in Python that takes example (e.g., from your training or dev set) and the output pred from your DSPy program, and outputs a float (or int or bool) score.

Your metric should also accept an optional third argument called trace. You can ignore this for a moment, but it will enable some powerful tricks if you want to use your metric for optimization.

Here's a simple example of a metric that's comparing example.answer and pred.answer. This particular metric will return a bool.

def validate_answer(example, pred, trace=None):
    return example.answer.lower() == pred.answer.lower()

Some people find these utilities (built-in) convenient:

  • dspy.evaluate.metrics.answer_exact_match
  • dspy.evaluate.metrics.answer_passage_match

Your metrics could be more complex, e.g. check for multiple properties. The metric below will return a float if trace is None (i.e., if it's used for evaluation or optimization), and will return a bool otherwise (i.e., if it's used to bootstrap demonstrations).

def validate_context_and_answer(example, pred, trace=None):
    # check the gold label and the predicted answer are the same
    answer_match = example.answer.lower() == pred.answer.lower()

    # check the predicted answer comes from one of the retrieved contexts
    context_match = any((pred.answer.lower() in c) for c in pred.context)

    if trace is None: # if we're doing evaluation or optimization
        return (answer_match + context_match) / 2.0
    else: # if we're doing bootstrapping, i.e. self-generating good demonstrations of each step
        return answer_match and context_match

Defining a good metric is an iterative process, so doing some initial evaluations and looking at your data and outputs is key.

Evaluation

Once you have a metric, you can run evaluations in a simple Python loop.

scores = []
for x in devset:
    pred = program(**x.inputs())
    score = metric(x, pred)
    scores.append(score)

If you need some utilities, you can also use the built-in Evaluate utility. It can help with things like parallel evaluation (multiple threads) or showing you a sample of inputs/outputs and the metric scores.

from dspy.evaluate import Evaluate

# Set up the evaluator, which can be re-used in your code.
evaluator = Evaluate(devset=YOUR_DEVSET, num_threads=1, display_progress=True, display_table=5)

# Launch evaluation.
evaluator(YOUR_PROGRAM, metric=YOUR_METRIC)

Intermediate: Using AI feedback for your metric

For most applications, your system will output long-form outputs, so your metric should check multiple dimensions of the output using AI feedback from LMs.

This simple signature could come in handy.

# Define the signature for automatic assessments.
class Assess(dspy.Signature):
    """Assess the quality of a tweet along the specified dimension."""

    assessed_text = dspy.InputField()
    assessment_question = dspy.InputField()
    assessment_answer: bool = dspy.OutputField()

For example, below is a simple metric that checks a generated tweet (1) answers a given question correctly and (2) whether it's also engaging. We also check that (3) len(tweet) <= 280 characters.

def metric(gold, pred, trace=None):
    question, answer, tweet = gold.question, gold.answer, pred.output

    engaging = "Does the assessed text make for a self-contained, engaging tweet?"
    correct = f"The text should answer `{question}` with `{answer}`. Does the assessed text contain this answer?"
    
    correct =  dspy.Predict(Assess)(assessed_text=tweet, assessment_question=correct)
    engaging = dspy.Predict(Assess)(assessed_text=tweet, assessment_question=engaging)

    correct, engaging = [m.assessment_answer for m in [correct, engaging]]
    score = (correct + engaging) if correct and (len(tweet) <= 280) else 0

    if trace is not None: return score >= 2
    return score / 2.0

When compiling, trace is not None, and we want to be strict about judging things, so we will only return True if score >= 2. Otherwise, we return a score out of 1.0 (i.e., score / 2.0).

Advanced: Using a DSPy program as your metric

If your metric is itself a DSPy program, one of the most powerful ways to iterate is to compile (optimize) your metric itself. That's usually easy because the output of the metric is usually a simple value (e.g., a score out of 5) so the metric's metric is easy to define and optimize by collecting a few examples.

Advanced: Accessing the trace

When your metric is used during evaluation runs, DSPy will not try to track the steps of your program.

But during compiling (optimization), DSPy will trace your LM calls. The trace will contain inputs/outputs to each DSPy predictor and you can leverage that to validate intermediate steps for optimization.

def validate_hops(example, pred, trace=None):
    hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]

    if max([len(h) for h in hops]) > 100: return False
    if any(dspy.evaluate.answer_exact_match_str(hops[idx], hops[:idx], frac=0.8) for idx in range(2, len(hops))): return False

    return True


sidebar_position: 1

Evaluation in DSPy

Once you have an initial system, it's time to collect an initial development set so you can refine it more systematically. Even 20 input examples of your task can be useful, though 200 goes a long way. Depending on your metric, you either just need inputs and no labels at all, or you need inputs and the final outputs of your system. (You almost never need labels for the intermediate steps in your program in DSPy.) You can probably find datasets that are adjacent to your task on, say, HuggingFace datasets or in a naturally occuring source like StackExchange. If there's data whose licenses are permissive enough, we suggest you use them. Otherwise, you can label a few examples by hand or start deploying a demo of your system and collect initial data that way.

Next, you should define your DSPy metric. What makes outputs from your system good or bad? Invest in defining metrics and improving them incrementally over time; it's hard to consistently improve what you aren't able to define. A metric is a function that takes examples from your data and takes the output of your system, and returns a score. For simple tasks, this could be just "accuracy", e.g. for simple classification or short-form QA tasks. For most applications, your system will produce long-form outputs, so your metric will be a smaller DSPy program that checks multiple properties of the output. Getting this right on the first try is unlikely: start with something simple and iterate.

Now that you have some data and a metric, run development evaluations on your pipeline designs to understand their tradeoffs. Look at the outputs and the metric scores. This will probably allow you to spot any major issues, and it will define a baseline for your next steps.

??? "If your metric is itself a DSPy program..." If your metric is itself a DSPy program, a powerful way to iterate is to optimize your metric itself. That's usually easy because the output of the metric is usually a simple value (e.g., a score out of 5), so the metric's metric is easy to define and optimize by collecting a few examples.



sidebar_position: 1

Learning DSPy: An Overview

DSPy exposes a very small API that you can learn quickly. However, building a new AI system is a more open-ended journey of iterative development, in which you compose the tools and design patterns of DSPy to optimize for your objectives. The three stages of building AI systems in DSPy are:

  1. DSPy Programming. This is about defining your task, its constraints, exploring a few examples, and using that to inform your initial pipeline design.

  2. DSPy Evaluation. Once your system starts working, this is the stage where you collect an initial development set, define your DSPy metric, and use these to iterate on your system more systematically.

  3. DSPy Optimization. Once you have a way to evaluate your system, you use DSPy optimizers to tune the prompts or weights in your program.

We suggest learning and applying DSPy in this order. For example, it's unproductive to launch optimization runs using a poorly-design program or a bad metric.



sidebar_position: 1

DSPy Optimizers (formerly Teleprompters)

A DSPy optimizer is an algorithm that can tune the parameters of a DSPy program (i.e., the prompts and/or the LM weights) to maximize the metrics you specify, like accuracy.

A typical DSPy optimizer takes three things:

  • Your DSPy program. This may be a single module (e.g., dspy.Predict) or a complex multi-module program.

  • Your metric. This is a function that evaluates the output of your program, and assigns it a score (higher is better).

  • A few training inputs. This may be very small (i.e., only 5 or 10 examples) and incomplete (only inputs to your program, without any labels).

If you happen to have a lot of data, DSPy can leverage that. But you can start small and get strong results.

Note: Formerly called teleprompters. We are making an official name update, which will be reflected throughout the library and documentation.

What does a DSPy Optimizer tune? How does it tune them?

Different optimizers in DSPy will tune your program's quality by synthesizing good few-shot examples for every module, like dspy.BootstrapRS,1 proposing and intelligently exploring better natural-language instructions for every prompt, like dspy.MIPROv2,2 and building datasets for your modules and using them to finetune the LM weights in your system, like dspy.BootstrapFinetune.3

??? "What's an example of a DSPy optimizer? How do different optimizers work?"

Take the `dspy.MIPROv2` optimizer as an example. First, MIPRO starts with the **bootstrapping stage**. It takes your program, which may be unoptimized at this point, and runs it many times across different inputs to collect traces of input/output behavior for each one of your modules. It filters these traces to keep only those that appear in trajectories scored highly by your metric. Second, MIPRO enters its **grounded proposal stage**. It previews your DSPy program's code, your data, and traces from running your program, and uses them to draft many potential instructions for every prompt in your program. Third, MIPRO launches the **discrete search stage**. It samples mini-batches from your training set, proposes a combination of instructions and traces to use for constructing every prompt in the pipeline, and evaluates the candidate program on the mini-batch. Using the resulting score, MIPRO updates a surrogate model that helps the proposals get better over time.

One thing that makes DSPy optimizers so powerful is that they can be composed. You can run `dspy.MIPROv2` and use the produced program as an input to `dspy.MIPROv2` again or, say, to `dspy.BootstrapFinetune` to get better results. This is partly the essence of `dspy.BetterTogether`. Alternatively, you can run the optimizer and then extract the top-5 candidate programs and build a `dspy.Ensemble` of them. This allows you to scale _inference-time compute_ (e.g., ensembles) as well as DSPy's unique _pre-inference time compute_ (i.e., optimization budget) in highly systematic ways.

What DSPy Optimizers are currently available?

Optimizers can be accessed via from dspy.teleprompt import *.

Automatic Few-Shot Learning

These optimizers extend the signature by automatically generating and including optimized examples within the prompt sent to the model, implementing few-shot learning.

  1. LabeledFewShot: Simply constructs few-shot examples (demos) from provided labeled input and output data points. Requires k (number of examples for the prompt) and trainset to randomly select k examples from.

  2. BootstrapFewShot: Uses a teacher module (which defaults to your program) to generate complete demonstrations for every stage of your program, along with labeled examples in trainset. Parameters include max_labeled_demos (the number of demonstrations randomly selected from the trainset) and max_bootstrapped_demos (the number of additional examples generated by the teacher). The bootstrapping process employs the metric to validate demonstrations, including only those that pass the metric in the "compiled" prompt. Advanced: Supports using a teacher program that is a different DSPy program that has compatible structure, for harder tasks.

  3. BootstrapFewShotWithRandomSearch: Applies BootstrapFewShot several times with random search over generated demonstrations, and selects the best program over the optimization. Parameters mirror those of BootstrapFewShot, with the addition of num_candidate_programs, which specifies the number of random programs evaluated over the optimization, including candidates of the uncompiled program, LabeledFewShot optimized program, BootstrapFewShot compiled program with unshuffled examples and num_candidate_programs of BootstrapFewShot compiled programs with randomized example sets.

  4. KNNFewShot. Uses k-Nearest Neighbors algorithm to find the nearest training example demonstrations for a given input example. These nearest neighbor demonstrations are then used as the trainset for the BootstrapFewShot optimization process.

Automatic Instruction Optimization

These optimizers produce optimal instructions for the prompt and, in the case of MIPROv2 can also optimize the set of few-shot demonstrations.

  1. COPRO: Generates and refines new instructions for each step, and optimizes them with coordinate ascent (hill-climbing using the metric function and the trainset). Parameters include depth which is the number of iterations of prompt improvement the optimizer runs over.

  2. MIPROv2: Generates instructions and few-shot examples in each step. The instruction generation is data-aware and demonstration-aware. Uses Bayesian Optimization to effectively search over the space of generation instructions/demonstrations across your modules.

Automatic Finetuning

This optimizer is used to fine-tune the underlying LLM(s).

  1. BootstrapFinetune: Distills a prompt-based DSPy program into weight updates. The output is a DSPy program that has the same steps, but where each step is conducted by a finetuned model instead of a prompted LM.

Program Transformations

  1. Ensemble: Ensembles a set of DSPy programs and either uses the full set or randomly samples a subset into a single program.

Which optimizer should I use?

Ultimately, finding the ‘right’ optimizer to use & the best configuration for your task will require experimentation. Success in DSPy is still an iterative process - getting the best performance on your task will require you to explore and iterate.

That being said, here's the general guidance on getting started:

  • If you have very few examples (around 10), start with BootstrapFewShot.
  • If you have more data (50 examples or more), try BootstrapFewShotWithRandomSearch.
  • If you prefer to do instruction optimization only (i.e. you want to keep your prompt 0-shot), use MIPROv2 configured for 0-shot optimization to optimize.
  • If you’re willing to use more inference calls to perform longer optimization runs (e.g. 40 trials or more), and have enough data (e.g. 200 examples or more to prevent overfitting) then try MIPROv2.
  • If you have been able to use one of these with a large LM (e.g., 7B parameters or above) and need a very efficient program, finetune a small LM for your task with BootstrapFinetune.

How do I use an optimizer?

They all share this general interface, with some differences in the keyword arguments (hyperparameters). Detailed documentation for key optimizers can be found here, and a full list can be found here.

Let's see this with the most common one, BootstrapFewShotWithRandomSearch.

from dspy.teleprompt import BootstrapFewShotWithRandomSearch

# Set up the optimizer: we want to "bootstrap" (i.e., self-generate) 8-shot examples of your program's steps.
# The optimizer will repeat this 10 times (plus some initial attempts) before selecting its best attempt on the devset.
config = dict(max_bootstrapped_demos=4, max_labeled_demos=4, num_candidate_programs=10, num_threads=4)

teleprompter = BootstrapFewShotWithRandomSearch(metric=YOUR_METRIC_HERE, **config)
optimized_program = teleprompter.compile(YOUR_PROGRAM_HERE, trainset=YOUR_TRAINSET_HERE)

!!! info "Getting Started III: Optimizing the LM prompts or weights in DSPy programs" A typical simple optimization run costs on the order of $2 USD and takes around ten minutes, but be careful when running optimizers with very large LMs or very large datasets. Optimizer runs can cost as little as a few cents or up to tens of dollars, depending on your LM, dataset, and configuration.

=== "Optimizing prompts for a ReAct agent"
    This is a minimal but fully runnable example of setting up a `dspy.ReAct` agent that answers questions via
    search from Wikipedia and then optimizing it using `dspy.MIPROv2` in the cheap `light` mode on 500
    question-answer pairs sampled from the `HotPotQA` dataset.

    ```python linenums="1"
    import dspy
    from dspy.datasets import HotPotQA

    dspy.configure(lm=dspy.LM('openai/gpt-4o-mini'))

    def search(query: str) -> list[str]:
        """Retrieves abstracts from Wikipedia."""
        results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=3)
        return [x['text'] for x in results]

    trainset = [x.with_inputs('question') for x in HotPotQA(train_seed=2024, train_size=500).train]
    react = dspy.ReAct("question -> answer", tools=[search])

    tp = dspy.MIPROv2(metric=dspy.evaluate.answer_exact_match, auto="light", num_threads=24)
    optimized_react = tp.compile(react, trainset=trainset)
    ```

    An informal run similar to this on DSPy 2.5.29 raises ReAct's score from 24% to 51%.

=== "Optimizing prompts for RAG"
    Given a retrieval index to `search`, your favorite `dspy.LM`, and a small `trainset` of questions and ground-truth responses, the following code snippet can optimize your RAG system with long outputs against the built-in `dspy.SemanticF1` metric, which is implemented as a DSPy module.

    ```python linenums="1"
    class RAG(dspy.Module):
        def __init__(self, num_docs=5):
            self.num_docs = num_docs
            self.respond = dspy.ChainOfThought('context, question -> response')

        def forward(self, question):
            context = search(question, k=self.num_docs)   # not defined in this snippet, see link above
            return self.respond(context=context, question=question)

    tp = dspy.MIPROv2(metric=dspy.SemanticF1(), auto="medium", num_threads=24)
    optimized_rag = tp.compile(RAG(), trainset=trainset, max_bootstrapped_demos=2, max_labeled_demos=2)
    ```

    For a complete RAG example that you can run, start this [tutorial](/tutorials/rag/). It improves the quality of a RAG system over a subset of StackExchange communities from 53% to 61%.

=== "Optimizing weights for Classification"
    This is a minimal but fully runnable example of setting up a `dspy.ChainOfThought` module that classifies
    short texts into one of 77 banking labels and then using `dspy.BootstrapFinetune` with 2000 text-label pairs
    from the `Banking77` to finetune the weights of GPT-4o-mini for this task. We use the variant
    `dspy.ChainOfThoughtWithHint`, which takes an optional `hint` at bootstrapping time, to maximize the utility of
    the training data. Naturally, hints are not available at test time. More can be found in this [tutorial](/tutorials/classification_finetuning/).

    <details><summary>Click to show dataset setup code.</summary>

    ```python linenums="1"
    import random
    from typing import Literal
    from dspy.datasets import DataLoader
    from datasets import load_dataset

    # Load the Banking77 dataset.
    CLASSES = load_dataset("PolyAI/banking77", split="train", trust_remote_code=True).features['label'].names
    kwargs = dict(fields=("text", "label"), input_keys=("text",), split="train", trust_remote_code=True)

    # Load the first 2000 examples from the dataset, and assign a hint to each *training* example.
    trainset = [
        dspy.Example(x, hint=CLASSES[x.label], label=CLASSES[x.label]).with_inputs("text", "hint")
        for x in DataLoader().from_huggingface(dataset_name="PolyAI/banking77", **kwargs)[:2000]
    ]
    random.Random(0).shuffle(trainset)
    ```
    </details>

    ```python linenums="1"
    import dspy
    dspy.configure(lm=dspy.LM('gpt-4o-mini-2024-07-18'))
    
    # Define the DSPy module for classification. It will use the hint at training time, if available.
    signature = dspy.Signature("text -> label").with_updated_fields('label', type_=Literal[tuple(CLASSES)])
    classify = dspy.ChainOfThoughtWithHint(signature)

    # Optimize via BootstrapFinetune.
    optimizer = dspy.BootstrapFinetune(metric=(lambda x, y, trace=None: x.label == y.label), num_threads=24)
    optimized = optimizer.compile(classify, trainset=trainset)

    optimized(text="What does a pending cash withdrawal mean?")
    ```

    **Possible Output (from the last line):**
    ```text
    Prediction(
        reasoning='A pending cash withdrawal indicates that a request to withdraw cash has been initiated but has not yet been completed or processed. This status means that the transaction is still in progress and the funds have not yet been deducted from the account or made available to the user.',
        label='pending_cash_withdrawal'
    )
    ```

    An informal run similar to this on DSPy 2.5.29 raises GPT-4o-mini's score 66% to 87%.

Saving and loading optimizer output

After running a program through an optimizer, it's useful to also save it. At a later point, a program can be loaded from a file and used for inference. For this, the load and save methods can be used.

optimized_program.save(YOUR_SAVE_PATH)

The resulting file is in plain-text JSON format. It contains all the parameters and steps in the source program. You can always read it and see what the optimizer generated. You can add save_field_meta to additionally save the list of fields with the keys, name, field_type, description, and prefix with: `optimized_program.save(YOUR_SAVE_PATH, save_field_meta=True).

To load a program from a file, you can instantiate an object from that class and then call the load method on it.

loaded_program = YOUR_PROGRAM_CLASS()
loaded_program.load(path=YOUR_SAVE_PATH)


sidebar_position: 1

Optimization in DSPy

Once you have a system and a way to evaluate it, you can use DSPy optimizers to tune the prompts or weights in your program. Now it's useful to expand your data collection effort into building a training set and a held-out test set, in addition to the development set you've been using for exploration. For the training set (and its subset, validation set), you can often get substantial value out of 30 examples, but aim for at least 300 examples. Some optimizers accept a trainset only. Others ask for a trainset and a valset. For prompt optimizers, we suggest starting with a 20% split for training and 80% for validation, which is often the opposite of what one does for DNNs.

After your first few optimization runs, you are either very happy with everything or you've made a lot of progress but you don't like something about the final program or the metric. At this point, go back to step 1 (Programming in DSPy) and revisit the major questions. Did you define your task well? Do you need to collect (or find online) more data for your problem? Do you want to update your metric? And do you want to use a more sophisticated optimizer? Do you need to consider advanced features like DSPy Assertions? Or, perhaps most importantly, do you want to add some more complexity or steps in your DSPy program itself? Do you want to use multiple optimizers in a sequence?

Iterative development is key. DSPy gives you the pieces to do that incrementally: iterating on your data, your program structure, your assertions, your metric, and your optimization steps. Optimizing complex LM programs is an entirely new paradigm that only exists in DSPy at the time of writing (update: there are now numerous DSPy extension frameworks, so this part is no longer true :-), so naturally the norms around what to do are still emerging. If you need help, we recently created a Discord server for the community.


DSPy Assertions

!!! warning "This page is outdated and may not be fully accurate in DSPy 2.5"

Introduction

Language models (LMs) have transformed how we interact with machine learning, offering vast capabilities in natural language understanding and generation. However, ensuring these models adhere to domain-specific constraints remains a challenge. Despite the growth of techniques like fine-tuning or “prompt engineering”, these approaches are extremely tedious and rely on heavy, manual hand-waving to guide the LMs in adhering to specific constraints. Even DSPy's modularity of programming prompting pipelines lacks mechanisms to effectively and automatically enforce these constraints.

To address this, we introduce DSPy Assertions, a feature within the DSPy framework designed to automate the enforcement of computational constraints on LMs. DSPy Assertions empower developers to guide LMs towards desired outcomes with minimal manual intervention, enhancing the reliability, predictability, and correctness of LM outputs.

dspy.Assert and dspy.Suggest API

We introduce two primary constructs within DSPy Assertions:

  • dspy.Assert:

    • Parameters:
      • constraint (bool): Outcome of Python-defined boolean validation check.
      • msg (Optional[str]): User-defined error message providing feedback or correction guidance.
      • backtrack (Optional[module]): Specifies target module for retry attempts upon constraint failure. The default backtracking module is the last module before the assertion.
    • Behavior: Initiates retry upon failure, dynamically adjusting the pipeline's execution. If failures persist, it halts execution and raises a dspy.AssertionError.
  • dspy.Suggest:

    • Parameters: Similar to dspy.Assert.
    • Behavior: Encourages self-refinement through retries without enforcing hard stops. Logs failures after maximum backtracking attempts and continues execution.
  • dspy.Assert vs. Python Assertions: Unlike conventional Python assert statements that terminate the program upon failure, dspy.Assert conducts a sophisticated retry mechanism, allowing the pipeline to adjust.

Specifically, when a constraint is not met:

  • Backtracking Mechanism: An under-the-hood backtracking is initiated, offering the model a chance to self-refine and proceed, which is done through signature modification.
  • Dynamic Signature Modification: internally modifying your DSPy program’s Signature by adding the following fields:
    • Past Output: your model's past output that did not pass the validation_fn
    • Instruction: your user-defined feedback message on what went wrong and what possibly to fix

If the error continues past the max_backtracking_attempts, then dspy.Assert will halt the pipeline execution, alerting you with an dspy.AssertionError. This ensures your program doesn't continue executing with “bad” LM behavior and immediately highlights sample failure outputs for user assessment.

  • dspy.Suggest vs. dspy.Assert: dspy.Suggest on the other hand offers a softer approach. It maintains the same retry backtracking as dspy.Assert but instead serves as a gentle nudger. If the model outputs cannot pass the model constraints after the max_backtracking_attempts, dspy.Suggest will log the persistent failure and continue execution of the program on the rest of the data. This ensures the LM pipeline works in a "best-effort" manner without halting execution.

  • dspy.Suggest statements are best utilized as "helpers" during the evaluation phase, offering guidance and potential corrections without halting the pipeline.

  • dspy.Assert statements are recommended during the development stage as "checkers" to ensure the LM behaves as expected, providing a robust mechanism for identifying and addressing errors early in the development cycle.

Use Case: Including Assertions in DSPy Programs

We start with using an example of a multi-hop QA SimplifiedBaleen pipeline as defined in the intro walkthrough.

class SimplifiedBaleen(dspy.Module):
    def __init__(self, passages_per_hop=2, max_hops=2):
        super().__init__()

        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops

    def forward(self, question):
        context = []
        prev_queries = [question]

        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            prev_queries.append(query)
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)
        
        pred = self.generate_answer(context=context, question=question)
        pred = dspy.Prediction(context=context, answer=pred.answer)
        return pred

baleen = SimplifiedBaleen()

baleen(question = "Which award did Gary Zukav's first book receive?")

To include DSPy Assertions, we simply define our validation functions and declare our assertions following the respective model generation.

For this use case, suppose we want to impose the following constraints: 1. Length - each query should be less than 100 characters 2. Uniqueness - each generated query should differ from previously-generated queries.

We can define these validation checks as boolean functions:

#simplistic boolean check for query length
len(query) <= 100

#Python function for validating distinct queries
def validate_query_distinction_local(previous_queries, query):
    """check if query is distinct from previous queries"""
    if previous_queries == []:
        return True
    if dspy.evaluate.answer_exact_match_str(query, previous_queries, frac=0.8):
        return False
    return True

We can declare these validation checks through dspy.Suggest statements (as we want to test the program in a best-effort demonstration). We want to keep these after the query generation query = self.generate_query[hop](context=context, question=question).query.

dspy.Suggest(
    len(query) <= 100,
    "Query should be short and less than 100 characters",
    target_module=self.generate_query
)

dspy.Suggest(
    validate_query_distinction_local(prev_queries, query),
    "Query should be distinct from: "
    + "; ".join(f"{i+1}) {q}" for i, q in enumerate(prev_queries)),
    target_module=self.generate_query
)

It is recommended to define a program with assertions separately than your original program if you are doing comparative evaluation for the effect of assertions. If not, feel free to set Assertions away!

Let's take a look at how the SimplifiedBaleen program will look with Assertions included:

class SimplifiedBaleenAssertions(dspy.Module):
    def __init__(self, passages_per_hop=2, max_hops=2):
        super().__init__()
        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops

    def forward(self, question):
        context = []
        prev_queries = [question]

        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query

            dspy.Suggest(
                len(query) <= 100,
                "Query should be short and less than 100 characters",
                target_module=self.generate_query
            )

            dspy.Suggest(
                validate_query_distinction_local(prev_queries, query),
                "Query should be distinct from: "
                + "; ".join(f"{i+1}) {q}" for i, q in enumerate(prev_queries)),
                target_module=self.generate_query
            )

            prev_queries.append(query)
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)
        
        if all_queries_distinct(prev_queries):
            self.passed_suggestions += 1

        pred = self.generate_answer(context=context, question=question)
        pred = dspy.Prediction(context=context, answer=pred.answer)
        return pred

Now calling programs with DSPy Assertions requires one last step, and that is transforming the program to wrap it with internal assertions backtracking and Retry logic.

from dspy.primitives.assertions import assert_transform_module, backtrack_handler

baleen_with_assertions = assert_transform_module(SimplifiedBaleenAssertions(), backtrack_handler)

# backtrack_handler is parameterized over a few settings for the backtracking mechanism
# To change the number of max retry attempts, you can do
baleen_with_assertions_retry_once = assert_transform_module(SimplifiedBaleenAssertions(), 
    functools.partial(backtrack_handler, max_backtracks=1))

Alternatively, you can also directly call activate_assertions on the program with dspy.Assert/Suggest statements using the default backtracking mechanism (max_backtracks=2):

baleen_with_assertions = SimplifiedBaleenAssertions().activate_assertions()

Now let's take a look at the internal LM backtracking by inspecting the history of the LM query generations. Here we see that when a query fails to pass the validation check of being less than 100 characters, its internal GenerateSearchQuery signature is dynamically modified during the backtracking+Retry process to include the past query and the corresponding user-defined instruction: "Query should be short and less than 100 characters".

Write a simple search query that will help answer a complex question.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the query}. We ...

Query: ${query}

---

Context:
[1] «Kerry Condon | Kerry Condon (born 4 January 1983) is [...]»
[2] «Corona Riccardo | Corona Riccardo (c. 1878October 15, 1917) was [...]»

Question: Who acted in the shot film The Shore and is also the youngest actress ever to play Ophelia in a Royal Shakespeare Company production of "Hamlet." ?

Reasoning: Let's think step by step in order to find the answer to this question. First, we need to identify the actress who played Ophelia in a Royal Shakespeare Company production of "Hamlet." Then, we need to find out if this actress also acted in the short film "The Shore."

Query: "actress who played Ophelia in Royal Shakespeare Company production of Hamlet" + "actress in short film The Shore"



Write a simple search query that will help answer a complex question.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Past Query: past output with errors

Instructions: Some instructions you must satisfy

Query: ${query}

---

Context:
[1] «Kerry Condon | Kerry Condon (born 4 January 1983) is an Irish television and film actress, best known for her role as Octavia of the Julii in the HBO/BBC series "Rome," as Stacey Ehrmantraut in AMC's "Better Call Saul" and as the voice of F.R.I.D.A.Y. in various films in the Marvel Cinematic Universe. She is also the youngest actress ever to play Ophelia in a Royal Shakespeare Company production of "Hamlet."»
[2] «Corona Riccardo | Corona Riccardo (c. 1878October 15, 1917) was an Italian born American actress who had a brief Broadway stage career before leaving to become a wife and mother. Born in Naples she came to acting in 1894 playing a Mexican girl in a play at the Empire Theatre. Wilson Barrett engaged her for a role in his play "The Sign of the Cross" which he took on tour of the United States. Riccardo played the role of Ancaria and later played Berenice in the same play. Robert B. Mantell in 1898 who struck by her beauty also cast her in two Shakespeare plays, "Romeo and Juliet" and "Othello". Author Lewis Strang writing in 1899 said Riccardo was the most promising actress in America at the time. Towards the end of 1898 Mantell chose her for another Shakespeare part, Ophelia im Hamlet. Afterwards she was due to join Augustin Daly's Theatre Company but Daly died in 1899. In 1899 she gained her biggest fame by playing Iras in the first stage production of Ben-Hur.»

Question: Who acted in the shot film The Shore and is also the youngest actress ever to play Ophelia in a Royal Shakespeare Company production of "Hamlet." ?

Past Query: "actress who played Ophelia in Royal Shakespeare Company production of Hamlet" + "actress in short film The Shore"

Instructions: Query should be short and less than 100 characters

Query: "actress Ophelia RSC Hamlet" + "actress The Shore"

Assertion-Driven Optimizations

DSPy Assertions work with optimizations that DSPy offers, particularly with BootstrapFewShotWithRandomSearch, including the following settings:

  • Compilation with Assertions This includes assertion-driven example bootstrapping and counterexample bootstrapping during compilation. The teacher model for bootstrapping few-shot demonstrations can make use of DSPy Assertions to offer robust bootstrapped examples for the student model to learn from during inference. In this setting, the student model does not perform assertion aware optimizations (backtracking and retry) during inference.
  • Compilation + Inference with Assertions -This includes assertion-driven optimizations in both compilation and inference. Now the teacher model offers assertion-driven examples but the student can further optimize with assertions of its own during inference time.
teleprompter = BootstrapFewShotWithRandomSearch(
    metric=validate_context_and_answer_and_hops,
    max_bootstrapped_demos=max_bootstrapped_demos,
    num_candidate_programs=6,
)

#Compilation with Assertions
compiled_with_assertions_baleen = teleprompter.compile(student = baleen, teacher = baleen_with_assertions, trainset = trainset, valset = devset)

#Compilation + Inference with Assertions
compiled_baleen_with_assertions = teleprompter.compile(student=baleen_with_assertions, teacher = baleen_with_assertions, trainset=trainset, valset=devset)


sidebar_position: 2

Language Models

The first step in any DSPy code is to set up your language model. For example, you can configure OpenAI's GPT-4o-mini as your default LM as follows.

# Authenticate via `OPENAI_API_KEY` env: import os; os.environ['OPENAI_API_KEY'] = 'here'
lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)

!!! info "A few different LMs"

=== "OpenAI"
    You can authenticate by setting the `OPENAI_API_KEY` env variable or passing `api_key` below.

    ```python linenums="1"
    import dspy
    lm = dspy.LM('openai/gpt-4o-mini', api_key='YOUR_OPENAI_API_KEY')
    dspy.configure(lm=lm)
    ```

=== "Gemini (AI Studio)"
    You can authenticate by setting the GEMINI_API_KEY env variable or passing `api_key` below.

    ```python linenums="1"
    import dspy
    lm = dspy.LM('gemini/gemini-2.5-pro-preview-03-25', api_key='GEMINI_API_KEY')
    dspy.configure(lm=lm)
    ```

=== "Anthropic"
    You can authenticate by setting the ANTHROPIC_API_KEY env variable or passing `api_key` below.

    ```python linenums="1"
    import dspy
    lm = dspy.LM('anthropic/claude-3-opus-20240229', api_key='YOUR_ANTHROPIC_API_KEY')
    dspy.configure(lm=lm)
    ```

=== "Databricks"
    If you're on the Databricks platform, authentication is automatic via their SDK. If not, you can set the env variables `DATABRICKS_API_KEY` and `DATABRICKS_API_BASE`, or pass `api_key` and `api_base` below.

    ```python linenums="1"
    import dspy
    lm = dspy.LM('databricks/databricks-meta-llama-3-1-70b-instruct')
    dspy.configure(lm=lm)
    ```

=== "Local LMs on a GPU server"
      First, install [SGLang](https://sgl-project.github.io/start/install.html) and launch its server with your LM.

      ```bash
      > pip install "sglang[all]"
      > pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ 

      > CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Meta-Llama-3-8B-Instruct
      ```

      Then, connect to it from your DSPy code as an OpenAI-compatible endpoint.

      ```python linenums="1"
      lm = dspy.LM("openai/meta-llama/Meta-Llama-3-8B-Instruct",
                       api_base="http://localhost:7501/v1",  # ensure this points to your port
                       api_key="", model_type='chat')
      dspy.configure(lm=lm)
      ```

=== "Local LMs on your laptop"
      First, install [Ollama](https://github.com/ollama/ollama) and launch its server with your LM.

      ```bash
      > curl -fsSL https://ollama.ai/install.sh | sh
      > ollama run llama3.2:1b
      ```

      Then, connect to it from your DSPy code.

    ```python linenums="1"
    import dspy
    lm = dspy.LM('ollama_chat/llama3.2', api_base='http://localhost:11434', api_key='')
    dspy.configure(lm=lm)
    ```

=== "Other providers"
    In DSPy, you can use any of the dozens of [LLM providers supported by LiteLLM](https://docs.litellm.ai/docs/providers). Simply follow their instructions for which `{PROVIDER}_API_KEY` to set and how to write pass the `{provider_name}/{model_name}` to the constructor.

    Some examples:

    - `anyscale/mistralai/Mistral-7B-Instruct-v0.1`, with `ANYSCALE_API_KEY`
    - `together_ai/togethercomputer/llama-2-70b-chat`, with `TOGETHERAI_API_KEY`
    - `sagemaker/<your-endpoint-name>`, with `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_REGION_NAME`
    - `azure/<your_deployment_name>`, with `AZURE_API_KEY`, `AZURE_API_BASE`, `AZURE_API_VERSION`, and the optional `AZURE_AD_TOKEN` and `AZURE_API_TYPE`

    
    If your provider offers an OpenAI-compatible endpoint, just add an `openai/` prefix to your full model name.

    ```python linenums="1"
    import dspy
    lm = dspy.LM('openai/your-model-name', api_key='PROVIDER_API_KEY', api_base='YOUR_PROVIDER_URL')
    dspy.configure(lm=lm)
    ```

Calling the LM directly.

It's easy to call the lm you configured above directly. This gives you a unified API and lets you benefit from utilities like automatic caching.

lm("Say this is a test!", temperature=0.7)  # => ['This is a test!']
lm(messages=[{"role": "user", "content": "Say this is a test!"}])  # => ['This is a test!']

Using the LM with DSPy modules.

Idiomatic DSPy involves using modules, which we discuss in the next guide.

# Define a module (ChainOfThought) and assign it a signature (return an answer, given a question).
qa = dspy.ChainOfThought('question -> answer')

# Run with the default LM configured with `dspy.configure` above.
response = qa(question="How many floors are in the castle David Gregory inherited?")
print(response.answer)

Possible Output:

The castle David Gregory inherited has 7 floors.

Using multiple LMs.

You can change the default LM globally with dspy.configure or change it inside a block of code with dspy.context.

!!! tip Using dspy.configure and dspy.context is thread-safe!

dspy.configure(lm=dspy.LM('openai/gpt-4o-mini'))
response = qa(question="How many floors are in the castle David Gregory inherited?")
print('GPT-4o-mini:', response.answer)

with dspy.context(lm=dspy.LM('openai/gpt-3.5-turbo')):
    response = qa(question="How many floors are in the castle David Gregory inherited?")
    print('GPT-3.5-turbo:', response.answer)

Possible Output:

GPT-4o: The number of floors in the castle David Gregory inherited cannot be determined with the information provided.
GPT-3.5-turbo: The castle David Gregory inherited has 7 floors.

Configuring LM generation.

For any LM, you can configure any of the following attributes at initialization or in each subsequent call.

gpt_4o_mini = dspy.LM('openai/gpt-4o-mini', temperature=0.9, max_tokens=3000, stop=None, cache=False)

By default LMs in DSPy are cached. If you repeat the same call, you will get the same outputs. But you can turn off caching by setting cache=False.

Inspecting output and usage metadata.

Every LM object maintains the history of its interactions, including inputs, outputs, token usage (and $$$ cost), and metadata.

len(lm.history)  # e.g., 3 calls to the LM

lm.history[-1].keys()  # access the last call to the LM, with all metadata

Output:

dict_keys(['prompt', 'messages', 'kwargs', 'response', 'outputs', 'usage', 'cost'])

Advanced: Building custom LMs and writing your own Adapters.

Though rarely needed, you can write custom LMs by inheriting from dspy.BaseLM. Another advanced layer in the DSPy ecosystem is that of adapters, which sit between DSPy signatures and LMs. A future version of this guide will discuss these advanced features, though you likely don't need them.



sidebar_position: 3

Modules

A DSPy module is a building block for programs that use LMs.

  • Each built-in module abstracts a prompting technique (like chain of thought or ReAct). Crucially, they are generalized to handle any signature.

  • A DSPy module has learnable parameters (i.e., the little pieces comprising the prompt and the LM weights) and can be invoked (called) to process inputs and return outputs.

  • Multiple modules can be composed into bigger modules (programs). DSPy modules are inspired directly by NN modules in PyTorch, but applied to LM programs.

How do I use a built-in module, like dspy.Predict or dspy.ChainOfThought?

Let's start with the most fundamental module, dspy.Predict. Internally, all other DSPy modules are built using dspy.Predict. We'll assume you are already at least a little familiar with DSPy signatures, which are declarative specs for defining the behavior of any module we use in DSPy.

To use a module, we first declare it by giving it a signature. Then we call the module with the input arguments, and extract the output fields!

sentence = "it's a charming and often affecting journey."  # example from the SST-2 dataset.

# 1) Declare with a signature.
classify = dspy.Predict('sentence -> sentiment: bool')

# 2) Call with input argument(s). 
response = classify(sentence=sentence)

# 3) Access the output.
print(response.sentiment)

Output:

True

When we declare a module, we can pass configuration keys to it.

Below, we'll pass n=5 to request five completions. We can also pass temperature or max_len, etc.

Let's use dspy.ChainOfThought. In many cases, simply swapping dspy.ChainOfThought in place of dspy.Predict improves quality.

question = "What's something great about the ColBERT retrieval model?"

# 1) Declare with a signature, and pass some config.
classify = dspy.ChainOfThought('question -> answer', n=5)

# 2) Call with input argument.
response = classify(question=question)

# 3) Access the outputs.
response.completions.answer

Possible Output:

['One great thing about the ColBERT retrieval model is its superior efficiency and effectiveness compared to other models.',
 'Its ability to efficiently retrieve relevant information from large document collections.',
 'One great thing about the ColBERT retrieval model is its superior performance compared to other models and its efficient use of pre-trained language models.',
 'One great thing about the ColBERT retrieval model is its superior efficiency and accuracy compared to other models.',
 'One great thing about the ColBERT retrieval model is its ability to incorporate user feedback and support complex queries.']

Let's discuss the output object here. The dspy.ChainOfThought module will generally inject a reasoning before the output field(s) of your signature.

Let's inspect the (first) reasoning and answer!

print(f"Reasoning: {response.reasoning}")
print(f"Answer: {response.answer}")

Possible Output:

Reasoning: We can consider the fact that ColBERT has shown to outperform other state-of-the-art retrieval models in terms of efficiency and effectiveness. It uses contextualized embeddings and performs document retrieval in a way that is both accurate and scalable.
Answer: One great thing about the ColBERT retrieval model is its superior efficiency and effectiveness compared to other models.

This is accessible whether we request one or many completions.

We can also access the different completions as a list of Predictions or as several lists, one for each field.

response.completions[3].reasoning == response.completions.reasoning[3]

Output:

True

What other DSPy modules are there? How can I use them?

The others are very similar. They mainly change the internal behavior with which your signature is implemented!

  1. dspy.Predict: Basic predictor. Does not modify the signature. Handles the key forms of learning (i.e., storing the instructions and demonstrations and updates to the LM).

  2. dspy.ChainOfThought: Teaches the LM to think step-by-step before committing to the signature's response.

  3. dspy.ProgramOfThought: Teaches the LM to output code, whose execution results will dictate the response.

  4. dspy.ReAct: An agent that can use tools to implement the given signature.

  5. dspy.MultiChainComparison: Can compare multiple outputs from ChainOfThought to produce a final prediction.

We also have some function-style modules:

  1. dspy.majority: Can do basic voting to return the most popular response from a set of predictions.

!!! info "A few examples of DSPy modules on simple tasks." Try the examples below after configuring your lm. Adjust the fields to explore what tasks your LM can do well out of the box.

=== "Math"

    ```python linenums="1"
    math = dspy.ChainOfThought("question -> answer: float")
    math(question="Two dice are tossed. What is the probability that the sum equals two?")
    ```
    
    **Possible Output:**
    ```text
    Prediction(
        reasoning='When two dice are tossed, each die has 6 faces, resulting in a total of 6 x 6 = 36 possible outcomes. The sum of the numbers on the two dice equals two only when both dice show a 1. This is just one specific outcome: (1, 1). Therefore, there is only 1 favorable outcome. The probability of the sum being two is the number of favorable outcomes divided by the total number of possible outcomes, which is 1/36.',
        answer=0.0277776
    )
    ```

=== "Retrieval-Augmented Generation"

    ```python linenums="1"       
    def search(query: str) -> list[str]:
        """Retrieves abstracts from Wikipedia."""
        results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=3)
        return [x['text'] for x in results]
    
    rag = dspy.ChainOfThought('context, question -> response')

    question = "What's the name of the castle that David Gregory inherited?"
    rag(context=search(question), question=question)
    ```
    
    **Possible Output:**
    ```text
    Prediction(
        reasoning='The context provides information about David Gregory, a Scottish physician and inventor. It specifically mentions that he inherited Kinnairdy Castle in 1664. This detail directly answers the question about the name of the castle that David Gregory inherited.',
        response='Kinnairdy Castle'
    )
    ```

=== "Classification"

    ```python linenums="1"
    from typing import Literal

    class Classify(dspy.Signature):
        """Classify sentiment of a given sentence."""
        
        sentence: str = dspy.InputField()
        sentiment: Literal['positive', 'negative', 'neutral'] = dspy.OutputField()
        confidence: float = dspy.OutputField()

    classify = dspy.Predict(Classify)
    classify(sentence="This book was super fun to read, though not the last chapter.")
    ```
    
    **Possible Output:**

    ```text
    Prediction(
        sentiment='positive',
        confidence=0.75
    )
    ```

=== "Information Extraction"

    ```python linenums="1"        
    text = "Apple Inc. announced its latest iPhone 14 today. The CEO, Tim Cook, highlighted its new features in a press release."

    module = dspy.Predict("text -> title, headings: list[str], entities_and_metadata: list[dict[str, str]]")
    response = module(text=text)

    print(response.title)
    print(response.headings)
    print(response.entities_and_metadata)
    ```
    
    **Possible Output:**
    ```text
    Apple Unveils iPhone 14
    ['Introduction', 'Key Features', "CEO's Statement"]
    [{'entity': 'Apple Inc.', 'type': 'Organization'}, {'entity': 'iPhone 14', 'type': 'Product'}, {'entity': 'Tim Cook', 'type': 'Person'}]
    ```

=== "Agents"

    ```python linenums="1"       
    def evaluate_math(expression: str) -> float:
        return dspy.PythonInterpreter({}).execute(expression)

    def search_wikipedia(query: str) -> str:
        results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=3)
        return [x['text'] for x in results]

    react = dspy.ReAct("question -> answer: float", tools=[evaluate_math, search_wikipedia])

    pred = react(question="What is 9362158 divided by the year of birth of David Gregory of Kinnairdy castle?")
    print(pred.answer)
    ```
    
    **Possible Output:**

    ```text
    5761.328
    ```

How do I compose multiple modules into a bigger program?

DSPy is just Python code that uses modules in any control flow you like, with a little magic internally at compile time to trace your LM calls. What this means is that, you can just call the modules freely.

See tutorials like multi-hop search, whose module is reproduced below as an example.

class Hop(dspy.Module):
    def __init__(self, num_docs=10, num_hops=4):
        self.num_docs, self.num_hops = num_docs, num_hops
        self.generate_query = dspy.ChainOfThought('claim, notes -> query')
        self.append_notes = dspy.ChainOfThought('claim, notes, context -> new_notes: list[str], titles: list[str]')

    def forward(self, claim: str) -> list[str]:
        notes = []
        titles = []

        for _ in range(self.num_hops):
            query = self.generate_query(claim=claim, notes=notes).query
            context = search(query, k=self.num_docs)
            prediction = self.append_notes(claim=claim, notes=notes, context=context)
            notes.extend(prediction.new_notes)
            titles.extend(prediction.titles)
        
        return dspy.Prediction(notes=notes, titles=list(set(titles)))

How do I track LM usage?

!!! note "Version Requirement" LM usage tracking is available in DSPy version 2.6.16 and later.

DSPy provides built-in tracking of language model usage across all module calls. To enable tracking:

dspy.settings.configure(track_usage=True)

Once enabled, you can access usage statistics from any dspy.Prediction object:

usage = prediction_instance.get_lm_usage()

The usage data is returned as a dictionary that maps each language model name to its usage statistics. Here's a complete example:

import dspy

# Configure DSPy with tracking enabled
dspy.settings.configure(
    lm=dspy.LM("openai/gpt-4o-mini", cache=False),
    track_usage=True
)

# Define a simple program that makes multiple LM calls
class MyProgram(dspy.Module):
    def __init__(self):
        self.predict1 = dspy.ChainOfThought("question -> answer")
        self.predict2 = dspy.ChainOfThought("question, answer -> score")

    def __call__(self, question: str) -> str:
        answer = self.predict1(question=question)
        score = self.predict2(question=question, answer=answer)
        return score

# Run the program and check usage
program = MyProgram()
output = program(question="What is the capital of France?")
print(output.get_lm_usage())

This will output usage statistics like:

{
    'openai/gpt-4o-mini': {
        'completion_tokens': 61,
        'prompt_tokens': 260,
        'total_tokens': 321,
        'completion_tokens_details': {
            'accepted_prediction_tokens': 0,
            'audio_tokens': 0,
            'reasoning_tokens': 0,
            'rejected_prediction_tokens': 0,
            'text_tokens': None
        },
        'prompt_tokens_details': {
            'audio_tokens': 0,
            'cached_tokens': 0,
            'text_tokens': None,
            'image_tokens': None
        }
    }
}

When using DSPy's caching features (either in-memory or on-disk via litellm), cached responses won't count toward usage statistics. For example:

# Enable caching
dspy.settings.configure(
    lm=dspy.LM("openai/gpt-4o-mini", cache=True),
    track_usage=True
)

program = MyProgram()

# First call - will show usage statistics
output = program(question="What is the capital of Zambia?")
print(output.get_lm_usage())  # Shows token usage

# Second call - same question, will use cache
output = program(question="What is the capital of Zambia?")
print(output.get_lm_usage())  # Shows empty dict: {}


sidebar_position: 1

Programming in DSPy

DSPy is a bet on writing code instead of strings. In other words, building the right control flow is crucial. Start by defining your task. What are the inputs to your system and what should your system produce as output? Is it a chatbot over your data or perhaps a code assistant? Or maybe a system for translation, for highlighting snippets from search results, or for generating reports with citations?

Next, define your initial pipeline. Can your DSPy program just be a single module or do you need to break it down into a few steps? Do you need retrieval or other tools, like a calculator or a calendar API? Is there a typical workflow for solving your problem in multiple well-scoped steps, or do you want more open-ended tool use with agents for your task? Think about these but start simple, perhaps with just a single dspy.ChainOfThought module, then add complexity incrementally based on observations.

As you do this, craft and try a handful of examples of the inputs to your program. Consider using a powerful LM at this point, or a couple of different LMs, just to understand what's possible. Record interesting (both easy and hard) examples you try. This will be useful when you are doing evaluation and optimization later.

??? "Beyond encouraging good design patterns, how does DSPy help here?"

Conventional prompts couple your fundamental system architecture with incidental choices not portable to new LMs, objectives, or pipelines. A conventional prompt asks the LM to take some inputs and produce some outputs of certain types (a _signature_), formats the inputs in certain ways and requests outputs in a form it can parse accurately (an _adapter_), asks the LM to apply certain strategies like "thinking step by step" or using tools (a _module_'s logic), and relies on substantial trial-and-error to discover the right way to ask each LM to do this (a form of manual _optimization_).

DSPy separates these concerns and automates the lower-level ones until you need to consider them. This allow you to write much shorter code, with much higher portability. For example, if you write a program using DSPy modules, you can swap the LM or its adapter without changing the rest of your logic. Or you can exchange one _module_, like `dspy.ChainOfThought`, with another, like `dspy.ProgramOfThought`, without modifying your signatures. When you're ready to use optimizers, the same program can have its prompts optimized or its LM weights fine-tuned.


sidebar_position: 2

Signatures

When we assign tasks to LMs in DSPy, we specify the behavior we need as a Signature.

A signature is a declarative specification of input/output behavior of a DSPy module. Signatures allow you to tell the LM what it needs to do, rather than specify how we should ask the LM to do it.

You're probably familiar with function signatures, which specify the input and output arguments and their types. DSPy signatures are similar, but with a couple of differences. While typical function signatures just describe things, DSPy Signatures declare and initialize the behavior of modules. Moreover, the field names matter in DSPy Signatures. You express semantic roles in plain English: a question is different from an answer, a sql_query is different from python_code.

Why should I use a DSPy Signature?

For modular and clean code, in which LM calls can be optimized into high-quality prompts (or automatic finetunes). Most people coerce LMs to do tasks by hacking long, brittle prompts. Or by collecting/generating data for fine-tuning. Writing signatures is far more modular, adaptive, and reproducible than hacking at prompts or finetunes. The DSPy compiler will figure out how to build a highly-optimized prompt for your LM (or finetune your small LM) for your signature, on your data, and within your pipeline. In many cases, we found that compiling leads to better prompts than humans write. Not because DSPy optimizers are more creative than humans, but simply because they can try more things and tune the metrics directly.

Inline DSPy Signatures

Signatures can be defined as a short string, with argument names and optional types that define semantic roles for inputs/outputs.

  1. Question Answering: "question -> answer", which is equivalent to "question: str -> answer: str" as the default type is always str

  2. Sentiment Classification: "sentence -> sentiment: bool", e.g. True if positive

  3. Summarization: "document -> summary"

Your signatures can also have multiple input/output fields with types:

  1. Retrieval-Augmented Question Answering: "context: list[str], question: str -> answer: str"

  2. Multiple-Choice Question Answering with Reasoning: "question, choices: list[str] -> reasoning: str, selection: int"

Tip: For fields, any valid variable names work! Field names should be semantically meaningful, but start simple and don't prematurely optimize keywords! Leave that kind of hacking to the DSPy compiler. For example, for summarization, it's probably fine to say "document -> summary", "text -> gist", or "long_context -> tldr".

Example A: Sentiment Classification

sentence = "it's a charming and often affecting journey."  # example from the SST-2 dataset.

classify = dspy.Predict('sentence -> sentiment: bool')  # we'll see an example with Literal[] later
classify(sentence=sentence).sentiment

Output:

True

Example B: Summarization

# Example from the XSum dataset.
document = """The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page."""

summarize = dspy.ChainOfThought('document -> summary')
response = summarize(document=document)

print(response.summary)

Possible Output:

The 21-year-old Lee made seven appearances and scored one goal for West Ham last season. He had loan spells in League One with Blackpool and Colchester United, scoring twice for the latter. He has now signed a contract with Barnsley, but the length of the contract has not been revealed.

Many DSPy modules (except dspy.Predict) return auxiliary information by expanding your signature under the hood.

For example, dspy.ChainOfThought also adds a reasoning field that includes the LM's reasoning before it generates the output summary.

print("Reasoning:", response.reasoning)

Possible Output:

Reasoning: We need to highlight Lee's performance for West Ham, his loan spells in League One, and his new contract with Barnsley. We also need to mention that his contract length has not been disclosed.

Class-based DSPy Signatures

For some advanced tasks, you need more verbose signatures. This is typically to:

  1. Clarify something about the nature of the task (expressed below as a docstring).

  2. Supply hints on the nature of an input field, expressed as a desc keyword argument for dspy.InputField.

  3. Supply constraints on an output field, expressed as a desc keyword argument for dspy.OutputField.

Example C: Classification

from typing import Literal

class Emotion(dspy.Signature):
    """Classify emotion."""
    
    sentence: str = dspy.InputField()
    sentiment: Literal['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'] = dspy.OutputField()

sentence = "i started feeling a little vulnerable when the giant spotlight started blinding me"  # from dair-ai/emotion

classify = dspy.Predict(Emotion)
classify(sentence=sentence)

Possible Output:

Prediction(
    sentiment='fear'
)

Tip: There's nothing wrong with specifying your requests to the LM more clearly. Class-based Signatures help you with that. However, don't prematurely tune the keywords of your signature by hand. The DSPy optimizers will likely do a better job (and will transfer better across LMs).

Example D: A metric that evaluates faithfulness to citations

class CheckCitationFaithfulness(dspy.Signature):
    """Verify that the text is based on the provided context."""

    context: str = dspy.InputField(desc="facts here are assumed to be true")
    text: str = dspy.InputField()
    faithfulness: bool = dspy.OutputField()
    evidence: dict[str, list[str]] = dspy.OutputField(desc="Supporting evidence for claims")

context = "The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page."

text = "Lee scored 3 goals for Colchester United."

faithfulness = dspy.ChainOfThought(CheckCitationFaithfulness)
faithfulness(context=context, text=text)

Possible Output:

Prediction(
    reasoning="Let's check the claims against the context. The text states Lee scored 3 goals for Colchester United, but the context clearly states 'He scored twice for the U's'. This is a direct contradiction.",
    faithfulness=False,
    evidence={'goal_count': ["scored twice for the U's"]}
)

Example E: Multi-modal image classification

class DogPictureSignature(dspy.Signature):
    """Output the dog breed of the dog in the image."""
    image_1: dspy.Image = dspy.InputField(desc="An image of a dog")
    answer: str = dspy.OutputField(desc="The dog breed of the dog in the image")

image_url = "https://picsum.photos/id/237/200/300"
classify = dspy.Predict(DogPictureSignature)
classify(image_1=dspy.Image.from_url(image_url))

Possible Output:

Prediction(
    answer='Labrador Retriever'
)

Using signatures to build modules & compiling them

While signatures are convenient for prototyping with structured inputs/outputs, that's not the only reason to use them!

You should compose multiple signatures into bigger DSPy modules and compile these modules into optimized prompts and finetunes.


Using DSPy in Production



draft: true

!!! warning "This document is from Aug 2024. Since then, DSPy 2.5 and 2.6 were released, DSPy has grown considerably, and 3.0 is approaching! Content below is highly outdated."

Roadmap Sketch: DSPy 2.5+

It’s been a year since DSPy evolved out of Demonstrate–Search–Predict (DSP), whose research started at Stanford NLP all the way back in February 2022. Thanks to 200 wonderful contributors, DSPy has introduced tens of thousands of people to building modular LM programs and optimizing their prompts and weights automatically. In this time, DSPy has grown to 160,000 monthly downloads and 16,000 stars on GitHub, becoming synonymous with prompt optimization in many circles and inspiring at least a half-dozen cool new libraries.

This document is an initial sketch of DSPy’s public roadmap for the next few weeks and months, as we work on DSPy 2.5 and plan for DSPy 3.0. Suggestions and open-source contributors are more than welcome: just open an issue or submit a pull request regarding the roadmap.

Technical Objectives

The thesis of DSPy is that for LMs to be useful, we have to shift from ad-hoc prompting to new notions of programming LMs. Instead of relying on LMs gaining much more general or more compositional capabilities, we need to enable developers to iteratively explore their problems and build modular software that invokes LMs for well-scoped tasks. We need to enable that through modules and optimizers that isolate how they decompose their problems and describe their system's objectives from how their LMs are invoked or fine-tuned to maximize their objectives. DSPy's goal has been to develop (and to build the community and shared infrastructure for the collective development of) the abstractions, programming patterns, and optimizers toward this thesis.

To a first approximation, DSPy’s current user-facing language has the minimum number of appropriate abstractions that address the goals above: declarative signatures, define-by-run modules, and optimizers that can be composed quite powerfully. But there are several things we need to do better to realize our goals. The upcoming DSPy releases will have the following objectives.

  1. Polishing the core functionality.
  2. Developing more accurate, lower-cost optimizers.
  3. Building end-to-end tutorials from DSPy’s ML workflow to deployment.
  4. Shifting towards more interactive optimization & tracking.

Team & Organization

DSPy is fairly unusual in its technical objectives, contributors, and audience. Though DSPy takes inspiration from PyTorch, a library for building and optimizing DNNs, there is one major difference: PyTorch was introduced well after DNNs were mature ML concepts, but DSPy seeks to establish and advance core LM Programs research: the framework is propelled by constant academic research from programming abstractions (like the original Demonstrate–Search–Predict concepts, DSPy Signatures, or LM Assertions) to NLP systems (like STORM, PATH, and IReRa) to prompt optimizers (like MIPRO) and RL (like BetterTogether), among many other related directions.

This research all composes into a concrete, practical library, thanks to dozens of industry contributors, many of whom are deploying apps in production using DSPy. Because of this, DSPy reaches not only of grad students and ML engineers, but also many non-ML engineers, from early adopter SWEs to hobbyists exploring new ways of using LMs. The following team, with help from many folks in the OSS community, is working towards the objectives in this Roadmap.

Project Lead: Omar Khattab (Stanford & Databricks)

Project Mentors: Chris Potts (Stanford), Matei Zaharia (UC Berkeley & Databricks), Heather Miller (CMU & Two Sigma)

Core Library: Arnav Singhvi (Databricks & Stanford), Herumb Shandilya (Stanford), Hanna Moazam (Databricks), Sri Vardhamanan (Dashworks), Cyrus Nouroozi (Zenbase), Amir Mehr (Zenbase), Kyle Caverly (Modular), with special thanks to Keshav Santhanam (Stanford), Thomas Ahle (Normal Computing), Connor Shorten (Weaviate)

Prompt Optimization: Krista Opsahl-Ong (Stanford), Michael Ryan (Stanford), Josh Purtell (Basis), with special thanks to Eric Zhang (Stanford)

Finetuning & RL: Dilara Soylu (Stanford), Isaac Miller (Anyscale), Karel D'Oosterlinck (Ghent), with special thanks to Paridhi Masehswari (Stanford)

PL Abstractions: Shangyin Tan (UC Berkeley), Manish Shetty (UC Berkeley), Peter Zhong (CMU)

Applications: Jasper Xian (Waterloo), Saron Samuel (Stanford), Alberto Mancarella (Stanford), Faraz Khoubsirat (Waterloo), Saiful Haq (IIT-B), Ashutosh Sharma (UIUC)

1) Polishing the core functionality.

Over the next month, polishing is the main objective and likely the one to have the highest ROI on the experience of the average user. Conceptually, DSPy has an extremely small core. It’s nothing but (1) LMs, (2) Signatures & Modules, (3) Optimizers, and (4) Assertions. These concepts and their implementations evolved organically over the past couple of years. We are working now to consolidate what we’ve learned and refactor internally so that things “just work” out of the box for new users, who may not know all the tips-and-tricks just yet.

More concretely:

  1. We want to increase the quality of zero-shot, off-the-shelf DSPy programs, i.e. those not yet compiled on custom data.
  2. Wherever possible, DSPy should delegate lower-level internal complexity (like managing LMs and structured generation) to emerging lower-level libraries. When required, we may fork smaller libraries out of DSPy to support infrastructure pieces as their own projects.
  3. DSPy should internally be more modular and we need higher compatibility between internal components. Specifically, we need more deeper and more native investment in (i) typed multi-field constraints, (ii) assertions, (iii) observability and experimental tracking, (iv) deployment of artifacts and related concerns like streaming and async, and (v) fine-tuning and serving open models.

On LMs

As of DSPy 2.4, the library has approximately 20,000 lines of code and roughly another 10,000 lines of code for tests, examples, and documentation. Some of these are clearly necessary (e.g., DSPy optimizers) but others exist only because the LM space lacks the building blocks we need under the hood. Luckily, for LM interfaces, a very strong library now exists: LiteLLM, a library that unifies interfaces to various LM and embedding providers. We expect to reduce around 6000 LoCs of support for custom LMs and retrieval models by shifting a lot of that to LiteLLM.

Objectives in this space include improved caching, saving/loading of LMs, support for streaming and async LM requests. Work here is currently led by Hanna Moazam and Sri Vardhamanan, building on a foundation by Cyrus Nouroozi, Amir Mehr, Kyle Caverly, and others.

On Signatures & Modules

Traditionally, LMs offer text-in-text-out interfaces. Toward modular programming, DSPy introduced signatures for the first time (as DSP Templates in Jan 2023) as a way to structure the inputs and outputs of LM interactions. Standard prompts conflate interface (“what should the LM do?”) with implementation (“how do we tell it to do that?”). DSPy signatures isolate the former so we can infer and learn the latter from data — in the context of a bigger program. Today in the LM landscape, notions of "structured outputs" have evolved dramatically, thanks to constrained decoding and other improvements, and have become mainstream. What may be called "structured inputs" remains is yet to become mainstream outside of DSPy, but is as crucial.

Objectives in this space include refining the abstractions and implementations first-class notion of LM Adapters in DSPy, as translators that sits between signatures and LM interfaces. While Optimizers adjust prompts through interactions with a user-supplied metric and data, Adapters are more concerned with building up interactions with LMs to account for, e.g. (i) non-plaintext LM interfaces like chat APIs, structured outputs, function calling, and multi-modal APIs, (ii) languages beyond English or other forms of higher-level specialization. This has been explored in DSPy on and off in various forms, but we have started working on more fundamental approaches to this problem that will offer tangible improvements to most use-cases. Work here is currently led by Omar Khattab.

On Finetuning & Serving

In February 2023, DSPy introduced the notion of compiling to optimize the weights of an LM program. (To understand just how long ago that was in AI terms, this was before the Alpaca training project at Stanford had even started and a month before the first GPT-4 was released.) Since then, we have shown in October 2023 and, much more expansively, in July 2024, that the fine-tuning flavor of DSPy can deliver large gains for small LMs, especially when composed with prompt optimization.

Overall, though, most DSPy users in practice explore prompt optimization and not weight optimization and most of our examples do the same. The primary reason for a lot of this is infrastructure. Fine-tuning in the DSPy flavor is more than just training a model: ultimately, we need to bootstrap training data for several different modules in a program, train multiple models and handle model selection, and then load and plug in those models into the program's modules. Doing this robustly at the level of abstraction DSPy offers requires a level of resource management that is not generally supported by external existing tools. Major efforts in this regard are currently led by Dilara Soylu and Isaac Miller.

On Optimizers & Assertions

This is a naturally major direction in the course of polishing. We will share more thoughts here after making more progress on the three angles above.

2) Developing more accurate, lower-cost optimizers.

A very large fraction of the research in DSPy focuses on optimizing the prompts and the weights of LM programs. In December 2022, we introduced the algorithm and abstractions behind BootstrapFewShot (as Demonstrate in DSP) and several of its variants. In February 2023, we introduced the core version of what later became BootstrapFinetune. In August 2023, we introduced new variations of both of these. In December 2023, we introduced the first couple of instruction optimizers into DSPy, CA-OPRO and early versions of MIPRO. These were again upgraded in March 2024. Fast forward to June and July 2024, we released MIPROv2 for prompt optimization and BetterTogether for fine-tuning the weights of LM programs.

We have been working towards a number of stronger optimizers. While we cannot share the internal details of research on new optimizers yet, we can outline the goals. A DSPy optimizer can be characterized via three angles:

  1. Quality: How much quality can it deliver from various LMs? How effective does it need the zero-shot program to be in order to work well?
  2. Cost: How many labeled (and unlabeled) inputs does it need? How many invocations of the program does it need? How expensive is the resulting optimized program at inference time?
  3. Robustness: How well can it generalize to different unseen data points or distributions? How sensitive is it to mistakes of the metric or labels?

Over the next six months, our goal is to dramatically improve each angle of these when the other two are held constant. Concretely, there are three directions here.

  • Benchmarking: A key prerequisite here is work on benchmarking. On the team, Michael Ryan and Shangyin Tan are leading these efforts. More soon.

  • Quality: The goal here is optimizers that extract, on average, 20% more on representative tasks than MIPROv2 and BetterTogether, under the usual conditions — like a few hundred inputs with labels and a good metric starting from a decent zero-shot program. Various efforts here are led by Dilara Soylu, Michael Ryan, Josh Purtell, Krista Opsahl-Ong, and Isaac Miller.

  • Efficiency: The goal here is optimizers that match the current best scores from MIPROv2 and BetterTogether but under 1-2 challenges like: (i) starting from only 10-20 inputs with labels, (ii) starting with a weak zero-shot program that scores 0%, (iii) where significant misalignment exists between train/validation and test, or (iii) where the user supplies no metric but provides a very small number of output judgments.

3) Building end-to-end tutorials from DSPy’s ML workflow to deployment.

Using DSPy well for solving a new task is just doing good machine learning with LMs, but teaching this is hard. On the one hand, it's an iterative process: you make some initial choices, which will be sub-optimal, and then you refine them incrementally. It's highly exploratory: it's often the case that no one knows yet how to best solve a problem in a DSPy-esque way. One the other hand, DSPy offers many emerging lessons from several years of building LM systems, in which the design space, the data regime, and many other factors are new both to ML experts and to the very large fraction of users that have no ML experience.

Though current docs do address a bunch of this in isolated ways, one thing we've learned is that we should separate teaching the core DSPy language (which is ultimately pretty small) from teaching the emerging ML workflow that works well in a DSPy-esque setting. As a natural extension of this, we need to place more emphasis on steps prior and after to the explicit coding in DSPy, from data collection to deployment that serves and monitors the optimized DSPy program in practice. This is just starting but efforts will be ramping up led by Omar Khattab, Isaac Miller, and Herumb Shandilya.

4) Shifting towards more interactive optimization & tracking.

Right now, a DSPy user has a few ways to observe and tweak the process of optimization. They can study the prompts before, during, and after optimization methods like inspect_history, built-in logging, and/or the metadata returned by optimizers. Similarly, they can rely on program.save and program.load to potentially adjust the optimized prompts by hand. Alternatively, they can use one of the many powerful observability integrations — like from Phoenix Arize, LangWatch, or Weights & Biases Weave — to observe in real time the process of optimization (e.g., scores, stack traces, successful & failed traces, and candidate prompts). DSPy encourages iterative engineering by adjusting the program, data, or metrics across optimization runs. For example, some optimizers allow “checkpointing” — e.g., if you optimize with BootstrapFewShotWithRandomSearch for 10 iterations then increase to 15 iterations, the first 10 will be loaded from cache.

While these can accomplish a lot of goals, there are two limitations that future versions of DSPy will seek to address.

  1. In general, DSPy’s (i) observability, (ii) experimental tracking, (iii) cost management, and (iii) deployment of programs should become first-class concerns via integration with tools like MLFlow. We will share more plans addressing this for DSPy 2.6 in the next 1-2 months.

  2. DSPy 3.0 will introduce new optimizers that prioritize ad-hoc, human-in-the-loop feedback. This is perhaps the only substantial paradigm shift we see as necessary in the foreseeable future in DSPy. It involves various research questions at the level of the abstractions, UI/HCI, and ML, so it is a longer-term goal that we will share more about in the next 3-4 month.


Async DSPy Programming

DSPy provides native support for asynchronous programming, allowing you to build more efficient and scalable applications. This guide will walk you through how to leverage async capabilities in DSPy, covering both built-in modules and custom implementations.

Why Use Async in DSPy?

Asynchronous programming in DSPy offers several benefits:

  • Improved performance through concurrent operations
  • Better resource utilization
  • Reduced waiting time for I/O-bound operations
  • Enhanced scalability for handling multiple requests

When Should I use Sync or Async?

Choosing between synchronous and asynchronous programming in DSPy depends on your specific use case. Here's a guide to help you make the right choice:

Use Synchronous Programming When

  • You're exploring or prototyping new ideas
  • You're conducting research or experiments
  • You're building small to medium-sized applications
  • You need simpler, more straightforward code
  • You want easier debugging and error tracking

Use Asynchronous Programming When:

  • You're building a high-throughput service (high QPS)
  • You're working with tools that only support async operations
  • You need to handle multiple concurrent requests efficiently
  • You're building a production service that requires high scalability

Important Considerations

While async programming offers performance benefits, it comes with some trade-offs:

  • More complex error handling and debugging
  • Potential for subtle, hard-to-track bugs
  • More complex code structure
  • Different code between ipython (Colab, Jupyter lab, Databricks notebooks, ...) and normal python runtime.

We recommend starting with synchronous programming for most development scenarios and switching to async only when you have a clear need for its benefits. This approach allows you to focus on the core logic of your application before dealing with the additional complexity of async programming.

Using Built-in Modules Asynchronously

Most DSPy built-in modules support asynchronous operations through the acall() method. This method maintains the same interface as the synchronous __call__ method but operates asynchronously.

Here's a basic example using dspy.Predict:

import dspy
import asyncio
import os

os.environ["OPENAI_API_KEY"] = "your_api_key"

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
predict = dspy.Predict("question->answer")

async def main():
    # Use acall() for async execution
    output = await predict.acall(question="why did a chicken cross the kitchen?")
    print(output)


asyncio.run(main())

Working with Async Tools

DSPy's Tool class seamlessly integrates with async functions. When you provide an async function to dspy.Tool, you can execute it using acall(). This is particularly useful for I/O-bound operations or when working with external services.

import asyncio
import dspy
import os

os.environ["OPENAI_API_KEY"] = "your_api_key"

async def foo(x):
    # Simulate an async operation
    await asyncio.sleep(0.1)
    print(f"I get: {x}")

# Create a tool from the async function
tool = dspy.Tool(foo)

async def main():
    # Execute the tool asynchronously
    await tool.acall(x=2)

asyncio.run(main())

Note: When using dspy.ReAct with tools, calling acall() on the ReAct instance will automatically execute all tools asynchronously using their acall() methods.

Creating Custom Async DSPy Modules

To create your own async DSPy module, implement the aforward() method instead of forward(). This method should contain your module's async logic. Here's an example of a custom module that chains two async operations:

import dspy
import asyncio
import os

os.environ["OPENAI_API_KEY"] = "your_api_key"
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

class MyModule(dspy.Module):
    def __init__(self):
        self.predict1 = dspy.ChainOfThought("question->answer")
        self.predict2 = dspy.ChainOfThought("answer->simplified_answer")

    async def aforward(self, question, **kwargs):
        # Execute predictions sequentially but asynchronously
        answer = await self.predict1.acall(question=question)
        return await self.predict2.acall(answer=answer)


async def main():
    mod = MyModule()
    result = await mod.acall(question="Why did a chicken cross the kitchen?")
    print(result)


asyncio.run(main())


Please refer to this tutorial from Drew Breunig using DSPy.

This tutorial demonstrates a few aspects of using DSPy in a highly-accessible, concrete context for categorizing historic events with a tiny LM.



Tutorial: Deploying your DSPy program

This guide demonstrates two potential ways to deploy your DSPy program in production: FastAPI for lightweight deployments and MLflow for more production-grade deployments with program versioning and management.

Below, we'll assume you have the following simple DSPy program that you want to deploy. You can replace this with something more sophisticated.

import dspy

dspy.settings.configure(lm=dspy.LM("openai/gpt-4o-mini"))
dspy_program = dspy.ChainOfThought("question -> answer")

Deploying with FastAPI

FastAPI offers a straightforward way to serve your DSPy program as a REST API. This is ideal when you have direct access to your program code and need a lightweight deployment solution.

> pip install fastapi uvicorn
> export OPENAI_API_KEY="your-openai-api-key"

Let's create a FastAPI application to serve your dspy_program defined above.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

import dspy

app = FastAPI(
    title="DSPy Program API",
    description="A simple API serving a DSPy Chain of Thought program",
    version="1.0.0"
)

# Define request model for better documentation and validation
class Question(BaseModel):
    text: str

# Configure your language model and 'asyncify' your DSPy program.
lm = dspy.LM("openai/gpt-4o-mini")
dspy.settings.configure(lm=lm, async_max_workers=4) # default is 8
dspy_program = dspy.ChainOfThought("question -> answer")
dspy_program = dspy.asyncify(dspy_program)

@app.post("/predict")
async def predict(question: Question):
    try:
        result = await dspy_program(question=question.text)
        return {
            "status": "success",
            "data": result.toDict()
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

In the code above, we call dspy.asyncify to convert the dspy program to run in async mode for high-throughput FastAPI deployments. Currently, this runs the dspy program in a separate thread and awaits its result.

By default, the limit of spawned threads is 8. Think of this like a worker pool. If you have 8 in-flight programs and call it once more, the 9th call will wait until one of the 8 returns. You can configure the async capacity using the new async_max_workers setting.

??? "Streaming, in DSPy 2.6.0+"

Streaming is also supported in DSPy 2.6.0+, available as a release candidate via `pip install -U --pre dspy`.

We can use `dspy.streamify` to convert the dspy program to a streaming mode. This is useful when you want to stream
the intermediate outputs (i.e. O1-style reasoning) to the client before the final prediction is ready. This uses
asyncify under the hood and inherits the execution semantics.

```python
dspy_program = dspy.asyncify(dspy.ChainOfThought("question -> answer"))
streaming_dspy_program = dspy.streamify(dspy_program)

@app.post("/predict/stream")
async def stream(question: Question):
    async def generate():
        async for value in streaming_dspy_program(question=question.text):
            if isinstance(value, dspy.Prediction):
                data = {"prediction": value.labels().toDict()}
            elif isinstance(value, litellm.ModelResponse):
                data = {"chunk": value.json()}
            yield f"data: {ujson.dumps(data)}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

# Since you're often going to want to stream the result of a DSPy program as server-sent events,
# we've included a helper function for that, which is equivalent to the code above.

from dspy.utils.streaming import streaming_response

@app.post("/predict/stream")
async def stream(question: Question):
    stream = streaming_dspy_program(question=question.text)
    return StreamingResponse(streaming_response(stream), media_type="text/event-stream")
```

Write your code to a file, e.g., fastapi_dspy.py. Then you can serve the app with:

> uvicorn fastapi_dspy:app --reload

It will start a local server at http://127.0.0.1:8000/. You can test it with the python code below:

import requests

response = requests.post(
    "http://127.0.0.1:8000/predict",
    json={"text": "What is the capital of France?"}
)
print(response.json())

You should see the response like below:

{'status': 'success', 'data': {'reasoning': 'The capital of France is a well-known fact, commonly taught in geography classes and referenced in various contexts. Paris is recognized globally as the capital city, serving as the political, cultural, and economic center of the country.', 'answer': 'The capital of France is Paris.'}}

Deploying with MLflow

We recommend deploying with MLflow if you are looking to package your DSPy program and deploy in an isolated environment. MLflow is a popular platform for managing machine learning workflows, including versioning, tracking, and deployment.

> pip install mlflow>=2.18.0

Let's spin up the MLflow tracking server, where we will store our DSPy program. The command below will start a local server at http://127.0.0.1:5000/.

> mlflow ui

Then we can define the DSPy program and log it to the MLflow server. "log" is an overloaded term in MLflow, basically it means we store the program information along with environment requirements in the MLflow server. See the code below:

import dspy
import mlflow

mlflow.set_tracking_uri("http://127.0.0.1:5000/")
mlflow.set_experiment("deploy_dspy_program")

lm = dspy.LM("openai/gpt-4o-mini")
dspy.settings.configure(lm=lm)
dspy_program = dspy.ChainOfThought("question -> answer")

with mlflow.start_run():
    mlflow.dspy.log_model(
        dspy_program,
        "dspy_program",
        input_example={"messages": [{"role": "user", "content": "What is LLM agent?"}]},
        task="llm/v1/chat",
    )

We recommend you to set task="llm/v1/chat" so that the deployed program automatically takes input and generate output in the same format as the OpenAI chat API, which is a common interface for LM applications. Write the code above into a file, e.g. mlflow_dspy.py, and run it.

After you logged the program, you can view the saved information in MLflow UI. Open http://127.0.0.1:5000/ and select the deploy_dspy_program experiment, then select the run your just created, under the Artifacts tab, you should see the logged program information, similar to the following screenshot:

MLflow UI

Grab your run id from UI (or the console print when you execute mlflow_dspy.py), now you can deploy the logged program with the following command:

> mlflow models serve -m runs:/{run_id}/model -p 6000

After the program is deployed, you can test it with the following command:

> curl http://127.0.0.1:6000/invocations -H "Content-Type:application/json"  --data '{"messages": [{"content": "what is 2 + 2?", "role": "user"}]}'

You should see the response like below:

{"choices": [{"index": 0, "message": {"role": "assistant", "content": "{\"reasoning\": \"The question asks for the sum of 2 and 2. To find the answer, we simply add the two numbers together: 2 + 2 = 4.\", \"answer\": \"4\"}"}, "finish_reason": "stop"}]}

For complete guide on how to deploy a DSPy program with MLflow, and how to customize the deployment, please refer to the MLflow documentation.

Best Practices for MLflow Deployment

  1. Environment Management: Always specify your Python dependencies in a conda.yaml or requirements.txt file.
  2. Versioning: Use meaningful tags and descriptions for your model versions.
  3. Input Validation: Define clear input schemas and examples.
  4. Monitoring: Set up proper logging and monitoring for production deployments.

For production deployments, consider using MLflow with containerization:

> mlflow models build-docker -m "runs:/{run_id}/model" -n "dspy-program"
> docker run -p 6000:8080 dspy-program

For a complete guide on production deployment options and best practices, refer to the MLflow documentation.


Welcome to DSPy tutorials! We've organized our tutorials into three main categories to help you get started:


Tutorial: Use MCP tools in DSPy

MCP, standing for Model Context Protocol, is an open protocol that standardizes how applications provide context to LLMs. Despite some development overhead, MCP offers a valuable opportunity to share tools, resources, and prompts with other developers regardless of the technical stack you are using. Likewise, you can use the tools built by other developers without rewriting code.

In this guide, we will walk you through how to use MCP tools in DSPy. For demonstration purposes, we will build an airline service agent that can help users book flights and modify or cancel existing bookings. This will rely on an MCP server with custom tools, but it should be easy to generalize to MCP servers built by the community.

??? "How to run this tutorial" This tutorial cannot be run in hosted IPython notebooks like Google Colab or Databricks notebooks. To run the code, you will need to follow the guide to write code on your local device. The code is tested on macOS and should work the same way in Linux environments.

Install Dependencies

Before starting, let's install the required dependencies:

pip install -U dspy mcp

MCP Server Setup

Let's first set up the MCP server for the airline agent, which contains:

  • A set of databases
    • User database, storing user information.
    • Flight database, storing flight information.
    • Ticket database, storing customer tickets.
  • A set of tools
    • fetch_flight_info: get flight information for specific dates.
    • fetch_itinerary: get information about booked itineraries.
    • book_itinerary: book a flight on behalf of the user.
    • modify_itinerary: modify an itinerary, either through flight changes or cancellation.
    • get_user_info: get user information.
    • file_ticket: file a backlog ticket for human assistance.

In your working directory, create a file mcp_server.py, and paste the following content into it:

import random
import string

from mcp.server.fastmcp import FastMCP
from pydantic import BaseModel

# Create an MCP server
mcp = FastMCP("Airline Agent")


class Date(BaseModel):
    # Somehow LLM is bad at specifying `datetime.datetime`
    year: int
    month: int
    day: int
    hour: int


class UserProfile(BaseModel):
    user_id: str
    name: str
    email: str


class Flight(BaseModel):
    flight_id: str
    date_time: Date
    origin: str
    destination: str
    duration: float
    price: float


class Itinerary(BaseModel):
    confirmation_number: str
    user_profile: UserProfile
    flight: Flight


class Ticket(BaseModel):
    user_request: str
    user_profile: UserProfile


user_database = {
    "Adam": UserProfile(user_id="1", name="Adam", email="[email protected]"),
    "Bob": UserProfile(user_id="2", name="Bob", email="[email protected]"),
    "Chelsie": UserProfile(user_id="3", name="Chelsie", email="[email protected]"),
    "David": UserProfile(user_id="4", name="David", email="[email protected]"),
}

flight_database = {
    "DA123": Flight(
        flight_id="DA123",
        origin="SFO",
        destination="JFK",
        date_time=Date(year=2025, month=9, day=1, hour=1),
        duration=3,
        price=200,
    ),
    "DA125": Flight(
        flight_id="DA125",
        origin="SFO",
        destination="JFK",
        date_time=Date(year=2025, month=9, day=1, hour=7),
        duration=9,
        price=500,
    ),
    "DA456": Flight(
        flight_id="DA456",
        origin="SFO",
        destination="SNA",
        date_time=Date(year=2025, month=10, day=1, hour=1),
        duration=2,
        price=100,
    ),
    "DA460": Flight(
        flight_id="DA460",
        origin="SFO",
        destination="SNA",
        date_time=Date(year=2025, month=10, day=1, hour=9),
        duration=2,
        price=120,
    ),
}

itinery_database = {}
ticket_database = {}


@mcp.tool()
def fetch_flight_info(date: Date, origin: str, destination: str):
    """Fetch flight information from origin to destination on the given date"""
    flights = []

    for flight_id, flight in flight_database.items():
        if (
            flight.date_time.year == date.year
            and flight.date_time.month == date.month
            and flight.date_time.day == date.day
            and flight.origin == origin
            and flight.destination == destination
        ):
            flights.append(flight)
    return flights


@mcp.tool()
def fetch_itinerary(confirmation_number: str):
    """Fetch a booked itinerary information from database"""
    return itinery_database.get(confirmation_number)


@mcp.tool()
def pick_flight(flights: list[Flight]):
    """Pick up the best flight that matches users' request."""
    sorted_flights = sorted(
        flights,
        key=lambda x: (
            x.get("duration") if isinstance(x, dict) else x.duration,
            x.get("price") if isinstance(x, dict) else x.price,
        ),
    )
    return sorted_flights[0]


def generate_id(length=8):
    chars = string.ascii_lowercase + string.digits
    return "".join(random.choices(chars, k=length))


@mcp.tool()
def book_itinerary(flight: Flight, user_profile: UserProfile):
    """Book a flight on behalf of the user."""
    confirmation_number = generate_id()
    while confirmation_number in itinery_database:
        confirmation_number = generate_id()
    itinery_database[confirmation_number] = Itinerary(
        confirmation_number=confirmation_number,
        user_profile=user_profile,
        flight=flight,
    )
    return confirmation_number, itinery_database[confirmation_number]


@mcp.tool()
def cancel_itinerary(confirmation_number: str, user_profile: UserProfile):
    """Cancel an itinerary on behalf of the user."""
    if confirmation_number in itinery_database:
        del itinery_database[confirmation_number]
        return
    raise ValueError("Cannot find the itinerary, please check your confirmation number.")


@mcp.tool()
def get_user_info(name: str):
    """Fetch the user profile from database with given name."""
    return user_database.get(name)


@mcp.tool()
def file_ticket(user_request: str, user_profile: UserProfile):
    """File a customer support ticket if this is something the agent cannot handle."""
    ticket_id = generate_id(length=6)
    ticket_database[ticket_id] = Ticket(
        user_request=user_request,
        user_profile=user_profile,
    )
    return ticket_id


if __name__ == "__main__":
    mcp.run()

Before we start the server, let's take a look at the code.

We first create a FastMCP instance, which is a utility that helps quickly build an MCP server:

mcp = FastMCP("Airline Agent")

Then we define our data structures, which in a real-world application would be the database schema, e.g.:

class Flight(BaseModel):
    flight_id: str
    date_time: Date
    origin: str
    destination: str
    duration: float
    price: float

Following that, we initialize our database instances. In a real-world application, these would be connectors to actual databases, but for simplicity, we just use dictionaries:

user_database = {
    "Adam": UserProfile(user_id="1", name="Adam", email="[email protected]"),
    "Bob": UserProfile(user_id="2", name="Bob", email="[email protected]"),
    "Chelsie": UserProfile(user_id="3", name="Chelsie", email="[email protected]"),
    "David": UserProfile(user_id="4", name="David", email="[email protected]"),
}

The next step is to define the tools and mark them with @mcp.tool() so that they are discoverable by MCP clients as MCP tools:

@mcp.tool()
def fetch_flight_info(date: Date, origin: str, destination: str):
    """Fetch flight information from origin to destination on the given date"""
    flights = []

    for flight_id, flight in flight_database.items():
        if (
            flight.date_time.year == date.year
            and flight.date_time.month == date.month
            and flight.date_time.day == date.day
            and flight.origin == origin
            and flight.destination == destination
        ):
            flights.append(flight)
    return flights

The last step is spinning up the server:

if __name__ == "__main__":
    mcp.run()

Now we have finished writing the server! Let's launch it:

python path_to_your_working_directory/mcp_server.py

Write a DSPy Program That Utilizes Tools in MCP Server

Now that the server is running, let's build the actual airline service agent which utilizes the MCP tools in our server to assist users. In your working directory, create a file named dspy_mcp_agent.py, and follow the guide to add code to it.

Gather Tools from MCP Servers

We first need to gather all available tools from the MCP server and make them usable by DSPy. DSPy provides an API dspy.Tool as the standard tool interface. Let's convert all the MCP tools to dspy.Tool.

We need to create an MCP client instance to communicate with the MCP server, fetch all available tools, and convert them to dspy.Tool using the static method from_mcp_tool:

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

# Create server parameters for stdio connection
server_params = StdioServerParameters(
    command="python",  # Executable
    args=["path_to_your_working_directory/mcp_server.py"],
    env=None,
)

async def run():
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            # Initialize the connection
            await session.initialize()
            # List available tools
            tools = await session.list_tools()

            # Convert MCP tools to DSPy tools
            dspy_tools = []
            for tool in tools.tools:
                dspy_tools.append(dspy.Tool.from_mcp_tool(session, tool))

            print(len(dspy_tools))
            print(dspy_tools[0].args)

if __name__ == "__main__":
    import asyncio

    asyncio.run(run())

With the code above, we have successfully collected all available MCP tools and converted them to DSPy tools.

Build a DSPy Agent to Handle Customer Requests

Now we will use dspy.ReAct to build the agent for handling customer requests. ReAct stands for "reasoning and acting," which asks the LLM to decide whether to call a tool or wrap up the process. If a tool is required, the LLM takes responsibility for deciding which tool to call and providing the appropriate arguments.

As usual, we need to create a dspy.Signature to define the input and output of our agent:

import dspy

class DSPyAirlineCustomerService(dspy.Signature):
    """You are an airline customer service agent. You are given a list of tools to handle user requests. You should decide the right tool to use in order to fulfill users' requests."""

    user_request: str = dspy.InputField()
    process_result: str = dspy.OutputField(
        desc=(
            "Message that summarizes the process result, and the information users need, "
            "e.g., the confirmation_number if it's a flight booking request."
        )
    )

And choose an LM for our agent:

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

Then we create the ReAct agent by passing the tools and signature into the dspy.ReAct API. We can now put together the complete code script:

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

import dspy

# Create server parameters for stdio connection
server_params = StdioServerParameters(
    command="python",  # Executable
    args=["script_tmp/mcp_server.py"],  # Optional command line arguments
    env=None,  # Optional environment variables
)


class DSPyAirlineCustomerService(dspy.Signature):
    """You are an airline customer service agent. You are given a list of tools to handle user requests.
    You should decide the right tool to use in order to fulfill users' requests."""

    user_request: str = dspy.InputField()
    process_result: str = dspy.OutputField(
        desc=(
            "Message that summarizes the process result, and the information users need, "
            "e.g., the confirmation_number if it's a flight booking request."
        )
    )


dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))


async def run(user_request):
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            # Initialize the connection
            await session.initialize()
            # List available tools
            tools = await session.list_tools()

            # Convert MCP tools to DSPy tools
            dspy_tools = []
            for tool in tools.tools:
                dspy_tools.append(dspy.Tool.from_mcp_tool(session, tool))

            # Create the agent
            react = dspy.ReAct(DSPyAirlineCustomerService, tools=dspy_tools)

            result = await react.acall(user_request=user_request)
            print(result)


if __name__ == "__main__":
    import asyncio

    asyncio.run(run("please help me book a flight from SFO to JFK on 09/01/2025, my name is Adam"))

Note that we must call react.acall because MCP tools are async by default. Let's execute the script:

python path_to_your_working_directory/dspy_mcp_agent.py

You should see output similar to this:

Prediction(
    trajectory={'thought_0': 'I need to fetch flight information for Adam from SFO to JFK on 09/01/2025 to find available flights for booking.', 'tool_name_0': 'fetch_flight_info', 'tool_args_0': {'date': {'year': 2025, 'month': 9, 'day': 1, 'hour': 0}, 'origin': 'SFO', 'destination': 'JFK'}, 'observation_0': ['{"flight_id": "DA123", "date_time": {"year": 2025, "month": 9, "day": 1, "hour": 1}, "origin": "SFO", "destination": "JFK", "duration": 3.0, "price": 200.0}', '{"flight_id": "DA125", "date_time": {"year": 2025, "month": 9, "day": 1, "hour": 7}, "origin": "SFO", "destination": "JFK", "duration": 9.0, "price": 500.0}'], ..., 'tool_name_4': 'finish', 'tool_args_4': {}, 'observation_4': 'Completed.'},
    reasoning="I successfully booked a flight for Adam from SFO to JFK on 09/01/2025. I found two available flights, selected the more economical option (flight DA123 at 1 AM for $200), retrieved Adam's user profile, and completed the booking process. The confirmation number for the flight is 8h7clk3q.",
    process_result='Your flight from SFO to JFK on 09/01/2025 has been successfully booked. Your confirmation number is 8h7clk3q.'
)

The trajectory field contains the entire thinking and acting process. If you're curious about what's happening under the hood, check out the Observability Guide to set up MLflow, which visualizes every step happening inside dspy.ReAct!

Conclusion

In this guide, we built an airline service agent that utilizes a custom MCP server and the dspy.ReAct module. In the context of MCP support, DSPy provides a simple interface for interacting with MCP tools, giving you the flexibility to implement any functionality you need.


Tutorial: Debugging and Observability in DSPy

This guide demonstrates how to debug problems and improve observability in DSPy. Modern AI programs often involve multiple components, such as language models, retrievers, and tools. DSPy allows you to build and optimize such complex AI systems in a clean and modular way.

However, as systems grow more sophisticated, the ability to understand what your system is doing becomes critical. Without transparency, the prediction process can easily become a black box, making failures or quality issues difficult to diagnose and production maintenance challenging.

By the end of this tutorial, you'll understand how to debug an issue and improve observability using MLflow Tracing. You'll also explore how to build a custom logging solution using callbacks.

Define a Program

We'll start by creating a simple ReAct agent that uses ColBERTv2's Wikipedia dataset as a retrieval source. You can replace this with a more sophisticated program.

import dspy
from dspy.datasets import HotPotQA

lm = dspy.LM('openai/gpt-4o-mini')
colbert = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.configure(lm=lm, rm=colbert)

agent = dspy.ReAct("question -> answer", tools=[dspy.Retrieve(k=1)])

Now, let's ask the agent a simple question:

prediction = agent(question="Which baseball team does Shohei Ohtani play for?")
print(prediction.answer)
Shohei Ohtani plays for the Los Angeles Angels.

Oh, this is incorrect. He no longer plays for the Angels; he moved to the Dodgers and won the World Series in 2024! Let's debug the program and explore potential fixes.

Using inspect_history

DSPy provides the inspect_history() utility, which prints out all LLM invocations made so far:

# Print out 5 LLM calls
dspy.inspect_history(n=5)
[2024-12-01T10:23:29.144257]

System message:

Your input fields are:
1. `question` (str)

...

Response:

[[ ## Thought_5 ## ]]
The search results continue to be unhelpful and do not provide the current team for Shohei Ohtani in Major League Baseball. I need to conclude that he plays for the Los Angeles Angels based on prior knowledge, as the searches have not yielded updated information.

[[ ## Action_5 ## ]]
Finish[Los Angeles Angels] 

[[ ## completed ## ]]

The log reveals that the agent could not retrieve helpful information from the search tool. However, what exactly did the retriever return? While useful, inspect_history has some limitations:

  • In real-world systems, other components like retrievers, tools, and custom modules play significant roles, but inspect_history only logs LLM calls.
  • DSPy programs often make multiple LLM calls within a single prediction. Monolith log history makes it hard to organize logs, especially when handling multiple questions.
  • Metadata such as parameters, latency, and the relationship between modules are not captured.

Tracing addresses these limitations and provides a more comprehensive solution.

Tracing

MLflow is an end-to-end machine learning platform that is integrated seamlessly with DSPy to support best practices in LLMOps. Using MLflow's automatic tracing capability with DSPy is straightforward; No sign up for services or an API key is required. You just need to install MLflow and call mlflow.dspy.autolog() in your notebook or script.

pip install -U mlflow>=2.18.0
import mlflow

mlflow.dspy.autolog()

# This is optional. Create an MLflow Experiment to store and organize your traces.
mlflow.set_experiment("DSPy")

Now you're all set! Let's run your agent again:

agent(question="Which baseball team does Shohei Ohtani play for?")

MLflow automatically generates a trace for the prediction and records it in the experiment. To explore traces visually, launch the MLflow UI by the following command and access it in your browser:

mlflow ui --port 5000

DSPy MLflow Tracing

From the retriever step output, you can observe that it returned outdated information; indicating Shohei Ohtani was still playing in the Japanese league and the final answer was based on the LLM's prior knowledge! We should update the dataset or add additional tools to ensure access to the latest information.

!!! info Learn more about MLflow

MLflow is an end-to-end LLMOps platform that offers extensive features like experiment tracking, evaluation, and deployment. To learn more about DSPy and MLflow integration, visit [this tutorial](../deployment/#deploying-with-mlflow).

For example, we can add a web search capability to the agent, using the Tavily web search API.

from dspy.predict.react import Tool
from tavily import TavilyClient

search_client = TavilyClient(api_key="<YOUR_TAVILY_API_KEY>")

def web_search(query: str) -> list[str]:
    """Run a web search and return the content from the top 5 search results"""
    response = search_client.search(query)
    return [r["content"] for r in response["results"]]

agent = dspy.ReAct("question -> answer", tools=[Tool(web_search)])

prediction = agent(question="Which baseball team does Shohei Ohtani play for?")
print(agent.answer)
Los Angeles Dodgers

Building a Custom Logging Solution

Sometimes, you may want to implement a custom logging solution. For instance, you might need to log specific events triggered by a particular module. DSPy's callback mechanism supports such use cases. The BaseCallback class provides several handlers for customizing logging behavior:

Handlers Description
on_module_start / on_module_end Triggered when a dspy.Module subclass is invoked.
on_lm_start / on_lm_end Triggered when a dspy.LM subclass is invoked.
on_adapter_format_start / on_adapter_format_end Triggered when a dspy.Adapter subclass formats the input prompt.
on_adapter_parse_start / on_adapter_parse_end Triggered when a dspy.Adapter subclass postprocess the output text from an LM.

Here’s an example of custom callback that logs the intermediate steps of a ReAct agent:

import dspy
from dspy.utils.callback import BaseCallback

# 1. Define a custom callback class that extends BaseCallback class
class AgentLoggingCallback(BaseCallback):

    # 2. Implement on_module_end handler to run a custom logging code.
    def on_module_end(self, call_id, outputs, exception):
        step = "Reasoning" if self._is_reasoning_output(outputs) else "Acting"
        print(f"== {step} Step ===")
        for k, v in outputs.items():
            print(f"  {k}: {v}")
        print("\n")

    def _is_reasoning_output(self, outputs):
        return any(k.startswith("Thought") for k in outputs.keys())

# 3. Set the callback to DSPy setting so it will be applied to program execution
dspy.configure(callbacks=[AgentLoggingCallback()])
== Reasoning Step ===
  Thought_1: I need to find the current team that Shohei Ohtani plays for in Major League Baseball.
  Action_1: Search[Shohei Ohtani current team 2023]

== Acting Step ===
  passages: ["Shohei Ohtani ..."]

...

!!! info Handling Inputs and Outputs in Callbacks

Be cautious when working with input or output data in callbacks. Mutating them in-place can modify the original data passed to the program, potentially leading to unexpected behavior. To avoid this, it’s strongly recommended to create a copy of the data before performing any operations that may alter it.


Output Refinement: BestOfN and Refine

Both BestOfN and Refine are DSPy modules designed to improve the reliability and quality of predictions by making multiple LM calls with different parameter settings. Both modules stop when they have reached N attempts or when the reward_fn returns an award above the threshold.

BestOfN

BestOfN is a module that runs the provided module multiple times (up to N) with different temperature settings. It returns either the first prediction that passes a specified threshold or the one with the highest reward if none meets the threshold.

Basic Usage

Lets say we wanted to have the best chance of getting a one word answer from the model. We could use BestOfN to try multiple temperature settings and return the best result.

import dspy

def one_word_answer(args, pred: dspy.Prediction) -> float:
    return 1.0 if len(pred.answer.split()) == 1 else 0.0

best_of_3 = dspy.BestOfN(
    module=dspy.ChainOfThought("question -> answer"), 
    N=3, 
    reward_fn=one_word_answer, 
    threshold=1.0
)

result = best_of_3(question="What is the capital of Belgium?")
print(result.answer)  # Brussels

Error Handling

By default, if the module encounters an error during an attempt, it will continue trying until it reaches N attempts. You can adjust this behavior with the fail_count parameter:

best_of_3 = dspy.BestOfN(
    module=qa, 
    N=3, 
    reward_fn=one_word_answer, 
    threshold=1.0,
    fail_count=1
)

best_of_3(question="What is the capital of Belgium?")
# raises an error after the first failure

Refine

Refine extends the functionality of BestOfN by adding an automatic feedback loop. After each unsuccessful attempt (except the final one), it automatically generates detailed feedback about the module's performance and uses this feedback as hints for subsequent runs.

Basic Usage

import dspy

def one_word_answer(args, pred: dspy.Prediction) -> float:
    return 1.0 if len(pred.answer.split()) == 1 else 0.0

refine = dspy.Refine(
    module=dspy.ChainOfThought("question -> answer"), 
    N=3, 
    reward_fn=one_word_answer, 
    threshold=1.0
)

result = refine(question="What is the capital of Belgium?")
print(result.answer)  # Brussels

Error Handling

Like BestOfN, Refine will try up to N times by default, even if errors occur. You can control this with the fail_count parameter:

# Stop after the first error
refine = dspy.Refine(
    module=qa, 
    N=3, 
    reward_fn=one_word_answer, 
    threshold=1.0,
    fail_count=1
)

Comparison: BestOfN vs. Refine

Both modules serve similar purposes but differ in their approach:

  • BestOfN simply tries different temperature settings and selects the best resulting prediction as defined by the reward_fn.
  • Refine adds an feedback loop, using the lm to generate a detailed feedback about the module's own performance using the previous prediction and the code in the reward_fn. This feedback is then used as hints for subsequent runs.

Practical Examples

Ensuring Factual Correctness

import dspy

class FactualityJudge(dspy.Signature):
    """Determine if a statement is factually accurate."""
    statement: str = dspy.InputField()
    is_factual: bool = dspy.OutputField()

factuality_judge = dspy.ChainOfThought(FactualityJudge)

def factuality_reward(args, pred: dspy.Prediction) -> float:
    statement = pred.answer    
    result = factuality_judge(statement)    
    return 1.0 if result.is_factual else 0.0

refined_qa = dspy.Refine(
    module=dspy.ChainOfThought("question -> answer"),
    N=3,
    reward_fn=factuality_reward,
    threshold=1.0
)

result = refined_qa(question="Tell me about Belgium's capital city.")
print(result.answer)

Summarization - Controlling Response Length

import dspy

def ideal_length_reward(args, pred: dspy.Prediction) -> float:
    """
    Reward the summary for being close to 75 words with a tapering off for longer summaries.
    """
    word_count = len(pred.summary.split())
    distance = abs(word_count - 75)
    return max(0.0, 1.0 - (distance / 125))

optimized_summarizer = dspy.BestOfN(
    module=dspy.ChainOfThought("text -> summary"),
    N=50,
    reward_fn=ideal_length_reward,
    threshold=0.9
)

result = optimized_summarizer(
    text="[Long text to summarize...]"
)
print(result.summary)

Migration from dspy.Suggest and dspy.Assert

BestOfN and Refine are the replacements for dspy.Suggest and dspy.Assert as of DSPy 2.6.


Please refer to this tutorial from the PAPILLON authors using DSPy.

This tutorial demonstrates a few aspects of using DSPy in a more advanced context:

  1. It builds a multi-stage dspy.Module that involves a small local LM using an external tool.
  2. It builds a multi-stage judge in DSPy, and uses it as a metric for evaluation.
  3. It uses this judge for optimizing the dspy.Module, using a large model as a teacher for a small local LM.

See the links on the side bar.


Tutorial: Saving and Loading your DSPy program

This guide demonstrates how to save and load your DSPy program. At a high level, there are two ways to save your DSPy program:

  1. Save the state of the program only, similar to weights-only saving in PyTorch.
  2. Save the whole program, including both the architecture and the state, which is supported by dspy>=2.6.0.

State-only Saving

State represents the DSPy program's internal state, including the signature, demos (few-shot examples), and other informaiton like the lm to use for each dspy.Predict in the program. It also includes configurable attributes of other DSPy modules like k for dspy.retrievers.Retriever. To save the state of a program, use the save method and set save_program=False. You can choose to save the state to a JSON file or a pickle file. We recommend saving the state to a JSON file because it is safer and readable. But sometimes your program contains non-serializable objects like dspy.Image or datetime.datetime, in which case you should save the state to a pickle file.

Let's say we have compiled a program with some data, and we want to save the program for future usage:

import dspy
from dspy.datasets.gsm8k import GSM8K, gsm8k_metric

dspy.settings.configure(lm=dspy.LM("openai/gpt-4o-mini"))

gsm8k = GSM8K()
gsm8k_trainset = gsm8k.train[:10]
dspy_program = dspy.ChainOfThought("question -> answer")

optimizer = dspy.BootstrapFewShot(metric=gsm8k_metric, max_bootstrapped_demos=4, max_labeled_demos=4, max_rounds=5)
compiled_dspy_program = optimizer.compile(dspy_program, trainset=gsm8k_trainset)

To save the state of your program to json file:

compiled_dspy_program.save("./dspy_program/program.json", save_program=False)

To save the state of your program to a pickle file:

compiled_dspy_program.save("./dspy_program/program.pkl", save_program=False)

To load your saved state, you need to recreate the same program, then load the state using the load method.

loaded_dspy_program = dspy.ChainOfThought("question -> answer") # Recreate the same program.
loaded_dspy_program.load("./dspy_program/program.json")

assert len(compiled_dspy_program.demos) == len(loaded_dspy_program.demos)
for original_demo, loaded_demo in zip(compiled_dspy_program.demos, loaded_dspy_program.demos):
    # Loaded demo is a dict, while the original demo is a dspy.Example.
    assert original_demo.toDict() == loaded_demo
assert str(compiled_dspy_program.signature) == str(loaded_dspy_program.signature)

Or load the state from a pickle file:

loaded_dspy_program = dspy.ChainOfThought("question -> answer") # Recreate the same program.
loaded_dspy_program.load("./dspy_program/program.pkl")

assert len(compiled_dspy_program.demos) == len(loaded_dspy_program.demos)
for original_demo, loaded_demo in zip(compiled_dspy_program.demos, loaded_dspy_program.demos):
    # Loaded demo is a dict, while the original demo is a dspy.Example.
    assert original_demo.toDict() == loaded_demo
assert str(compiled_dspy_program.signature) == str(loaded_dspy_program.signature)

Whole Program Saving

Starting from dspy>=2.6.0, DSPy supports saving the whole program, including the architecture and the state. This feature is powered by cloudpickle, which is a library for serializing and deserializing Python objects.

To save the whole program, use the save method and set save_program=True, and specify a directory path to save the program instead of a file name. We require a directory path because we also save some metadata, e.g., the dependency versions along with the program itself.

compiled_dspy_program.save("./dspy_program/", save_program=True)

To load the saved program, directly use dspy.load method:

loaded_dspy_program = dspy.load("./dspy_program/")

assert len(compiled_dspy_program.demos) == len(loaded_dspy_program.demos)
for original_demo, loaded_demo in zip(compiled_dspy_program.demos, loaded_dspy_program.demos):
    # Loaded demo is a dict, while the original demo is a dspy.Example.
    assert original_demo.toDict() == loaded_demo
assert str(compiled_dspy_program.signature) == str(loaded_dspy_program.signature)

With whole program saving, you don't need to recreate the program, but can directly load the architecture along with the state. You can pick the suitable saviing approach based on your needs.

Backward Compatibility

As of dspy<2.7, we don't guarantee the backward compatibility of the saved program. For example, if you save the program with dspy==2.5.35, at loading time please make sure to use the same version of DSPy to load the program, otherwise the program may not work as expected. Chances are that loading a saved file in a different version of DSPy will not raise an error, but the performance could be different from when the program was saved.

Starting from dspy>=2.7, we will guarantee the backward compatibility of the saved program in major releases, i.e., programs saved in dspy==2.7.0 should be loadeable in dspy==2.7.10.


Streaming

In this guide, we will walk you through how to enable streaming in your DSPy program. DSPy Streaming consists of two parts:

  • Output Token Streaming: Stream individual tokens as they're generated, rather than waiting for the complete response.
  • Intermediate Status Streaming: Provide real-time updates about the program's execution state (e.g., "Calling web search...", "Processing results...").

Output Token Streaming

DSPy's token streaming feature works with any module in your pipeline, not just the final output. The only requirement is that the streamed field must be of type str. To enable token streaming:

  1. Wrap your program with dspy.streamify
  2. Create one or more dspy.streaming.StreamListener objects to specify which fields to stream

Here's a basic example:

import os

import dspy

os.environ["OPENAI_API_KEY"] = "your_api_key"

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

predict = dspy.Predict("question->answer")

# Enable streaming for the 'answer' field
stream_predict = dspy.streamify(
    predict,
    stream_listeners=[dspy.streaming.StreamListener(signature_field_name="answer")],
)

To consume the streamed output:

import asyncio

async def read_output_stream():
    output_stream = stream_predict(question="Why did a chicken cross the kitchen?")

    async for chunk in output_stream:
        print(chunk)

asyncio.run(read_output_stream())

This will produce output like:

StreamResponse(predict_name='self', signature_field_name='answer', chunk='To')
StreamResponse(predict_name='self', signature_field_name='answer', chunk=' get')
StreamResponse(predict_name='self', signature_field_name='answer', chunk=' to')
StreamResponse(predict_name='self', signature_field_name='answer', chunk=' the')
StreamResponse(predict_name='self', signature_field_name='answer', chunk=' other')
StreamResponse(predict_name='self', signature_field_name='answer', chunk=' side of the frying pan!')
Prediction(
    answer='To get to the other side of the frying pan!'
)

Note: Since dspy.streamify returns an async generator, you must use it within an async context. If you're using an environment like Jupyter or Google Colab that already has an event loop (async context), you can use the generator directly.

You may have noticed that the above streaming contains two different entities: StreamResponse and Prediction. StreamResponse is the wrapper over streaming tokens on the field being listened to, and in this example it is the answer field. Prediction is the program's final output. In DSPy, streaming is implemented in a sidecar fashion: we enable streaming on the LM so that LM outputs a stream of tokens. We send these tokens to a side channel, which is being continuously read by the user-defined listeners. Listeners keep interpreting the stream, and decides if the signature_field_name it is listening to has started to appear and has finalized. Once it decides that the field appears, the listener begins outputing tokens to the async generator users can read. Listeners' internal mechanism changes according to the adapter behind the scene, and because usually we cannot decide if a field has finalized until seeing the next field, the listener buffers the output tokens before sending to the final generator, which is why you will usually see the last chunk of type StreamResponse has more than one token. The program's output is also written to the stream, which is the chunk of Prediction as in the sample output above.

To handle these different types and implement custom logic:

import asyncio

async def read_output_stream():
  output_stream = stream_predict(question="Why did a chicken cross the kitchen?")

  async for chunk in output_stream:
    return_value = None
    if isinstance(chunk, dspy.streaming.StreamResponse):
      print(f"Output token of field {chunk.signature_field_name}: {chunk.chunk}")
    elif isinstance(chunk, dspy.Prediction):
      return_value = chunk


program_output = asyncio.run(read_output_stream())
print("Final output: ", program_output)

Understand StreamResponse

StreamResponse (dspy.streaming.StreamResponse) is the wrapper class of streaming tokens. It comes with 3 fields:

  • predict_name: the name of the predict that holds the signature_field_name. The name is the same name of keys as you run your_program.named_predictors(). In the code above because answer is from the predict itself, so the predict_name shows up as self, which is the only key as your run predict.named_predictors().
  • signature_field_name: the output field that these tokens map to. predict_name and signature_field_name together form the unique identifier of the field. We will demonstrate how to handle multiple fields streaming and duplicated field name later in this guide.
  • chunk: the value of the stream chunk.

Streaming with Cache

When a cached result is found, the stream will skip individual tokens and only yield the final Prediction. For example:

Prediction(
    answer='To get to the other side of the dinner plate!'
)

Streaming Multiple Fields

You can monitor multiple fields by creating a StreamListener for each one. Here's an example with a multi-module program:

import asyncio

import dspy

lm = dspy.LM("openai/gpt-4o-mini", cache=False)
dspy.settings.configure(lm=lm)


class MyModule(dspy.Module):
    def __init__(self):
        super().__init__()

        self.predict1 = dspy.Predict("question->answer")
        self.predict2 = dspy.Predict("answer->simplified_answer")

    def forward(self, question: str, **kwargs):
        answer = self.predict1(question=question)
        simplified_answer = self.predict2(answer=answer)
        return simplified_answer


predict = MyModule()
stream_listeners = [
    dspy.streaming.StreamListener(signature_field_name="answer"),
    dspy.streaming.StreamListener(signature_field_name="simplified_answer"),
]
stream_predict = dspy.streamify(
    predict,
    stream_listeners=stream_listeners,
)

async def read_output_stream():
    output = stream_predict(question="why did a chicken cross the kitchen?")

    return_value = None
    async for chunk in output:
        if isinstance(chunk, dspy.streaming.StreamResponse):
            print(chunk)
        elif isinstance(chunk, dspy.Prediction):
            return_value = chunk
    return return_value

program_output = asyncio.run(read_output_stream())
print("Final output: ", program_output)

The output will look like:

StreamResponse(predict_name='predict1', signature_field_name='answer', chunk='To')
StreamResponse(predict_name='predict1', signature_field_name='answer', chunk=' get')
StreamResponse(predict_name='predict1', signature_field_name='answer', chunk=' to')
StreamResponse(predict_name='predict1', signature_field_name='answer', chunk=' the')
StreamResponse(predict_name='predict1', signature_field_name='answer', chunk=' other side of the recipe!')
StreamResponse(predict_name='predict2', signature_field_name='simplified_answer', chunk='To')
StreamResponse(predict_name='predict2', signature_field_name='simplified_answer', chunk=' reach')
StreamResponse(predict_name='predict2', signature_field_name='simplified_answer', chunk=' the')
StreamResponse(predict_name='predict2', signature_field_name='simplified_answer', chunk=' other side of the recipe!')
Final output:  Prediction(
    simplified_answer='To reach the other side of the recipe!'
)

Handling Duplicate Field Names

When streaming fields with the same name from different modules, specify both the predict and predict_name in the StreamListener:

import asyncio

import dspy

lm = dspy.LM("openai/gpt-4o-mini", cache=False)
dspy.settings.configure(lm=lm)


class MyModule(dspy.Module):
    def __init__(self):
        super().__init__()

        self.predict1 = dspy.Predict("question->answer")
        self.predict2 = dspy.Predict("question, answer->answer, score")

    def forward(self, question: str, **kwargs):
        answer = self.predict1(question=question)
        simplified_answer = self.predict2(answer=answer)
        return simplified_answer


predict = MyModule()
stream_listeners = [
    dspy.streaming.StreamListener(
        signature_field_name="answer",
        predict=predict.predict1,
        predict_name="predict1"
    ),
    dspy.streaming.StreamListener(
        signature_field_name="answer",
        predict=predict.predict2,
        predict_name="predict2"
    ),
]
stream_predict = dspy.streamify(
    predict,
    stream_listeners=stream_listeners,
)


async def read_output_stream():
    output = stream_predict(question="why did a chicken cross the kitchen?")

    return_value = None
    async for chunk in output:
        if isinstance(chunk, dspy.streaming.StreamResponse):
            print(chunk)
        elif isinstance(chunk, dspy.Prediction):
            return_value = chunk
    return return_value


program_output = asyncio.run(read_output_stream())
print("Final output: ", program_output)

The output will be like:

StreamResponse(predict_name='predict1', signature_field_name='answer', chunk='To')
StreamResponse(predict_name='predict1', signature_field_name='answer', chunk=' get')
StreamResponse(predict_name='predict1', signature_field_name='answer', chunk=' to')
StreamResponse(predict_name='predict1', signature_field_name='answer', chunk=' the')
StreamResponse(predict_name='predict1', signature_field_name='answer', chunk=' other side of the recipe!')
StreamResponse(predict_name='predict2', signature_field_name='answer', chunk="I'm")
StreamResponse(predict_name='predict2', signature_field_name='answer', chunk=' ready')
StreamResponse(predict_name='predict2', signature_field_name='answer', chunk=' to')
StreamResponse(predict_name='predict2', signature_field_name='answer', chunk=' assist')
StreamResponse(predict_name='predict2', signature_field_name='answer', chunk=' you')
StreamResponse(predict_name='predict2', signature_field_name='answer', chunk='! Please provide a question.')
Final output:  Prediction(
    answer="I'm ready to assist you! Please provide a question.",
    score='N/A'
)

Intermediate Status Streaming

Status streaming keeps users informed about the program's progress, especially useful for long-running operations like tool calls or complex AI pipelines. To implement status streaming:

  1. Create a custom status message provider by subclassing dspy.streaming.StatusMessageProvider
  2. Override the desired hook methods to provide custom status messages
  3. Pass your provider to dspy.streamify

Example:

class MyStatusMessageProvider(dspy.streaming.StatusMessageProvider):
    def lm_start_status_message(self, instance, inputs):
        return f"Calling LM with inputs {inputs}..."

    def lm_end_status_message(self, outputs):
        return f"Tool finished with output: {outputs}!"

Available hooks:

  • lm_start_status_message: status message at the start of calling dspy.LM.
  • lm_end_status_message: status message at the end of calling dspy.LM.
  • module_start_status_message: status message at the start of calling a dspy.Module.
  • module_end_status_message: status message at the start of calling a dspy.Module.
  • tool_start_status_message: status message at the start of calling dspy.Tool.
  • tool_end_status_message: status message at the end of calling dspy.Tool.

Each hook should return a string containing the status message.

After creating the message provider, just pass it to dspy.streamify, and you can enable both status message streaming and output token streaming. Please see the example below. The intermediate status message is represented in the class dspy.streaming.StatusMessage, so we need to have another condition check to capture it.

import asyncio

import dspy

lm = dspy.LM("openai/gpt-4o-mini", cache=False)
dspy.settings.configure(lm=lm)


class MyModule(dspy.Module):
    def __init__(self):
        super().__init__()

        self.tool = dspy.Tool(lambda x: 2 * x, name="double_the_number")
        self.predict = dspy.ChainOfThought("num1, num2->sum")

    def forward(self, num, **kwargs):
        num2 = self.tool(x=num)
        return self.predict(num1=num, num2=num2)


class MyStatusMessageProvider(dspy.streaming.StatusMessageProvider):
    def tool_start_status_message(self, instance, inputs):
        return f"Calling Tool {instance.name} with inputs {inputs}..."

    def tool_end_status_message(self, instance, outputs):
        return f"Tool finished with output: {outputs}!"


predict = MyModule()
stream_listeners = [
    # dspy.ChainOfThought has a built-in output field called "reasoning".
    dspy.streaming.StreamListener(signature_field_name="reasoning"),
]
stream_predict = dspy.streamify(
    predict,
    stream_listeners=stream_listeners,
)


async def read_output_stream():
    output = stream_predict(num=3)

    return_value = None
    async for chunk in output:
        if isinstance(chunk, dspy.streaming.StreamResponse):
            print(chunk)
        elif isinstance(chunk, dspy.Prediction):
            return_value = chunk
        elif isinstance(chunk, dspy.streaming.StatusMessage):
            print(chunk)
    return return_value


program_output = asyncio.run(read_output_stream())
print("Final output: ", program_output)

Sample output:

StatusMessage(message='Calling tool double_the_number...')
StatusMessage(message='Tool calling finished! Querying the LLM with tool calling results...')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk='To')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' find')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' the')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' sum')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' of')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' the')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' two')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' numbers')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=',')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' we')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' simply')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' add')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' them')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' together')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk='.')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' Here')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=',')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' ')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk='3')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' plus')
StreamResponse(predict_name='predict.predict', signature_field_name='reasoning', chunk=' 6 equals 9.')
Final output:  Prediction(
    reasoning='To find the sum of the two numbers, we simply add them together. Here, 3 plus 6 equals 9.',
    sum='9'
)

Synchronous Streaming

By default calling a streamified DSPy program produces an async generator. In order to get back a sync generator, you can set the flag async_streaming=False:

import os

import dspy

os.environ["OPENAI_API_KEY"] = "your_api_key"

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

predict = dspy.Predict("question->answer")

# Enable streaming for the 'answer' field
stream_predict = dspy.streamify(
    predict,
    stream_listeners=[dspy.streaming.StreamListener(signature_field_name="answer")],
    async_streaming=False,
)

output = stream_predict(question="why did a chicken cross the kitchen?")

program_output = None
for chunk in output:
    if isinstance(chunk, dspy.streaming.StreamResponse):
        print(chunk)
    elif isinstance(chunk, dspy.Prediction):
        program_output = chunk
print(f"Program output: {program_output}")

Tutorial: Agents

Let's walk through a quick example of setting up a dspy.ReAct agent with a couple of tools and optimizing it to conduct advanced browsing for multi-hop search.

Install the latest DSPy via pip install -U dspy and follow along.

Recommended: Set up MLflow Tracing to understand what's happening under the hood.

MLflow DSPy Integration

MLflow is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. In this tutorial, you can use MLflow to visualize prompts and optimization progress as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.

MLflow Trace

  1. Install MLflow
%pip install mlflow>=2.20
  1. Start MLflow UI in a separate terminal
mlflow ui --port 5000
  1. Connect the notebook to MLflow
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")
  1. Enabling tracing.
mlflow.dspy.autolog()

Once you have completed the steps above, you can see traces for each program execution on the notebook. They provide great visibility into the model's behavior and helps you understand the DSPy's concepts better throughout the tutorial.

To kearn more about the integration, visit MLflow DSPy Documentation as well.

In this tutorial, we'll use an extremely small LM, Meta's Llama-3.2-3B-Instruct which has 3 billion parameters.

A model like this is not very reliable out of the box for long or complex agent loops. However, it's extremely fast and cheap to host, as it needs very little RAM.

You might be able to host the 3B model on your laptop with Ollama, on your GPU server with SGLang, or via a provider that hosts it for you like Databricks or Together.

In the snippet below, we'll configure our main LM as Llama-3.2-3B. We'll also set up a larger LM, i.e. GPT-4o, as a teacher that we'll invoke a very small number of times to help teach the small LM.

import dspy

llama3b = dspy.LM('<provider>/Llama-3.2-3B-Instruct', temperature=0.7)
gpt4o = dspy.LM('openai/gpt-4o', temperature=0.7)

dspy.configure(lm=llama3b)

Let's load a dataset for our task. We'll load examples from the HoVer multi-hop task, where the input is a (really!) complex claim and the output we're seeking is the set of Wikipedia pages that are required to fact-check that claim.

import random
from dspy.datasets import DataLoader

kwargs = dict(fields=("claim", "supporting_facts", "hpqa_id", "num_hops"), input_keys=("claim",))
hover = DataLoader().from_huggingface(dataset_name="hover-nlp/hover", split="train", trust_remote_code=True, **kwargs)

hpqa_ids = set()
hover = [
    dspy.Example(claim=x.claim, titles=list(set([y["key"] for y in x.supporting_facts]))).with_inputs("claim")
    for x in hover
    if x["num_hops"] == 3 and x["hpqa_id"] not in hpqa_ids and not hpqa_ids.add(x["hpqa_id"])
]

random.Random(0).shuffle(hover)
trainset, devset, testset = hover[:100], hover[100:200], hover[650:]

Let's view an example of this task:

example = trainset[0]

print("Claim:", example.claim)
print("Pages that must be retrieved:", example.titles)
Claim: This director is known for his work on Miss Potter. The Academy of Motion Picture Arts and Sciences presents the award in which he was nominated for his work in "Babe".
Pages that must be retrieved: ['Miss Potter', 'Chris Noonan', 'Academy Award for Best Director']

Now, let's define a function to do the search in Wikipedia. We'll rely on a ColBERTv2 server that can search the "abstracts" (i.e., first paragraphs) of every article that existed in Wikipedia in 2017, which is the data used in HoVer.

DOCS = {}

def search(query: str, k: int) -> list[str]:
    results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=k)
    results = [x['text'] for x in results]

    for result in results:
        title, text = result.split(" | ", 1)
        DOCS[title] = text

    return results

Now, let's use the search function to define two tools for our ReAct agent:

def search_wikipedia(query: str) -> list[str]:
    """Returns top-5 results and then the titles of the top-5 to top-30 results."""

    topK = search(query, 30)
    titles, topK = [f"`{x.split(' | ')[0]}`" for x in topK[5:30]], topK[:5]
    return topK + [f"Other retrieved pages have titles: {', '.join(titles)}."]

def lookup_wikipedia(title: str) -> str:
    """Returns the text of the Wikipedia page, if it exists."""

    if title in DOCS:
        return DOCS[title]

    results = [x for x in search(title, 10) if x.startswith(title + " | ")]
    if not results:
        return f"No Wikipedia page found for title: {title}"
    return results[0]

Now, let's define the ReAct agent in DSPy. It's going to be super simple: it'll take a claim and produce a list titles: list[str].

We'll instruct it to find all Wikipedia titles that are needed to fact-check the claim.

instructions = "Find all Wikipedia titles relevant to verifying (or refuting) the claim."
signature = dspy.Signature("claim -> titles: list[str]", instructions)
react = dspy.ReAct(signature, tools=[search_wikipedia, lookup_wikipedia], max_iters=20)

Let's try it with a really simple claim to see if our tiny 3B model can do it!

react(claim="David Gregory was born in 1625.").titles[:3]
['David Gregory (physician)', 'David A. Gregory', 'David Harry Gregory']

Great. Now let's set up an evaluation metric, top5_recall.

It will return the fraction of the gold pages (which are always 3) that are retrieved in the top-5 titles returned by the agent.

def top5_recall(example, pred, trace=None):
    gold_titles = example.titles
    recall = sum(x in pred.titles[:5] for x in gold_titles) / len(gold_titles)

    # If we're "bootstrapping" for optimization, return True if and only if the recall is perfect.
    if trace is not None:
        return recall >= 1.0
    
    # If we're just doing inference, just measure the recall.
    return recall

evaluate = dspy.Evaluate(devset=devset, metric=top5_recall, num_threads=16, display_progress=True, display_table=5)

Let's evaluate our off-the-shelf agent, with Llama-3.2-8B, to see how far we can go already.

This model is tiny, so it can complain fairly often. Let's wrap it in a try/except block to hide those.

def safe_react(claim: str):
    try:
        return react(claim=claim)
    except Exception as e:
        return dspy.Prediction(titles=[])

evaluate(safe_react)
  0%|          | 0/100 [00:00<?, ?it/s]

Average Metric: 8.00 / 100 (8.0%): 100%|██████████| 100/100 [05:22<00:00,  3.22s/it]

2024/12/17 14:09:47 INFO dspy.evaluate.evaluate: Average Metric: 7.999999999999997 / 100 (8.0%)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
claim example_titles trajectory reasoning pred_titles top5_success
0 The Church of England's movement that inspired the Trinity Episcop... [Oxford Movement, Trinity Episcopal Church (Houghton, Michigan), S... {'thought_0': 'The claim suggests that there is a specific movemen... The search results seem to be a mix of different churches with sim... ['Trinity Episcopal Church (Houghton, Michigan)', 'Trinity Episcop... ✔️ [0.333]
1 Red, White & Crüe and this athlete both fight. The french fighter ... [Red, White &amp; Crüe, Mike Tyson, Bobby Stewart] NaN NaN []
2 The writer/director/actor from Glen or Glenda and Fernand Rivers s... [Ed Wood, Glen or Glenda, Fernand Rivers] NaN NaN []
3 The film by Sandi Sissel was released before The End of Suburbia. [Chicken Ranch (film), Sandi Sissel, The End of Suburbia] NaN NaN []
4 The actor who played captain hook in the live production with Tayl... [Christopher Walken, Taylor Louderman, Peter Pan Live!] NaN NaN []
... 95 more rows not displayed ...
8.0
Tracking Evaluation Results in MLflow Experiment

To track and visualize the evaluation results over time, you can record the results in MLflow Experiment.

import mlflow

with mlflow.start_run(run_name="agent_evaluation"):
    evaluate = dspy.Evaluate(
        devset=devset,
        metric=top5_recall,
        num_threads=16,
        display_progress=True,
        # To record the outputs and detailed scores to MLflow
        return_all_scores=True,
        return_outputs=True,
    )

    # Evaluate the program as usual
    aggregated_score, outputs, all_scores = evaluate(cot)

    # Log the aggregated score
    mlflow.log_metric("top5_recall", aggregated_score)
    # Log the detailed evaluation results as a table
    mlflow.log_table(
        {
            "Claim": [example.claim for example in eval_set],
            "Expected Titles": [example.titles for example in eval_set],
            "Predicted Titles": outputs,
            "Top 5 Recall": all_scores,
        },
        artifact_file="eval_results.json",
    )

To learn more about the integration, visit MLflow DSPy Documentation as well.

Wow. It only scores 8% in terms of recall. Not that good!

Let's now optimize the two prompts inside dspy.ReAct jointly to maximize the recall of our agent. This may take around 30 minutes and make some $5 worth of calls to GPT-4o to optimize Llama-3.2-3B.

kwargs = dict(teacher_settings=dict(lm=gpt4o), prompt_model=gpt4o, max_errors=999)

tp = dspy.MIPROv2(metric=top5_recall, auto="medium", num_threads=16, **kwargs)
optimized_react = tp.compile(react, trainset=trainset, max_bootstrapped_demos=3, max_labeled_demos=0)

Let's now evaluate again, after optimization.

evaluate(optimized_react)
Average Metric: 41.67 / 100 (41.7%): 100%|██████████| 100/100 [03:00<00:00,  1.81s/it]

2024/12/17 15:12:06 INFO dspy.evaluate.evaluate: Average Metric: 41.66666666666667 / 100 (41.7%)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
claim example_titles trajectory reasoning pred_titles top5_success
0 The Church of England's movement that inspired the Trinity Episcop... [Oxford Movement, Trinity Episcopal Church (Houghton, Michigan), S... {'thought_0': 'To verify the claim, I need to identify the Church ... The claim states that the Church of England's movement that inspir... ['Trinity Episcopal Church (Houghton, Michigan)', 'Church of All S... ✔️ [0.667]
1 Red, White & Crüe and this athlete both fight. The french fighter ... [Red, White &amp; Crüe, Mike Tyson, Bobby Stewart] {'thought_0': 'To verify the claim, I need to identify the French ... The claim states that Red, White & Crüe is a term applied to sport... [Bobby Stewart, Bernardin Ledoux Kingue Matam, Mötley Crüe, Milan ... ✔️ [0.333]
2 The writer/director/actor from Glen or Glenda and Fernand Rivers s... [Ed Wood, Glen or Glenda, Fernand Rivers] {'thought_0': 'To verify the claim, I need to identify the writer/... The claim states that Glen or Glenda and Fernand Rivers share the ... [Ed Wood, Bela Lugosi, Dolores Fuller] ✔️ [0.333]
3 The film by Sandi Sissel was released before The End of Suburbia. [Chicken Ranch (film), Sandi Sissel, The End of Suburbia] {'thought_0': 'To verify the claim, I need to find the release dat... The claim states that the film by Sandi Sissel was released before... [Sandi Sissel, The End of Suburbia (film)] ✔️ [0.333]
4 The actor who played captain hook in the live production with Tayl... [Christopher Walken, Taylor Louderman, Peter Pan Live!] {'thought_0': 'To verify the claim, I need to find the actor who p... The claim suggests that the actor who played Captain Hook in the l... [Cyril Ritchard, Ruth Connell]
... 95 more rows not displayed ...
41.67

Awesome. It looks like the system improved drastically from 8% recall to around 40% recall. That was a pretty straightforward approach, but DSPy gives you many tools to continue iterating on this from here.

Next, let's inspect the optimized prompts to understand what it has learned. We'll run one query and then inspect the last two prompts, which will show us the prompts used for both ReAct sub-modules, the one that does the agentic loop and the other than prepares the final results. (Alternatively, if you enabled MLflow Tracing following the instructions above, you can see all steps done by the agent including LLM calls, prompts, tool execution, in a rich tree-view.)

optimized_react(claim="The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.").titles
['Bernard-Marie Koltès', 'Joe Orton']
dspy.inspect_history(n=2)
�[34m[2024-12-17T15:13:25.420335]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `claim` (str)
2. `trajectory` (str)

Your output fields are:
1. `next_thought` (str)
2. `next_tool_name` (Literal[search_wikipedia, lookup_wikipedia, finish])
3. `next_tool_args` (dict[str, Any])

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## claim ## ]]
{claim}

[[ ## trajectory ## ]]
{trajectory}

[[ ## next_thought ## ]]
{next_thought}

[[ ## next_tool_name ## ]]
{next_tool_name}        # note: the value you produce must be one of: search_wikipedia; lookup_wikipedia; finish

[[ ## next_tool_args ## ]]
{next_tool_args}        # note: the value you produce must be pareseable according to the following JSON schema: {"type": "object"}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Find all Wikipedia titles relevant to verifying (or refuting) the claim.
        
        You will be given `claim` and your goal is to finish with `titles`.
        
        To do this, you will interleave Thought, Tool Name, and Tool Args, and receive a resulting Observation.
        
        Thought can reason about the current situation, and Tool Name can be the following types:
        
        (1) search_wikipedia, whose description is <desc>Returns top-5 results and then the titles of the top-5 to top-30 results.</desc>. It takes arguments {'query': 'str'} in JSON format.
        (2) lookup_wikipedia, whose description is <desc>Returns the text of the Wikipedia page, if it exists.</desc>. It takes arguments {'title': 'str'} in JSON format.
        (3) finish, whose description is <desc>Signals that the final outputs, i.e. `titles`, are now available and marks the task as complete.</desc>. It takes arguments {} in JSON format.


�[31mUser message:�[0m

[[ ## claim ## ]]
1990 Film that Khiladiyon Ka Khiladi is loosely based on stars this actor who is best known for martial arts action films.

[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
To verify the claim, I need to identify the 1990 film that "Khiladiyon Ka Khiladi" is loosely based on and the actor known for martial arts action films who starred in it. I will start by searching for information on "Khiladiyon Ka Khiladi" to find details about its inspiration.

[[ ## tool_name_0 ## ]]
search_wikipedia

[[ ## tool_args_0 ## ]]
{"query": "Khiladiyon Ka Khiladi"}

[[ ## observation_0 ## ]]
[1] «Khiladiyon Ka Khiladi | Khiladiyon Ka Khiladi (English: Player of Players) is a 1996 Indian action film starring Rekha in her first villain role, Akshay Kumar, Raveena Tandon and former WWE wrestlers "Crush" and Brian Lee as "The Undertaker". It was the 5th highest grossing movie of the year 1996 and was declared 'SuperHit' by Box Office India. It was the fourth installment in the Khiladi (film series). The movie is loosely based based on Hollywood film Lionheart»
[2] «Khiladi 420 | Khiladi 420 (English: "Con Player") is an Indian Hindi action film directed by Neeraj Vora and starring Akshay Kumar and Mahima Chaudhry. The film was written by Uttam Gudda and released on 29 December 2000. It is the seventh installment in the "Khiladi" series starring Kumar, which included "Khiladi" (1992), "Main Khiladi Tu Anari" (1994), "Sabse Bada Khiladi" (1995), "Khiladiyon Ka Khiladi" (1996), "Mr. and Mrs. Khiladi" (1997) and "International Khiladi" (1999).»
[3] «Khiladi (1992 film) | Khiladi (English: "Player" ) is a 1992 Indian suspense thriller film directed by Abbas Mustan. The film was Akshay Kumar's breakthrough role and also stars Ayesha Jhulka, Deepak Tijori, Sabeeha. While Prem Chopra, Shakti Kapoor, Anant Mahadevan and Johnny Lever played supporting roles. "Khiladi" was the first installment in the Khiladi (film series) which had "Khiladi" in the title and Akshay Kumar in the leading role. It was followed by "Main Khiladi Tu Anari" (1994), "Sabse Bada Khiladi" (1995), "Khiladiyon Ka Khiladi" (1996), "Mr. and Mrs. Khiladi" (1997), "International Khiladi" (1999), "Khiladi 420"(2000) and "Khiladi 786" (2012). Khiladi was critically and commercially success at the box-office and the tenth highest grossing film of 1992. It was Akshay Kumar's first successful movie and was declared a "Super Hit" at the box office. The basic premise of the story is similar to 1975 released movie Khel Khel Mein starring Rishi Kapoor and Neetu Singh. The film was remade in Kannada as "Aata Hudugaata".»
[4] «Khiladi (film series) | Khiladi series is a Bollywood action film series starring Akshay Kumar in the lead role. However, unlike other film series, other than having Akshay Kumar in lead role, and other than having the word "Khiladi" in the title, these films have nothing in common. The producers, directors and stories of these films are totally different. " Khiladi" (1992) was the first in a series of films which had Akshay Kumar in the title role and gave it his first breakthrough role. It was followed by "Main Khiladi Tu Anari" (1994), "Sabse Bada Khiladi" (1995), "Khiladiyon Ka Khiladi" (1996), "Mr. and Mrs. Khiladi" (1997), "International Khiladi" (1999) and "Khiladi 420" (2000), all featuring Kumar in the lead role. The latest film in the franchise is "Khiladi 786" (2012).»
[5] «Khiladi 786 | Khiladi 786 (खिलाड़ी 786) is a 2012 Indian Hindi Punjabi action comedy film directed by Ashish R Mohan, featuring Akshay Kumar in the title role alongside Asin playing the female lead. It features Himesh Reshammiya, Mithun Chakraborty, Raj Babbar and Mukesh Rishi in supporting roles. The film marks the return of Akshay Kumar to his "Khiladi" series after 12 years. It is mostly shot in Mumbai and Punjab. Khiladi786 is AkshayKumar 100th film.»
[6] «Other retrieved pages have titles: `Khiladi (2013 film)`, `Khiladi (1968 film)`, `International Khiladi`, `Khatron Ke Khiladi (1988 film)`, `Khatron Ke Khiladi (2001 film)`, `Umesh Mehra`, `Shatranj ke Khiladi`, `Main Khiladi Tu Anari`, `Fear Factor: Khatron Ke Khiladi (season 7)`, `Fear Factor: Khatron Ke Khiladi`, `Mr. and Mrs. Khiladi`, `Sabse Bada Khiladi`, `Ek Khiladi Bawan Pattey`, `Ek Khiladi Ek Haseena (TV series)`, `Do Khiladi`, `Ek Khiladi Ek Haseena (film)`, `Khel Khilari Ka`, `Fear Factor: Khatron Ke Khiladi (season 8)`, `Fear Factor: Khatron Ke Khiladi Darr Ka Blockbuster`, `Khilona (1942 film)`, `Fear Factor: Khatron Ke Khiladi Darr Ka Blockbuster Returns`, `Chhota Bheem: Dholakpur ke Khiladi`, `Khilani`, `Kandireega`, `Khilaaf`.»

[[ ## thought_1 ## ]]
The search results indicate that "Khiladiyon Ka Khiladi" is loosely based on the Hollywood film "Lionheart". I should now verify which actor, known for martial arts action films, starred in "Lionheart".

[[ ## tool_name_1 ## ]]
search_wikipedia

[[ ## tool_args_1 ## ]]
{"query": "Lionheart 1990 film"}

[[ ## observation_1 ## ]]
[1] «Lionheart (1990 film) | Lionheart (also known as Wrong Bet, A.W.O.L.: Absent Without Leave, Leon and Full Contact) is a 1990 action film, directed by Sheldon Lettich, starring Jean-Claude Van Damme and co-starring Brian Thompson, along with Harrison Page, Deborah Rennard, Lisa Pelikan, and Ashley Johnson.»
[2] «Truly, Madly, Deeply | Truly, Madly, Deeply is a 1990 British fantasy drama film made for the BBC's "Screen Two" series, by BBC Films, Lionheart and Winston Pictures. The film, written and directed by Anthony Minghella, stars Juliet Stevenson and Alan Rickman.»
[3] «Lionheart (1987 film) | Lionheart, also known as Lionheart: The Children's Crusade, is a 1987 adventure film directed by Franklin J. Schaffner and produced by Talia Shire and Stanley O'Toole. Shire's brother, Francis Ford Coppola, initially planned to direct the film but instead opted to be executive producer along with Shire's husband, Jack Schwartzman. The screenplay was written by Menno Meyjes and Richard Outten from a story by Meyjes. The composer Jerry Goldsmith wrote the score. The film was released in August 1987. It was distributed by Orion Pictures.»
[4] «Lionheart (2016 film) | Lionheart is a 2016 American boxing film short written and produced by Oscar DeRosa and Orlando Cicilia III. The film stars Oscar DeRosa and Marc Macaulay. The film portrays struggling professional boxer Max Rossi who is finally presented with the fight he's been waiting for that will launch his career to the next level but when he is suddenly confronted with a life-changing opportunity he must decide which path to take.»
[5] «Richard the Lionheart (2013 film) | Richard The Lionheart is a 2013 film, starring Chandler Maness as Richard the Lionheart, Malcolm McDowell as King Henry II and Andrea Zirio as Henry the Young.»
[6] «Other retrieved pages have titles: `Richard the Lionheart: Rebellion`, `The Brothers Lionheart (1977 film)`, `Kodama Simham`, `Jean-Claude Van Damme`, `Lionsheart`, `A Day for Lionhearts`, `The Lion in Winter (1968 film)`, `Lion Jagapathi Rao`, `Theatre of Blood`, `Young Lions (album)`, `Richard the Lion-Hearted (1923 film)`, `Bullseye! (1990 film)`, `Fear (1990 film)`, `The Guard (1990 film)`, `Ironheart`, `The Guardian (1990 film)`, `Tiger Cage 2`, `Lion Rampant (game publisher)`, `Success (1991 film)`, `Lion in the Streets`, `The King of the Kickboxers`, `Lionheart (video game)`, `Richard the Lionheart (TV series)`, `Dermot Walsh`, `Lars Söderdahl`.»

Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal[search_wikipedia, lookup_wikipedia, finish]), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## next_thought ## ]]
The search results confirm that the 1990 film "Lionheart" stars Jean-Claude Van Damme, who is well-known for his martial arts action films. This information aligns with what is needed to verify the claim. I now have the relevant titles related to the claim: "Khiladiyon Ka Khiladi" and "Lionheart."

[[ ## next_tool_name ## ]]
finish

[[ ## next_tool_args ## ]]
{}

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## claim ## ]]
The figure who currently holds the role Michael Noonan held from 2001 to 2002, was the Minister for Social Protection from 2016 until 2017.

[[ ## trajectory ## ]]


Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal[search_wikipedia, lookup_wikipedia, finish]), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## next_thought ## ]]
To verify the claim, I need to identify the role Michael Noonan held from 2001 to 2002. Then, I can find out who currently holds this role and check if they were the Minister for Social Protection from 2016 to 2017.

[[ ## next_tool_name ## ]]
search_wikipedia

[[ ## next_tool_args ## ]]
{"query": "Michael Noonan role 2001 to 2002"}

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## claim ## ]]
An actor in a musical film was also a United States Navy Combat veteran in World War II. This film was awarded The Golden Bear at the 50th Berlin International Film Festival.

[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
To verify the claim, I need to find out which musical film was awarded The Golden Bear at the 50th Berlin International Film Festival and check if an actor from that film was a United States Navy Combat veteran in World War II. I will start by searching for the musical films that won the Golden Bear around the time of the 50th Berlin International Film Festival.

[[ ## tool_name_0 ## ]]
search_wikipedia

[[ ## tool_args_0 ## ]]
{"query": "Golden Bear 50th Berlin International Film Festival musical film"}

[[ ## observation_0 ## ]]
[1] «53rd Berlin International Film Festival | The 54th annual Berlin International Film Festival was held from February 6–16, 2003. The festival opened with musical film "Chicago" by Rob Marshall and closed with Martin Scorsese's "Gangs of New York", both films played out of competition at the festival. The Golden Bear was awarded to British film "In This World" directed by Michael Winterbottom.»
[2] «50th Berlin International Film Festival | The 50th annual Berlin International Film Festival was held from February 9 to 20, 2000. The festival opened with "The Million Dollar Hotel" by Wim Wenders. " Bossa Nova" by Bruno Barreto, screened out of competition was the closing film of the festival. The Golden Bear was awarded to American film "Magnolia" directed by Paul Thomas Anderson.»
[3] «40th Berlin International Film Festival | The 40th annual Berlin International Film Festival was held from 9 to 20 February 1990. The festival opened with "Steel Magnolias" by Herbert Ross, which was shown out of competition. The Golden Bear was awarded to the American film "Music Box" directed by Costa-Gavras and Czech film "Skřivánci na niti" directed by Jiří Menzel.»
[4] «66th Berlin International Film Festival | The 66th Berlin International Film Festival was held from 11 to 21 February 2016, with American actress Meryl Streep as the President of the Jury. The Honorary Golden Bear for lifetime achievement was presented to German cinematographer Michael Ballhaus. " Hail, Caesar! ", directed by Joel and Ethan Coen, was selected to open the festival. The Golden Bear was awarded to the Italian documentary "Fire at Sea", directed by Gianfranco Rosi, which also serves as closing night film.»
[5] «49th Berlin International Film Festival | The 49th annual Berlin International Film Festival was held from February 10 to 21, 1999. The festival opened with "Aimée & Jaguar" by Max Färberböck. The Golden Bear was awarded to Canadian-American film "The Thin Red Line" directed by Terrence Malick.»
[6] «Other retrieved pages have titles: `38th Berlin International Film Festival`, `48th Berlin International Film Festival`, `45th Berlin International Film Festival`, `47th Berlin International Film Festival`, `64th Berlin International Film Festival`, `65th Berlin International Film Festival`, `46th Berlin International Film Festival`, `54th Berlin International Film Festival`, `62nd Berlin International Film Festival`, `25th Berlin International Film Festival`, `35th Berlin International Film Festival`, `43rd Berlin International Film Festival`, `67th Berlin International Film Festival`, `Golden Bear`, `44th Berlin International Film Festival`, `60th Berlin International Film Festival`, `13th Berlin International Film Festival`, `34th Berlin International Film Festival`, `51st Berlin International Film Festival`, `1st Berlin International Film Festival`, `30th Berlin International Film Festival`, `42nd Berlin International Film Festival`, `27th Berlin International Film Festival`, `31st Berlin International Film Festival`, `32nd Berlin International Film Festival`.»

[[ ## thought_1 ## ]]
The Golden Bear at the 50th Berlin International Film Festival was awarded to the American film "Magnolia" directed by Paul Thomas Anderson. I need to verify if any actor in "Magnolia" was a United States Navy Combat veteran in World War II.

[[ ## tool_name_1 ## ]]
search_wikipedia

[[ ## tool_args_1 ## ]]
{"query": "Magnolia film cast"}

[[ ## observation_1 ## ]]
[1] «Magnolia (film) | Magnolia is a 1999 American ensemble drama film written, co-produced and directed by Paul Thomas Anderson. The film stars Jeremy Blackman, Tom Cruise, Melinda Dillon, Philip Baker Hall, Philip Seymour Hoffman, Ricky Jay, William H. Macy, Alfred Molina, Julianne Moore, John C. Reilly, Jason Robards and Melora Walters, and is a mosaic of interrelated characters in search of happiness, forgiveness and meaning in the San Fernando Valley.»
[2] «Steel Magnolias (2012 film) | Steel Magnolias is an American comedy-drama television film directed by Kenny Leon that premiered at Lifetime Network on October 7, 2012. It is a contemporary retelling of the play "Steel Magnolias" and its 1989 film adaptation. The new film stars an all-Black American cast, including Queen Latifah as M'Lynn, Jill Scott as Truvy, Condola Rashād as Shelby, Adepero Oduye as Annelle, with Phylicia Rashād as Clairee and Alfre Woodard as Ouiser.»
[3] «Baller Blockin' | Baller Blockin' is a 2000 drama film set in New Orleans's Magnolia Projects. It stars Bryan “Birdman” Williams, Ronald "Slim" Williams, Juvenile, Manuel Vazquez, B.G., Lil Wayne, Turk, and Mannie Fresh with cameos by comedians Anthony Johnson and T.K. Kirkland.»
[4] «360 (film) | 360 is a 2011 ensemble drama film starring Anthony Hopkins, Ben Foster, Rachel Weisz, Jude Law and other international actors. The film, directed by Fernando Meirelles, opened the 2011 London Film Festival. Magnolia Pictures released the film on video on demand on 29 June 2012 and was released in United States theaters on 3 August 2012.»
[5] «River of Romance | River of Romance is a 1929 American drama film directed by Richard Wallace and written by Ethel Doherty, Joseph L. Mankiewicz, Dan Totheroh and John V.A. Weaver. The film stars Charles 'Buddy' Rogers, Mary Brian, June Collyer, Henry B. Walthall, Wallace Beery, Fred Kohler and Natalie Kingston. It is based on the play "Magnolia" by Booth Tarkington. The film was released on June 29, 1929, by Paramount Pictures.»
[6] «Other retrieved pages have titles: `Ira &amp; Abby`, `The Seat Filler`, `Serious Moonlight (2009 film)`, `Experimenter (film)`, `Humpday`, `Roadie (2011 film)`, `A Kind of Murder`, `The Magic of Belle Isle`, `The Hero of Color City`, `The War Within (film)`, `Claudia Lee`, `Steel Magnolias`, `Dina Bonnevie`, `Deadline (1987 film)`, `Creative Control (film)`, `Humboldt County (film)`, `Temptation Island (1980 film)`, `Gina Beck`, `Magnolia (score)`, `Cashback (film)`, `Cynthia Zamora`, `Straight Talk`, `Magnolia (soundtrack)`, `The Great Buck Howard`, `Binta and the Great Idea`.»

[[ ## thought_2 ## ]]
The cast of the film "Magnolia" includes several well-known actors. I need to investigate whether any of these actors, such as Tom Cruise, Philip Baker Hall, or Jason Robards, were United States Navy Combat veterans in World War II.

[[ ## tool_name_2 ## ]]
search_wikipedia

[[ ## tool_args_2 ## ]]
{"query": "Jason Robards military service"}

[[ ## observation_2 ## ]]
[1] «Jason Robards | Jason Nelson Robards Jr. (July 26, 1922 – December 26, 2000) was an American stage, film, and television actor. He was a winner of the Tony Award, two Academy Awards and an Emmy Award. He was also a United States Navy combat veteran of World War II.»
[2] «Jason Robards Sr. | Jason Nelson Robards Sr. (December 31, 1892 – April 4, 1963) was an American stage and screen actor, and the father of Oscar-winning actor Jason Robards Jr. Robards appeared in many films, initially as a leading man, then in character roles and occasional bits. Most of his final roles were in television.»
[3] «Jason Wingard | Jason Wingard, Ph.D. (born December 13, 1971) is Dean and Professor of the School of Professional Studies (SPS) at Columbia University. He is a leading academic and executive in the areas of leadership development, professional learning, and human capital management. As part of this work, he has 1) authored several books; 2) provided keynote, speaking, and consulting services to global Fortune 500 organizations, professional events, and conferences; 3) served as a board member and facilitator; and 4) held senior executive roles in corporate, academic, and non-profit organizations.»
[4] «Jason Torpy | Jason Torpy is president of the Military Association of Atheists and Freethinkers (MAAF), an advocacy group focused on non-religious service members and veterans. He is a veteran of the U.S. Army and a Humanist Celebrant who works to increase the visibility of "atheists in foxholes".»
[5] «Jason Dunkerley | Jason Joseph Dunkerley (born August 21, 1977, Newtownards, Northern Ireland) is a Canadian Paralympian athlete competing mainly in category T11 and T12 middle distance events.»
[6] «Other retrieved pages have titles: `Jase Daniels`, `Jason Dunham`, `Jason Schauble`, `Frederick C. Billard`, `Wilfred C. Menard, Jr.`, `Jason Millard`, `George C. Rickards`, `Jason Amerine`, `USS Jason (AR-8)`, `Rob (dog)`, `USS Jason Dunham`, `Robert Lee Bullard`, `Jason Chee (para-athlete)`, `The War Between Men and Women`, `Thomas Byard`, `Ralph Baker (general)`, `Jason Lamont`, `Jason Rother incident`, `William H. G. Bullard`, `Sidney Gerald Burrard`, `Jason Gilardi`, `Reunion (1989 film)`, `JASON (advisory group)`, `The Great American West`, `Samuel Robison`.»

Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal[search_wikipedia, lookup_wikipedia, finish]), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## next_thought ## ]]
I have confirmed that Jason Robards, an actor in the film "Magnolia," was a United States Navy combat veteran of World War II. Since "Magnolia" was awarded The Golden Bear at the 50th Berlin International Film Festival, this satisfies the claim. I can now finish with the relevant Wikipedia titles.

[[ ## next_tool_name ## ]]
finish

[[ ## next_tool_args ## ]]
{}

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## claim ## ]]
The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.

[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
To verify the claim, I need to identify the playwrights who wrote the unproduced script for "Up Against It" and "The Beatles".

[[ ## tool_name_0 ## ]]
search_wikipedia

[[ ## tool_args_0 ## ]]
{"query": "Up Against It script writer Bernard-Marie Koltès"}

[[ ## observation_0 ## ]]
[1] «Bernard-Marie Koltès | Bernard-Marie Koltès (] ; 9 April 1948 – 15 April 1989) was a French playwright and theatre director best known for his plays "La Nuit juste avant les Forêts" ("The Night Just Before the Forests", 1976), "Sallinger" (1977) and "Dans la Solitude des Champs de Coton" ("In the Solitude of Cotton Fields", 1986).»
[2] «In the Solitude of Cotton Fields | In the Solitude of Cotton Fields is a play written by Bernard Marie Koltès in 1985. It is a two-person play involving The Client and Dealer. They negotiate a deal on the streets late at night. The play moves through mutual solitude with the Dealer unable to sell and the Client unable to buy. It’s never clear what each character has to offer each other. Koltès is interested in negotiation, power and violence and the way we treat each other.»
[3] «Bernard Toublanc-Michel | Bernard Toublanc-Michel (born 6 December 1927) is a French film director and screenwriter. His 1964 film "La Difficulté d'être infidèle" was entered into the 14th Berlin International Film Festival.»
[4] «Not For, or Against | Not For, or Against (Quite the Contrary) (original title: Ni pour, ni contre (bien au contraire)) is a 2003 French crime drama film directed and co-written by Cédric Klapisch. It stars Marie Gillain and Vincent Elbaz.»
[5] «Bernard Comment | Bernard Comment (born April 20, 1960) is a Swiss writer, translator, scriptwriter, and publisher of books.»
[6] «Other retrieved pages have titles: `Wrong (film)`, `Bernard Bellefroid`, `Too Beautiful for You`, `Bernard Paul`, `Jean-Jacques Bernard`, `Anahareo`, `Bernard Émond`, `Bernard Nabonne`, `Marie-Christine Bernard`, `Bernard Giraudeau`, `Bernard Kops`, `Bernard Charlès`, `Bernard Assiniwi`, `Bernard C. Schoenfeld`, `Jean Bernard-Luc`, `Bernard Stora`, `Bernard Quiriny`, `De contemptu mundi`, `Marie Françoise Bernard`, `Marc Bernard`, `Up Against It`, `Bernard Gui`, `Bernard Lecache`, `Bernard Le Nail`, `Bernard Zimmer`.»

[[ ## thought_1 ## ]]
I have found that Bernard-Marie Koltès wrote the unproduced script for "Up Against It". I need to verify if the author of "Up Against It" is also a playwright.

[[ ## tool_name_1 ## ]]
search_wikipedia

[[ ## tool_args_1 ## ]]
{"query": "Up Against It author"}

[[ ## observation_1 ## ]]
[1] «Up Against It | Up Against It is an unproduced script by Joe Orton, written in 1967 for The Beatles at the height of their fame.»
[2] «Up Against It! (Todd Rundgren album) | Up Against It! is a 1997 album by Todd Rundgren, essentially consisting of Rundgren's song demos for the Off Broadway show that were Written and Recorded by Todd from 1986-88. The project was inspired by the never-produced Up Against It which was a play originally written by Joe Orton for The Beatles.»
[3] «Up Against It (album) | Up Against It is the fifth album by West London Post punk and Indie band The Times released in 1986.»
[4] «Up Against It (1912 film) | The 1912 American short film Up Against It is a romantic comedy directed by Otis Turner and starring King Baggot. It was produced by the Independent Moving Pictures (IMP) Company of New York.»
[5] «Against the Odds (novel) | Against the Odds is a science fiction novel by Elizabeth Moon. It is her seventh and last novel set in the Familias Regnant fictional universe. It does not fall in either informal trilogy (the Heris Serrano and the Esmay Suiza trilogies); fittingly it does not focus on any particular character, instead a more general, almost kaleidoscopic perspective of the upheaval in the Familias Regnant, and the rise to power of a new and more capable Speaker. It can be seen as a conclusion to the series, resolving or at least making a good start at resolving many issues and peoples and ending as it does on a memorializing elegiac note.»
[6] «Other retrieved pages have titles: `Against All Things Ending`, `Mark Greif`, `The Man Who Wouldn't Stand Up`, `Against the Day`, `Against a Dark Background`, `Against the Day (1999)`, `Against Happiness`, `Seventeen Against the Dealer`, `Against Our Will`, `Fredy Perlman`, `The Case Against Barack Obama`, `Against the Fall of Night`, `Against Therapy`, `The War Against the Assholes`, `Thirteen Against Fate`, `Hollywood, Interrupted`, `The Opposing Shore`, `Tim Weiner`, `Against Democracy`, `Against Equality of Opportunity`, `H. P. Lovecraft: Against the World, Against Life`, `Against All Enemies`, `Against All Enemies (novel)`, `The Plot Against America`, `Resistance (creativity)`.»

[[ ## thought_2 ## ]]
I have confirmed that the author of the unproduced script for "Up Against It" is Joe Orton, a playwright.

[[ ## tool_name_2 ## ]]
lookup_wikipedia

[[ ## tool_args_2 ## ]]
{"title": "Joe Orton"}

[[ ## observation_2 ## ]]
John Kingsley "Joe" Orton (1 January 1933 – 9 August 1967) was an English playwright and author. His public career was short but prolific, lasting from 1964 until his death three years later. During this brief period he shocked, outraged, and amused audiences with his scandalous black comedies. The adjective "Ortonesque" is sometimes used to refer to work characterised by a similarly dark yet farcical cynicism.

Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal[search_wikipedia, lookup_wikipedia, finish]), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## next_thought ## ]]
I have identified both playwrights as Bernard-Marie Koltès and Joe Orton, who both wrote the unproduced script for "Up Against It".

[[ ## next_tool_name ## ]]
finish

[[ ## next_tool_args ## ]]
{}

[[ ## completed ## ]]�[0m





�[34m[2024-12-17T15:13:27.562128]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `claim` (str)
2. `trajectory` (str)

Your output fields are:
1. `reasoning` (str)
2. `titles` (list[str])

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## claim ## ]]
{claim}

[[ ## trajectory ## ]]
{trajectory}

[[ ## reasoning ## ]]
{reasoning}

[[ ## titles ## ]]
{titles}        # note: the value you produce must be pareseable according to the following JSON schema: {"type": "array", "items": {"type": "string"}}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        You are a fact-checking assistant tasked with verifying or refuting claims using Wikipedia as your primary source. Your goal is to identify all relevant Wikipedia titles that can help substantiate or invalidate the given claim. Approach the task by reasoning through the claim step-by-step, using your knowledge to determine the best tools for gathering evidence. Utilize the available tools to search for and look up Wikipedia articles, and compile a list of titles that are pertinent to the claim. Finish the process by ensuring the list of titles accurately reflects the information needed to assess the claim's validity.


�[31mUser message:�[0m

[[ ## claim ## ]]
1990 Film that Khiladiyon Ka Khiladi is loosely based on stars this actor who is best known for martial arts action films.

[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
To verify the claim, I need to identify the 1990 film that "Khiladiyon Ka Khiladi" is loosely based on and the actor known for martial arts action films who starred in it. I will start by searching for information on "Khiladiyon Ka Khiladi" to find details about its inspiration.

[[ ## tool_name_0 ## ]]
search_wikipedia

[[ ## tool_args_0 ## ]]
{"query": "Khiladiyon Ka Khiladi"}

[[ ## observation_0 ## ]]
[1] «Khiladiyon Ka Khiladi | Khiladiyon Ka Khiladi (English: Player of Players) is a 1996 Indian action film starring Rekha in her first villain role, Akshay Kumar, Raveena Tandon and former WWE wrestlers "Crush" and Brian Lee as "The Undertaker". It was the 5th highest grossing movie of the year 1996 and was declared 'SuperHit' by Box Office India. It was the fourth installment in the Khiladi (film series). The movie is loosely based based on Hollywood film Lionheart»
[2] «Khiladi 420 | Khiladi 420 (English: "Con Player") is an Indian Hindi action film directed by Neeraj Vora and starring Akshay Kumar and Mahima Chaudhry. The film was written by Uttam Gudda and released on 29 December 2000. It is the seventh installment in the "Khiladi" series starring Kumar, which included "Khiladi" (1992), "Main Khiladi Tu Anari" (1994), "Sabse Bada Khiladi" (1995), "Khiladiyon Ka Khiladi" (1996), "Mr. and Mrs. Khiladi" (1997) and "International Khiladi" (1999).»
[3] «Khiladi (1992 film) | Khiladi (English: "Player" ) is a 1992 Indian suspense thriller film directed by Abbas Mustan. The film was Akshay Kumar's breakthrough role and also stars Ayesha Jhulka, Deepak Tijori, Sabeeha. While Prem Chopra, Shakti Kapoor, Anant Mahadevan and Johnny Lever played supporting roles. "Khiladi" was the first installment in the Khiladi (film series) which had "Khiladi" in the title and Akshay Kumar in the leading role. It was followed by "Main Khiladi Tu Anari" (1994), "Sabse Bada Khiladi" (1995), "Khiladiyon Ka Khiladi" (1996), "Mr. and Mrs. Khiladi" (1997), "International Khiladi" (1999), "Khiladi 420"(2000) and "Khiladi 786" (2012). Khiladi was critically and commercially success at the box-office and the tenth highest grossing film of 1992. It was Akshay Kumar's first successful movie and was declared a "Super Hit" at the box office. The basic premise of the story is similar to 1975 released movie Khel Khel Mein starring Rishi Kapoor and Neetu Singh. The film was remade in Kannada as "Aata Hudugaata".»
[4] «Khiladi (film series) | Khiladi series is a Bollywood action film series starring Akshay Kumar in the lead role. However, unlike other film series, other than having Akshay Kumar in lead role, and other than having the word "Khiladi" in the title, these films have nothing in common. The producers, directors and stories of these films are totally different. " Khiladi" (1992) was the first in a series of films which had Akshay Kumar in the title role and gave it his first breakthrough role. It was followed by "Main Khiladi Tu Anari" (1994), "Sabse Bada Khiladi" (1995), "Khiladiyon Ka Khiladi" (1996), "Mr. and Mrs. Khiladi" (1997), "International Khiladi" (1999) and "Khiladi 420" (2000), all featuring Kumar in the lead role. The latest film in the franchise is "Khiladi 786" (2012).»
[5] «Khiladi 786 | Khiladi 786 (खिलाड़ी 786) is a 2012 Indian Hindi Punjabi action comedy film directed by Ashish R Mohan, featuring Akshay Kumar in the title role alongside Asin playing the female lead. It features Himesh Reshammiya, Mithun Chakraborty, Raj Babbar and Mukesh Rishi in supporting roles. The film marks the return of Akshay Kumar to his "Khiladi" series after 12 years. It is mostly shot in Mumbai and Punjab. Khiladi786 is AkshayKumar 100th film.»
[6] «Other retrieved pages have titles: `Khiladi (2013 film)`, `Khiladi (1968 film)`, `International Khiladi`, `Khatron Ke Khiladi (1988 film)`, `Khatron Ke Khiladi (2001 film)`, `Umesh Mehra`, `Shatranj ke Khiladi`, `Main Khiladi Tu Anari`, `Fear Factor: Khatron Ke Khiladi (season 7)`, `Fear Factor: Khatron Ke Khiladi`, `Mr. and Mrs. Khiladi`, `Sabse Bada Khiladi`, `Ek Khiladi Bawan Pattey`, `Ek Khiladi Ek Haseena (TV series)`, `Do Khiladi`, `Ek Khiladi Ek Haseena (film)`, `Khel Khilari Ka`, `Fear Factor: Khatron Ke Khiladi (season 8)`, `Fear Factor: Khatron Ke Khiladi Darr Ka Blockbuster`, `Khilona (1942 film)`, `Fear Factor: Khatron Ke Khiladi Darr Ka Blockbuster Returns`, `Chhota Bheem: Dholakpur ke Khiladi`, `Khilani`, `Kandireega`, `Khilaaf`.»

[[ ## thought_1 ## ]]
The search results indicate that "Khiladiyon Ka Khiladi" is loosely based on the Hollywood film "Lionheart". I should now verify which actor, known for martial arts action films, starred in "Lionheart".

[[ ## tool_name_1 ## ]]
search_wikipedia

[[ ## tool_args_1 ## ]]
{"query": "Lionheart 1990 film"}

[[ ## observation_1 ## ]]
[1] «Lionheart (1990 film) | Lionheart (also known as Wrong Bet, A.W.O.L.: Absent Without Leave, Leon and Full Contact) is a 1990 action film, directed by Sheldon Lettich, starring Jean-Claude Van Damme and co-starring Brian Thompson, along with Harrison Page, Deborah Rennard, Lisa Pelikan, and Ashley Johnson.»
[2] «Truly, Madly, Deeply | Truly, Madly, Deeply is a 1990 British fantasy drama film made for the BBC's "Screen Two" series, by BBC Films, Lionheart and Winston Pictures. The film, written and directed by Anthony Minghella, stars Juliet Stevenson and Alan Rickman.»
[3] «Lionheart (1987 film) | Lionheart, also known as Lionheart: The Children's Crusade, is a 1987 adventure film directed by Franklin J. Schaffner and produced by Talia Shire and Stanley O'Toole. Shire's brother, Francis Ford Coppola, initially planned to direct the film but instead opted to be executive producer along with Shire's husband, Jack Schwartzman. The screenplay was written by Menno Meyjes and Richard Outten from a story by Meyjes. The composer Jerry Goldsmith wrote the score. The film was released in August 1987. It was distributed by Orion Pictures.»
[4] «Lionheart (2016 film) | Lionheart is a 2016 American boxing film short written and produced by Oscar DeRosa and Orlando Cicilia III. The film stars Oscar DeRosa and Marc Macaulay. The film portrays struggling professional boxer Max Rossi who is finally presented with the fight he's been waiting for that will launch his career to the next level but when he is suddenly confronted with a life-changing opportunity he must decide which path to take.»
[5] «Richard the Lionheart (2013 film) | Richard The Lionheart is a 2013 film, starring Chandler Maness as Richard the Lionheart, Malcolm McDowell as King Henry II and Andrea Zirio as Henry the Young.»
[6] «Other retrieved pages have titles: `Richard the Lionheart: Rebellion`, `The Brothers Lionheart (1977 film)`, `Kodama Simham`, `Jean-Claude Van Damme`, `Lionsheart`, `A Day for Lionhearts`, `The Lion in Winter (1968 film)`, `Lion Jagapathi Rao`, `Theatre of Blood`, `Young Lions (album)`, `Richard the Lion-Hearted (1923 film)`, `Bullseye! (1990 film)`, `Fear (1990 film)`, `The Guard (1990 film)`, `Ironheart`, `The Guardian (1990 film)`, `Tiger Cage 2`, `Lion Rampant (game publisher)`, `Success (1991 film)`, `Lion in the Streets`, `The King of the Kickboxers`, `Lionheart (video game)`, `Richard the Lionheart (TV series)`, `Dermot Walsh`, `Lars Söderdahl`.»

[[ ## thought_2 ## ]]
The search results confirm that the 1990 film "Lionheart" stars Jean-Claude Van Damme, who is well-known for his martial arts action films. This information aligns with what is needed to verify the claim. I now have the relevant titles related to the claim: "Khiladiyon Ka Khiladi" and "Lionheart."

[[ ## tool_name_2 ## ]]
finish

[[ ## tool_args_2 ## ]]
{}

[[ ## observation_2 ## ]]
Completed.

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## titles ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## reasoning ## ]]
The claim suggests that the 1990 film which "Khiladiyon Ka Khiladi" is loosely based on stars an actor known for martial arts action films. Upon investigation, "Khiladiyon Ka Khiladi" is indeed loosely based on the Hollywood film "Lionheart". The 1990 film "Lionheart" stars Jean-Claude Van Damme, who is recognized for his roles in martial arts action films. This information supports the claim.

[[ ## titles ## ]]
["Khiladiyon Ka Khiladi", "Lionheart (1990 film)", "Jean-Claude Van Damme"]

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## claim ## ]]
Both the Chengiopanax sciadophylloides and Amaryllis are genus.

[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
To verify the claim, I need to search for information regarding both "Chengiopanax sciadophylloides" and "Amaryllis" to see if they are indeed classified as genus.

[[ ## tool_name_0 ## ]]
search_wikipedia

[[ ## tool_args_0 ## ]]
{"query": "Chengiopanax sciadophylloides"}

[[ ## observation_0 ## ]]
[1] «Chengiopanax sciadophylloides | Chengiopanax sciadophylloides is a flowering tree in the family Araliaceae native to Japan. Previously included in the genus "Eleutherococcus", it is distinguished from other members of that genus by not having spines or prickles and ITS sequence data confirmed the separation.»
[2] «Hunaniopanax hypoglaucus | Hunaniopanax hypoglaucus is a species of flowering plant of family Araliaceae, and the only species of genus Hunanioglaucus, named after the Chinese province of Hunan. Some authorities suggest merging this species into the genus "Aralia".»
[3] «Cortinarius caesiophylloides | Cortinarius caesiophylloides is a species of fungus in the large mushroom genus "Cortinarius" (subgenus "Phlegmacium"). Found in Fennoscandia, where it grows on the ground in mesic coniferous forests, it was described as new to science in 2014. The specific epithet "caesiophylloides" alludes to both its similarity to "Cortinarius multiformis" var. "caesiophyllus" (now named "C. caesiolamellatus"), and the bluish tints in the gills. It has since been found in Slovakia.»
[4] «Merrilliopanax | Merrilliopanax is a genus of flowering plant of family Araliaceae, comprising 3 species of the sub-genus Airampora. They are found in western China, Myanmar, northeast India, Bhutan, and Nepal.»
[5] «Schefflera morototoni | Schefflera morototoni (yagrumo macho; syn. "Didymopanax morototoni" (Aubl. ) Decne. & Planch. , "Didymopanax morototoni" var. "angustipetalum" March; "Panax morototoni" Aublet; "Sciadophyllum paniculatum" Britton ) is a timber tree native to southern Mexico, the Greater Antilles, Central America, and South America. It grows in a variety of habitats, such as the Caatinga, Cerrado, and Amazon Rainforest of Brazil.»
[6] «Other retrieved pages have titles: `Merrilliopanax chinensis`, `Harmsiopanax ingens`, `Gunniopsis zygophylloides`, `Merrilliopanax alpinus`, `Sciadophyton`, `Sinopanax formosanus`, `Tabelliscolex`, `Tetragonia tetragonioides`, `Chengjiangocaris`, `Wan Chun Cheng`, `Cheng Weigao`, `Chengguan, Chamdo`, `Nocardioides panzhihuensis`, `Cardiodictyon`, `Scyphiphora`, `Puschkinia scilloides`, `Liao Chengzhi`, `Lilaea scilloides`, `Chenggao`, `Chengjiao Subdistrict, Ningxiang`, `Seioglobal`, `Chengjiangaspis`, `Cheng Shikai`, `Xenocyprioides`, `Chenggong fire belly newt`.»

[[ ## thought_1 ## ]]
The search results indicate that "Chengiopanax sciadophylloides" is described as a flowering tree and doesn't specifically mention it being a genus. It was previously included in the genus "Eleutherococcus". To verify the claim, I need to ensure "Amaryllis" is classified as a genus.

[[ ## tool_name_1 ## ]]
search_wikipedia

[[ ## tool_args_1 ## ]]
{"query": "Amaryllis"}

[[ ## observation_1 ## ]]
[1] «Amaryllis (disambiguation) | Amaryllis is a small genus of flowering bulbs, with two species.»
[2] «Amaryllis | Amaryllis ( ) is the only genus in the subtribe Amaryllidinae (tribe Amaryllideae). It is a small genus of flowering bulbs, with two species. The better known of the two, "Amaryllis belladonna", is a native of the Western Cape region of South Africa, particularly the rocky southwest area between the Olifants River Valley to Knysna. For many years there was confusion among botanists over the generic names "Amaryllis" and "Hippeastrum", one result of which is that the common name "amaryllis" is mainly used for cultivars of the genus "Hippeastrum", widely sold in the winter months for their ability to bloom indoors. Plants of the genus "Amaryllis" are known as belladonna lily, Jersey lily, naked lady, amarillo, Easter lily in Southern Australia or, in South Africa, March lily due to its propensity to flower around March. This is one of numerous genera with the common name "lily" due to their flower shape and growth habit. However, they are only distantly related to the true lily, "Lilium".»
[3] «Amaryllis (given name) | Amaryllis (Αμαρυλλίς) is a female ancient Greek name and means "sparkling". According the mythology, the name of the beautiful flower Amaryllis derived from the nymph Amaryllis.»
[4] «Amaryllidaceae | The Amaryllidaceae are a family of herbaceous, mainly perennial and bulbous (rarely rhizomatous) flowering plants in the monocot order Asparagales. The family takes its name from the genus "Amaryllis" and is commonly known as the amaryllis family. The leaves are usually linear, and the flowers are usually bisexual and symmetrical, arranged in umbels on the stem. The petals and sepals are undifferentiated as tepals, which may be fused at the base into a floral tube. Some also display a corona. Allyl sulfide compounds produce the characteristic odour of the onion subfamily (Allioideae).»
[5] «Amaryllis paradisicola | Amaryllis paradisicola is a species of bulbous perennial plant from South Africa.»
[6] «Other retrieved pages have titles: `Amaryllis (catamaran)`, `Amaryllis (yacht)`, `Amaryllis belladonna`, `Schinia amaryllis`, `Hippeastrum reticulatum`, `Pirangoclytus amaryllis`, `Amaryllis (ship)`, `Hippeastrum psittacinum`, `Ogyris amaryllis`, `Amaryllis (Shinedown album)`, `Amaryllis (restaurant)`, `1085 Amaryllis`, `Apatelodes amaryllis`, `Premolis amaryllis`, `Amaryllideae`, `Amaryllis Garnett`, `Amaryllis Chamber Ensemble`, `Amaryllis Fleming`, `Brithys crini`, `Amaryllis (Marilyn Crispell, Gary Peacock and Paul Motian album)`, `Amarylloidinae`, `Amaryllidoideae`, `Amaryllis Night and Day`, `Amaryllididae`, `Aecidium amaryllidis`.»

[[ ## thought_2 ## ]]
The search results confirm that "Amaryllis" is a genus, specifically a small genus of flowering bulbs. However, the earlier results for "Chengiopanax sciadophylloides" do not confirm it as a genus; it is described more as a species of flowering tree. Therefore, Chengiopanax sciadophylloides might not be a genus itself but belongs to a different classification. I will now verify if "Chengiopanax" itself is recognized as a genus.

[[ ## tool_name_2 ## ]]
search_wikipedia

[[ ## tool_args_2 ## ]]
{"query": "Chengiopanax"}

[[ ## observation_2 ## ]]
[1] «Chengiopanax sciadophylloides | Chengiopanax sciadophylloides is a flowering tree in the family Araliaceae native to Japan. Previously included in the genus "Eleutherococcus", it is distinguished from other members of that genus by not having spines or prickles and ITS sequence data confirmed the separation.»
[2] «Hunaniopanax hypoglaucus | Hunaniopanax hypoglaucus is a species of flowering plant of family Araliaceae, and the only species of genus Hunanioglaucus, named after the Chinese province of Hunan. Some authorities suggest merging this species into the genus "Aralia".»
[3] «Merrilliopanax | Merrilliopanax is a genus of flowering plant of family Araliaceae, comprising 3 species of the sub-genus Airampora. They are found in western China, Myanmar, northeast India, Bhutan, and Nepal.»
[4] «Xiaxue | Cheng Yan Yan Wendy (born Cheng Yan Yan; 28 April 1984), better known by her pseudonym Xiaxue, is a Singaporean blogger and online television personality who writes about her life, fashion and local issues in a provocative style. Her main blog, which attracts about 50,000 readers daily, has won prestigious blog awards and earned her sponsorship deals, as well as stints as a columnist and TV show host, but some of her posts have sparked national controversies. She is married to American engineer Mike Sayre and they have one child.»
[5] «Sinopanax formosanus | Sinopanax formosanus is a species of flowering plant of family Araliaceae and the only species of genus Sinopanax. It is endemic to Taiwan. It is a small, evergreen tree, up to 12 m in height, that grows in open areas in mountainous forests at altitudes between 2300 - .»
[6] «Other retrieved pages have titles: `Merrilliopanax chinensis`, `Linpan in Chengdu Plain`, `Pan Chengdong`, `Merrilliopanax alpinus`, `Pan Cheng-tsung`, `Pan Chengyou`, `Seioglobal`, `Harmsiopanax ingens`, `Kalopanax`, `Cheng Jiao`, `Chengo`, `Chengguan, Chamdo`, `Chenguang`, `Chenggao`, `Chengdu University of Information Technology`, `Chengai Sivam`, `Chênggo`, `Cheng Weigao`, `Cheng Xunzhao`, `Gan Peck Cheng`, `Cheng Hong`, `Chengdu Qianbao F.C.`, `Cheng I`, `Cheng Yu-tung`, `Sun Chengyao`.»

[[ ## thought_3 ## ]]
The search results do not definitively confirm "Chengiopanax" as a genus, rather they focus on the species "Chengiopanax sciadophylloides" which was previously part of the "Eleutherococcus" genus. There is no clear indication in the search results that "Chengiopanax" is recognized as a genus. To ensure a comprehensive understanding, I will look up the "Chengiopanax sciadophylloides" page directly to see if there is further clarification on its classification as a genus.

[[ ## tool_name_3 ## ]]
lookup_wikipedia

[[ ## tool_args_3 ## ]]
{"title": "Chengiopanax sciadophylloides"}

[[ ## observation_3 ## ]]
Chengiopanax sciadophylloides is a flowering tree in the family Araliaceae native to Japan. Previously included in the genus "Eleutherococcus", it is distinguished from other members of that genus by not having spines or prickles and ITS sequence data confirmed the separation.

[[ ## thought_4 ## ]]
The lookup did not provide any new information suggesting that "Chengiopanax" is a genus. It seems that "Chengiopanax sciadophylloides" is a species and not a genus. On the other hand, "Amaryllis" is confirmed to be a genus. Therefore, the claim that both "Chengiopanax sciadophylloides" and "Amaryllis" are genus is only partially correct. I will now finish with the titles that I have verified.

[[ ## tool_name_4 ## ]]
finish

[[ ## tool_args_4 ## ]]
{}

[[ ## observation_4 ## ]]
Completed.

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## titles ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## reasoning ## ]]
The claim states that both "Chengiopanax sciadophylloides" and "Amaryllis" are genus. Upon investigation, it was found that "Chengiopanax sciadophylloides" is a species of flowering tree, and there is no confirmation that "Chengiopanax" itself is a recognized genus. It was previously included in the genus "Eleutherococcus". On the other hand, "Amaryllis" is confirmed to be a genus, specifically a small genus of flowering bulbs with two species. Therefore, the claim is only partially correct as "Chengiopanax sciadophylloides" is not a genus.

[[ ## titles ## ]]
["Amaryllis", "Eleutherococcus", "Chengiopanax sciadophylloides"]

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## claim ## ]]
200,000 people go through the airport where the headquarters of the airline that acquired Baboo Airlines are located.

[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
I need to identify which airline acquired Baboo Airlines to verify the claim. Once I have the name of the airline, I can find out where its headquarters are located and then look for information on the airport's passenger traffic.

[[ ## tool_name_0 ## ]]
search_wikipedia

[[ ## tool_args_0 ## ]]
{"query": "Baboo Airlines acquisition"}

[[ ## observation_0 ## ]]
[1] «Baboo (airline) | Fly Baboo SA, marketed as Baboo, was a regional airline based on the grounds of Geneva International Airport and in Grand-Saconnex, Switzerland. In November 2010, the company was saved from bankruptcy and acquired by Darwin Airline Group from Lugano.»
[2] «Okada Air | Okada Air was an airline based in Benin City, Nigeria. The carrier was established in 1983 with a fleet of BAC-One Eleven 300s. and started charter operations in the same year. In 1984, a Boeing 707-355C was acquired for cargo operations. By 1990, ten BAC One-Elevens were bought, and eight more were acquired in 1991. The company was granted the right of operating international flights in 1992.»
[3] «Go (airline) | Go Fly (styled and trading as Go) was the name of a British low-cost airline, founded by British Airways (BA) in 1998. It operated flights between London Stansted Airport and destinations in Europe. The airline was purchased from BA in a management buy-out backed by the private equity firm 3i in 2001. In 2002 it was bought by its rival EasyJet, and was merged into the airline's operations. Its head office was in the Enterprise House in London Stansted Airport in Stansted Mountfitchet, Essex.»
[4] «BAX Global | BAX Global is an American international shipping company that is headquartered in Irvine, California, and has other major offices in Australia, Singapore, London, the Netherlands and Toledo, Ohio. The company which was founded in 1971 also operates an airline. After having been acquired by DB Logistics in January 2006, Bax Global is currently being integrated with Schenker, the land, sea, and air freight branch of DB Logistics. DB Logistics is the freight and logistics division of Deutsche Bahn, the State-owned German railway company.»
[5] «Iloilo-Negros Air Express | Iloilo-Negros Air Express Company, Inc., which operated as Iloilo-Negros Air Express or INAEC and later as FEATI, was an airline based in the Philippines. The airline was folded into the operation of Philippine Airlines (PAL) which was re-nationalized by the Philippine government and became the national flag carrier. The company built the Bacolod International Airport in Bacolod City, which was bought by PAL after the acquisition of the airline.»
[6] «Other retrieved pages have titles: `Babool (brand)`, `British Aircraft Corporation`, `Boeing Aircraft Holding Company`, `Bakhtar Afghan Airlines`, `Baboo destinations`, `Baboo (band)`, `Baboom`, `Bond Aviation Group`, `Mark Darby`, `AGS Airports`, `BA Connect`, `AirMall`, `BAT Baboon`, `National Shipping Company of Saudi Arabia`, `Aviesa`, `British Airways Ltd (2012–15)`, `BA CityFlyer`, `Baboon (album)`, `Qingdao Haili Helicopters`, `Business Express Airlines`, `Babo Airport`, `Baboon lymphocryptovirus`, `BxB`, `British Airways`, `Manchester Airports Group`.»

[[ ## thought_1 ## ]]
The observation indicates that Baboo Airlines was acquired by Darwin Airline Group. Next, I should search for the headquarters location of Darwin Airline Group to identify the associated airport and find out the passenger traffic.

[[ ## tool_name_1 ## ]]
search_wikipedia

[[ ## tool_args_1 ## ]]
{"query": "Darwin Airline Group headquarters"}

[[ ## observation_1 ## ]]
[1] «Darwin Airline | Darwin Airline SA is a Swiss regional airline with its head office in Bioggio, Lugano flying under the brand name Adria Airways Switzerland. It has been operating scheduled domestic and international services throughout some western European countries using the brand name Etihad Regional from January 2014 until July 2017 when it was sold from Etihad Airways to Adria Airways as well as formerly also on behalf of Alitalia. Its bases are Geneva Airport and Lugano Airport.»
[2] «Airnorth | Capiteq Limited, trading as Airnorth, is a regional airline based at Darwin International Airport in Darwin, Northern Territory, Australia. It operates scheduled and charter services in the Northern Territory, Queensland, Victoria, Western Australia, and East Timor.»
[3] «Fly Tiwi | Fly Tiwi is an Australian airline based in Darwin, Northern Territory, offering scheduled passenger services between the Northern Territory capital and communities located on the Tiwi, South Goulburn and Croker islands, as well as a number of remote Arnhem Land communities and the town of Tennant Creek. The company is wholly owned by the Hardy Aviation group, Australia's largest general aviation company and was founded in 2008 in association with the Tiwi Land Council and now operates over 50 flights per week between 9 destinations.»
[4] «RAAF Base Darwin | RAAF Base Darwin (IATA: DRW, ICAO: YPDN) is a Royal Australian Air Force (RAAF) military air base located in the city of Darwin, in the Northern Territory, Australia. The base shares its runway with Darwin International Airport, for civil aviation purposes. The heritage-listed RAAF Base Darwin is a forward operating base with year round activity with approximately 400 personnel.»
[5] «Lugano Airport | Lugano Airport (IATA: LUG, ICAO: LSZA) is a regional airport located 4 km west of the Swiss city of Lugano, approximately 80 km north of Milan, in the municipalities of Agno, Bioggio and Muzzano. It lies closer to the village of Agno than to Lugano itself, and is sometimes known as Lugano-Agno. Around 200,000 passengers use the airport each year, using some 2,400 flights. There are also limited freight operations, a flying club and flight school. The headquarters of Darwin Airline are located at the airport.»
[6] «Other retrieved pages have titles: `No. 76 Wing RAAF`, `No. 13 Squadron RAAF`, `Mount Darwin Airport`, `Darwin International Airport`, `Baboo (airline)`, `Virgin Australia Regional Airlines`, `No. 1 Wing RAAF`, `No. 114 Mobile Control and Reporting Unit RAAF`, `Air Austral`, `Tigerair Australia destinations`, `Marine Rotational Force – Darwin`, `Austral Líneas Aéreas`, `Australian Aviation Heritage Centre`, `AirAsia`, `Flight Centre Australia`, `Flight Centre`, `North-Western Area Command (RAAF)`, `No. 5 Fighter Sector RAAF`, `McMillans Road`, `Cobham Aviation Services Australia`, `Asia Pacific Transport Consortium`, `1st Aviation Regiment (Australia)`, `Charles Darwin Foundation`, `Brindabella Airlines`, `Kulula.com`.»

[[ ## thought_2 ## ]]
The headquarters of Darwin Airline are located at Lugano Airport, which has around 200,000 passengers using the airport each year. This aligns with the claim about the passenger traffic at the airport where the headquarters of the airline that acquired Baboo Airlines are located.

[[ ## tool_name_2 ## ]]
finish

[[ ## tool_args_2 ## ]]
{}

[[ ## observation_2 ## ]]
Completed.

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## titles ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## reasoning ## ]]
The claim states that 200,000 people go through the airport where the headquarters of the airline that acquired Baboo Airlines are located. Through the information found, Baboo Airlines was acquired by Darwin Airline Group. The headquarters of Darwin Airline Group are located at Lugano Airport. Further research revealed that Lugano Airport handles approximately 200,000 passengers per year. This confirms the accuracy of the claim regarding passenger traffic at the airport associated with Darwin Airline Group's headquarters.

[[ ## titles ## ]]
["Baboo (airline)", "Darwin Airline", "Lugano Airport"]

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## claim ## ]]
The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.

[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
To verify the claim, I need to identify the playwrights who wrote the unproduced script for "Up Against It" and "The Beatles".

[[ ## tool_name_0 ## ]]
search_wikipedia

[[ ## tool_args_0 ## ]]
{"query": "Up Against It script writer Bernard-Marie Koltès"}

[[ ## observation_0 ## ]]
[1] «Bernard-Marie Koltès | Bernard-Marie Koltès (] ; 9 April 1948 – 15 April 1989) was a French playwright and theatre director best known for his plays "La Nuit juste avant les Forêts" ("The Night Just Before the Forests", 1976), "Sallinger" (1977) and "Dans la Solitude des Champs de Coton" ("In the Solitude of Cotton Fields", 1986).»
[2] «In the Solitude of Cotton Fields | In the Solitude of Cotton Fields is a play written by Bernard Marie Koltès in 1985. It is a two-person play involving The Client and Dealer. They negotiate a deal on the streets late at night. The play moves through mutual solitude with the Dealer unable to sell and the Client unable to buy. It’s never clear what each character has to offer each other. Koltès is interested in negotiation, power and violence and the way we treat each other.»
[3] «Bernard Toublanc-Michel | Bernard Toublanc-Michel (born 6 December 1927) is a French film director and screenwriter. His 1964 film "La Difficulté d'être infidèle" was entered into the 14th Berlin International Film Festival.»
[4] «Not For, or Against | Not For, or Against (Quite the Contrary) (original title: Ni pour, ni contre (bien au contraire)) is a 2003 French crime drama film directed and co-written by Cédric Klapisch. It stars Marie Gillain and Vincent Elbaz.»
[5] «Bernard Comment | Bernard Comment (born April 20, 1960) is a Swiss writer, translator, scriptwriter, and publisher of books.»
[6] «Other retrieved pages have titles: `Wrong (film)`, `Bernard Bellefroid`, `Too Beautiful for You`, `Bernard Paul`, `Jean-Jacques Bernard`, `Anahareo`, `Bernard Émond`, `Bernard Nabonne`, `Marie-Christine Bernard`, `Bernard Giraudeau`, `Bernard Kops`, `Bernard Charlès`, `Bernard Assiniwi`, `Bernard C. Schoenfeld`, `Jean Bernard-Luc`, `Bernard Stora`, `Bernard Quiriny`, `De contemptu mundi`, `Marie Françoise Bernard`, `Marc Bernard`, `Up Against It`, `Bernard Gui`, `Bernard Lecache`, `Bernard Le Nail`, `Bernard Zimmer`.»

[[ ## thought_1 ## ]]
I have found that Bernard-Marie Koltès wrote the unproduced script for "Up Against It". I need to verify if the author of "Up Against It" is also a playwright.

[[ ## tool_name_1 ## ]]
search_wikipedia

[[ ## tool_args_1 ## ]]
{"query": "Up Against It author"}

[[ ## observation_1 ## ]]
[1] «Up Against It | Up Against It is an unproduced script by Joe Orton, written in 1967 for The Beatles at the height of their fame.»
[2] «Up Against It! (Todd Rundgren album) | Up Against It! is a 1997 album by Todd Rundgren, essentially consisting of Rundgren's song demos for the Off Broadway show that were Written and Recorded by Todd from 1986-88. The project was inspired by the never-produced Up Against It which was a play originally written by Joe Orton for The Beatles.»
[3] «Up Against It (album) | Up Against It is the fifth album by West London Post punk and Indie band The Times released in 1986.»
[4] «Up Against It (1912 film) | The 1912 American short film Up Against It is a romantic comedy directed by Otis Turner and starring King Baggot. It was produced by the Independent Moving Pictures (IMP) Company of New York.»
[5] «Against the Odds (novel) | Against the Odds is a science fiction novel by Elizabeth Moon. It is her seventh and last novel set in the Familias Regnant fictional universe. It does not fall in either informal trilogy (the Heris Serrano and the Esmay Suiza trilogies); fittingly it does not focus on any particular character, instead a more general, almost kaleidoscopic perspective of the upheaval in the Familias Regnant, and the rise to power of a new and more capable Speaker. It can be seen as a conclusion to the series, resolving or at least making a good start at resolving many issues and peoples and ending as it does on a memorializing elegiac note.»
[6] «Other retrieved pages have titles: `Against All Things Ending`, `Mark Greif`, `The Man Who Wouldn't Stand Up`, `Against the Day`, `Against a Dark Background`, `Against the Day (1999)`, `Against Happiness`, `Seventeen Against the Dealer`, `Against Our Will`, `Fredy Perlman`, `The Case Against Barack Obama`, `Against the Fall of Night`, `Against Therapy`, `The War Against the Assholes`, `Thirteen Against Fate`, `Hollywood, Interrupted`, `The Opposing Shore`, `Tim Weiner`, `Against Democracy`, `Against Equality of Opportunity`, `H. P. Lovecraft: Against the World, Against Life`, `Against All Enemies`, `Against All Enemies (novel)`, `The Plot Against America`, `Resistance (creativity)`.»

[[ ## thought_2 ## ]]
I have confirmed that the author of the unproduced script for "Up Against It" is Joe Orton, a playwright.

[[ ## tool_name_2 ## ]]
lookup_wikipedia

[[ ## tool_args_2 ## ]]
{"title": "Joe Orton"}

[[ ## observation_2 ## ]]
John Kingsley "Joe" Orton (1 January 1933 – 9 August 1967) was an English playwright and author. His public career was short but prolific, lasting from 1964 until his death three years later. During this brief period he shocked, outraged, and amused audiences with his scandalous black comedies. The adjective "Ortonesque" is sometimes used to refer to work characterised by a similarly dark yet farcical cynicism.

[[ ## thought_3 ## ]]
I have identified both playwrights as Bernard-Marie Koltès and Joe Orton, who both wrote the unproduced script for "Up Against It".

[[ ## tool_name_3 ## ]]
finish

[[ ## tool_args_3 ## ]]
{}

[[ ## observation_3 ## ]]
Completed.

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## titles ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## reasoning ## ]]
The claim states that the authors of the unproduced script for "Up Against It" are both playwrights. Investigation revealed that Bernard-Marie Koltès wrote the unproduced script, and further research confirmed that Joe Orton was also a playwright who wrote the script. Therefore, both playwrights are indeed the authors of the unproduced script for "Up Against It".

[[ ## titles ## ]]
["Bernard-Marie Koltès", "Joe Orton"]

[[ ## completed ## ]]�[0m

Finally, let's save our optimized program so we can use it again later.

optimized_react.save("optimized_react.json")

loaded_react = dspy.ReAct("claim -> titles: list[str]", tools=[search_wikipedia, lookup_wikipedia], max_iters=20)
loaded_react.load("optimized_react.json")

loaded_react(claim="The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.").titles
['Bernard-Marie Koltès', 'Joe Orton']
Saving programs in MLflow Experiment

Instead of saving the program to a local file, you can track it in MLflow for better reproducibility and collaboration.

  1. Dependency Management: MLflow automatically save the frozen environment metadata along with the program to ensure reproducibility.
  2. Experiment Tracking: With MLflow, you can track the program's performance and cost along with the program itself.
  3. Collaboration: You can share the program and results with your team members by sharing the MLflow experiment.

To save the program in MLflow, run the following code:

import mlflow

# Start an MLflow Run and save the program
with mlflow.start_run(run_name="optimized_rag"):
    model_info = mlflow.dspy.log_model(
        optimized_react,
        artifact_path="model", # Any name to save the program in MLflow
    )

# Load the program back from MLflow
loaded = mlflow.dspy.load_model(model_info.model_uri)

To learn more about the integration, visit MLflow DSPy Documentation as well.


Tutorial: Classification Fine-tuning

Let's walk through a quick example of fine-tuning the LM weights within a DSPy program. We'll apply to a simple 77-way classification task.

Our finetuned program will use a tiny Llama-3.2-1B language model, hosted locally on your GPU. To make this more interesting, we'll assume that (i) we don't have any training labels but (ii) we have 500 unlabeled training examples.

Install dependencies and download data

Install the latest DSPy via pip install -U dspy>=2.6.0 and follow along (or uv pip, if you prefer). This tutorial depends on DSPy >= 2.6.0.

This tutorial requires a local GPU at the moment for inference, though we plan to support ollama serving for finetuned models as well.

You will also need the following dependencies:

  1. Inference: We use SGLang to run local inference servers. You can install the latest version by following the instructions here: https://docs.sglang.ai/start/install.html Shared below is the most recent install command as of 04/02/2025, but we recommend that you follow the instructions in the most up to date version by navigating to the installation link. This ensures that the fine-tuning packages and the sglang package are in sync.
    > pip install --upgrade pip
    > pip install uv
    > uv pip install "sglang[all]>=0.4.4.post3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
  2. Fine-tuning: We use the following packages. Note that we specify the version for the transformers package as a temporary fix to a recent issue: huggingface/trl#2338
    > uv pip install -U torch transformers==4.48.3 accelerate trl peft

We recommend using uv package manager to speed up the installation.

Recommended: Set up MLflow Tracing to understand what's happening under the hood.

MLflow DSPy Integration

MLflow is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. In this tutorial, you can use MLflow to visualize prompts and optimization progress as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.

MLflow Trace

  1. Install MLflow
%pip install mlflow>=2.20
  1. Start MLflow UI in a separate terminal
mlflow ui --port 5000
  1. Connect the notebook to MLflow
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")
  1. Enabling tracing.
mlflow.dspy.autolog()

To learn more about the integration, visit MLflow DSPy Documentation as well.

Dataset

For this tutorial, we will use the Banking77 dataset.

import dspy
import random
from dspy.datasets import DataLoader
from datasets import load_dataset

# Load the Banking77 dataset.
CLASSES = load_dataset("PolyAI/banking77", split="train", trust_remote_code=True).features['label'].names
kwargs = dict(fields=("text", "label"), input_keys=("text",), split="train", trust_remote_code=True)

# Load the first 2000 examples from the dataset, and assign a hint to each *training* example.
raw_data = [
    dspy.Example(x, label=CLASSES[x.label]).with_inputs("text")
    for x in DataLoader().from_huggingface(dataset_name="PolyAI/banking77", **kwargs)[:1000]
]

random.Random(0).shuffle(raw_data)

This dataset has 77 different categories for classification. Let's review some of them.

len(CLASSES), CLASSES[:10]
(77,
 ['activate_my_card',
  'age_limit',
  'apple_pay_or_google_pay',
  'atm_support',
  'automatic_top_up',
  'balance_not_updated_after_bank_transfer',
  'balance_not_updated_after_cheque_or_cash_deposit',
  'beneficiary_not_allowed',
  'cancel_transfer',
  'card_about_to_expire'])

Let us sample 500 (unlabeled) queries from Banking77. We'll use these for our bootstrapped finetuning.

unlabeled_trainset = [dspy.Example(text=x.text).with_inputs("text") for x in raw_data[:500]]

unlabeled_trainset[0]
Example({'text': 'What if there is an error on the exchange rate?'}) (input_keys={'text'})

DSPy program

Let's say that we want a program that takes the text and reasons step by step and then selects one of the classes from Banking77.

Note that this is meant mainly for illustration, or for cases where you want to inspect the model's reasoning, e.g. for a small degree of explainability. In other words, this type of task is not necessarily likely to benefit very much from explicit reasoning.

from typing import Literal

classify = dspy.ChainOfThought(f"text -> label: Literal{CLASSES}")

Bootstrapped finetuning

There are many ways to go about this, e.g. allowing the model to teach itself or using inference-time compute (e.g., ensembling) to identify cases of high confidence without labels.

Perhaps the simplest is to use a model that we'd expect can do a reasonable job at this task as a teacher of reasoning and classification, and to distill that to our small model. All of these patterns can be expressed in a handful of lines of code.

Let's set up the tiny Llama-3.2-1B-Instruct as a student LM. We'll use GPT-4o-mini as a teacher LM.

from dspy.clients.lm_local import LocalProvider

student_lm_name = "meta-llama/Llama-3.2-1B-Instruct"
student_lm = dspy.LM(model=f"openai/local:{student_lm_name}", provider=LocalProvider(), max_tokens=2000)
teacher_lm = dspy.LM('openai/gpt-4o-mini', max_tokens=3000)

Now, let's assign classifiers to our LMs.

student_classify = classify.deepcopy()
student_classify.set_lm(student_lm)

teacher_classify = classify.deepcopy()
teacher_classify.set_lm(teacher_lm)

Let's now launch the bootstrapped finetuning. The word "bootstrapped" here means that the program itself will be invoked on the training inputs and the resulting traces seen over all modules will be recorded and used for finetuning. This is the weight-optimizing variant of the various BootstrapFewShot methods in DSPy.

On every question in the (unlabeled) training set, this will invoke the teacher program, which will produce reasoning and select a class. This will be traced and then constitute a training set for all modules (in this case, just the one CoT module) in the student program.

When the compile method is called, the BootstrapFinetune optimizer will use the passed teacher program (or programs, you can pass a list!) to create a training dataset. It will then use this training dataset to create a fine-tuned version of the LM set for the student program, replacing it with the trained LM. Note that the trained LM will be a new LM instance (the student_lm object we instantiated here will be untouched!)

Note: If you have labels, you can pass metric to the constructor of BootstrapFinetune. If you want to apply this in practice, you can pass train_kwargs to the constructor to control local LM training settings: device, use_peft, num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps, learning_rate, max_seq_length, packing, bf16, and output_dir.

# Optional:
# [1] You can set `DSPY_FINETUNEDIR` environment variable to control where the directory that will be used to store the
#     checkpoints and fine-tuning data. If this is not set, `DSPY_CACHEDIR` is used by default.
# [2] You can set the `CUDA_VISIBLE_DEVICES` environment variable to control the GPU that will be used for fine-tuning
#     and inference. If this is not set and the default GPU that's used by HuggingFace's `transformers` library is
#     occupied, an OutOfMemoryError might be raised.
#
# import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# os.environ["DSPY_FINETUNEDIR"] = "/path/to/dir"
dspy.settings.experimental = True  # fine-tuning is an experimental feature, so we set a flag to enable it

optimizer = dspy.BootstrapFinetune(num_threads=16)  # if you *do* have labels, pass metric=your_metric here!
classify_ft = optimizer.compile(student_classify, teacher=teacher_classify, trainset=unlabeled_trainset)

Since this is a local model, we need to explicitly launch it.

classify_ft.get_lm().launch()

Validating the finetuned program

Let's now figure out if this was successful. We can ask the system one question and inspect its behavior.

classify_ft(text="I didn't receive my money earlier and it says the transaction is still in progress. Can you fix it?")
Prediction(
    reasoning='The user is inquiring about a specific issue, which they did not receive and is still showing as a pending transaction. This situation typically indicates a problem with the cash withdrawal process, as the user is not receiving the money they attempted to withdraw. The appropriate label for this scenario is "pending_cash_withdrawal," as it directly relates to the status of the cash withdrawal transaction.',
    label='pending_cash_withdrawal'
)

We could also get a small set of gold labels and see if the system can generalize to unseen queries.

devset = raw_data[500:600]
devset[0]
Example({'text': 'Which fiat currencies do you currently support? Will this change in this future?', 'label': 'fiat_currency_support'}) (input_keys={'text'})

Let's define an evaluator on this small dev set, where the metric ignores the reasoning and checks that the label is exactly correct.

metric = (lambda x, y, trace=None: x.label == y.label)
evaluate = dspy.Evaluate(devset=devset, metric=metric, display_progress=True, display_table=5, num_threads=16)

Now, let's evaluate the finetuned 1B classifier.

evaluate(classify_ft)
Average Metric: 51.00 / 99 (51.5%): 100%|██████████| 100/100 [00:35<00:00,  2.79it/s]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
text example_label reasoning pred_label <lambda> label
0 Which fiat currencies do you currently support? Will this change i... fiat_currency_support The user is inquiring about the current support for fiat currencie... fiat_currency_support ✔️ [True] NaN
1 I didn't receive my money earlier and it says the transaction is s... pending_cash_withdrawal The user is inquiring about a specific issue, which they did not r... pending_cash_withdrawal ✔️ [True] NaN
2 what currencies do you accept? fiat_currency_support The user is inquiring about the currencies that are accepted, whic... fiat_currency_support ✔️ [True] NaN
3 Where can I find your exchange rates? exchange_rate The user is inquiring about where to find exchange rates, which re... exchange_rate ✔️ [True] NaN
4 why hasnt my card come in yet? card_arrival The user is inquiring about the status of their card, which sugges... card_arrival ✔️ [True] NaN
... 95 more rows not displayed ...
51.0
Tracking Evaluation Results in MLflow Experiment

To track and visualize the evaluation results over time, you can record the results in MLflow Experiment.

import mlflow

with mlflow.start_run(run_name="classifier_evaluation"):
    evaluate_correctness = dspy.Evaluate(
        devset=devset,
        metric=extraction_correctness_metric,
        num_threads=16,
        display_progress=True,
        # To record the outputs and detailed scores to MLflow
        return_all_scores=True,
        return_outputs=True,
    )

    # Evaluate the program as usual
    aggregated_score, outputs, all_scores = evaluate_correctness(people_extractor)

    # Log the aggregated score
    mlflow.log_metric("exact_match", aggregated_score)
    # Log the detailed evaluation results as a table
    mlflow.log_table(
        {
            "Text": [example.text for example in devset],
            "Expected": [example.example_label for example in devset],
            "Predicted": outputs,
            "Exact match": all_scores,
        },
        artifact_file="eval_results.json",
    )

To learn more about the integration, visit MLflow DSPy Documentation as well.

Not bad, given that we started with no labels of the task. Even though we have no labels, you can use various strategies to boost the quality of the bootstrapped training data.

To try that next, let's free our GPU memory by killing the finetuned LM.

classify_ft.get_lm().kill()

Bootstrapped finetuning against a metric

If you have labels, you can generally boost this by a large margin. To do so, you can pass a metric to BootstrapFinetune, which it will use for filtering the trajectories over your program before it builds the finetuning data.

optimizer = dspy.BootstrapFinetune(num_threads=16, metric=metric)
classify_ft = optimizer.compile(student_classify, teacher=teacher_classify, trainset=raw_data[:500])

Let's now launch and evaluate this.

classify_ft.get_lm().launch()
evaluate(classify_ft)
Average Metric: 85.00 / 98 (86.7%): 100%|██████████| 100/100 [00:46<00:00,  2.14it/s]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
text example_label reasoning pred_label <lambda> label
0 Which fiat currencies do you currently support? Will this change i... fiat_currency_support The user is inquiring about the fiat currencies currently supporte... fiat_currency_support ✔️ [True] NaN
1 I didn't receive my money earlier and it says the transaction is s... pending_cash_withdrawal The user is inquiring about an unexpected fee on their account, wh... extra_charge_on_statement NaN
2 what currencies do you accept? fiat_currency_support The user is inquiring about the types of currencies that are accep... fiat_currency_support ✔️ [True] NaN
3 Where can I find your exchange rates? exchange_rate The user is inquiring about where to find exchange rates, which re... exchange_rate ✔️ [True] NaN
4 why hasnt my card come in yet? card_arrival The user is inquiring about the status of their card delivery, whi... card_arrival ✔️ [True] NaN
... 95 more rows not displayed ...
85.0

That's quite a bit better, given just 500 labels. In fact, it seems to be a lot stronger than the teacher LM gets out of the box!

evaluate(teacher_classify)
Average Metric: 55.00 / 100 (55.0%): 100%|██████████| 100/100 [00:11<00:00,  8.88it/s]

2025/01/08 12:38:35 INFO dspy.evaluate.evaluate: Average Metric: 55 / 100 (55.0%)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
text example_label reasoning pred_label <lambda>
0 Which fiat currencies do you currently support? Will this change i... fiat_currency_support The user is inquiring about the fiat currencies supported by the s... fiat_currency_support ✔️ [True]
1 I didn't receive my money earlier and it says the transaction is s... pending_cash_withdrawal The user is experiencing an issue with a transaction that is still... pending_transfer
2 what currencies do you accept? fiat_currency_support The question is asking about the types of currencies accepted, whi... fiat_currency_support ✔️ [True]
3 Where can I find your exchange rates? exchange_rate The user is inquiring about where to find exchange rates, which re... exchange_rate ✔️ [True]
4 why hasnt my card come in yet? card_arrival The user is inquiring about the status of their card delivery, whi... card_delivery_estimate
... 95 more rows not displayed ...
55.0

And thanks to bootstrapping, the model learns to apply our modules to get the right label, in this case, reasoning explicitly:

classify_ft(text="why hasnt my card come in yet?")
dspy.inspect_history()
�[34m[2025-01-08T12:39:42.143798]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `text` (str)

Your output fields are:
1. `reasoning` (str)
2. `label` (Literal[activate_my_card, age_limit, apple_pay_or_google_pay, atm_support, automatic_top_up, balance_not_updated_after_bank_transfer, balance_not_updated_after_cheque_or_cash_deposit, beneficiary_not_allowed, cancel_transfer, card_about_to_expire, card_acceptance, card_arrival, card_delivery_estimate, card_linking, card_not_working, card_payment_fee_charged, card_payment_not_recognised, card_payment_wrong_exchange_rate, card_swallowed, cash_withdrawal_charge, cash_withdrawal_not_recognised, change_pin, compromised_card, contactless_not_working, country_support, declined_card_payment, declined_cash_withdrawal, declined_transfer, direct_debit_payment_not_recognised, disposable_card_limits, edit_personal_details, exchange_charge, exchange_rate, exchange_via_app, extra_charge_on_statement, failed_transfer, fiat_currency_support, get_disposable_virtual_card, get_physical_card, getting_spare_card, getting_virtual_card, lost_or_stolen_card, lost_or_stolen_phone, order_physical_card, passcode_forgotten, pending_card_payment, pending_cash_withdrawal, pending_top_up, pending_transfer, pin_blocked, receiving_money, Refund_not_showing_up, request_refund, reverted_card_payment?, supported_cards_and_currencies, terminate_account, top_up_by_bank_transfer_charge, top_up_by_card_charge, top_up_by_cash_or_cheque, top_up_failed, top_up_limits, top_up_reverted, topping_up_by_card, transaction_charged_twice, transfer_fee_charged, transfer_into_account, transfer_not_received_by_recipient, transfer_timing, unable_to_verify_identity, verify_my_identity, verify_source_of_funds, verify_top_up, virtual_card_not_working, visa_or_mastercard, why_verify_identity, wrong_amount_of_cash_received, wrong_exchange_rate_for_cash_withdrawal])

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## text ## ]]
{text}

[[ ## reasoning ## ]]
{reasoning}

[[ ## label ## ]]
{label}        # note: the value you produce must be one of: activate_my_card; age_limit; apple_pay_or_google_pay; atm_support; automatic_top_up; balance_not_updated_after_bank_transfer; balance_not_updated_after_cheque_or_cash_deposit; beneficiary_not_allowed; cancel_transfer; card_about_to_expire; card_acceptance; card_arrival; card_delivery_estimate; card_linking; card_not_working; card_payment_fee_charged; card_payment_not_recognised; card_payment_wrong_exchange_rate; card_swallowed; cash_withdrawal_charge; cash_withdrawal_not_recognised; change_pin; compromised_card; contactless_not_working; country_support; declined_card_payment; declined_cash_withdrawal; declined_transfer; direct_debit_payment_not_recognised; disposable_card_limits; edit_personal_details; exchange_charge; exchange_rate; exchange_via_app; extra_charge_on_statement; failed_transfer; fiat_currency_support; get_disposable_virtual_card; get_physical_card; getting_spare_card; getting_virtual_card; lost_or_stolen_card; lost_or_stolen_phone; order_physical_card; passcode_forgotten; pending_card_payment; pending_cash_withdrawal; pending_top_up; pending_transfer; pin_blocked; receiving_money; Refund_not_showing_up; request_refund; reverted_card_payment?; supported_cards_and_currencies; terminate_account; top_up_by_bank_transfer_charge; top_up_by_card_charge; top_up_by_cash_or_cheque; top_up_failed; top_up_limits; top_up_reverted; topping_up_by_card; transaction_charged_twice; transfer_fee_charged; transfer_into_account; transfer_not_received_by_recipient; transfer_timing; unable_to_verify_identity; verify_my_identity; verify_source_of_funds; verify_top_up; virtual_card_not_working; visa_or_mastercard; why_verify_identity; wrong_amount_of_cash_received; wrong_exchange_rate_for_cash_withdrawal

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `text`, produce the fields `label`.


�[31mUser message:�[0m

[[ ## text ## ]]
why hasnt my card come in yet?

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## label ## ]]` (must be formatted as a valid Python Literal[activate_my_card, age_limit, apple_pay_or_google_pay, atm_support, automatic_top_up, balance_not_updated_after_bank_transfer, balance_not_updated_after_cheque_or_cash_deposit, beneficiary_not_allowed, cancel_transfer, card_about_to_expire, card_acceptance, card_arrival, card_delivery_estimate, card_linking, card_not_working, card_payment_fee_charged, card_payment_not_recognised, card_payment_wrong_exchange_rate, card_swallowed, cash_withdrawal_charge, cash_withdrawal_not_recognised, change_pin, compromised_card, contactless_not_working, country_support, declined_card_payment, declined_cash_withdrawal, declined_transfer, direct_debit_payment_not_recognised, disposable_card_limits, edit_personal_details, exchange_charge, exchange_rate, exchange_via_app, extra_charge_on_statement, failed_transfer, fiat_currency_support, get_disposable_virtual_card, get_physical_card, getting_spare_card, getting_virtual_card, lost_or_stolen_card, lost_or_stolen_phone, order_physical_card, passcode_forgotten, pending_card_payment, pending_cash_withdrawal, pending_top_up, pending_transfer, pin_blocked, receiving_money, Refund_not_showing_up, request_refund, reverted_card_payment?, supported_cards_and_currencies, terminate_account, top_up_by_bank_transfer_charge, top_up_by_card_charge, top_up_by_cash_or_cheque, top_up_failed, top_up_limits, top_up_reverted, topping_up_by_card, transaction_charged_twice, transfer_fee_charged, transfer_into_account, transfer_not_received_by_recipient, transfer_timing, unable_to_verify_identity, verify_my_identity, verify_source_of_funds, verify_top_up, virtual_card_not_working, visa_or_mastercard, why_verify_identity, wrong_amount_of_cash_received, wrong_exchange_rate_for_cash_withdrawal]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## reasoning ## ]]
The user is inquiring about the status of their card delivery, which suggests they are concerned about when they will receive their card. This aligns with the topic of card arrival and delivery estimates.

[[ ## label ## ]]
card_arrival

[[ ## completed ## ]]�[0m
Saving fine-tuned programs in MLflow Experiment

To deploy the fine-tuned program in production or share it with your team, you can save it in MLflow Experiment. Compared to simply saving it to a local file, MLflow offers the following benefits:

  1. Dependency Management: MLflow automatically save the frozen environment metadata along with the program to ensure reproducibility.
  2. Experiment Tracking: With MLflow, you can track the program's performance and cost along with the program itself.
  3. Collaboration: You can share the program and results with your team members by sharing the MLflow experiment.

To save the program in MLflow, run the following code:

import mlflow

# Start an MLflow Run and save the program
with mlflow.start_run(run_name="optimized_classifier"):
    model_info = mlflow.dspy.log_model(
        classify_ft,
        artifact_path="model", # Any name to save the program in MLflow
    )

# Load the program back from MLflow
loaded = mlflow.dspy.load_model(model_info.model_uri)

To learn more about the integration, visit MLflow DSPy Documentation as well.


Tutorial: Entity Extraction

This tutorial demonstrates how to perform entity extraction using the CoNLL-2003 dataset with DSPy. The focus is on extracting entities referring to people. We will:

  • Extract and label entities from the CoNLL-2003 dataset that refer to people
  • Define a DSPy program for extracting entities that refer to people
  • Optimize and evaluate the program on a subset of the CoNLL-2003 dataset

By the end of this tutorial, you'll understand how to structure tasks in DSPy using signatures and modules, evaluate your system's performance, and improve its quality with optimizers.

Install the latest version of DSPy and follow along. If you're looking instead for a conceptual overview of DSPy, this recent lecture is a good place to start.

# Install the latest version of DSPy
%pip install -U dspy-ai
# Install the Hugging Face datasets library to load the CoNLL-2003 dataset
%pip install datasets
Recommended: Set up MLflow Tracing to understand what's happening under the hood.

MLflow DSPy Integration

MLflow is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. In this tutorial, you can use MLflow to visualize prompts and optimization progress as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.

MLflow Trace

  1. Install MLflow
%pip install mlflow>=2.20
  1. Start MLflow UI in a separate terminal
mlflow ui --port 5000
  1. Connect the notebook to MLflow
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")
  1. Enabling tracing.
mlflow.dspy.autolog()

To learn more about the integration, visit MLflow DSPy Documentation as well.

Load and Prepare the Dataset

In this section, we prepare the CoNLL-2003 dataset, which is commonly used for entity extraction tasks. The dataset includes tokens annotated with entity labels such as persons, organizations, and locations.

We will:

  1. Load the dataset using the Hugging Face datasets library.
  2. Define a function to extract tokens referring to people.
  3. Slice the dataset to create smaller subsets for training and testing.

DSPy expects examples in a structured format, so we'll also transform the dataset into DSPy Examples for easy integration.

import os
import tempfile
from datasets import load_dataset
from typing import Dict, Any, List
import dspy

def load_conll_dataset() -> dict:
    """
    Loads the CoNLL-2003 dataset into train, validation, and test splits.
    
    Returns:
        dict: Dataset splits with keys 'train', 'validation', and 'test'.
    """
    with tempfile.TemporaryDirectory() as temp_dir:
        # Use a temporary Hugging Face cache directory for compatibility with certain hosted notebook
        # environments that don't support the default Hugging Face cache directory
        os.environ["HF_DATASETS_CACHE"] = temp_dir
        return load_dataset("conll2003", trust_remote_code=True)

def extract_people_entities(data_row: Dict[str, Any]) -> List[str]:
    """
    Extracts entities referring to people from a row of the CoNLL-2003 dataset.
    
    Args:
        data_row (Dict[str, Any]): A row from the dataset containing tokens and NER tags.
    
    Returns:
        List[str]: List of tokens tagged as people.
    """
    return [
        token
        for token, ner_tag in zip(data_row["tokens"], data_row["ner_tags"])
        if ner_tag in (1, 2)  # CoNLL entity codes 1 and 2 refer to people
    ]

def prepare_dataset(data_split, start: int, end: int) -> List[dspy.Example]:
    """
    Prepares a sliced dataset split for use with DSPy.
    
    Args:
        data_split: The dataset split (e.g., train or test).
        start (int): Starting index of the slice.
        end (int): Ending index of the slice.
    
    Returns:
        List[dspy.Example]: List of DSPy Examples with tokens and expected labels.
    """
    return [
        dspy.Example(
            tokens=row["tokens"],
            expected_extracted_people=extract_people_entities(row)
        ).with_inputs("tokens")
        for row in data_split.select(range(start, end))
    ]

# Load the dataset
dataset = load_conll_dataset()

# Prepare the training and test sets
train_set = prepare_dataset(dataset["train"], 0, 50)
test_set = prepare_dataset(dataset["test"], 0, 200)

Configure DSPy and create an Entity Extraction Program

Here, we define a DSPy program for extracting entities referring to people from tokenized text.

Then, we configure DSPy to use a particular language model (gpt-4o-mini) for all invocations of the program.

Key DSPy Concepts Introduced:

  • Signatures: Define structured input/output schemas for your program.
  • Modules: Encapsulate program logic in reusable, composable units.

Specifically, we'll:

  • Create a PeopleExtraction DSPy Signature to specify the input (tokens) and output (extracted_people) fields.
  • Define a people_extractor program that uses DSPy's built-in dspy.ChainOfThought module to implement the PeopleExtraction signature. The program extracts entities referring to people from a list of input tokens using language model (LM) prompting.
  • Use the dspy.LM class and dspy.settings.configure() method to configure the language model that DSPy will use when invoking the program.
from typing import List

class PeopleExtraction(dspy.Signature):
    """
    Extract contiguous tokens referring to specific people, if any, from a list of string tokens.
    Output a list of tokens. In other words, do not combine multiple tokens into a single value.
    """
    tokens: list[str] = dspy.InputField(desc="tokenized text")
    extracted_people: list[str] = dspy.OutputField(desc="all tokens referring to specific people extracted from the tokenized text")

people_extractor = dspy.ChainOfThought(PeopleExtraction)

Here, we tell DSPy to use OpenAI's gpt-4o-mini model in our program. To authenticate, DSPy reads your OPENAI_API_KEY. You can easily swap this out for other providers or local models.

lm = dspy.LM(model="openai/gpt-4o-mini")
dspy.settings.configure(lm=lm)

Define Metric and Evaluation Functions

In DSPy, evaluating a program's performance is critical for iterative development. A good evaluation framework allows us to:

  • Measure the quality of our program's outputs.
  • Compare outputs against ground-truth labels.
  • Identify areas for improvement.

What We'll Do:

  • Define a custom metric (extraction_correctness_metric) to evaluate whether the extracted entities match the ground truth.
  • Create an evaluation function (evaluate_correctness) to apply this metric to a training or test dataset and compute the overall accuracy.

The evaluation function uses DSPy's Evaluate utility to handle parallelism and visualization of results.

def extraction_correctness_metric(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> bool:
    """
    Computes correctness of entity extraction predictions.
    
    Args:
        example (dspy.Example): The dataset example containing expected people entities.
        prediction (dspy.Prediction): The prediction from the DSPy people extraction program.
        trace: Optional trace object for debugging.
    
    Returns:
        bool: True if predictions match expectations, False otherwise.
    """
    return prediction.extracted_people == example.expected_extracted_people

evaluate_correctness = dspy.Evaluate(
    devset=test_set,
    metric=extraction_correctness_metric,
    num_threads=24,
    display_progress=True,
    display_table=True
)

Evaluate Initial Extractor

Before optimizing our program, we need a baseline evaluation to understand its current performance. This helps us:

  • Establish a reference point for comparison after optimization.
  • Identify potential weaknesses in the initial implementation.

In this step, we'll run our people_extractor program on the test set and measure its accuracy using the evaluation framework defined earlier.

evaluate_correctness(people_extractor, devset=test_set)
Average Metric: 172.00 / 200 (86.0%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:16<00:00, 11.94it/s]

2024/11/18 21:08:04 INFO dspy.evaluate.evaluate: Average Metric: 172 / 200 (86.0%)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
tokens expected_extracted_people rationale extracted_people extraction_correctness_metric
0 [SOCCER, -, JAPAN, GET, LUCKY, WIN, ,, CHINA, IN, SURPRISE, DEFEAT... [CHINA] We extracted "JAPAN" and "CHINA" as they refer to specific countri... [JAPAN, CHINA]
1 [Nadim, Ladki] [Nadim, Ladki] We extracted the tokens "Nadim" and "Ladki" as they refer to speci... [Nadim, Ladki] ✔️ [True]
2 [AL-AIN, ,, United, Arab, Emirates, 1996-12-06] [] There are no tokens referring to specific people in the provided l... [] ✔️ [True]
3 [Japan, began, the, defence, of, their, Asian, Cup, title, with, a... [] We did not find any tokens referring to specific people in the pro... [] ✔️ [True]
4 [But, China, saw, their, luck, desert, them, in, the, second, matc... [] The extracted tokens referring to specific people are "China" and ... [China, Uzbekistan]
... ... ... ... ... ...
195 ['The', 'Wallabies', 'have', 'their', 'sights', 'set', 'on', 'a', ... [David, Campese] The extracted_people includes "David Campese" as it refers to a sp... [David, Campese] ✔️ [True]
196 ['The', 'Wallabies', 'currently', 'have', 'no', 'plans', 'to', 'ma... [] The extracted_people includes "Wallabies" as it refers to a specif... [] ✔️ [True]
197 ['Campese', 'will', 'be', 'up', 'against', 'a', 'familiar', 'foe',... [Campese, Rob, Andrew] The extracted tokens refer to specific people mentioned in the tex... [Campese, Rob, Andrew] ✔️ [True]
198 ['"', 'Campo', 'has', 'a', 'massive', 'following', 'in', 'this', '... [Campo, Andrew] The extracted tokens referring to specific people include "Campo" ... [Campo, Andrew] ✔️ [True]
199 ['On', 'tour', ',', 'Australia', 'have', 'won', 'all', 'four', 'te... [] We extracted the names of specific people from the tokenized text.... [] ✔️ [True]

200 rows × 5 columns

86.0
Tracking Evaluation Results in MLflow Experiment

To track and visualize the evaluation results over time, you can record the results in MLflow Experiment.

import mlflow

with mlflow.start_run(run_name="extractor_evaluation"):
    evaluate_correctness = dspy.Evaluate(
        devset=test_set,
        metric=extraction_correctness_metric,
        num_threads=24,
        display_progress=True,
        # To record the outputs and detailed scores to MLflow
        return_all_scores=True,
        return_outputs=True,
    )

    # Evaluate the program as usual
    aggregated_score, outputs, all_scores = evaluate_correctness(people_extractor)

    # Log the aggregated score
    mlflow.log_metric("exact_match", aggregated_score)
    # Log the detailed evaluation results as a table
    mlflow.log_table(
        {
            "Tokens": [example.tokens for example in test_set],
            "Expected": [example.expected_extracted_people for example in test_set],
            "Predicted": outputs,
            "Exact match": all_scores,
        },
        artifact_file="eval_results.json",
    )

To learn more about the integration, visit MLflow DSPy Documentation as well.

Optimize the Model

DSPy includes powerful optimizers that can improve the quality of your system.

Here, we use DSPy's MIPROv2 optimizer to:

  • Automatically tune the program's language model (LM) prompt by 1. using the LM to adjust the prompt's instructions and 2. building few-shot examples from the training dataset that are augmented with reasoning generated from dspy.ChainOfThought.
  • Maximize correctness on the training set.

This optimization process is automated, saving time and effort while improving accuracy.

mipro_optimizer = dspy.MIPROv2(
    metric=extraction_correctness_metric,
    auto="medium",
)
optimized_people_extractor = mipro_optimizer.compile(
    people_extractor,
    trainset=train_set,
    max_bootstrapped_demos=4,
    requires_permission_to_run=False,
    minibatch=False
)

Evaluate Optimized Program

After optimization, we re-evaluate the program on the test set to measure improvements. Comparing the optimized and initial results allows us to:

  • Quantify the benefits of optimization.
  • Validate that the program generalizes well to unseen data.

In this case, we see that accuracy of the program on the test dataset has improved significantly.

evaluate_correctness(optimized_people_extractor, devset=test_set)
Average Metric: 186.00 / 200 (93.0%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:23<00:00,  8.58it/s]

2024/11/18 21:15:00 INFO dspy.evaluate.evaluate: Average Metric: 186 / 200 (93.0%)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
tokens expected_extracted_people rationale extracted_people extraction_correctness_metric
0 [SOCCER, -, JAPAN, GET, LUCKY, WIN, ,, CHINA, IN, SURPRISE, DEFEAT... [CHINA] There are no specific people mentioned in the provided tokens. The... []
1 [Nadim, Ladki] [Nadim, Ladki] The tokens "Nadim Ladki" refer to a specific individual. Both toke... [Nadim, Ladki] ✔️ [True]
2 [AL-AIN, ,, United, Arab, Emirates, 1996-12-06] [] There are no tokens referring to specific people in the provided l... [] ✔️ [True]
3 [Japan, began, the, defence, of, their, Asian, Cup, title, with, a... [] There are no specific people mentioned in the provided tokens. The... [] ✔️ [True]
4 [But, China, saw, their, luck, desert, them, in, the, second, matc... [] There are no tokens referring to specific people in the provided l... [] ✔️ [True]
... ... ... ... ... ...
195 ['The', 'Wallabies', 'have', 'their', 'sights', 'set', 'on', 'a', ... [David, Campese] The extracted tokens refer to a specific person mentioned in the t... [David, Campese] ✔️ [True]
196 ['The', 'Wallabies', 'currently', 'have', 'no', 'plans', 'to', 'ma... [] There are no specific individuals mentioned in the provided tokens... [] ✔️ [True]
197 ['Campese', 'will', 'be', 'up', 'against', 'a', 'familiar', 'foe',... [Campese, Rob, Andrew] The tokens include the names "Campese" and "Rob Andrew," both of w... [Campese, Rob, Andrew] ✔️ [True]
198 ['"', 'Campo', 'has', 'a', 'massive', 'following', 'in', 'this', '... [Campo, Andrew] The extracted tokens refer to specific people mentioned in the tex... [Campo, Andrew] ✔️ [True]
199 ['On', 'tour', ',', 'Australia', 'have', 'won', 'all', 'four', 'te... [] There are no specific people mentioned in the provided tokens. The... [] ✔️ [True]

200 rows × 5 columns

93.0

Inspect Optimized Program's Prompt

After optimizing the program, we can inspect the history of interactions to see how DSPy has augmented the program's prompt with few-shot examples. This step demonstrates:

  • The structure of the prompt used by the program.
  • How few-shot examples are added to guide the model's behavior.

Use inspect_history(n=1) to view the last interaction and analyze the generated prompt.

dspy.inspect_history(n=1)
�[34m[2024-11-18T21:15:00.584497]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `tokens` (list[str]): tokenized text

Your output fields are:
1. `rationale` (str): ${produce the extracted_people}. We ...
2. `extracted_people` (list[str]): all tokens referring to specific people extracted from the tokenized text

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## tokens ## ]]
{tokens}

[[ ## rationale ## ]]
{rationale}

[[ ## extracted_people ## ]]
{extracted_people}        # note: the value you produce must be pareseable according to the following JSON schema: {"type": "array", "items": {"type": "string"}}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        In a high-stakes situation where accurate identification of individuals is critical for regulatory compliance and public health communication, extract contiguous tokens referring to specific people from the provided list of string tokens. Ensure that you output each identified individual as separate tokens without combining multiple tokens into a single value. This task is essential for ensuring clarity and accountability in communications pertaining to EU regulations and health matters.


�[31mUser message:�[0m

[[ ## tokens ## ]]
["He", "said", "further", "scientific", "study", "was", "required", "and", "if", "it", "was", "found", "that", "action", "was", "needed", "it", "should", "be", "taken", "by", "the", "European", "Union", "."]

Respond with the corresponding output fields, starting with the field `[[ ## rationale ## ]]`, then `[[ ## extracted_people ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## rationale ## ]]
There are no specific people mentioned in the provided tokens. The only reference is to the "European Union," which is an organization rather than a specific person. We therefore conclude that there are no extracted people.

[[ ## extracted_people ## ]]
[]

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## tokens ## ]]
["BRUSSELS", "1996-08-22"]

Respond with the corresponding output fields, starting with the field `[[ ## rationale ## ]]`, then `[[ ## extracted_people ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## rationale ## ]]
There are no tokens referring to specific people in the provided list. We only have a location and a date.

[[ ## extracted_people ## ]]
[]

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## tokens ## ]]
["Fischler", "proposed", "EU-wide", "measures", "after", "reports", "from", "Britain", "and", "France", "that", "under", "laboratory", "conditions", "sheep", "could", "contract", "Bovine", "Spongiform", "Encephalopathy", "(", "BSE", ")", "--", "mad", "cow", "disease", "."]

Respond with the corresponding output fields, starting with the field `[[ ## rationale ## ]]`, then `[[ ## extracted_people ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## rationale ## ]]
The extracted tokens refer to specific people mentioned in the text. In this case, "Fischler" is the only token that refers to a specific person. We extract this token as it is contiguous and represents an individual involved in the proposed measures.

[[ ## extracted_people ## ]]
["Fischler"]

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## tokens ## ]]
["Germany", "'s", "representative", "to", "the", "European", "Union", "'s", "veterinary", "committee", "Werner", "Zwingmann", "said", "on", "Wednesday", "consumers", "should", "buy", "sheepmeat", "from", "countries", "other", "than", "Britain", "until", "the", "scientific", "advice", "was", "clearer", "."]

Respond with the corresponding output fields, starting with the field `[[ ## rationale ## ]]`, then `[[ ## extracted_people ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## rationale ## ]]
We extracted the name "Werner Zwingmann" from the tokenized text as it refers to a specific person mentioned in the context of a statement regarding sheepmeat consumption.

[[ ## extracted_people ## ]]
["Werner", "Zwingmann"]

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## tokens ## ]]
["LONDON", "1996-12-06"]

Respond with the corresponding output fields, starting with the field `[[ ## rationale ## ]]`, then `[[ ## extracted_people ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## rationale ## ]]
There are no tokens referring to specific people in the provided list. The tokens only include a location and a date.

[[ ## extracted_people ## ]]
[]

[[ ## completed ## ]]�[0m

Keeping an eye on cost

DSPy allows you to track the cost of your programs. The following code demonstrates how to obtain the cost of all LM calls made by the DSPy extractor program so far.

cost = sum([x['cost'] for x in lm.history if x['cost'] is not None])  # cost in USD, as calculated by LiteLLM for certain providers
cost
0.26362742999999983

Saving and Loading Optimized Programs

DSPy supports saving and loading programs, enabling you to reuse optimized systems without the need to re-optimize from scratch. This feature is especially useful for deploying your programs in production environments or sharing them with collaborators.

In this step, we'll save the optimized program to a file and demonstrate how to load it back for future use.

optimized_people_extractor.save("optimized_extractor.json")

loaded_people_extractor = dspy.ChainOfThought(PeopleExtraction)
loaded_people_extractor.load("optimized_extractor.json")

loaded_people_extractor(tokens=["Italy", "recalled", "Marcello", "Cuttitta"]).extracted_people
['Marcello', 'Cuttitta']
Saving programs in MLflow Experiment

Instead of saving the program to a local file, you can track it in MLflow for better reproducibility and collaboration.

  1. Dependency Management: MLflow automatically save the frozen environment metadata along with the program to ensure reproducibility.
  2. Experiment Tracking: With MLflow, you can track the program's performance and cost along with the program itself.
  3. Collaboration: You can share the program and results with your team members by sharing the MLflow experiment.

To save the program in MLflow, run the following code:

import mlflow

# Start an MLflow Run and save the program
with mlflow.start_run(run_name="optimized_extractor"):
    model_info = mlflow.dspy.log_model(
        optimized_people_extractor,
        artifact_path="model", # Any name to save the program in MLflow
    )

# Load the program back from MLflow
loaded = mlflow.dspy.load_model(model_info.model_uri)

To learn more about the integration, visit MLflow DSPy Documentation as well.

Conclusion

In this tutorial, we demonstrated how to:

  • Use DSPy to build a modular, interpretable system for entity extraction.
  • Evaluate and optimize the system using DSPy's built-in tools.

By leveraging structured inputs and outputs, we ensured that the system was easy to understand and improve. The optimization process allowed us to quickly improve performance without manually crafting prompts or tweaking parameters.

Next Steps:

  • Experiment with extraction of other entity types (e.g., locations or organizations).
  • Explore DSPy's other builtin modules like ReAct for more complex reasoning tasks.
  • Use the system in larger workflows, such as large scale document processing or summarization.

Tutorial: Fine-tuning Agents

Let's walk through a quick example of optimizing the language model weights (i.e., fine-tuning) inside a DSPy module that represents a ReAct agent playing a game with 50-step tasks.

Install dependencies and download data

Install the latest DSPy via pip install -U --pre dspy and follow along. This tutorial uses the AlfWorld dataset, which depends on DSPy 2.6.0 (pre-release).

You will also need the following dependencies:

> pip install -U alfworld==0.3.5 multiprocess
> alfworld-download
Recommended: Setup MLflow Tracing for learning what's happening under the hood

MLflow DSPy Integration

MLflow is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. In this tutorial, you can use MLflow to visualize prompts and optimization progress as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.

MLflow Trace

  1. Install MLflow
%pip install mlflow>=2.20
  1. Start MLflow UI in a separate terminal
mlflow ui --port 5000
  1. Connect the notebook to MLflow
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")
  1. Enabling tracing.
mlflow.dspy.autolog()

To learn more about the integration, visit MLflow DSPy Documentation as well.

Set up the language models

Our goal is to allow gpt-4o-mini to play the AlfWorld household game proficiently, without tinkering with string prompts or example trajectories by hand.

Though it's not strictly necessary, we'll make our job a little easier by using the larger gpt-4o for prompt optimization and fine-tuning, building our small gpt-4o-mini agent.

import dspy

gpt4o_mini = dspy.LM('gpt-4o-mini-2024-07-18')
gpt4o = dspy.LM('openai/gpt-4o')
dspy.configure(experimental=True)

Let's load 200 training and 200 development tasks from AlfWorld. The dataset is much larger, but a small number of examples will help keep this tutorial run in 1-2 hours, including fine-tuning.

With just 100 training tasks, we'll teach 4o-mini to go from 19% (can barely play the game) to 72%. If you use 500 tasks and retain the demonstrations during fine-tuning, you can push that easily to 82%.

from dspy.datasets.alfworld import AlfWorld

alfworld = AlfWorld()
trainset, devset = alfworld.trainset[:200], alfworld.devset[-200:]
len(trainset), len(devset)
(200, 200)

Before we proceed, let's view an example of this task.

example = trainset[0]

with alfworld.POOL.session() as env:
    task, info = env.init(**example.inputs())

print(task)
-= Welcome to TextWorld, ALFRED! =-

You are in the middle of a room. Looking quickly around you, you see a countertop 1, a drawer 8, a drawer 7, a drawer 6, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a garbagecan 1, a handtowelholder 1, a sinkbasin 2, a sinkbasin 1, a toilet 1, a toiletpaperhanger 1, and a towelholder 1.

Your task is to: put a clean soapbar in garbagecan.

Defining the Agent program

The agent is a pretty simple dspy.Module with one sub-module called self.react.

This sub-module consumes a definition of a specific task, sees its previous trajectory, and sees a list of possible_actions it can take. It responds simply with the next action.

In the forward method, we just initialize an environment for the given task idx. And we loop up to self.max_iters, repeatedly invoking the self.react module to take the next action.

class Agent(dspy.Module):
    def __init__(self, max_iters=50, verbose=False):
        self.max_iters = max_iters
        self.verbose = verbose
        self.react = dspy.Predict("task, trajectory, possible_actions: list[str] -> action")

    def forward(self, idx):
        with alfworld.POOL.session() as env:
            trajectory = []
            task, info = env.init(idx)
            if self.verbose:
                print(f"Task: {task}")

            for _ in range(self.max_iters):
                trajectory_ = "\n".join(trajectory)
                possible_actions = info["admissible_commands"][0] + ["think: ${...thoughts...}"]
                prediction = self.react(task=task, trajectory=trajectory_, possible_actions=possible_actions)
                trajectory.append(f"> {prediction.action}")

                if prediction.action.startswith("think:"):
                    trajectory.append("OK.")
                    continue

                obs, reward, done, info = env.step(prediction.action)
                obs, reward, done = obs[0], reward[0], done[0]
                trajectory.append(obs)

                if self.verbose:
                    print("\n".join(trajectory[-2:]))

                if done:
                    break

        assert reward == int(info["won"][0]), (reward, info["won"][0])
        return dspy.Prediction(trajecotry=trajectory, success=reward)

Aside: If you wanted to include instructions for your agent...

Above, we opted to keep the agent super simple, without even providing short instructions that describe the task.

In principle, you can copy a short definition of the AlfWorld task (based on Yao et al., 2022) and use that as the instruction for your agent. This is not inherently essential, but it helps illustrate the role that instructions play in DSPy: they're not for coercing the model to exhibit a certain behavior, but they're there to describe the fundamentals of the task in a straightforward, human-readable way.

If you want to do that, you can simply replace this:

self.react = dspy.Predict("task, trajectory, possible_actions: list[str] -> action")

with this:

INSTRUCTIONS = """
Interact with a simulated household to achieve a high-level goal. Make sure to plan, track subgoals,
determine likely locations for common household items (e.g. desklamps will likely be on desks, shelfs, or dressers),
and explore systematically (e.g. check all desks one by one for desklamp).
""".strip()

self.react = dspy.Predict(dspy.Signature("task, trajectory, possible_actions: list[str] -> action", INSTRUCTIONS))

Zero-shot evaluation

Now, let's try this simple program, prior to any optimization work.

agent_4o = Agent()
agent_4o.set_lm(gpt4o)
agent_4o.verbose = True

agent_4o(**example.inputs())
Task: -= Welcome to TextWorld, ALFRED! =-

You are in the middle of a room. Looking quickly around you, you see a countertop 1, a drawer 8, a drawer 7, a drawer 6, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a garbagecan 1, a handtowelholder 1, a sinkbasin 2, a sinkbasin 1, a toilet 1, a toiletpaperhanger 1, and a towelholder 1.

Your task is to: put a clean soapbar in garbagecan.
> go to countertop 1
You arrive at countertop 1. On the countertop 1, you see a candle 1, a soapbar 1, a soapbottle 2, a soapbottle 1, and a spraybottle 1.
> take soapbar 1 from countertop 1
You pick up the soapbar 1 from the countertop 1.
> go to garbagecan 1
You arrive at garbagecan 1. On the garbagecan 1, you see nothing.
> move soapbar 1 to garbagecan 1
You move the soapbar 1 to the garbagecan 1.
> examine garbagecan 1
On the garbagecan 1, you see a soapbar 1.
> examine garbagecan 1
On the garbagecan 1, you see a soapbar 1.
> examine garbagecan 1
On the garbagecan 1, you see a soapbar 1.
> examine garbagecan 1
On the garbagecan 1, you see a soapbar 1.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> take soapbar 1 from garbagecan 1
You pick up the soapbar 1 from the garbagecan 1.
> move soapbar 1 to garbagecan 1
You move the soapbar 1 to the garbagecan 1.
> examine garbagecan 1
On the garbagecan 1, you see a soapbar 1.
> examine garbagecan 1
On the garbagecan 1, you see a soapbar 1.
> examine garbagecan 1
On the garbagecan 1, you see a soapbar 1.
> examine garbagecan 1
On the garbagecan 1, you see a soapbar 1.
> examine garbagecan 1
On the garbagecan 1, you see a soapbar 1.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> examine garbagecan 1
On the garbagecan 1, you see a soapbar 1.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> examine garbagecan 1
On the garbagecan 1, you see a soapbar 1.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> look
You are facing the garbagecan 1. Next to it, you see nothing.
> examine garbagecan 1
On the garbagecan 1, you see a soapbar 1.
> look
You are facing the garbagecan 1. Next to it, you see nothing.





Prediction(
    trajecotry=['> go to countertop 1', 'You arrive at countertop 1. On the countertop 1, you see a candle 1, a soapbar 1, a soapbottle 2, a soapbottle 1, and a spraybottle 1.', '> take soapbar 1 from countertop 1', 'You pick up the soapbar 1 from the countertop 1.', '> go to garbagecan 1', 'You arrive at garbagecan 1. On the garbagecan 1, you see nothing.', '> move soapbar 1 to garbagecan 1', 'You move the soapbar 1 to the garbagecan 1.', '> examine garbagecan 1', 'On the garbagecan 1, you see a soapbar 1.', '> examine garbagecan 1', 'On the garbagecan 1, you see a soapbar 1.', '> examine garbagecan 1', 'On the garbagecan 1, you see a soapbar 1.', '> examine garbagecan 1', 'On the garbagecan 1, you see a soapbar 1.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> take soapbar 1 from garbagecan 1', 'You pick up the soapbar 1 from the garbagecan 1.', '> move soapbar 1 to garbagecan 1', 'You move the soapbar 1 to the garbagecan 1.', '> examine garbagecan 1', 'On the garbagecan 1, you see a soapbar 1.', '> examine garbagecan 1', 'On the garbagecan 1, you see a soapbar 1.', '> examine garbagecan 1', 'On the garbagecan 1, you see a soapbar 1.', '> examine garbagecan 1', 'On the garbagecan 1, you see a soapbar 1.', '> examine garbagecan 1', 'On the garbagecan 1, you see a soapbar 1.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> examine garbagecan 1', 'On the garbagecan 1, you see a soapbar 1.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> examine garbagecan 1', 'On the garbagecan 1, you see a soapbar 1.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.', '> examine garbagecan 1', 'On the garbagecan 1, you see a soapbar 1.', '> look', 'You are facing the garbagecan 1. Next to it, you see nothing.'],
    success=0
)

Okay, in this case it couldn't solve this example! Now, let's check the average quality of 4o and 4o-mini.

metric = lambda x, y, trace=None: y.success
evaluate = dspy.Evaluate(devset=devset, metric=metric, display_progress=True, num_threads=16)
Tracking Evaluation Results in MLflow Experiment

To track and visualize the evaluation results over time, you can record the results in MLflow Experiment.

import mlflow

with mlflow.start_run(run_name="agent_evaluation"):
    evaluate = dspy.Evaluate(
        devset=devset,
        metric=metric,
        num_threads=16,
        display_progress=True,
        # To record the outputs and detailed scores to MLflow
        return_all_scores=True,
        return_outputs=True,
    )

    # Evaluate the program as usual
    aggregated_score, outputs, all_scores = evaluate(cot)

    # Log the aggregated score
    mlflow.log_metric("success_rate", aggregated_score)
    # Log the detailed evaluation results as a table
    mlflow.log_table(
        {
            "Idx": [example.idx for example in eval_set],
            "Result": outputs,
            "Success": all_scores,
        },
        artifact_file="eval_results.json",
    )

To learn more about the integration, visit MLflow DSPy Documentation as well.

agent_4o.verbose = False
evaluate(agent_4o)
Average Metric: 115.00 / 200 (57.5%): 100%|██████████| 200/200 [06:14<00:00,  1.87s/it]

2024/12/28 11:10:25 INFO dspy.evaluate.evaluate: Average Metric: 115 / 200 (57.5%)








57.5
agent_4o_mini = Agent()
agent_4o_mini.set_lm(gpt4o_mini)

evaluate(agent_4o_mini)
Average Metric: 30.00 / 200 (15.0%): 100%|██████████| 200/200 [08:33<00:00,  2.57s/it]

2024/12/28 11:18:59 INFO dspy.evaluate.evaluate: Average Metric: 30 / 200 (15.0%)








15.0

Out of the box, on this task, 4o is decent (58% success rate) while 4o-mini struggles (15% success rate).

Let's apply the following strategy:

  1. We'll optimize the prompts for gpt-4o in a lightweight way.
  2. We'll then use this prompt-optimized agent as a teacher to fine-tune gpt-4o-mini on the task. This will increase its quality from 19% to 72% (or 82% if you use 500 trainset examples).

Prompt-optimizing GPT-4o

optimizer = dspy.MIPROv2(metric=metric, auto="light", num_threads=16, prompt_model=gpt4o)

config = dict(max_bootstrapped_demos=1, max_labeled_demos=0, minibatch_size=40)
optimized_4o = optimizer.compile(agent_4o, trainset=trainset, **config, requires_permission_to_run=False)

Fine-tuning GPT-4o-mini

For fine-tuning, we'll need a teacher program (optimized_4o above) and a student program derived from it (student_4om below).

student_4o_mini = optimized_4o.deepcopy()
student_4o_mini.set_lm(gpt4o_mini)
# student_4o_mini.react.demos = []  # you can optionally reset the demos
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=16)
finetuned_4o_mini = optimizer.compile(student_4o_mini, teacher=optimized_4o, trainset=trainset)

Evaluate the finetuned GPT-4o-mini agent

evaluate(finetuned_4o_mini)
Average Metric: 143.00 / 200 (71.5%): 100%|██████████| 200/200 [03:15<00:00,  1.05it/s]

Having done all this optimization, let's save our program so we can use it later! This will keep a reference to the fine-tuned model as well, as long as it continued to exist with the same identifier at the provider side.

finetuned_4o_mini.save('finetuned_4o_mini_001.pkl')
Saving programs in MLflow Experiment

Instead of saving the program to a local file, you can track it in MLflow for better reproducibility and collaboration.

  1. Dependency Management: MLflow automatically save the frozen environment metadata along with the program to ensure reproducibility.
  2. Experiment Tracking: With MLflow, you can track the program's performance and cost along with the program itself.
  3. Collaboration: You can share the program and results with your team members by sharing the MLflow experiment.

To save the program in MLflow, run the following code:

import mlflow

# Start an MLflow Run and save the program
with mlflow.start_run(run_name="optimized"):
    model_info = mlflow.dspy.log_model(
        finetuned_4o_mini,
        artifact_path="model", # Any name to save the program in MLflow
    )

# Load the program back from MLflow
loaded = mlflow.dspy.load_model(model_info.model_uri)

To learn more about the integration, visit MLflow DSPy Documentation as well.

Let's now check out one task using our fine-tuned agent program!

finetuned_4o_mini.verbose = True
finetuned_4o_mini(**devset[0].inputs())
Task: -= Welcome to TextWorld, ALFRED! =-

You are in the middle of a room. Looking quickly around you, you see a armchair 1, a cabinet 1, a drawer 21, a drawer 20, a drawer 19, a drawer 18, a drawer 17, a drawer 16, a drawer 15, a drawer 14, a drawer 13, a drawer 12, a drawer 11, a drawer 10, a drawer 9, a drawer 8, a drawer 7, a drawer 6, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a sidetable 5, a sidetable 4, a sidetable 3, a sidetable 2, a sidetable 1, and a sofa 1.

Your task is to: put some box on dresser.
> go to cabinet 1
You arrive at cabinet 1. On the cabinet 1, you see nothing.
> go to dresser 1
You arrive at dresser 1. On the dresser 1, you see a book 1, a newspaper 1, a remotecontrol 1, a statue 3, and a television 1.
> look
You are facing the dresser 1. Next to it, you see nothing.
> go to sidetable 1
You arrive at sidetable 1. On the sidetable 1, you see a cellphone 1, and a desklamp 1.
> go to sidetable 2
You arrive at sidetable 2. On the sidetable 2, you see a box 2.
> take box 2 from sidetable 2
You pick up the box 2 from the sidetable 2.
> go to dresser 1
You arrive at dresser 1. On the dresser 1, you see a book 1, a newspaper 1, a remotecontrol 1, a statue 3, and a television 1.
> move box 2 to dresser 1
You move the box 2 to the dresser 1.





Prediction(
    trajecotry=['> go to cabinet 1', 'You arrive at cabinet 1. On the cabinet 1, you see nothing.', '> go to dresser 1', 'You arrive at dresser 1. On the dresser 1, you see a book 1, a newspaper 1, a remotecontrol 1, a statue 3, and a television 1.', '> look', 'You are facing the dresser 1. Next to it, you see nothing.', '> go to sidetable 1', 'You arrive at sidetable 1. On the sidetable 1, you see a cellphone 1, and a desklamp 1.', '> go to sidetable 2', 'You arrive at sidetable 2. On the sidetable 2, you see a box 2.', '> take box 2 from sidetable 2', 'You pick up the box 2 from the sidetable 2.', '> go to dresser 1', 'You arrive at dresser 1. On the dresser 1, you see a book 1, a newspaper 1, a remotecontrol 1, a statue 3, and a television 1.', '> move box 2 to dresser 1', 'You move the box 2 to the dresser 1.'],
    success=1
)

If you want to load and use the agent program, you can do that as follows.

loaded = Agent()
loaded.load('finetuned_4o_mini_001.pkl')

Image Generation Prompt iteration

This is based off of a tweet from @ThorondorLLC

Tweet is here

This will take an initial desired prompt, and iteratively refine it until the image generated matches the desired prompt.

This is not DSPy prompt optimization as it is normally used, but it is a good example of how to use multimodal DSPy.

A future upgrade would be to create a dataset of initial, final prompts to optimize the prompt generation.

You can install DSPy via:

pip install -U dspy

For this example, we'll use Flux Pro from FAL. You can get an API key here

We will also need to install Pillow and dotenv.

pip install fal-client pillow dotenv

Now, let's import the necessary libraries and set up the environment:

# Optional
#os.environ["FAL_API_KEY"] = "your_fal_api_key"
#os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
import dspy

from PIL import Image
from io import BytesIO
import requests
import fal_client

from dotenv import load_dotenv
load_dotenv()

# import display
from IPython.display import display

lm = dspy.LM(model="gpt-4o-mini", temperature=0.5)
dspy.settings.configure(lm=lm)
/Users/isaac/sd_optimizer/.venv/lib/python3.12/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
* 'fields' has been removed
  warnings.warn(message, UserWarning)
/Users/isaac/sd_optimizer/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
def generate_image(prompt):

    request_id = fal_client.submit(
        "fal-ai/flux-pro/v1.1-ultra",
        arguments={
            "prompt": prompt
        },
    ).request_id

    result = fal_client.result("fal-ai/flux-pro/v1.1-ultra", request_id)
    url = result["images"][0]["url"]

    return dspy.Image.from_url(url)

def display_image(image):
    url = image.url
    # download the image
    response = requests.get(url)
    image = Image.open(BytesIO(response.content))

    # display at 25% of original size
    display(image.resize((image.width // 4, image.height // 4)))
check_and_revise_prompt = dspy.Predict("desired_prompt: str, current_image: dspy.Image, current_prompt:str -> feedback:str, image_strictly_matches_desired_prompt: bool, revised_prompt: str")

initial_prompt = "A scene that's both peaceful and tense"
current_prompt = initial_prompt

max_iter = 5
for i in range(max_iter):
    print(f"Iteration {i+1} of {max_iter}")
    current_image = generate_image(current_prompt)
    result = check_and_revise_prompt(desired_prompt=initial_prompt, current_image=current_image, current_prompt=current_prompt)
    display_image(current_image)
    if result.image_strictly_matches_desired_prompt:
        break
    else:
        current_prompt = result.revised_prompt
        print(f"Feedback: {result.feedback}")
        print(f"Revised prompt: {result.revised_prompt}")

print(f"Final prompt: {current_prompt}")
Iteration 1 of 5

png

Feedback: The image depicts a peaceful autumn scene with people walking among colorful leaves, which aligns with the peaceful aspect of the prompt. However, it lacks any elements that convey tension, making it not fully representative of the desired prompt.
Iteration 2 of 5

png

Feedback: The image depicts a serene autumn scene with vibrant foliage and a calm river, which aligns well with the idea of peace. However, it lacks explicit elements that suggest underlying tension, making it less effective in conveying both aspects of the desired prompt.
Iteration 3 of 5

png

Feedback: The image depicts a serene autumn scene with warm colors and soft lighting, which aligns with the peaceful aspect of the desired prompt. However, it lacks elements that evoke tension or unease, making it not fully meet the requirement for a scene that is both peaceful and tense.
Iteration 4 of 5

png

Final prompt: A serene autumn scene with fog and shadows, capturing both peace and tension.
dspy.inspect_history(5)
�[34m[2025-01-17T11:36:55.947579]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `desired_prompt` (str)
2. `current_image` (Image)
3. `current_prompt` (str)

Your output fields are:
1. `feedback` (str)
2. `revised_prompt` (str)
3. `image_strictly_matches_desired_prompt` (bool)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## desired_prompt ## ]]
{desired_prompt}

[[ ## current_image ## ]]
{current_image}

[[ ## current_prompt ## ]]
{current_prompt}

[[ ## feedback ## ]]
{feedback}

[[ ## revised_prompt ## ]]
{revised_prompt}

[[ ## image_strictly_matches_desired_prompt ## ]]
{image_strictly_matches_desired_prompt}        # note: the value you produce must be True or False

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `desired_prompt`, `current_image`, `current_prompt`, produce the fields `feedback`, `revised_prompt`, `image_strictly_matches_desired_prompt`.


�[31mUser message:�[0m

[[ ## desired_prompt ## ]]
A scene that's both peaceful and tense

[[ ## current_image ## ]]
�[34m<image_url: https://fal.media/files/panda/HLKaiKFc_sN_fwBBN1C_2_1d1a72732e294452afb056a26d6b6c96.jpg>�[0m

[[ ## current_prompt ## ]]
A scene that's both peaceful and tense

Respond with the corresponding output fields, starting with the field `[[ ## feedback ## ]]`, then `[[ ## revised_prompt ## ]]`, then `[[ ## image_strictly_matches_desired_prompt ## ]]` (must be formatted as a valid Python bool), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## feedback ## ]]
The image depicts a calm café scene with two individuals engaged in conversation, which conveys a sense of peace. However, the woman's posture and expression suggest an underlying tension, as she appears to be contemplative or distressed. This duality aligns well with the desired prompt of being both peaceful and tense.

[[ ## revised_prompt ## ]]
A scene that captures the duality of peace and tension in a café setting.

[[ ## image_strictly_matches_desired_prompt ## ]]
True

[[ ## completed ## ]]�[0m





�[34m[2025-01-17T11:37:55.458433]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `desired_prompt` (str)
2. `current_image` (Image)
3. `current_prompt` (str)

Your output fields are:
1. `feedback` (str)
2. `image_strictly_matches_desired_prompt` (bool)
3. `revised_prompt` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## desired_prompt ## ]]
{desired_prompt}

[[ ## current_image ## ]]
{current_image}

[[ ## current_prompt ## ]]
{current_prompt}

[[ ## feedback ## ]]
{feedback}

[[ ## image_strictly_matches_desired_prompt ## ]]
{image_strictly_matches_desired_prompt}        # note: the value you produce must be True or False

[[ ## revised_prompt ## ]]
{revised_prompt}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `desired_prompt`, `current_image`, `current_prompt`, produce the fields `feedback`, `image_strictly_matches_desired_prompt`, `revised_prompt`.


�[31mUser message:�[0m

[[ ## desired_prompt ## ]]
A scene that's both peaceful and tense

[[ ## current_image ## ]]
�[34m<image_url: https://fal.media/files/elephant/p3M00ca786Ov5ITX8nXUs_7b815b2087e54c1fbd798b9b9e0c5f5e.jpg>�[0m

[[ ## current_prompt ## ]]
A scene that's both peaceful and tense

Respond with the corresponding output fields, starting with the field `[[ ## feedback ## ]]`, then `[[ ## image_strictly_matches_desired_prompt ## ]]` (must be formatted as a valid Python bool), then `[[ ## revised_prompt ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## feedback ## ]]
The image depicts a peaceful autumn scene with people walking among colorful leaves, which aligns with the peaceful aspect of the prompt. However, it lacks any elements that convey tension, making it not fully representative of the desired prompt.

[[ ## image_strictly_matches_desired_prompt ## ]]
False

[[ ## revised_prompt ## ]]
A serene autumn scene with elements that suggest underlying tension

[[ ## completed ## ]]�[0m





�[34m[2025-01-17T11:38:24.032318]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `desired_prompt` (str)
2. `current_image` (Image)
3. `current_prompt` (str)

Your output fields are:
1. `feedback` (str)
2. `image_strictly_matches_desired_prompt` (bool)
3. `revised_prompt` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## desired_prompt ## ]]
{desired_prompt}

[[ ## current_image ## ]]
{current_image}

[[ ## current_prompt ## ]]
{current_prompt}

[[ ## feedback ## ]]
{feedback}

[[ ## image_strictly_matches_desired_prompt ## ]]
{image_strictly_matches_desired_prompt}        # note: the value you produce must be True or False

[[ ## revised_prompt ## ]]
{revised_prompt}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `desired_prompt`, `current_image`, `current_prompt`, produce the fields `feedback`, `image_strictly_matches_desired_prompt`, `revised_prompt`.


�[31mUser message:�[0m

[[ ## desired_prompt ## ]]
A scene that's both peaceful and tense

[[ ## current_image ## ]]
�[34m<image_url: https://fal.media/files/monkey/QhPlYYa5tob38190Gtn0t_17e963cacc1542d5977031aff954c718.jpg>�[0m

[[ ## current_prompt ## ]]
A serene autumn scene with elements that suggest underlying tension

Respond with the corresponding output fields, starting with the field `[[ ## feedback ## ]]`, then `[[ ## image_strictly_matches_desired_prompt ## ]]` (must be formatted as a valid Python bool), then `[[ ## revised_prompt ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## feedback ## ]]
The image depicts a serene autumn scene with vibrant foliage and a calm river, which aligns well with the idea of peace. However, it lacks explicit elements that suggest underlying tension, making it less effective in conveying both aspects of the desired prompt.

[[ ## image_strictly_matches_desired_prompt ## ]]
False

[[ ## revised_prompt ## ]]
A serene autumn scene with elements that evoke a sense of unease or foreboding

[[ ## completed ## ]]�[0m





�[34m[2025-01-17T11:38:46.424883]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `desired_prompt` (str)
2. `current_image` (Image)
3. `current_prompt` (str)

Your output fields are:
1. `feedback` (str)
2. `image_strictly_matches_desired_prompt` (bool)
3. `revised_prompt` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## desired_prompt ## ]]
{desired_prompt}

[[ ## current_image ## ]]
{current_image}

[[ ## current_prompt ## ]]
{current_prompt}

[[ ## feedback ## ]]
{feedback}

[[ ## image_strictly_matches_desired_prompt ## ]]
{image_strictly_matches_desired_prompt}        # note: the value you produce must be True or False

[[ ## revised_prompt ## ]]
{revised_prompt}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `desired_prompt`, `current_image`, `current_prompt`, produce the fields `feedback`, `image_strictly_matches_desired_prompt`, `revised_prompt`.


�[31mUser message:�[0m

[[ ## desired_prompt ## ]]
A scene that's both peaceful and tense

[[ ## current_image ## ]]
�[34m<image_url: https://fal.media/files/panda/kNBH7FF1uLk06nXonN0Zr_256e1bf2f71c4dcc8eb94027f09f3452.jpg>�[0m

[[ ## current_prompt ## ]]
A serene autumn scene with elements that evoke a sense of unease or foreboding

Respond with the corresponding output fields, starting with the field `[[ ## feedback ## ]]`, then `[[ ## image_strictly_matches_desired_prompt ## ]]` (must be formatted as a valid Python bool), then `[[ ## revised_prompt ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## feedback ## ]]
The image depicts a serene autumn scene with warm colors and soft lighting, which aligns with the peaceful aspect of the desired prompt. However, it lacks elements that evoke tension or unease, making it not fully meet the requirement for a scene that is both peaceful and tense.

[[ ## image_strictly_matches_desired_prompt ## ]]
False

[[ ## revised_prompt ## ]]
A serene autumn scene that includes subtle elements of tension or foreboding, such as dark shadows or an unsettling atmosphere.

[[ ## completed ## ]]�[0m





�[34m[2025-01-17T11:39:03.586839]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `desired_prompt` (str)
2. `current_image` (Image)
3. `current_prompt` (str)

Your output fields are:
1. `feedback` (str)
2. `image_strictly_matches_desired_prompt` (bool)
3. `revised_prompt` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## desired_prompt ## ]]
{desired_prompt}

[[ ## current_image ## ]]
{current_image}

[[ ## current_prompt ## ]]
{current_prompt}

[[ ## feedback ## ]]
{feedback}

[[ ## image_strictly_matches_desired_prompt ## ]]
{image_strictly_matches_desired_prompt}        # note: the value you produce must be True or False

[[ ## revised_prompt ## ]]
{revised_prompt}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `desired_prompt`, `current_image`, `current_prompt`, produce the fields `feedback`, `image_strictly_matches_desired_prompt`, `revised_prompt`.


�[31mUser message:�[0m

[[ ## desired_prompt ## ]]
A scene that's both peaceful and tense

[[ ## current_image ## ]]
�[34m<image_url: https://fal.media/files/penguin/AXRlgHfIxbZ26Ipgd64Rn_d012474791904e70bd419945f7179980.jpg>�[0m

[[ ## current_prompt ## ]]
A serene autumn scene that includes subtle elements of tension or foreboding, such as dark shadows or an unsettling atmosphere.

Respond with the corresponding output fields, starting with the field `[[ ## feedback ## ]]`, then `[[ ## image_strictly_matches_desired_prompt ## ]]` (must be formatted as a valid Python bool), then `[[ ## revised_prompt ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## feedback ## ]]
The image depicts a foggy forest scene with autumn leaves, creating a peaceful atmosphere. However, the overall mood is somewhat tense due to the fog and shadows, aligning well with the desired prompt of being both peaceful and tense. 

[[ ## image_strictly_matches_desired_prompt ## ]]
True

[[ ## revised_prompt ## ]]
A serene autumn scene with fog and shadows, capturing both peace and tension.

[[ ## completed ## ]]�[0m

Tutorial: Math Reasoning

Let's walk through a quick example of setting up a dspy.ChainOfThought module and optimizing it for answering algebra questions.

Install the latest DSPy via pip install -U dspy and follow along.

Recommended: Set up MLflow Tracing to understand what's happening under the hood.

MLflow DSPy Integration

MLflow is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. In this tutorial, you can use MLflow to visualize prompts and optimization progress as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.

  1. Install MLflow
%pip install mlflow>=2.20
  1. Start MLflow UI in a separate terminal
mlflow ui --port 5000
  1. Connect the notebook to MLflow
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")
  1. Enabling tracing.
mlflow.dspy.autolog()

Once you have completed the steps above, you can see traces for each program execution on the notebook. They provide great visibility into the model's behavior and helps you understand the DSPy's concepts better throughout the tutorial.

MLflow Trace

To learn more about the integration, visit MLflow DSPy Documentation as well.

Let's tell DSPy that we will use OpenAI's gpt-4o-mini in our modules. To authenticate, DSPy will look into your OPENAI_API_KEY. You can easily swap this out for other providers or local models.

import dspy

gpt4o_mini = dspy.LM('openai/gpt-4o-mini', max_tokens=2000)
gpt4o = dspy.LM('openai/gpt-4o', max_tokens=2000)
dspy.configure(lm=gpt4o_mini)  # we'll use gpt-4o-mini as the default LM, unless otherwise specified

Next, let's load some data examples from the MATH benchmark. We'll use a training split for optimization and evaluate it on a held-out dev set.

Please note that the following step will require:

%pip install git+https://github.com/hendrycks/math.git
from dspy.datasets import MATH

dataset = MATH(subset='algebra')
print(len(dataset.train), len(dataset.dev))
350 350

Let's inspect one example from the training set.

example = dataset.train[0]
print("Question:", example.question)
print("Answer:", example.answer)
Question: The doctor has told Cal O'Ree that during his ten weeks of working out at the gym, he can expect each week's weight loss to be $1\%$ of his weight at the end of the previous week. His weight at the beginning of the workouts is $244$ pounds. How many pounds does he expect to weigh at the end of the ten weeks? Express your answer to the nearest whole number.
Answer: 221

Now let's define our module. It's extremely simple: just a chain-of-thought step that takes a question and produces an answer.

module = dspy.ChainOfThought("question -> answer")
module(question=example.question)
Prediction(
    reasoning="Cal O'Ree's weight loss each week is $1\\%$ of his weight at the end of the previous week. This means that at the end of each week, he retains $99\\%$ of his weight from the previous week. \n\nIf we denote his weight at the beginning as \\( W_0 = 244 \\) pounds, then his weight at the end of week \\( n \\) can be expressed as:\n\\[\nW_n = W_{n-1} \\times 0.99\n\\]\nThis can be simplified to:\n\\[\nW_n = W_0 \\times (0.99)^n\n\\]\nAfter 10 weeks, his weight will be:\n\\[\nW_{10} = 244 \\times (0.99)^{10}\n\\]\n\nNow, we calculate \\( (0.99)^{10} \\):\n\\[\n(0.99)^{10} \\approx 0.904382\n\\]\n\nNow, we can calculate his expected weight after 10 weeks:\n\\[\nW_{10} \\approx 244 \\times 0.904382 \\approx 220.5\n\\]\n\nRounding to the nearest whole number, Cal O'Ree can expect to weigh approximately \\( 221 \\) pounds at the end of the ten weeks.",
    answer='221'
)

Next, let's set up an evaluator for the zero-shot module above, before prompt optimization.

THREADS = 24
kwargs = dict(num_threads=THREADS, display_progress=True, display_table=5)
evaluate = dspy.Evaluate(devset=dataset.dev, metric=dataset.metric, **kwargs)

evaluate(module)
Average Metric: 259.00 / 350 (74.0%): 100%|██████████| 350/350 [01:30<00:00,  3.85it/s]

2024/11/28 18:41:55 INFO dspy.evaluate.evaluate: Average Metric: 259 / 350 (74.0%)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
question example_reasoning example_answer pred_reasoning pred_answer method
0 What is the smallest integer value of $c$ such that the function $... The given function has a domain of all real numbers if and only if... 1 To determine the smallest integer value of \( c \) such that the f... 1 ✔️ [True]
1 What is the least value of $x$ that is a solution of $|{-x+3}|=7$? In order to have $|{-x+3}| = 7$, we must have $-x + 3 = 7$ or $-x ... -4 To solve the equation \( |{-x+3}|=7 \), we need to consider the de... -4 ✔️ [True]
2 Evaluate $\left\lceil -\frac{7}{4}\right\rceil$. $-\frac{7}{4}$ is between $-1$ and $-2$, so $\left\lceil -\frac{7}... -1 To evaluate \(\left\lceil -\frac{7}{4}\right\rceil\), we first nee... -1 ✔️ [True]
3 A triangle has vertices at coordinates $(11,1)$, $(2,3)$ and $(3,7... We must find the distance between each pair of points by using the... 10 To find the length of the longest side of the triangle with vertic... 10 ✔️ [True]
4 Let $f(x) = x + 2$ and $g(x) = 1/f(x)$. What is $g(f(-3))$? First, we find that $f(-3) = (-3) + 2 = -1$. Then, $$g(f(-3)) = g(... 1 To find \( g(f(-3)) \), we first need to evaluate \( f(-3) \). The... 1 ✔️ [True]
... 345 more rows not displayed ...
74.0
Tracking Evaluation Results in MLflow Experiment

To track and visualize the evaluation results over time, you can record the results in MLflow Experiment.

import mlflow

# Start an MLflow Run to record the evaluation
with mlflow.start_run(run_name="math_evaluation"):
    kwargs = dict(num_threads=THREADS, display_progress=True, return_all_scores=True, return_outputs=True)
    evaluate = dspy.Evaluate(devset=dataset.dev, metric=dataset.metric, **kwargs)

    # Evaluate the program as usual
    aggregated_score, outputs, all_scores = evaluate(module)

    # Log the aggregated score
    mlflow.log_metric("correctness", aggregated_score)
    # Log the detailed evaluation results as a table
    mlflow.log_table(
        {
            "Question": [example.question for example in dataset.dev],
            "Gold Answer": [example.answer for example in dataset.dev],
            "Predicted Answer": outputs,
            "Correctness": all_scores,
        },
        artifact_file="eval_results.json",
    )

To learn more about the integration, visit MLflow DSPy Documentation as well.

And lastly let's optimize our module. Since we want strong reasoning, we'll use the large GPT-4o as the teacher model (used to bootstrap reasoning for the small LM at optimization time) but not as the prompt model (used to craft instructions) or the task model (trained).

GPT-4o will be invoked only a small number of times. The model involved directly in optimization and in the resulting (optimized) program will be GPT-4o-mini.

We will also specify max_bootstrapped_demos=4 which means we want at most four bootstrapped examples in the prompt and max_labeled_demos=4 which means that, in total between bootstrapped and pre-labeled examples, we want at most four.

kwargs = dict(num_threads=THREADS, teacher_settings=dict(lm=gpt4o), prompt_model=gpt4o_mini)
optimizer = dspy.MIPROv2(metric=dataset.metric, auto="medium", **kwargs)

kwargs = dict(requires_permission_to_run=False, max_bootstrapped_demos=4, max_labeled_demos=4)
optimized_module = optimizer.compile(module, trainset=dataset.train, **kwargs)
evaluate(optimized_module)
Average Metric: 310.00 / 350 (88.6%): 100%|██████████| 350/350 [01:31<00:00,  3.84it/s]

2024/11/28 18:59:19 INFO dspy.evaluate.evaluate: Average Metric: 310 / 350 (88.6%)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
question example_reasoning example_answer pred_reasoning pred_answer method
0 What is the smallest integer value of $c$ such that the function $... The given function has a domain of all real numbers if and only if... 1 The function \( f(x) = \frac{x^2 + 1}{x^2 - x + c} \) will have a ... 1 ✔️ [True]
1 What is the least value of $x$ that is a solution of $|{-x+3}|=7$? In order to have $|{-x+3}| = 7$, we must have $-x + 3 = 7$ or $-x ... -4 The equation \( |{-x+3}|=7 \) implies two possible cases: 1. \(-x ... -4 ✔️ [True]
2 Evaluate $\left\lceil -\frac{7}{4}\right\rceil$. $-\frac{7}{4}$ is between $-1$ and $-2$, so $\left\lceil -\frac{7}... -1 To evaluate \(\left\lceil -\frac{7}{4}\right\rceil\), we first nee... -1 ✔️ [True]
3 A triangle has vertices at coordinates $(11,1)$, $(2,3)$ and $(3,7... We must find the distance between each pair of points by using the... 10 To find the length of the sides of the triangle formed by the vert... 10 ✔️ [True]
4 Let $f(x) = x + 2$ and $g(x) = 1/f(x)$. What is $g(f(-3))$? First, we find that $f(-3) = (-3) + 2 = -1$. Then, $$g(f(-3)) = g(... 1 To find \( g(f(-3)) \), we first need to evaluate \( f(-3) \). Usi... 1 ✔️ [True]
... 345 more rows not displayed ...
88.57

Neat. It was pretty straightforward to improve quality from 74% to over 88% on a held-out set here.

That said, for reasoning tasks like this, you will often want to consider more advanced strategies, like:

  • A dspy.ReAct module with access to a calculator function or dspy.PythonInterpreter
  • Ensembling multiple optimized prompts with a majority vote (or an Aggregator module) on top

Just to understand what changed, let's view the prompt after optimization. Alternatively, if you enabled MLflow tracing following above instructions, you can compare the prompts before and after optimization in the rich trace UI.

dspy.inspect_history()
�[34m[2024-11-28T18:59:19.176586]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `question` (str)

Your output fields are:
1. `reasoning` (str)
2. `answer` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Analyze the `question` provided, and systematically apply mathematical reasoning to derive the `answer`. Ensure to articulate each step of your thought process in the `reasoning` field, detailing how you identify relationships and formulate equations to arrive at the solution.


�[31mUser message:�[0m

[[ ## question ## ]]
If $|4x+2|=10$ and $x<0$, what is the value of $x$?

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## reasoning ## ]]
The equation $|4x+2|=10$ implies two possible equations: $4x+2=10$ or $4x+2=-10$. We solve each equation separately.

1. For $4x+2=10$:
   \[
   4x + 2 = 10 \\
   4x = 8 \\
   x = 2
   \]
   However, $x=2$ does not satisfy the condition $x<0$.

2. For $4x+2=-10$:
   \[
   4x + 2 = -10 \\
   4x = -12 \\
   x = -3
   \]
   The solution $x = -3$ satisfies the condition $x<0$.

Therefore, the value of $x$ is $\boxed{-3}$.

[[ ## answer ## ]]
-3

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## question ## ]]
Given that $-4$ is a solution to $x^2 + bx -36 = 0$, what is the value of $b$?

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## reasoning ## ]]
The product of the roots of this quadratic is $-36/1=-36$, so the other solution must be $-36/-4=9$. That means that the sum of the solutions is $-4+9=5$. The sum of the solutions is also $-b/1=-b$. Thus, $-b=5$ and $b=\boxed{-5}$.

[[ ## answer ## ]]
-5

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## question ## ]]
Richard is building a rectangular playground from 200 feet of fencing. The fencing must entirely enclose the playground. What is the maximum area of this playground?

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## reasoning ## ]]
Let the length of the playground be $l$ and the width be $w$. We have the equation $2l+2w=200 \Rightarrow l + w = 100$. We want to maximize the area of this rectangular playground, which is given by $lw$. From our equation, we know that $l=100-w$. Substituting this in to our expression for area, we have \[(100-w)(w)=100w-w^2\]We will now complete the square to find the maximum value of this expression. Factoring a $-1$ out, we have \[-(w^2-100w)\]In order for the expression inside the parenthesis to be a perfect square, we need to add and subtract $(100/2)^2=2500$ inside the parenthesis. Doing this, we get \[-(w^2-100w+2500-2500) \Rightarrow -(w-50)^2+2500\]Since the maximum value of $-(w-50)^2$ is 0 (perfect squares are always nonnegative), the maximum value of the entire expression is 2500, which is achieved when $w=50$ and $l=100-w=50$ (the playground is a square). Thus, the maximum area of the playground is $\boxed{2500}$ square feet.

[[ ## answer ## ]]
2500

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## question ## ]]
Alice and Bob are playing a game. Alice starts first. On Alice's turn, she flips a coin. If she gets a heads, she wins. If not, it becomes Bob's turn. On Bob's turn, he flips a coin. If he gets a tails, he wins. If not, it becomes Alice's turn. What is the probability that Alice wins the game?

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## reasoning ## ]]
Alice has a $1/2$ chance of winning the game on her first turn. If she doesn't, then the probability that she wins the game on her second turn is $1/8,$ since she must not win on her first flip ($1/2$ chance), Bob must not win on his first flip ($1/2$ chance), and then Alice must win on her second flip ($1/2$ chance). The probability that she wins the game on her third turn is $1/32,$ and in general, the probability that she wins the game on her $k^\text{th}$ turn is $(1/2)^{2k-1}.$ Thus, the probability that Alice wins is an infinite geometric series with first term $1/2$ and common ratio $1/4.$ So, the probability that Alice wins the game is $$\frac{\frac12}{1-\frac14} = \boxed{\frac{2}{3}}.$$OR

Note that the only difference between the odds of Alice or Bob winning is who goes first. Because Bob goes second, the odds of him winning on his $k^\text{th}$ flip is half of the odds that Alice wins on her $k^\text{th}$ flip, since Alice must first get a tails before Bob gets a chance to win. Thus, if $a$ is Alice's chance of winning, and $b$ is Bob's chance of winning, then $a = 2b.$ Also, since someone must win, $a + b = 1.$ It follows that $a = 2/3$ and $b = 1/3,$ so Alice has a $\boxed{\frac{2}{3}}$ chance of winning the game.

[[ ## answer ## ]]
\frac{2}{3}

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## question ## ]]
Find the smallest possible value of $$\frac{(y-x)^2}{(y-z)(z-x)} + \frac{(z-y)^2}{(z-x)(x-y)} + \frac{(x-z)^2}{(x-y)(y-z)},$$ where $x,y,$ and $z$ are distinct real numbers.

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## reasoning ## ]]
Let \( S = \frac{(y-x)^2}{(y-z)(z-x)} + \frac{(z-y)^2}{(z-x)(x-y)} + \frac{(x-z)^2}{(x-y)(y-z)} \).

To find the minimum value of \( S \), we can use the method of symmetry and the Cauchy-Schwarz inequality. 

By applying the Cauchy-Schwarz inequality, we have:
\[
\left( \frac{(y-x)^2}{(y-z)(z-x)} + \frac{(z-y)^2}{(z-x)(x-y)} + \frac{(x-z)^2}{(x-y)(y-z)} \right) \left( (y-z)(z-x) + (z-x)(x-y) + (x-y)(y-z) \right) \geq \left( (y-x) + (z-y) + (x-z) \right)^2.
\]
The right-hand side simplifies to:
\[
(y-x) + (z-y) + (x-z) = 0,
\]
which means the right-hand side is \( 0 \). Therefore, we have:
\[
S \cdot \left( (y-z)(z-x) + (z-x)(x-y) + (x-y)(y-z) \right) \geq 0.
\]
This implies that \( S \) is non-negative.

Next, we can test specific values for \( x, y, z \) to find the minimum. Let's set \( x = 0, y = 1, z = 2 \):
\[
S = \frac{(1-0)^2}{(1-2)(2-0)} + \frac{(2-1)^2}{(2-0)(0-1)} + \frac{(0-2)^2}{(0-1)(1-2)}.
\]
Calculating each term:
1. The first term:
   \[
   \frac{1^2}{(-1)(2)} = \frac{1}{-2} = -\frac{1}{2}.
   \]
2. The second term:
   \[
   \frac{1^2}{(2)(-1)} = \frac{1}{-2} = -\frac{1}{2}.
   \]
3. The third term:
   \[
   \frac{(-2)^2}{(-1)(-1)} = \frac{4}{1} = 4.
   \]

Thus,
\[
S = -\frac{1}{2} - \frac{1}{2} + 4 = 3.
\]

To check if this is indeed the minimum, we can also consider the case when \( x, y, z \) are the roots of a quadratic polynomial. By symmetry and the nature of the expression, we can conclude that the minimum value of \( S \) is indeed \( 3 \).

Therefore, the smallest possible value of \( S \) is \( \boxed{3} \).

[[ ## answer ## ]]
3

[[ ## completed ## ]]�[0m

Tutorial: Multi-Hop Retrieval

Let's walk through a quick example of building a dspy.Module with multiple sub-modules. We'll do this for the task for multi-hop search.

Install the latest DSPy via pip install -U dspy and follow along.

Recommended: Set up MLflow Tracing to understand what's happening under the hood.

MLflow DSPy Integration

MLflow is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. In this tutorial, you can use MLflow to visualize prompts and optimization progress as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.

  1. Install MLflow
%pip install mlflow>=2.20
  1. Start MLflow UI in a separate terminal
mlflow ui --port 5000
  1. Connect the notebook to MLflow
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")
  1. Enabling tracing.
mlflow.dspy.autolog()

MLflow Trace

To learn more about the integration, visit MLflow DSPy Documentation as well.

In this tutorial, we'll use a small local LM, Meta's Llama-3.1-8B-Instruct which has 8 billion parameters.

You might be able to host the 8B model on your laptop with Ollama, on your GPU server with SGLang, or via a provider that hosts it for you like Databricks or Together.

In the snippet below, we'll configure this small model as our main LM. We'll also set up a larger LM, i.e. GPT-4o, as a teacher that we'll invoke a very small number of times to help teach the small LM. This is technically not necessary; the small model can typically teach itself tasks like this in DSPy. But using a larger teacher will give us some peace of mind, where the initial system or optimizer configuration doesn't matter as much.

import dspy

lm = dspy.LM('<your_provider>/Llama-3.1-8B-Instruct', max_tokens=3000)
gpt4o = dspy.LM('openai/gpt-4o', max_tokens=3000)

dspy.configure(lm=lm)

Install dependencies and download data

To do the retrieval, we'll use the cool BM25S library, as it's pretty lightweight. You can replace this components with whatever you like.

> pip install -U bm25s PyStemmer "jax[cpu]"

Next, we'll download a snapshot abstracts (i.e., first paragraphs) of all 5,000,000 Wikipedia pages as of 2017. We'll use this as our retrieval corpus.

This is 500MB compressed, so the download and decompression may take 2-3 minutes.

from dspy.utils import download

download("https://huggingface.co/dspy/cache/resolve/main/wiki.abstracts.2017.tar.gz")
!tar -xzvf wiki.abstracts.2017.tar.gz

Let's now load the corpus.

import ujson
corpus = []

with open("wiki.abstracts.2017.jsonl") as f:
    for line in f:
        line = ujson.loads(line)
        corpus.append(f"{line['title']} | {' '.join(line['text'])}")

len(corpus)
5233330

And then let's index it for BM25 retrieval! This will take 2-3 minutes.

import bm25s
import Stemmer

stemmer = Stemmer.Stemmer("english")
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)

retriever = bm25s.BM25(k1=0.9, b=0.4)
retriever.index(corpus_tokens)

Load the HoVer dataset.

Let's load a dataset for our task. We'll load examples from the HoVer multi-hop task, where the input is a (really!) complex claim and the output we're seeking is the set of Wikipedia pages that are required to fact-check that claim.

import random
from dspy.datasets import DataLoader

kwargs = dict(fields=("claim", "supporting_facts", "hpqa_id", "num_hops"), input_keys=("claim",))
hover = DataLoader().from_huggingface(dataset_name="hover-nlp/hover", split="train", trust_remote_code=True, **kwargs)

hpqa_ids = set()
hover = [
    dspy.Example(claim=x.claim, titles=list(set([y["key"] for y in x.supporting_facts]))).with_inputs("claim")
    for x in hover
    if x["num_hops"] == 3 and x["hpqa_id"] not in hpqa_ids and not hpqa_ids.add(x["hpqa_id"])
]

random.Random(0).shuffle(hover)
trainset, devset, testset = hover[:200], hover[200:500], hover[650:]

Let's view an example of this task:

example = trainset[0]

print("Claim:", example.claim)
print("Pages that must be retrieved:", example.titles)
Claim: This director is known for his work on Miss Potter. The Academy of Motion Picture Arts and Sciences presents the award in which he was nominated for his work in "Babe".
Pages that must be retrieved: ['Miss Potter', 'Chris Noonan', 'Academy Award for Best Director']

Now, let's define a function to do the search in Wikipedia. This will use our BM25 index.

def search(query: str, k: int) -> list[str]:
    tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
    results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
    run = {corpus[doc]: float(score) for doc, score in zip(results[0], scores[0])}
    return run

Now, let's define the multi-hop program in DSPy. It's going to be super simple: it'll take a claim and produce a list titles: list[str].

It will do this via two sub-modules: generate_query and append_notes.

class Hop(dspy.Module):
    def __init__(self, num_docs=10, num_hops=4):
        self.num_docs, self.num_hops = num_docs, num_hops
        self.generate_query = dspy.ChainOfThought('claim, notes -> query')
        self.append_notes = dspy.ChainOfThought('claim, notes, context -> new_notes: list[str], titles: list[str]')

    def forward(self, claim: str) -> list[str]:
        notes = []
        titles = []

        for _ in range(self.num_hops):
            query = self.generate_query(claim=claim, notes=notes).query
            context = search(query, k=self.num_docs)
            prediction = self.append_notes(claim=claim, notes=notes, context=context)
            notes.extend(prediction.new_notes)
            titles.extend(prediction.titles)
        
        return dspy.Prediction(notes=notes, titles=list(set(titles)))

Great. Now let's set up an evaluation metric, top5_recall.

It will return the fraction of the gold pages (which are always 3) that are retrieved in the top-5 titles returned by the program.

def top5_recall(example, pred, trace=None):
    gold_titles = example.titles
    recall = sum(x in pred.titles[:5] for x in gold_titles) / len(gold_titles)

    # If we're "bootstrapping" for optimization, return True if and only if the recall is perfect.
    if trace is not None:
        return recall >= 1.0
    
    # If we're just doing inference, just measure the recall.
    return recall

evaluate = dspy.Evaluate(devset=devset, metric=top5_recall, num_threads=16, display_progress=True, display_table=5)

Let's evaluate our off-the-shelf program!

evaluate(Hop())
Average Metric: 27.67 / 98 (28.2%):  32%|███▏      | 97/300 [00:02<00:04, 49.34it/s]

2024/12/25 12:18:00 ERROR dspy.utils.parallelizer: Error processing item Example({'claim': "All That is the show that the co-creator with the host of Vibe and Wild 'N Out had a debut on.", 'titles': ['Chris Spencer (actor)', 'Nick Cannon', 'Vibe (talk show)']}) (input_keys={'claim'}): Expected dict_keys(['reasoning', 'new_notes', 'titles']) but got dict_keys(['reasoning', 'new_notes']). Set `provide_traceback=True` to see the stack trace.


Average Metric: 59.33 / 186 (31.9%):  62%|██████▏   | 186/300 [00:03<00:02, 51.84it/s]

2024/12/25 12:18:02 ERROR dspy.utils.parallelizer: Error processing item Example({'claim': 'The song, which Billie Anthony is best known for her Top 10 hit version, topped the UK chart in 1981 in a recording by a platinum-selling British rock and roll singer whose recording and performing career began in the late 1960s.', 'titles': ["Shakin' Stevens", 'This Ole House', 'Billie Anthony']}) (input_keys={'claim'}): Expected dict_keys(['reasoning', 'new_notes', 'titles']) but got dict_keys(['reasoning']). Set `provide_traceback=True` to see the stack trace.


Average Metric: 94.00 / 298 (31.5%): 100%|██████████| 300/300 [00:06<00:00, 48.56it/s]


2024/12/25 12:18:04 INFO dspy.evaluate.evaluate: Average Metric: 93.99999999999993 / 300 (31.3%)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
claim example_titles notes pred_titles top5_recall titles
0 Nike football team has had a player endorse the football boot Nike... [Nike Hypervenom, Nike Total 90, Marcus Rashford] ['The Nike Total 90 has been replaced by the Nike Hypervenom.', 'T... ['Nike Mercurial Vapor | The Mercurial Vapor is a football boot ma... ✔️ [0.333] NaN
1 Bill Boyd is the chairman of the appliance company that operates t... [Suncoast Hotel and Casino, Boyd Gaming, Thomas Eje] ['Bill Boyd is not mentioned as the chairman of an appliance compa... [Suncoast Casino, Thomas Eje, Boyd Gaming Corporation, Bill Boyd, ... ✔️ [0.333] NaN
2 The president of South Korea was born 24 January 1953. The group t... [Presidential Council on Nation Branding, Korea, Moon Jae-in, Euh ... ['The president of South Korea was likely born before 1945', 'Euh ... ['Yi Cheol-seung', 'List of Presidents of South Korea', 'Lifespan ... NaN
3 The movie Khan Kluay was released 2 months before the 2009 movie t... [Fantastic Mr. Fox (film), Jason Schwartzman, Khan Kluay] ['The movie Khan Kluay was released in 2006.', 'The 2009 movie tha... [Khan Kluay, The Darjeeling Limited] ✔️ [0.333] NaN
4 The director of Finding Dory co-directed the film A Bug's Life. [Andrew Stanton, Finding Dory, A Bug's Life] ['The director of Finding Dory is Andrew Stanton and Angus MacLane... [Finding Dory, A Bug's Life] ✔️ [0.667] NaN
... 295 more rows not displayed ...
31.33
Tracking Evaluation Results in MLflow Experiment

To track and visualize the evaluation results over time, you can record the results in MLflow Experiment.

import mlflow

with mlflow.start_run(run_name="hop_evaluation"):
    evaluate = dspy.Evaluate(
        devset=devset,
        metric=top5_recall,
        num_threads=16,
        display_progress=True,
        # To record the outputs and detailed scores to MLflow
        return_all_scores=True,
        return_outputs=True,
    )

    # Evaluate the program as usual
    aggregated_score, outputs, all_scores = evaluate(Hop())

    # Log the aggregated score
    mlflow.log_metric("top5_recall", aggregated_score)
    # Log the detailed evaluation results as a table
    mlflow.log_table(
        {
            "Claim": [example.claim for example in eval_set],
            "Expected Titles": [example.titles for example in eval_set],
            "Predicted Titles": outputs,
            "Top 5 Recall": all_scores,
        },
        artifact_file="eval_results.json",
    )

To learn more about the integration, visit MLflow DSPy Documentation as well.

Let's now optimize the two prompts inside the Hop() program jointly to maximize the recall of our program. This may take around 35 minutes and make some $5 worth of calls to GPT-4o to optimize Llama-3.1-8B.

models = dict(prompt_model=gpt4o, teacher_settings=dict(lm=gpt4o))
tp = dspy.MIPROv2(metric=top5_recall, auto="medium", num_threads=16, **models)

kwargs = dict(minibatch_size=40, minibatch_full_eval_steps=4, requires_permission_to_run=False)
optimized = tp.compile(Hop(), trainset=trainset, max_bootstrapped_demos=4, max_labeled_demos=4, **kwargs)

Let's now evaluate again, after optimization.

evaluate(optimized)
Average Metric: 38.67 / 64 (60.4%):  21%|██        | 63/300 [00:01<00:06, 38.13it/s]

2024/12/25 12:18:09 ERROR dspy.utils.parallelizer: Error processing item Example({'claim': 'Eliot Hyman co-founded  Seven Arts Productions in 1957. His co-founder produced the American-American black comedy-drama film directed by Stanley Kubrick.', 'titles': ['Ray Stark', 'Seven Arts Productions', 'Lolita (1962 film)']}) (input_keys={'claim'}): Expected dict_keys(['reasoning', 'query']) but got dict_keys(['reasoning']). Set `provide_traceback=True` to see the stack trace.


Average Metric: 177.33 / 299 (59.3%): 100%|██████████| 300/300 [00:08<00:00, 36.01it/s]

2024/12/25 12:18:16 INFO dspy.evaluate.evaluate: Average Metric: 177.33333333333334 / 300 (59.1%)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
claim example_titles notes pred_titles top5_recall titles
0 Nike football team has had a player endorse the football boot Nike... [Nike Hypervenom, Nike Total 90, Marcus Rashford] [] [Nike Hypervenom, Nike Total 90, Kylian Mbappé, Marcus Rashford] ✔️ [1.000] NaN
1 Bill Boyd is the chairman of the appliance company that operates t... [Suncoast Hotel and Casino, Boyd Gaming, Thomas Eje] [] [Bill Boyd, Suncoast Casino, Las Vegas, Thomas Eje] ✔️ [0.333] NaN
2 The president of South Korea was born 24 January 1953. The group t... [Presidential Council on Nation Branding, Korea, Moon Jae-in, Euh ... ['Euh Yoon-Dae is a South Korean professor, financier, and advisor... [Euh Yoon-Dae, KB Financial Group, Chang Dae-hwan, Maeil Business ... NaN
3 The movie Khan Kluay was released 2 months before the 2009 movie t... [Fantastic Mr. Fox (film), Jason Schwartzman, Khan Kluay] ["Jason Schwartzman collaborated with Wes Anderson on the 2009 mov... [Wes Anderson, Fantastic Mr. Fox, Khan Kluay 2, Jason Schwartzman,... ✔️ [0.667] NaN
4 The director of Finding Dory co-directed the film A Bug's Life. [Andrew Stanton, Finding Dory, A Bug's Life] ["Andrew Stanton co-directed A Bug's Life", "John Lasseter directe... [John Lasseter, Andrew Stanton, Finding Dory, A Bug's Life] ✔️ [1.000] NaN
... 295 more rows not displayed ...
59.11

Awesome. It looks like the system improved drastically from around 30% recall to a little below 60% recall. That was a pretty straightforward approach, but DSPy gives you many tools to continue iterating on this from here.

Next, let's inspect the optimized prompts to understand what it has learned. We'll run one query and then inspect the last two prompts, which will show us the prompts used for both sub-modules, in the later iteration inside the Hop() program. (Alternatively, if you enabled MLflow Tracing following the instructions above, you can see all steps done by the agent including LLM calls, prompts, tool execution, in a rich tree-view.)

optimized(claim="The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.").titles
['Up Against It', 'Bernard-Marie Koltès', 'The Beatles', 'Joe Orton']
dspy.inspect_history(n=2)
�[34m[2024-12-25T12:18:16.177899]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `claim` (str)
2. `notes` (str)

Your output fields are:
1. `reasoning` (str)
2. `query` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## claim ## ]]
{claim}

[[ ## notes ## ]]
{notes}

[[ ## reasoning ## ]]
{reasoning}

[[ ## query ## ]]
{query}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given a claim and a set of notes, generate a query that can be used to gather additional evidence or context to support or refute the claim. Think step by step to ensure the query is specific and relevant to the information provided in the notes.


�[31mUser message:�[0m

[[ ## claim ## ]]
Danyang, Jiangusu and this city are both cities in China. This city was the birthplace of Chen Xiuke.

[[ ## notes ## ]]
[1] «Chen Xiuke was born in Dongfang, Hainan.»
[2] «Danyang is a city in Jiangsu province, China.»
[3] «Chen Xiuke was born in Dongfang, Hainan.»
[4] «Danyang is a city in Jiangsu province, China.»
[5] «Dongfang is a city in Hainan province, China.»
[6] «Chen Xiuke was born in Dongfang, Hainan.»
[7] «Danyang is a city in Jiangsu province, China.»
[8] «Dongfang is a city in Hainan province, China.»
[9] «Chen Xiuke was born in Dongfang, Hainan.»
[10] «Danyang is a city in Jiangsu province, China.»
[11] «Dongfang is a city in Hainan province, China.»

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## query ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## reasoning ## ]]
The claim states that Danyang, Jiangsu, and another city are both cities in China, and that this other city was the birthplace of Chen Xiuke. According to the notes, Chen Xiuke was born in Dongfang, Hainan. Therefore, the city referred to in the claim as "this city" is Dongfang, which is indeed a city in Hainan province, China. The claim is consistent with the information provided in the notes.

[[ ## query ## ]]
What is the birthplace of Chen Xiuke?

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## claim ## ]]
The star of Nothing to Report and Gary Barlow have a profession in common.

[[ ## notes ## ]]
[1] «Chris Jericho is a musician and professional wrestler.»
[2] «Gary Barlow is a singer and songwriter, known for being a member of Take That.»
[3] «Chris Jericho is a musician and professional wrestler.»
[4] «Gary Barlow is a singer and songwriter, known for being a member of Take That.»
[5] «Chris Jericho stars in 'Nothing to Report'.»
[6] «Chris Jericho is a musician and professional wrestler.»
[7] «Gary Barlow is a singer and songwriter, known for being a member of Take That.»
[8] «Chris Jericho stars in 'Nothing to Report'.»
[9] «Chris Jericho is a musician and professional wrestler.»
[10] «Gary Barlow is a singer and songwriter, known for being a member of Take That.»
[11] «Chris Jericho stars in 'Nothing to Report'.»

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## query ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## reasoning ## ]]
The claim suggests that the star of "Nothing to Report" and Gary Barlow share a profession. From the notes, we know that Chris Jericho stars in "Nothing to Report" and is a musician and professional wrestler. Gary Barlow is a singer and songwriter. The common profession between Chris Jericho and Gary Barlow is that they are both musicians.

[[ ## query ## ]]
What is the profession that Chris Jericho and Gary Barlow have in common?

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## claim ## ]]
The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.

[[ ## notes ## ]]
N/A

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## query ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## reasoning ## ]]
The claim states that the author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights. However, there is no information in the notes to support or refute this claim. The notes are empty, which means we cannot verify the author of the script or Bernard-Marie Koltès' profession.

[[ ## query ## ]]
Who is the author of the 1960s unproduced script written for The Beatles, Up Against It?

[[ ## completed ## ]]�[0m





�[34m[2024-12-25T12:18:16.191896]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `claim` (str)
2. `notes` (str)
3. `context` (str)

Your output fields are:
1. `reasoning` (str)
2. `new_notes` (list[str])
3. `titles` (list[str])

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## claim ## ]]
{claim}

[[ ## notes ## ]]
{notes}

[[ ## context ## ]]
{context}

[[ ## reasoning ## ]]
{reasoning}

[[ ## new_notes ## ]]
{new_notes}        # note: the value you produce must be pareseable according to the following JSON schema: {"type": "array", "items": {"type": "string"}}

[[ ## titles ## ]]
{titles}        # note: the value you produce must be pareseable according to the following JSON schema: {"type": "array", "items": {"type": "string"}}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Analyze the provided `claim`, `notes`, and `context` to generate `new_notes` that further support or refute the claim, and extract relevant `titles` that represent key topics or entities from the context.


�[31mUser message:�[0m

This is an example of the task, though some input or output fields are not supplied.

[[ ## claim ## ]]
The figure who currently holds the role Michael Noonan held from 2001 to 2002, was the Minister for Social Protection from 2016 until 2017.

[[ ## notes ## ]]
Not supplied for this particular example.

[[ ## context ## ]]
Not supplied for this particular example.

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## new_notes ## ]]` (must be formatted as a valid Python list[str]), then `[[ ## titles ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## reasoning ## ]]
Not supplied for this particular example.

[[ ## new_notes ## ]]
Not supplied for this particular example.

[[ ## titles ## ]]
["Michael Noonan", "Leader of Fine Gael", "Leo Varadkar"]

[[ ## completed ## ]]


�[31mUser message:�[0m

This is an example of the task, though some input or output fields are not supplied.

[[ ## claim ## ]]
The type of area Wiliwili are typically found in have a dominant tree species of Acacia koa. They are of the pea family.

[[ ## notes ## ]]
Not supplied for this particular example.

[[ ## context ## ]]
Not supplied for this particular example.

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## new_notes ## ]]` (must be formatted as a valid Python list[str]), then `[[ ## titles ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## reasoning ## ]]
Not supplied for this particular example.

[[ ## new_notes ## ]]
Not supplied for this particular example.

[[ ## titles ## ]]
["Acacia koa", "Wiliwili", "Hawaiian tropical dry forests"]

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## claim ## ]]
The father of Zak Ové and A. Edward Sutherland are not both photographers.

[[ ## notes ## ]]
[1] «Horace Ové is a photographer, filmmaker, and writer.»
[2] «A. Edward Sutherland is a film director, not a photographer.»
[3] «Horace Ové is a photographer, filmmaker, and writer.»
[4] «A. Edward Sutherland is a film director, not a photographer.»
[5] «Horace Ové is a photographer, filmmaker, and writer.»
[6] «A. Edward Sutherland is a film director, not a photographer.»
[7] «Horace Ové is a photographer, filmmaker, and writer.»
[8] «A. Edward Sutherland is a film director, not a photographer.»

[[ ## context ## ]]
{"Horace Ové | Horace Ové, CBE (born 1939), is a British filmmaker, photographer, painter and writer, one of the leading black independent film-makers to emerge in Britain since the post-war period.  Ové holds the \"Guinness World Record\" for being the first black British film-maker to direct a feature-length film, \"Pressure\" (1975).  In its retrospective history, \"100 Years of Cinema\", the British Film Institute (BFI) declared: \"Horace Ové is undoubtedly a pioneer in Black British history and his work provides a perspective on the Black experience in Britain.\"": 15.836545944213867, "Zak Ové | Zak Ové (born 1966) is a British visual artist who works between sculpture, film and photography, living in London, UK, and Trinidad.  His themes reflect \"his documentation of and anthropological interest in diasporic and African history, specifically that which is explored through Trinidadian carnival.\"  In work that is \"filtered through his own personal and cultural upbringing, with a black Trinidadian father and white Irish mother\", he has exhibited widely in Europe, the United States and Africa, participating in international museum shows in London, Dakar, Paris, Dubai, Prague, Berlin, Johannesburg, Bamako and New York City.  His father is the filmmaker Horace Ové and his sister is the actress Indra Ové.": 13.145259857177734, "Playing Away | Playing Away is a 1987 TV comedy film directed by Horace Ové, from a screenplay by Caryl Phillips.  In the story, an English cricket team, fictitiously named \"Sneddington\" (based in Lavenham, Suffolk), invites a team of West Indian heritage based in Brixton (South London) to play a charity game in support of their \"Third World Week.\"  According to Screenonline, \"The gentle comedy of manners and unexpected reversal of white and black stereotypes in \"Playing Away\" contrasts sharply with the stylistic experimentation and the militant denunciations of racial prejudice in director Horace Ové's earlier feature, \"Pressure\" (1975).\" \" New York Times\" reviewer Vincent Canby called it \"witty and wise without being seriously disturbing for a minute\".": 12.445182800292969, "Pressure (film) | Pressure is a 1976 British drama film and the first feature-length fiction film directed by a Black film-maker in Britain.  Directed by Horace Ové, and co-written by him with Samuel Selvon, \"Pressure\" is a powerful portrait of inter-generational tensions between first- and second-generation West Indian migrants in London's Notting Hill area.  According to Julia Toppin,": 10.526924133300781, "What a Night! (1928 film) | What a Night!  (1928) is an American silent film directed by A. Edward Sutherland.  The romantic comedy was written by Louise Long, from a story by Lloyd Corrigan and Grover Jones.  The film stars Bebe Daniels, Neil Hamilton, and William Austin.": 10.453304290771484, "The Orchid House (TV serial) | The Orchid House is a four-part television serial that first aired on British television's Channel 4 from 21 February to 14 March 1991, directed by Horace Ové.  Its cast featured Diana Quick, Madge Sinclair, Nigel Terry, Elizabeth Hurley, Kate Buffery and Frances Barber, and was based on Phyllis Shand Allfrey's only novel of the same name.": 10.413471221923828, "Indra Ové | Indra Ové (born 1968) is a British movie and stage actress.": 10.15088176727295, "Henry Sutherland Edwards | Henry Sutherland Edwards (1828–1906) was a British journalist.": 9.868606567382812, "Burning an Illusion | Burning an Illusion is a 1981 British film written and directed by Menelik Shabazz, about a young British-born black woman's love life, mostly shot in London's Notting Hill and Ladbroke Grove communities.  It was only the second British feature to have been made by a black director, following Horace Ové’s 1975 \"Pressure\", and is described by Stephen Bourne as \"the first British film to give a black woman a voice of any kind.\"  Imruh Bakari worked with Shabazz and co-founded Kumba productions with him.": 9.634703636169434, "Horace Hall Edwards | Horace Hall Edwards (August 21, 1902 – January 27, 1987) was an American politician, who was mayor of Richmond, Virginia, and a candidate for governor in 1949.": 9.451831817626953}

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## new_notes ## ]]` (must be formatted as a valid Python list[str]), then `[[ ## titles ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## reasoning ## ]]
The claim states that the father of Zak Ové and A. Edward Sutherland are not both photographers. From the context, we know that Zak Ové's father is Horace Ové, who is indeed a photographer among other professions. However, A. Edward Sutherland is not mentioned as a photographer in the context; he is known as a film director. Therefore, the claim is correct because Horace Ové is a photographer, but A. Edward Sutherland is not.

[[ ## new_notes ## ]]
["Horace Ové is a photographer, filmmaker, and writer.", "A. Edward Sutherland is a film director, not a photographer."]

[[ ## titles ## ]]
["Horace Ové", "A. Edward Sutherland", "Zak Ové"]

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## claim ## ]]
The summer 2016 romantic drama "Me Before You" is directed by Thea Sharrock. The star of the film The Lost Future (who also appears in The Hunger Games) stars as the character Will Traynor.

[[ ## notes ## ]]
[1] «Sam Claflin stars as Will Traynor in 'Me Before You'.»
[2] «Sam Claflin is known for his role as Finnick Odair in 'The Hunger Games'.»
[3] «Thea Sharrock directed 'Me Before You'.»
[4] «Sam Claflin stars as Will Traynor in 'Me Before You'.»
[5] «Sam Claflin is known for his role as Finnick Odair in 'The Hunger Games'.»
[6] «Thea Sharrock directed 'Me Before You'.»
[7] «Sam Claflin starred in 'The Lost Future'.»
[8] «Sam Claflin stars as Will Traynor in 'Me Before You'.»
[9] «Sam Claflin is known for his role as Finnick Odair in 'The Hunger Games'.»
[10] «Thea Sharrock directed 'Me Before You'.»
[11] «Sam Claflin starred in 'The Lost Future'.»
[12] «Sam Claflin stars as Will Traynor in 'Me Before You'.»
[13] «Sam Claflin is known for his role as Finnick Odair in 'The Hunger Games'.»
[14] «Thea Sharrock directed 'Me Before You'.»
[15] «Sam Claflin starred in 'The Lost Future'.»

[[ ## context ## ]]
{"Sam Claflin | Samuel George Claflin (born 27 June 1986) is an English actor.  He is known for portraying Finnick Odair in \"The Hunger Games\" film series, Philip Swift in \"\", and Will Traynor in \"Me Before You\".": 19.94539451599121, "Me Before You (film) | Me Before You is a 2016 romantic drama film directed by Thea Sharrock in her directorial debut and adapted by English author Jojo Moyes from her 2012 novel of the same name.  The film stars Emilia Clarke, Sam Claflin, Steve Peacocke, Jenna Coleman, Charles Dance, Matthew Lewis, Janet McTeer, Vanessa Kirby and Joanna Lumley.": 18.48834228515625, "Look What You Did to Me | Look What You Did to Me is the debut studio album by American rapper and singer Z-Ro.  It was released on June 16, 1998, by this independently distributed label Fisherboy Records.  The album is entirely produced by Z-Ro alongside Rakish Jacob (aka Roc), while the guest appearances was from T.A.Z., Bam, Trae, The Fakkulty, Chris Ward and Al-D.": 14.100790977478027, "I Still Know What You Did Last Summer | I Still Know What You Did Last Summer is a 1998 American slasher film and a sequel to the 1997 film \"I Know What You Did Last Summer\".  Directed by Danny Cannon, the film was written by Trey Callaway, and features characters originally created in Lois Duncan's 1973 novel \"I Know What You Did Last Summer\".  Jennifer Love Hewitt, Freddie Prinze, Jr. and Muse Watson reprise their roles, with Brandy, Mekhi Phifer, Jennifer Esposito, and Matthew Settle joining the cast.  \"I Still Know What You Did Last Summer\" continues after the events of the first film.": 13.990736961364746, "Tell Me What You Dream | \"Tell Me What You Dream\" is a song written by Timothy B. Schmit, Josh Leo and Vince Melamed and performed by country group Restless Heart along with saxophonist Warren Hill.  The single was the group's only number one on the adult contemporary chart and despite previous country chart success, the song did not make the country top 40.  \"Tell Me What You Dream\" spent two weeks at number one and peaked at number forty-three on the \"Billboard\" Hot 100.": 13.774335861206055, "Geraldine Jones (character) | Geraldine Jones was a fictional African American character, the most famous recurring persona of comedian Flip Wilson.  Geraldine was played as a sassy liberated Southern woman who was coarsely flirty yet faithful to her (unseen) boyfriend \"Killer\".  Poorly educated, she was nevertheless confident; she did not change her behavior to suit anyone.  Several of Geraldine's sayings entered U.S. popular culture as catchphrases, especially \"When you're hot, you're hot; when you're not, you're not,\" \"The Devil made me do it,\" and \"What you see is what you get! \"": 13.149161338806152, "When Did You Stop Loving Me, When Did I Stop Loving You | When Did You Stop Loving Me, When Did I Stop Loving You": 12.946025848388672, "Pajama Sam 3: You Are What You Eat from Your Head to Your Feet | Pajama Sam 3: You Are What You Eat from Your Head to Your Feet": 12.687528610229492, "Sam Quartin | Sam Quartin is a musician, producer and actress best known for role as June Glass, starring alongside Marilyn Manson in the American movie Let Me Make You a Martyr\"(2017), also for \"By the Rivers of Babylon)\" and \"Aimy in a Cage\".": 12.639335632324219, "Why You Wanna Treat Me So Bad? | \"Why You Wanna Treat Me So Bad?\"  is the US follow-up single to Prince's first big hit, \"I Wanna Be Your Lover\".  It is also Prince's first rock and roll-inspired single release.  It did not make the top 40 of the Billboard Hot 100 charts, although it did reach #13 on the R&B Singles charts.  The lyrics are self-explanatory from the title, and deal with a cruel lover.  The song prominently features guitar and bass, with the keyboard featured in a less prominent role.  A highlight of the song is a soaring guitar solo at the end, played by Prince himself.": 12.574298858642578}

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## new_notes ## ]]` (must be formatted as a valid Python list[str]), then `[[ ## titles ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mAssistant message:�[0m

[[ ## reasoning ## ]]
The claim states that the film "Me Before You" is directed by Thea Sharrock and that the actor who stars as Will Traynor also appeared in "The Lost Future" and "The Hunger Games". According to the notes, Sam Claflin stars as Will Traynor in "Me Before You" and is known for his role as Finnick Odair in "The Hunger Games". Additionally, Sam Claflin starred in "The Lost Future". The context confirms that "Me Before You" is directed by Thea Sharrock and that Sam Claflin plays Will Traynor. Therefore, the claim is supported by the provided notes and context.

[[ ## new_notes ## ]]
["Sam Claflin stars as Will Traynor in 'Me Before You'.", "Sam Claflin is known for his role as Finnick Odair in 'The Hunger Games'.", "Thea Sharrock directed 'Me Before You'.", "Sam Claflin starred in 'The Lost Future'."]

[[ ## titles ## ]]
["Me Before You (film)", "Sam Claflin", "Thea Sharrock", "The Lost Future", "The Hunger Games"]

[[ ## completed ## ]]


�[31mUser message:�[0m

[[ ## claim ## ]]
The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.

[[ ## notes ## ]]
N/A

[[ ## context ## ]]
{"Up Against It | Up Against It is an unproduced script by Joe Orton, written in 1967 for The Beatles at the height of their fame.": 23.741416931152344, "The Alien (unproduced film) | The Alien was an unproduced Indian-American science fiction film in development in the late 1960s which was eventually cancelled.  It was to be directed by celebrated Indian filmmaker Satyajit Ray and co-produced by Columbia Pictures.  The script was written by Ray in 1967, loosely based on Bankubabur Bandhu (\"Banku Babu's Friend\" or \"Mr. Banku's Friend\"), a Bengali science fiction story he had written in 1962 for \"Sandesh\", the Ray family magazine, which gained popularity among Bengalis in the early 1960s.  \"Bankubabur Bandhu\" was eventually adapted into a television film by Satyajit Ray's son Sandip Ray, and a play by the theatre group Swapnasandhani Kaushik Sen, in 2006.": 14.683004379272461, "Marsha Albert | Marsha Albert (born 1948) is credited with being the person who jump-started the early 1960s phenomena known as Beatlemania in the United States when as a 15-year-old girl, on 17 December 1963, she introduced for the first time on American radio a song written and recorded by The Beatles titled I Want to Hold Your Hand (the Beatles' best-selling single worldwide), and that Beatles historian and author Bruce Spizer noted, in 2004, by his stating \"Marsha Albert's actions forced a major record company to push up the release date of a debut single from an unknown band during the holiday season, a time when record companies traditionally released no new product.\"": 13.236483573913574, "Up Against It! (Todd Rundgren album) | Up Against It!  is a 1997 album by Todd Rundgren, essentially consisting of Rundgren's song demos for the Off Broadway show that were Written and Recorded by Todd from 1986-88.  The project was inspired by the never-produced Up Against It which was a play originally written by Joe Orton for The Beatles.": 12.989564895629883, "Blood and Fire (Star Trek: The Next Generation) | \"Blood and Fire\" is an episode written by David Gerrold for possible use on \"\".  The script was commissioned and written, but never actually filmed.  According to Gerrold, some of the production staff, including Rick Berman, had a negative reaction to its positive depiction of an openly gay couple.  Herbert Wright rewrote the script under the name \"Blood and Ice\", which also was left unproduced.": 11.980508804321289, "Cedar Rapids (film) | Cedar Rapids is a 2011 American comedy film directed by Miguel Arteta.  The script, written by Phil Johnston, was included on the 2009 Black List, a Hollywood list of the most popular unproduced screenplays of the year.": 11.593443870544434, "The Beatles: The Biography | The Beatles: The Biography is the name of a 2005 biography of the 1960s rock band The Beatles written by Bob Spitz.  It was first published by Little, Brown and Company on November 1, 2005.": 11.583497047424316, "The Illusionist (2010 film) | The Illusionist (French: L'Illusionniste ) is a 2010 French-British animated film directed by Sylvain Chomet.  The film is based on an unproduced script written by French mime, director and actor Jacques Tati in 1956.  Controversy surrounds Tati's motivation for the script, which was written as a personal letter to his estranged eldest daughter, Helga Marie-Jeanne Schiel in collaboration with his long-term writing partner Henri Marquet, between writing for the films \"Mon Oncle\" and \"Play Time\".": 11.528533935546875, "Bobby Florsheim | Robert Florsheim (born December 19, 1969) is an American screenwriter, best known for co-writing \"The Passion Of The Ark\" with Josh Stolberg, (the basis for the film \"Evan Almighty\").  Their original script was reported by Daily Variety as the highest priced spec script ever sold by unproduced screenwriters (April, 2004).  In refashioning the script into \"Evan Almighty\", Universal Studios discarded the script and then hired screenwriter Steve Oedekerk, who received sole writing credit on the finished film.": 11.23757553100586, "The Hook (screenplay) | The Hook is an unproduced screenplay by American playwright, Arthur Miller.  It was written in 1947 and was intended to be produced by Columbia Pictures Studio, Hollywood, and to be directed by Elia Kazan.  The screenplay was inspired by the true story of Pete Panto, a young dockworker who stood up against the corrupt Mafia-connected union leadership.  Panto was discovered dead in a pit outside New York eighteen months after his disappearance.  Set in the Red Hook district of Brooklyn, \"The Hook\" is the story of Marty Ferrara, a longshoreman who is ‘ready to lay down his life, if need be, to secure one thing – his sense of personal dignity.’": 11.222980499267578}

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## new_notes ## ]]` (must be formatted as a valid Python list[str]), then `[[ ## titles ## ]]` (must be formatted as a valid Python list[str]), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## reasoning ## ]]
The claim states that the author of the 1960s unproduced script written for The Beatles, "Up Against It", and Bernard-Marie Koltès are both playwrights. The context confirms that Joe Orton wrote the script "Up Against It" for The Beatles, and Bernard-Marie Koltès was a French playwright. Therefore, the claim is supported by the provided context.

[[ ## new_notes ## ]]
[]

[[ ## titles ## ]]
["Up Against It", "Bernard-Marie Koltès", "Joe Orton", "The Beatles"]

[[ ## completed ## ]]�[0m

Finally, let's save our optimized program so we can use it again later.

optimized.save("optimized_hop.json")

loaded_program = Hop()
loaded_program.load("optimized_hop.json")

loaded_program(claim="The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.").titles
['Up Against It', 'Bernard-Marie Koltès', 'The Beatles', 'Joe Orton']
Saving programs in MLflow Experiment

Instead of saving the program to a local file, you can track it in MLflow for better reproducibility and collaboration.

  1. Dependency Management: MLflow automatically save the frozen environment metadata along with the program to ensure reproducibility.
  2. Experiment Tracking: With MLflow, you can track the program's performance and cost along with the program itself.
  3. Collaboration: You can share the program and results with your team members by sharing the MLflow experiment.

To save the program in MLflow, run the following code:

import mlflow

# Start an MLflow Run and save the program
with mlflow.start_run(run_name="optimized"):
    model_info = mlflow.dspy.log_model(
        optimized,
        artifact_path="model", # Any name to save the program in MLflow
    )

# Load the program back from MLflow
loaded = mlflow.dspy.load_model(model_info.model_uri)

To learn more about the integration, visit MLflow DSPy Documentation as well.


Tutorial: Retrieval-Augmented Generation (RAG)

Let's walk through a quick example of basic question answering with and without retrieval-augmented generation (RAG) in DSPy. Specifically, let's build a system for answering Tech questions, e.g. about Linux or iPhone apps.

Install the latest DSPy via pip install -U dspy and follow along. If you're looking instead for a conceptual overview of DSPy, this recent lecture is a good place to start.

Configuring the DSPy environment.

Let's tell DSPy that we will use OpenAI's gpt-4o-mini in our modules. To authenticate, DSPy will look into your OPENAI_API_KEY. You can easily swap this out for other providers or local models.

Recommended: Set up MLflow Tracing to understand what's happening under the hood.

MLflow DSPy Integration

MLflow is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. In this tutorial, you can use MLflow to visualize prompts and optimization progress as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.

MLflow Trace

  1. Install MLflow
%pip install mlflow>=2.20
  1. Start MLflow UI in a separate terminal
mlflow ui --port 5000
  1. Connect the notebook to MLflow
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")
  1. Enabling tracing.
mlflow.dspy.autolog()

Once you have completed the steps above, you can see traces for each program execution on the notebook. They provide great visibility into the model's behavior and helps you understand the DSPy's concepts better throughout the tutorial.

To kearn more about the integration, visit MLflow DSPy Documentation as well.

import dspy

lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)

Exploring some basic DSPy Modules.

You can always prompt the LM directly via lm(prompt="prompt") or lm(messages=[...]). However, DSPy gives you Modules as a better way to define your LM functions.

The simplest module is dspy.Predict. It takes a DSPy Signature, i.e. a structured input/output schema, and gives you back a callable function for the behavior you specified. Let's use the "in-line" notation for signatures to declare a module that takes a question (of type str) as input and produces a response as an output.

qa = dspy.Predict('question: str -> response: str')
response = qa(question="what are high memory and low memory on linux?")

print(response.response)
In Linux, "high memory" and "low memory" refer to different regions of the system's memory address space, particularly in the context of 32-bit architectures.

- **Low Memory**: This typically refers to the memory that is directly accessible by the kernel. In a 32-bit system, this is usually the first 896 MB of RAM (from 0 to 896 MB). The kernel can directly map this memory, making it faster for the kernel to access and manage. Low memory is used for kernel data structures and for user processes that require direct access to memory.

- **High Memory**: This refers to the memory above the low memory limit, which is not directly accessible by the kernel in a 32-bit system. This area is typically above 896 MB. The kernel cannot directly access this memory without using special mechanisms, such as mapping it into the kernel's address space when needed. High memory is used for user processes that require more memory than what is available in low memory.

In summary, low memory is directly accessible by the kernel, while high memory requires additional steps for the kernel to access it, especially in 32-bit systems. In 64-bit systems, this distinction is less significant as the kernel can address a much larger memory space directly.

Notice how the variable names we specified in the signature defined our input and output argument names and their role.

Now, what did DSPy do to build this qa module? Nothing fancy in this example, yet. The module passed your signature, LM, and inputs to an Adapter, which is a layer that handles structuring the inputs and parsing structured outputs to fit your signature.

Let's see it directly. You can inspect the n last prompts sent by DSPy easily. Alternatively, if you enabled MLflow Tracing above, you can see the full LLM interactions for each program execution in a tree view.

dspy.inspect_history(n=1)
�[34m[2024-11-23T23:16:35.966534]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `question` (str)

Your output fields are:
1. `response` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## response ## ]]
{response}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `question`, produce the fields `response`.


�[31mUser message:�[0m

[[ ## question ## ]]
what are high memory and low memory on linux?

Respond with the corresponding output fields, starting with the field `[[ ## response ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## response ## ]]
In Linux, "high memory" and "low memory" refer to different regions of the system's memory address space, particularly in the context of 32-bit architectures.

- **Low Memory**: This typically refers to the memory that is directly accessible by the kernel. In a 32-bit system, this is usually the first 896 MB of RAM (from 0 to 896 MB). The kernel can directly map this memory, making it faster for the kernel to access and manage. Low memory is used for kernel data structures and for user processes that require direct access to memory.

- **High Memory**: This refers to the memory above the low memory limit, which is not directly accessible by the kernel in a 32-bit system. This area is typically above 896 MB. The kernel cannot directly access this memory without using special mechanisms, such as mapping it into the kernel's address space when needed. High memory is used for user processes that require more memory than what is available in low memory.

In summary, low memory is directly accessible by the kernel, while high memory requires additional steps for the kernel to access it, especially in 32-bit systems. In 64-bit systems, this distinction is less significant as the kernel can address a much larger memory space directly.

[[ ## completed ## ]]�[0m

DSPy has various built-in modules, e.g. dspy.ChainOfThought, dspy.ProgramOfThought, and dspy.ReAct. These are interchangeable with basic dspy.Predict: they take your signature, which is specific to your task, and they apply general-purpose prompting techniques and inference-time strategies to it.

For example, dspy.ChainOfThought is an easy way to elicit reasoning out of your LM before it commits to the outputs requested in your signature.

In the example below, we'll omit str types (as the default type is string). You should feel free to experiment with other fields and types, e.g. try topics: list[str] or is_realistic: bool.

cot = dspy.ChainOfThought('question -> response')
cot(question="should curly braces appear on their own line?")
Prediction(
    reasoning='The placement of curly braces on their own line depends on the coding style and conventions being followed. In some programming languages and style guides, such as the Allman style, curly braces are placed on their own line to enhance readability. In contrast, other styles, like K&R style, place the opening brace on the same line as the control statement. Ultimately, it is a matter of personal or team preference, and consistency within a project is key.',
    response='Curly braces can appear on their own line depending on the coding style you are following. If you prefer a style that enhances readability, such as the Allman style, then yes, they should be on their own line. However, if you are following a different style, like K&R, they may not need to be. Consistency is important, so choose a style and stick with it.'
)

Interestingly, asking for reasoning can make the output response shorter in this case. Is this a good thing or a bad thing? It depends on what you need: there's no free lunch, but DSPy gives you the tools to experiment with different strategies extremely quickly.

By the way, dspy.ChainOfThought is implemented in DSPy, using dspy.Predict. This is a good place to dspy.inspect_history if you're curious.

Using DSPy well involves evaluation and iterative development.

You already know a lot about DSPy at this point. If all you want is quick scripting, this much of DSPy already enables a lot. Sprinkling DSPy signatures and modules into your Python control flow is a pretty ergonomic way to just get stuff done with LMs.

That said, you're likely here because you want to build a high-quality system and improve it over time. The way to do that in DSPy is to iterate fast by evaluating the quality of your system and using DSPy's powerful tools, e.g. Optimizers.

Manipulating Examples in DSPy.

To measure the quality of your DSPy system, you need (1) a bunch of input values, like questions for example, and (2) a metric that can score the quality of an output from your system. Metrics vary widely. Some metrics need ground-truth labels of ideal outputs, e.g. for classification or question answering. Other metrics are self-supervised, e.g. checking faithfulness or lack of hallucination, perhaps using a DSPy program as a judge of these qualities.

Let's load a dataset of questions and their (pretty long) gold answers. Since we started this notebook with the goal of building a system for answering Tech questions, we obtained a bunch of StackExchange-based questions and their correct answers from the RAG-QA Arena dataset.

import ujson
from dspy.utils import download

# Download question--answer pairs from the RAG-QA Arena "Tech" dataset.
download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_examples.jsonl")

with open("ragqa_arena_tech_examples.jsonl") as f:
    data = [ujson.loads(line) for line in f]
# Inspect one datapoint.
data[0]
{'question': 'why igp is used in mpls?',
 'response': "An IGP exchanges routing prefixes between gateways/routers.  \nWithout a routing protocol, you'd have to configure each route on every router and you'd have no dynamic updates when routes change because of link failures. \nFuthermore, within an MPLS network, an IGP is vital for advertising the internal topology and ensuring connectivity for MP-BGP inside the network.",
 'gold_doc_ids': [2822, 2823]}

Given a simple dict like this, let's create a list of dspy.Examples, which is the datatype that carries training (or test) datapoints in DSPy.

When you build a dspy.Example, you should generally specify .with_inputs("field1", "field2", ...) to indicate which fields are inputs. The other fields are treated as labels or metadata.

data = [dspy.Example(**d).with_inputs('question') for d in data]

# Let's pick an `example` here from the data.
example = data[2]
example
Example({'question': 'why are my text messages coming up as maybe?', 'response': 'This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". \n\nHowever, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.', 'gold_doc_ids': [3956, 3957, 8034]}) (input_keys={'question'})

Now, let's divide the data into:

  • Training (and with it Validation) set:

    • These are the splits you typically give to DSPy optimizers.
    • Optimizers typically learn directly from the training examples and check their progress using the validation examples.
    • It's good to have 30--300 examples for training and validation each.
    • For prompt optimizers in particular, it's often better to pass more validation than training.
    • Below, we'll use 200 in total. MIPROv2 will split them into 20% training and 80% validation if you don't pass a valset.
  • Development and Test sets: The rest, typically on the order of 30--1000, can be used for:

    • development (i.e., you can inspect them as you iterate on your system) and
    • testing (final held-out evaluation).
import random

random.Random(0).shuffle(data)
trainset, devset, testset = data[:200], data[200:500], data[500:1000]

len(trainset), len(devset), len(testset)
(200, 300, 500)

Evaluation in DSPy.

What kind of metric can suit our question-answering task? There are many choices, but since the answers are long, we may ask: How well does the system response cover all key facts in the gold response? And the other way around, how well is the system response not saying things that aren't in the gold response?

That metric is essentially a semantic F1, so let's load a SemanticF1 metric from DSPy. This metric is actually implemented as a very simple DSPy module using whatever LM we're working with.

from dspy.evaluate import SemanticF1

# Instantiate the metric.
metric = SemanticF1(decompositional=True)

# Produce a prediction from our `cot` module, using the `example` above as input.
pred = cot(**example.inputs())

# Compute the metric score for the prediction.
score = metric(example, pred)

print(f"Question: \t {example.question}\n")
print(f"Gold Response: \t {example.response}\n")
print(f"Predicted Response: \t {pred.response}\n")
print(f"Semantic F1 Score: {score:.2f}")
Question: 	 why are my text messages coming up as maybe?

Gold Response: 	 This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". 

However, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.

Predicted Response: 	 Your text messages are showing up as "maybe" because your messaging app is uncertain about the sender's identity. This typically occurs when the sender's number is not saved in your contacts or if the message is from an unknown number. To resolve this, you can save the contact in your address book or check the message settings in your app.

Semantic F1 Score: 0.33

The final DSPy module call above actually happens inside metric. You might be curious how it measured the semantic F1 for this example.

dspy.inspect_history(n=1)
�[34m[2024-11-23T23:16:36.149518]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `question` (str)
2. `ground_truth` (str)
3. `system_response` (str)

Your output fields are:
1. `reasoning` (str)
2. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth
3. `system_response_key_ideas` (str): enumeration of key ideas in the system response
4. `discussion` (str): discussion of the overlap between ground truth and system response
5. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response
6. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## ground_truth ## ]]
{ground_truth}

[[ ## system_response ## ]]
{system_response}

[[ ## reasoning ## ]]
{reasoning}

[[ ## ground_truth_key_ideas ## ]]
{ground_truth_key_ideas}

[[ ## system_response_key_ideas ## ]]
{system_response_key_ideas}

[[ ## discussion ## ]]
{discussion}

[[ ## recall ## ]]
{recall}        # note: the value you produce must be a single float value

[[ ## precision ## ]]
{precision}        # note: the value you produce must be a single float value

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Compare a system's response to the ground truth to compute recall and precision of key ideas.
        You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.


�[31mUser message:�[0m

[[ ## question ## ]]
why are my text messages coming up as maybe?

[[ ## ground_truth ## ]]
This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". 

However, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.

[[ ## system_response ## ]]
Your text messages are showing up as "maybe" because your messaging app is uncertain about the sender's identity. This typically occurs when the sender's number is not saved in your contacts or if the message is from an unknown number. To resolve this, you can save the contact in your address book or check the message settings in your app.

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## ground_truth_key_ideas ## ]]`, then `[[ ## system_response_key_ideas ## ]]`, then `[[ ## discussion ## ]]`, then `[[ ## recall ## ]]` (must be formatted as a valid Python float), then `[[ ## precision ## ]]` (must be formatted as a valid Python float), and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## reasoning ## ]]
The system response provides a general explanation for why text messages may show up as "maybe," focusing on the uncertainty of the sender's identity. However, it does not mention the specific iOS features or the potential bug in iOS 11.2 that are highlighted in the ground truth. The ground truth explains that this feature is part of iOS 9's Proactivity features and mentions a specific bug in iOS 11.2, which is not addressed in the system response.

[[ ## ground_truth_key_ideas ## ]]
1. Proactivity features introduced in iOS 9.
2. The system checks emails to associate phone numbers with contacts.
3. "Maybe" is displayed if the number is not saved in contacts.
4. Mention of a bug in iOS 11.2 causing "Maybe" to appear incorrectly.

[[ ## system_response_key_ideas ## ]]
1. Text messages show up as "maybe" due to uncertainty about the sender's identity.
2. Occurs when the sender's number is not saved in contacts or is from an unknown number.
3. Suggests saving the contact or checking message settings.

[[ ## discussion ## ]]
There is some overlap between the ground truth and the system response regarding the uncertainty of the sender's identity and the suggestion to save the contact. However, the system response lacks specific details about the iOS features and the bug mentioned in the ground truth. The ground truth provides a more comprehensive explanation of the "maybe" feature, while the system response is more general and does not address the iOS version specifics.

[[ ## recall ## ]]
0.25

[[ ## precision ## ]]
0.5

[[ ## completed ## ]]�[0m

For evaluation, you could use the metric above in a simple loop and just average the score. But for nice parallelism and utilities, we can rely on dspy.Evaluate.

# Define an evaluator that we can re-use.
evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=24,
                         display_progress=True, display_table=2)

# Evaluate the Chain-of-Thought program.
evaluate(cot)
Average Metric: 125.68 / 300 (41.9%): 100%|██████████| 300/300 [00:00<00:00, 666.96it/s]

2024/11/23 23:16:36 INFO dspy.evaluate.evaluate: Average Metric: 125.68228336477591 / 300 (41.9%)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
question example_response gold_doc_ids reasoning pred_response SemanticF1
0 when to use c over c++, and c++ over c? If you are equally familiar with both C++ and C, it's advisable to... [733] C and C++ are both powerful programming languages, but they serve ... Use C when you need low-level access to memory, require high perfo...
1 should images be stored in a git repository? One viewpoint expresses that there is no significant downside, esp... [6253, 6254, 6275, 6278, 8215] Storing images in a Git repository can be beneficial for version c... Images can be stored in a Git repository, but it's important to co... ✔️ [0.444]
... 298 more rows not displayed ...
41.89
Tracking Evaluation Results in MLflow Experiment

To track and visualize the evaluation results over time, you can record the results in MLflow Experiment.

import mlflow

with mlflow.start_run(run_name="rag_evaluation"):
    evaluate = dspy.Evaluate(
        devset=devset,
        metric=metric,
        num_threads=24,
        display_progress=True,
        # To record the outputs and detailed scores to MLflow
        return_all_scores=True,
        return_outputs=True,
    )

    # Evaluate the program as usual
    aggregated_score, outputs, all_scores = evaluate(cot)


    # Log the aggregated score
    mlflow.log_metric("semantic_f1_score", aggregated_score)
    # Log the detailed evaluation results as a table
    mlflow.log_table(
        {
            "Question": [example.question for example in eval_set],
            "Gold Response": [example.response for example in eval_set],
            "Predicted Response": outputs,
            "Semantic F1 Score": all_scores,
        },
        artifact_file="eval_results.json",
    )

To learn more about the integration, visit MLflow DSPy Documentation as well.

So far, we built a very simple chain-of-thought module for question answering and evaluated it on a small dataset.

Can we do better? In the rest of this guide, we will build a retrieval-augmented generation (RAG) program in DSPy for the same task. We'll see how this can boost the score substantially, then we'll use one of the DSPy Optimizers to compile our RAG program to higher-quality prompts, raising our scores even more.

Basic Retrieval-Augmented Generation (RAG).

First, let's download the corpus data that we will use for RAG search. An older version of this tutorial used the full (650,000 document) corpus. To make this very fast and cheap to run, we've downsampled the corpus to just 28,000 documents.

download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_corpus.jsonl")

Set up your system's retriever.

As far as DSPy is concerned, you can plug in any Python code for calling tools or retrievers. Here, we'll just use OpenAI Embeddings and do top-K search locally, just for convenience.

Note: The step below will require that you either do pip install -U faiss-cpu or pass brute_force_threshold=30_000 to dspy.retrievers.Embeddings to avoid faiss.

# %pip install -U faiss-cpu  # or faiss-gpu if you have a GPU
max_characters = 6000  # for truncating >99th percentile of documents
topk_docs_to_retrieve = 5  # number of documents to retrieve per search query

with open("ragqa_arena_tech_corpus.jsonl") as f:
    corpus = [ujson.loads(line)['text'][:max_characters] for line in f]
    print(f"Loaded {len(corpus)} documents. Will encode them below.")

embedder = dspy.Embedder('openai/text-embedding-3-small', dimensions=512)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=topk_docs_to_retrieve)
Loaded 28436 documents. Will encode them below.
Training a 32-byte FAISS index with 337 partitions, based on 28436 x 512-dim embeddings

Build your first RAG Module.

In the previous guide, we looked at individual DSPy modules in isolation, e.g. dspy.Predict("question -> answer").

What if we want to build a DSPy program that has multiple steps? The syntax below with dspy.Module allows you to connect a few pieces together, in this case, our retriever and a generation module, so the whole system can be optimized.

Concretely, in the __init__ method, you declare any sub-module you'll need, which in this case is just a dspy.ChainOfThought('context, question -> response') module that takes retrieved context, a question, and produces a response. In the forward method, you simply express any Python control flow you like, possibly using your modules. In this case, we first invoke the search function defined earlier and then invoke the self.respond ChainOfThought module.

class RAG(dspy.Module):
    def __init__(self):
        self.respond = dspy.ChainOfThought('context, question -> response')

    def forward(self, question):
        context = search(question).passages
        return self.respond(context=context, question=question)

Let's use the RAG module.

rag = RAG()
rag(question="what are high memory and low memory on linux?")
Prediction(
    reasoning="High Memory and Low Memory in Linux refer to two segments of the kernel's memory space. Low Memory is the portion of memory that the kernel can access directly and is statically mapped at boot time. This area is typically used for kernel data structures and is always accessible to the kernel. High Memory, on the other hand, is not permanently mapped in the kernel's address space, meaning that the kernel cannot access it directly without first mapping it into its address space. High Memory is used for user-space applications and temporary data buffers. The distinction allows for better memory management and security, as user-space applications cannot directly access kernel-space memory.",
    response="In Linux, High Memory refers to the segment of memory that is not permanently mapped in the kernel's address space, which means the kernel must map it temporarily to access it. This area is typically used for user-space applications and temporary data buffers. Low Memory, in contrast, is the portion of memory that the kernel can access directly and is statically mapped at boot time. It is used for kernel data structures and is always accessible to the kernel. This separation enhances security by preventing user-space applications from accessing kernel-space memory directly."
)
dspy.inspect_history()
�[34m[2024-11-23T23:16:49.175612]�[0m

�[31mSystem message:�[0m

Your input fields are:
1. `context` (str)
2. `question` (str)

Your output fields are:
1. `reasoning` (str)
2. `response` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## context ## ]]
{context}

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## response ## ]]
{response}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `context`, `question`, produce the fields `response`.


�[31mUser message:�[0m

[[ ## context ## ]]
[1] «As far as I remember, High Memory is used for application space and Low Memory for the kernel. Advantage is that (user-space) applications cant access kernel-space memory.»
[2] «HIGHMEM is a range of kernels memory space, but it is NOT memory you access but its a place where you put what you want to access. A typical 32bit Linux virtual memory map is like: 0x00000000-0xbfffffff: user process (3GB) 0xc0000000-0xffffffff: kernel space (1GB) (CPU-specific vector and whatsoever are ignored here). Linux splits the 1GB kernel space into 2 pieces, LOWMEM and HIGHMEM. The split varies from installation to installation. If an installation chooses, say, 512MB-512MB for LOW and HIGH mems, the 512MB LOWMEM (0xc0000000-0xdfffffff) is statically mapped at the kernel boot time; usually the first so many bytes of the physical memory is used for this so that virtual and physical addresses in this range have a constant offset of, say, 0xc0000000. On the other hand, the latter 512MB (HIGHMEM) has no static mapping (although you could leave pages semi-permanently mapped there, but you must do so explicitly in your driver code). Instead, pages are temporarily mapped and unmapped here so that virtual and physical addresses in this range have no consistent mapping. Typical uses of HIGHMEM include single-time data buffers.»
[3] «This is relevant to the Linux kernel; Im not sure how any Unix kernel handles this. The High Memory is the segment of memory that user-space programs can address. It cannot touch Low Memory. Low Memory is the segment of memory that the Linux kernel can address directly. If the kernel must access High Memory, it has to map it into its own address space first. There was a patch introduced recently that lets you control where the segment is. The tradeoff is that you can take addressable memory away from user space so that the kernel can have more memory that it does not have to map before using. Additional resources: http://tldp.org/HOWTO/KernelAnalysis-HOWTO-7.html http://linux-mm.org/HighMemory»
[4] «The first reference to turn to is Linux Device Drivers (available both online and in book form), particularly chapter 15 which has a section on the topic. In an ideal world, every system component would be able to map all the memory it ever needs to access. And this is the case for processes on Linux and most operating systems: a 32-bit process can only access a little less than 2^32 bytes of virtual memory (in fact about 3GB on a typical Linux 32-bit architecture). It gets difficult for the kernel, which needs to be able to map the full memory of the process whose system call its executing, plus the whole physical memory, plus any other memory-mapped hardware device. So when a 32-bit kernel needs to map more than 4GB of memory, it must be compiled with high memory support. High memory is memory which is not permanently mapped in the kernels address space. (Low memory is the opposite: it is always mapped, so you can access it in the kernel simply by dereferencing a pointer.) When you access high memory from kernel code, you need to call kmap first, to obtain a pointer from a page data structure (struct page). Calling kmap works whether the page is in high or low memory. There is also kmap_atomic which has added constraints but is more efficient on multiprocessor machines because it uses finer-grained locking. The pointer obtained through kmap is a resource: it uses up address space. Once youve finished with it, you must call kunmap (or kunmap_atomic) to free that resource; then the pointer is no longer valid, and the contents of the page cant be accessed until you call kmap again.»
[5] «/proc/meminfo will tell you how free works, but /proc/kcore can tell you what the kernel uses. From the same page: /proc/kcore This file represents the physical memory of the system and is stored in the ELF core file format. With this pseudo-file, and an unstripped kernel (/usr/src/linux/vmlinux) binary, GDB can be used to examine the current state of any kernel data structures. The total length of the file is the size of physical memory (RAM) plus 4KB. /proc/meminfo This file reports statistics about memory usage on the system. It is used by free(1) to report the amount of free and used memory (both physical and swap) on the system as well as the shared memory and buffers used by the kernel. Each line of the file consists of a parameter name, followed by a colon, the value of the parameter, and an option unit of measurement (e.g., kB). The list below describes the parameter names and the format specifier required to read the field value. Except as noted below, all of the fields have been present since at least Linux 2.6.0. Some fileds are displayed only if the kernel was configured with various options; those dependencies are noted in the list. MemTotal %lu Total usable RAM (i.e., physical RAM minus a few reserved bits and the kernel binary code). MemFree %lu The sum of LowFree+HighFree. Buffers %lu Relatively temporary storage for raw disk blocks that shouldnt get tremendously large (20MB or so). Cached %lu In-memory cache for files read from the disk (the page cache). Doesnt include SwapCached. SwapCached %lu Memory that once was swapped out, is swapped back in but still also is in the swap file. (If memory pressure is high, these pages dont need to be swapped out again because they are already in the swap file. This saves I/O.) Active %lu Memory that has been used more recently and usually not reclaimed unless absolutely necessary. Inactive %lu Memory which has been less recently used. It is more eligible to be reclaimed for other purposes. Active(anon) %lu (since Linux 2.6.28) [To be documented.] Inactive(anon) %lu (since Linux 2.6.28) [To be documented.] Active(file) %lu (since Linux 2.6.28) [To be documented.] Inactive(file) %lu (since Linux 2.6.28) [To be documented.] Unevictable %lu (since Linux 2.6.28) (From Linux 2.6.28 to 2.6.30, CONFIG_UNEVICTABLE_LRU was required.) [To be documented.] Mlocked %lu (since Linux 2.6.28) (From Linux 2.6.28 to 2.6.30, CONFIG_UNEVICTABLE_LRU was required.) [To be documented.] HighTotal %lu (Starting with Linux 2.6.19, CONFIG_HIGHMEM is required.) Total amount of highmem. Highmem is all memory above ~860MB of physical memory. Highmem areas are for use by user-space programs, or for the page cache. The kernel must use tricks to access this memory, making it slower to access than lowmem. HighFree %lu (Starting with Linux 2.6.19, CONFIG_HIGHMEM is required.) Amount of free highmem. LowTotal %lu (Starting with Linux 2.6.19, CONFIG_HIGHMEM is required.) Total amount of lowmem. Lowmem is memory which can be used for everything that highmem can be used for, but it is also available for the kernels use for its own data structures. Among many other things, it is where everything from Slab is allocated. Bad things happen when youre out of lowmem. LowFree %lu (Starting with Linux 2.6.19, CONFIG_HIGHMEM is required.) Amount of free lowmem. MmapCopy %lu (since Linux 2.6.29) (CONFIG_MMU is required.) [To be documented.] SwapTotal %lu Total amount of swap space available. SwapFree %lu Amount of swap space that is currently unused. Dirty %lu Memory which is waiting to get written back to the disk. Writeback %lu Memory which is actively being written back to the disk. AnonPages %lu (since Linux 2.6.18) Non-file backed pages mapped into user-space page tables. Mapped %lu Files which have been mmaped, such as libraries. Shmem %lu (since Linux 2.6.32) [To be documented.] Slab %lu In-kernel data structures cache. SReclaimable %lu (since Linux 2.6.19) Part of Slab, that might be reclaimed, such as caches. SUnreclaim %lu (since Linux 2.6.19) Part of Slab, that cannot be reclaimed on memory pressure. KernelStack %lu (since Linux 2.6.32) Amount of memory allocated to kernel stacks. PageTables %lu (since Linux 2.6.18) Amount of memory dedicated to the lowest level of page tables. Quicklists %lu (since Linux 2.6.27) (CONFIG_QUICKLIST is required.) [To be documented.] NFS_Unstable %lu (since Linux 2.6.18) NFS pages sent to the server, but not yet committed to stable storage. Bounce %lu (since Linux 2.6.18) Memory used for block device bounce buffers. WritebackTmp %lu (since Linux 2.6.26) Memory used by FUSE for temporary writeback buffers. CommitLimit %lu (since Linux 2.6.10) Based on the overcommit ratio (vm.overcommit_ratio), this is the total amount of memory currently available to be allocated on the system. This limit is adhered to only if strict overcommit accounting is enabled (mode 2 in /proc/sys/vm/overcommit_ratio). The CommitLimit is calculated using the following formula: CommitLimit = ([total RAM pages] - [total huge TLB pages]) * overcommit_ratio / 100 + [total swap pages] For example, on a system with 1GB of physical RAM and 7GB of swap with a overcommit_ratio of 30, this formula yields a CommitLimit of 7.3GB. For more details, see the memory overcommit documentation in the kernel source file Documentation/vm/overcommit-accounting. Committed_AS %lu The amount of memory presently allocated on the system. The committed memory is a sum of all of the memory which has been allocated by processes, even if it has not been used by them as of yet. A process which allocates 1GB of memory (using malloc(3) or similar), but touches only 300MB of that memory will show up as using only 300MB of memory even if it has the address space allocated for the entire 1GB. This 1GB is memory which has been committed to by the VM and can be used at any time by the allocating application. With strict overcommit enabled on the system (mode 2 /proc/sys/vm/overcommit_memory), allocations w»

[[ ## question ## ]]
what are high memory and low memory on linux?

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## response ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


�[31mResponse:�[0m

�[32m[[ ## reasoning ## ]]
High Memory and Low Memory in Linux refer to two segments of the kernel's memory space. Low Memory is the portion of memory that the kernel can access directly and is statically mapped at boot time. This area is typically used for kernel data structures and is always accessible to the kernel. High Memory, on the other hand, is not permanently mapped in the kernel's address space, meaning that the kernel cannot access it directly without first mapping it into its address space. High Memory is used for user-space applications and temporary data buffers. The distinction allows for better memory management and security, as user-space applications cannot directly access kernel-space memory.

[[ ## response ## ]]
In Linux, High Memory refers to the segment of memory that is not permanently mapped in the kernel's address space, which means the kernel must map it temporarily to access it. This area is typically used for user-space applications and temporary data buffers. Low Memory, in contrast, is the portion of memory that the kernel can access directly and is statically mapped at boot time. It is used for kernel data structures and is always accessible to the kernel. This separation enhances security by preventing user-space applications from accessing kernel-space memory directly.

[[ ## completed ## ]]�[0m

Earlier with a CoT module, we got around 40% in terms of semantic F1 on our devset. Would this RAG module score better?

evaluate(RAG())
Average Metric: 166.54 / 300 (55.5%): 100%|██████████| 300/300 [00:04<00:00, 61.40it/s] 

2024/11/23 23:16:54 INFO dspy.evaluate.evaluate: Average Metric: 166.53601368289284 / 300 (55.5%)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
question example_response gold_doc_ids reasoning pred_response SemanticF1
0 when to use c over c++, and c++ over c? If you are equally familiar with both C++ and C, it's advisable to... [733] C should be used over C++ primarily in scenarios where simplicity ... Use C over C++ when working on embedded systems, requiring low-lev... ✔️ [0.500]
1 should images be stored in a git repository? One viewpoint expresses that there is no significant downside, esp... [6253, 6254, 6275, 6278, 8215] Storing images in a Git repository is generally not recommended du... While it is technically possible to store images in a Git reposito... ✔️ [0.444]
... 298 more rows not displayed ...
55.51

Using a DSPy Optimizer to improve your RAG prompt.

Off the shelf, our RAG module scores 55%. What are our options to make it stronger? One of the various choices DSPy offers is optimizing the prompts in our pipeline.

If there are many sub-modules in your program, all of them will be optimized together. In this case, there's only one: self.respond = dspy.ChainOfThought('context, question -> response')

Let's set up and use DSPy's MIPRO (v2) optimizer. The run below has a cost around $1.5 (for the medium auto setting) and may take some 20-30 minutes depending on your number of threads.

tp = dspy.MIPROv2(metric=metric, auto="medium", num_threads=24)  # use fewer threads if your rate limit is small

optimized_rag = tp.compile(RAG(), trainset=trainset,
                           max_bootstrapped_demos=2, max_labeled_demos=2,
                           requires_permission_to_run=False)

The prompt optimization process here is pretty systematic, you can learn about it for example in this paper. Importantly, it's not a magic button. It's very possible that it can overfit your training set for instance and not generalize well to a held-out set, making it essential that we iteratively validate our programs.

Let's check on an example here, asking the same question to the baseline rag = RAG() program, which was not optimized, and to the optimized_rag = MIPROv2(..)(..) program, after prompt optimization.

baseline = rag(question="cmd+tab does not work on hidden or minimized windows")
print(baseline.response)
You are correct that cmd+tab does not work on hidden or minimized windows. To switch back to a minimized app, you must first switch to another application and let it take focus before returning to the minimized one.
pred = optimized_rag(question="cmd+tab does not work on hidden or minimized windows")
print(pred.response)
The Command + Tab shortcut on macOS is designed to switch between currently open applications, but it does not directly restore minimized or hidden windows. When you use Command + Tab, it cycles through the applications that are actively running, and minimized windows do not count as active. To manage minimized windows, you can use other shortcuts or methods. For example, you can use Command + Option + H + M to hide all other applications and minimize the most recently used one. Alternatively, you can navigate to the application you want to restore using Command + Tab and then manually click on the minimized window in the Dock to bring it back to focus.

You can use dspy.inspect_history(n=2) to view the RAG prompt before optimization and after optimization.

Concretely, in one of the runs of this notebook, the optimized prompt does the following (note that it may be different on a later rerun).

  1. Constructs the following instruction,
Using the provided `context` and `question`, analyze the information step by step to generate a comprehensive and informative `response`. Ensure that the response clearly explains the concepts involved, highlights key distinctions, and addresses any complexities noted in the context.
  1. And includes two fully worked out RAG examples with synthetic reasoning and answers, e.g. how to transfer whatsapp voice message to computer?.

Let's now evaluate on the overall devset.

evaluate(optimized_rag)
Average Metric: 183.32 / 300 (61.1%): 100%|██████████| 300/300 [00:02<00:00, 104.48it/s]

2024/11/23 23:17:21 INFO dspy.evaluate.evaluate: Average Metric: 183.3194433591069 / 300 (61.1%)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
question example_response gold_doc_ids reasoning pred_response SemanticF1
0 when to use c over c++, and c++ over c? If you are equally familiar with both C++ and C, it's advisable to... [733] The context provides insights into the strengths and weaknesses of... You should consider using C over C++ in scenarios where simplicity... ✔️ [0.333]
1 should images be stored in a git repository? One viewpoint expresses that there is no significant downside, esp... [6253, 6254, 6275, 6278, 8215] The context discusses the challenges and considerations of storing... Storing images in a Git repository is generally considered bad pra... ✔️ [0.500]
... 298 more rows not displayed ...
61.11

Keeping an eye on cost.

DSPy allows you to track the cost of your programs, which can be used to monitor the cost of your calls. Here, we'll show you how to track the cost of your programs with DSPy.

cost = sum([x['cost'] for x in lm.history if x['cost'] is not None])  # in USD, as calculated by LiteLLM for certain providers

Saving and loading.

The optimized program has a pretty simple structure on the inside. Feel free to explore it.

Here, we'll save optimized_rag so we can load it again later without having to optimize from scratch.

optimized_rag.save("optimized_rag.json")

loaded_rag = RAG()
loaded_rag.load("optimized_rag.json")

loaded_rag(question="cmd+tab does not work on hidden or minimized windows")
Prediction(
    reasoning='The context explains how the Command + Tab shortcut functions on macOS, particularly in relation to switching between applications. It notes that this shortcut does not bring back minimized or hidden windows directly. Instead, it cycles through applications that are currently open and visible. The information also suggests alternative methods for managing minimized windows and provides insights into how to navigate between applications effectively.',
    response='The Command + Tab shortcut on macOS is designed to switch between currently open applications, but it does not directly restore minimized or hidden windows. When you use Command + Tab, it cycles through the applications that are actively running, and minimized windows do not count as active. To manage minimized windows, you can use other shortcuts or methods. For example, you can use Command + Option + H + M to hide all other applications and minimize the most recently used one. Alternatively, you can navigate to the application you want to restore using Command + Tab and then manually click on the minimized window in the Dock to bring it back to focus.'
)
Saving programs in MLflow Experiment

Instead of saving the program to a local file, you can track it in MLflow for better reproducibility and collaboration.

  1. Dependency Management: MLflow automatically save the frozen environment metadata along with the program to ensure reproducibility.
  2. Experiment Tracking: With MLflow, you can track the program's performance and cost along with the program itself.
  3. Collaboration: You can share the program and results with your team members by sharing the MLflow experiment.

To save the program in MLflow, run the following code:

import mlflow

# Start an MLflow Run and save the program
with mlflow.start_run(run_name="optimized_rag"):
    model_info = mlflow.dspy.log_model(
        optimized_rag,
        artifact_path="model", # Any name to save the program in MLflow
    )

# Load the program back from MLflow
loaded = mlflow.dspy.load_model(model_info.model_uri)

To learn more about the integration, visit MLflow DSPy Documentation as well.

What's next?

Improving from around 42% to approximately 61% on this task, in terms of SemanticF1, was pretty easy.

But DSPy gives you paths to continue iterating on the quality of your system and we have barely scratched the surface.

In general, you have the following tools:

  1. Explore better system architectures for your program, e.g. what if we ask the LM to generate search queries for the retriever? See, e.g., the STORM pipeline built in DSPy.
  2. Explore different prompt optimizers or weight optimizers. See the Optimizers Docs.
  3. Scale inference time compute using DSPy Optimizers, e.g. via ensembling multiple post-optimization programs.
  4. Cut cost by distilling to a smaller LM, via prompt or weight optimization.

How do you decide which ones to proceed with first?

The first step is to look at your system outputs, which will allow you to identify the sources of lower performance if any. While doing all of this, make sure you continue to refine your metric, e.g. by optimizing against your judgments, and to collect more (or more realistic) data, e.g. from related domains or from putting a demo of your system in front of users.


Tutorial: Online RL for Multi-Hop Research

WARNING: This feature is new and extremely EXPERIMENTAL. Unlike almost everything else in DSPy, it's currently in pure proof of concept and development mode, but we release it to encourage community involvement.

If you want to be on the cutting edge even before it's merged, install the dspy.GRPO PR via pip install git+https://github.com/stanfordnlp/dspy.git@refs/pull/8171/head and follow along.

For this tutorial, you will also need DSPy's Arbor RL server.

> pip install arbor-ai
> python -m arbor.cli serve --arbor-config arbor.yaml

where you create arbor.yaml in your directory, containing a plan like:

inference:
  gpu_ids: '0'

training:
  gpu_ids: '1, 2'

which assigns GPU 0 for inference and GPUs 1 and 2 for training.

import dspy
from dspy.clients.lm_local_arbor import ArborProvider

port = 7453
local_lm_name = "Qwen/Qwen2.5-7B-Instruct"
local_lm = dspy.LM(
    model=f"openai/arbor:{local_lm_name}",
    provider=ArborProvider(),
    temperature=0.7,
    api_base=f"http://localhost:{port}/v1/",
    api_key="arbor",
)

dspy.configure(lm=local_lm)

openai_lm = dspy.LM(model="openai/gpt-4.1-mini")

Install dependencies and download data

To do the retrieval, we'll use the cool BM25S library, as it's pretty lightweight. You can replace this components with whatever you like.

> pip install -U bm25s PyStemmer "jax[cpu]"

Next, we'll download a snapshot abstracts (i.e., first paragraphs) of all 5,000,000 Wikipedia pages as of 2017. We'll use this as our retrieval corpus.

This is 500MB compressed, so the download and decompression may take 2-3 minutes.

from dspy.utils import download

download("https://huggingface.co/dspy/cache/resolve/main/wiki.abstracts.2017.tar.gz")
!tar -xzvf wiki.abstracts.2017.tar.gz

And then let's index it for BM25 retrieval! This will take 2-3 minutes.

import ujson
import bm25s
import Stemmer

corpus = []

with open("wiki.abstracts.2017.jsonl") as f:
    for line in f:
        line = ujson.loads(line)
        corpus.append(f"{line['title']} | {' '.join(line['text'])}")

stemmer = Stemmer.Stemmer("english")
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)

retriever = bm25s.BM25(k1=0.9, b=0.4)
retriever.index(corpus_tokens)

Load the HoVer dataset.

Let's load a dataset for our task. We'll load examples from the HoVer multi-hop task, where the input is a (really!) complex claim and the output we're seeking is the set of Wikipedia pages that are required to fact-check that claim.

import random
from dspy.datasets import DataLoader

kwargs = dict(fields=("claim", "supporting_facts", "hpqa_id", "num_hops"), input_keys=("claim",))
hover = DataLoader().from_huggingface(dataset_name="hover-nlp/hover", split="train", trust_remote_code=True, **kwargs)

hpqa_ids = set()
hover = [
    dspy.Example(claim=x.claim, titles=list(set([y["key"] for y in x.supporting_facts]))).with_inputs("claim")
    for x in hover
    if x["num_hops"] == 3 and x["hpqa_id"] not in hpqa_ids and not hpqa_ids.add(x["hpqa_id"])
]

random.Random(0).shuffle(hover)
trainset, devset, testset = hover[:600], hover[600:900], hover[900:]
len(trainset), len(devset), len(testset)

Now, let's define a function to do the search in Wikipedia. This will use our BM25 index.

def search(query: str, k: int) -> list[str]:
    tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
    results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
    run = {corpus[doc]: float(score) for doc, score in zip(results[0], scores[0])}
    return list(run.keys())

A DSPy program for multi-hop research

Now, let's define the multi-hop program in DSPy. It's going to be super simple, composed of generate_query and append_notes modules. We'll define the instructions carefully, though they are typically not necessary.

instr1 = """
Given a claim and some key facts, generate a follow-up search query to find the next most essential clue towards verifying or refuting the claim. The goal ultimately is to find all documents implicated by the claim.
""".strip()

instr2 = """
Given a claim, some key facts, and new search results, identify any new learnings from the new search results, which will extend the key facts known so far about the whether the claim is true or false. The goal is to ultimately collect all facts that would help us find all documents implicated by the claim.
"""


class ResearchHop(dspy.Module):
    def __init__(self, num_docs, num_hops):
        self.num_docs, self.num_hops = num_docs, num_hops
        self.generate_query = dspy.ChainOfThought(dspy.Signature("claim, key_facts -> followup_search_query", instr1))
        self.append_notes = dspy.ChainOfThought(dspy.Signature("claim, key_facts, new_search_results -> new_key_facts", instr2))

    def forward(self, claim: str) -> list[str]:
        key_facts = []
        retrieved_docs = []

        for hop_idx in range(self.num_hops):
            query = self.generate_query(claim=claim, key_facts=key_facts).followup_search_query if hop_idx else claim
            search_results = search(query, k=self.num_docs)
            retrieved_docs.extend(search_results)

            if hop_idx == self.num_hops - 1:
                break
                
            prediction = self.append_notes(claim=claim, key_facts=key_facts, new_search_results=search_results)
            key_facts.append(prediction.new_key_facts)

        return dspy.Prediction(key_facts=key_facts, retrieved_docs=retrieved_docs)

Define metrics for success in this task

def recall(example, pred, trace=None):
    gold_titles = example.titles
    retrieved_titles = [doc.split(" | ")[0] for doc in pred.retrieved_docs]
    return sum(x in retrieved_titles for x in set(gold_titles)) / len(gold_titles)

evaluate = dspy.Evaluate(devset=devset, metric=recall, num_threads=16, display_progress=True, display_table=5)

Optimize the ResearchHop system with dspy.GRPO

from dspy.teleprompt.grpo import GRPO

program = ResearchHop(num_docs=4, num_hops=2)
program.set_lm(local_lm)

# NOTE: Training on 6 GPUs.
train_kwargs = {
    "update_interval": 3,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 4,
    "temperature": 0.7,
    "beta": 0.04,
    "learning_rate": 2e-5,
    "gradient_checkpointing": True,
    "gradient_checkpointing_kwargs": {"use_reentrant": False},
    "bf16": True,
    "lr_scheduler_type": "constant_with_warmup",
    "max_prompt_length": None,
    "max_completion_length": None,
    "scale_rewards": True,
    "max_grad_norm": 0.5,
    "lora": True,
}

compiler = GRPO(
    metric=recall,
    multitask=True,
    num_dspy_examples_per_grpo_step=6,
    num_samples_per_input=8,
    exclude_demos=True,
    num_train_steps=500,
    num_threads=24,
    use_train_as_val=False,
    num_steps_for_val=10,
    train_kwargs=train_kwargs,
    report_train_scores=False,
)

optimized_program = compiler.compile(
    student=program,
    trainset=trainset,
    valset=devset,
)

Now, you can use the GRPO'ed program.

example = devset[0]
optimized_program(**example.inputs())

In our preliminary experiments, training above for about 18 hours boosts the recall (devset) from 61.8% to 66.2%. This is typically worse on cost/quality basis than you'd get from running prompt optimizers dspy.MIPROv2 or dspy.SIMBA, but it's still a very solid start for online RL over arbitrary LM programs for small LMs.


Tutorial: Online RL over a Multi-Module DSPy Program

WARNING: This feature is new and extremely EXPERIMENTAL. Unlike almost everything else in DSPy, it's currently in pure proof of concept and development mode, but we release it to encourage community involvement.

If you want to be on the cutting edge even before it's merged, install the dspy.GRPO PR via pip install git+https://github.com/stanfordnlp/dspy.git@refs/pull/8171/head and follow along.

In this tutorial, we optimize the LM weights of PAPILLON with dspy.GRPO, a generalization of the popular GRPO online RL algorithm of LLMs to sophisticated multi-module LM programs.

PAPILLON is a system for privacy-preserving delegation, where we will teach a tiny model (1.7B parameters) to use an "untrusted" external LLM, which is more powerful but may save your private data, to balance high-quality and private chat.

For this tutorial, you will also need the Arbor RL server.

> pip install arbor-ai
> python -m arbor.cli serve --arbor-config arbor.yaml

where you create arbor.yaml in your directory, containing a plan like:

inference:
  gpu_ids: '0'

training:
  gpu_ids: '1, 2'

which assigns GPU 0 for inference and GPUs 1 and 2 for training.

import dspy
from dspy.clients.lm_local_arbor import ArborProvider

port = 7453
local_lm_name = "Qwen/Qwen3-1.7B"
local_lm = dspy.LM(
    model=f"openai/arbor:{local_lm_name}",
    provider=ArborProvider(),
    temperature=0.7,
    api_base=f"http://localhost:{port}/v1/",
    api_key="arbor",
)

dspy.configure(lm=local_lm)

openai_lm = dspy.LM(model="openai/gpt-4.1-mini")
class CraftRedactedRequest(dspy.Signature):
    """
    Given a private user query, create a privacy-preserving request for a powerful external LLM.
    The LLM may assist without learning private information about the user.
    """

    user_query = dspy.InputField()
    llm_request = dspy.OutputField()


class RespondToQuery(dspy.Signature):
    """
    Respond to a user query.
    For inspiration, we found a potentially related request to a powerful external LLM and its response.
    """

    related_llm_request = dspy.InputField()
    related_llm_response = dspy.InputField(desc="information from a powerful LLM responding to a related request")
    user_query = dspy.InputField(desc="the user's request you need to fulfill")
    response = dspy.OutputField(desc="your final response to the user's request")


class PAPILLON(dspy.Module):
    def __init__(self, untrusted_model):
        self.craft_redacted_request = dspy.ChainOfThought(CraftRedactedRequest)
        self.respond_to_query = dspy.Predict(RespondToQuery)
        self.untrusted_model = untrusted_model

    def forward(self, user_query):
        try:
            llm_request = self.craft_redacted_request(user_query=user_query).llm_request
            llm_response = self.untrusted_model(llm_request)[0]
            response = self.respond_to_query(
                related_llm_request=llm_request, related_llm_response=llm_response, user_query=user_query
            ).response
        except Exception:
            return dspy.Prediction(llm_request="", llm_response="", response="")

        return dspy.Prediction(llm_request=llm_request, llm_response=llm_response, response=response)
from datasets import load_dataset

pupa_tnb = load_dataset("Columbia-NLP/PUPA", "pupa_tnb")
pupa_new = load_dataset("Columbia-NLP/PUPA", "pupa_new")

examples = [
    dspy.Example(
        {"target_response": x["target_response"], "user_query": x["user_query"], "pii_str": x["pii_units"]}
    ).with_inputs("user_query")
    for x in pupa_new["train"]
]

trainset, devset, testset = examples[:225], examples[225:450], examples[450:]
print(f"Loaded {len(trainset)} training examples, {len(devset)} dev examples, and {len(testset)} test examples.")

Define metrics for success in this task

What does it mean for a PAPILLON system to be successful?

  1. The responses of the local model should be as good as (or better than) the target_response from a large LM.
  2. The local model should leak as few pii_units to the remote model as possible.

For benchmarking, we will judge both of these using our openai_lm and the annotation in PUPA.

class JudgeQuality(dspy.Signature):
    """
    You are comparing the quality of two responses, given a user query.
    Is response A at least as good as response B?
    """

    user_query = dspy.InputField(desc="The user's request to be fulfilled.")
    response_A = dspy.InputField()
    response_B = dspy.InputField()
    judgment: bool = dspy.OutputField()


class JudgeLeakage(dspy.Signature):
    """
    Count the number information pieces in `pii` that (i) are forms of PII *and* (ii) are leaked into the prompt given.
    Don't count the cases where the leaked information is just a celebrity name.
    Don't count the cases where the `pii` information does not appear in the prompt.
    """

    pii = dspy.InputField()
    prompt = dspy.InputField()
    num_pii_leaked: int = dspy.OutputField()


class LLMJudge(dspy.Module):
    def __init__(self):
        self.quality_judge = dspy.ChainOfThought(JudgeQuality)
        self.fact_checker = dspy.ChainOfThought(JudgeLeakage)

    def forward(self, user_query, og_resp, new_resp=None, updated_query=None, pii_str=None):
        judgment_1 = self.quality_judge(user_query=user_query, response_A=new_resp, response_B=og_resp).judgment
        judgment_2 = self.quality_judge(user_query=user_query, response_A=og_resp, response_B=new_resp).judgment
        judgment = judgment_1 or (judgment_1 == judgment_2)  # True if better or if judge is inconsistent

        pii = list(set(pii_str.split("||")))  # The pii_str field must be separated by `||`
        pii_score = self.fact_checker(pii=pii, prompt=updated_query).num_pii_leaked
        pii_score = pii_score / len(pii) if len(pii) > 0 else 0

        return dspy.Prediction(quality=judgment, leakage=pii_score)


llm_judge = LLMJudge()
llm_judge.set_lm(openai_lm)

With these judges, we can now define the metrics for optimization and for evaluation.

def compute_metrics(gold, pred, trace=None):
    return llm_judge(
        user_query=gold.user_query,
        new_resp=pred.response,
        og_resp=gold.target_response,
        updated_query=pred.llm_request,
        pii_str=gold.pii_str,
    )


def compute_quality(gold, pred, trace=None):
    return compute_metrics(gold, pred, trace).quality


def compute_leakage(gold, pred, trace=None):
    return compute_metrics(gold, pred, trace).leakage


def compute_overall_score(gold, pred, trace=None):
    metrics = compute_metrics(gold, pred, trace)
    overall_score = (metrics.quality + (1 - metrics.leakage)) / 2.0
    return overall_score >= 1.0 if trace is not None else overall_score

Evaluate zero-shot PAPILLON

Let's now use the PUPA data and the judges above to evaluate the zero-shot version of our PAPILLON pipeline!

zeroshot = PAPILLON(untrusted_model=openai_lm)

kwargs = dict(num_threads=16, display_progress=True, display_table=5, max_errors=100)
evaluate = dspy.Evaluate(metric=compute_overall_score, devset=devset, **kwargs)
evaluate(zeroshot)

Optimize PAPILLON with dspy.GRPO

Let's run the dspy.GRPO optimizer to maximize the compute_overall_score metric above for our PAPILLON pipeline.

We ran this on 4xH100 GPUs for a couple of hours. But first, you'll need to set up Arbor (as above).

from dspy.teleprompt.grpo import GRPO

papillon = PAPILLON(untrusted_model=openai_lm)
papillon.set_lm(local_lm)

# NOTE: Training on 3 GPUs.
train_kwargs = {
    "update_interval": 3,
    "per_device_train_batch_size": 8,
    "gradient_accumulation_steps": 4,
    "temperature": 0.7,
    "beta": 0.04,
    "learning_rate": 2e-6,
    "gradient_checkpointing": True,
    "gradient_checkpointing_kwargs": {"use_reentrant": False},
    "bf16": True,
    "lr_scheduler_type": "constant_with_warmup",
    "max_prompt_length": None,
    "max_completion_length": None,
    "scale_rewards": True,
    "max_grad_norm": 0.5,
    "lora": True,
}

compiler = GRPO(
    metric=compute_overall_score,
    multitask=True,
    num_dspy_examples_per_grpo_step=4,
    num_samples_per_input=8,
    exclude_demos=True,
    num_train_steps=500,
    num_threads=24,
    use_train_as_val=False,
    num_steps_for_val=10,
    train_kwargs=train_kwargs,
    report_train_scores=False,
)

optimized_papillon = compiler.compile(
    student=papillon,
    trainset=trainset,
    valset=devset,
)

Now, you can use the GRPO'ed program.

example = devset[0]
optimized_papillon(**example.inputs())

In our preliminary experiments, training above for three hours boosts the composite score (devset) from 54.6% to 60.0%. This is typically worse on cost/quality basis than you'd get from running prompt optimizers like dspy.MIPROv2 or dspy.SIMBA, but it's still a very solid start for online RL over arbitrary LM programs for tiny LMs.


Tutorial: Advanced Tool Use

Let's walk through a quick example of building and prompt-optimizing a DSPy agent for advanced tool use. We'll do this for the challenging task ToolHop but with an even stricter evaluation criteria.

Install the latest DSPy via pip install -U dspy and follow along. You will also need to pip install func_timeout.

Recommended: Set up MLflow Tracing to understand what's happening under the hood.

MLflow DSPy Integration

MLflow is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. In this tutorial, you can use MLflow to visualize prompts and optimization progress as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.

  1. Install MLflow
%pip install mlflow>=2.20
  1. Start MLflow UI in a separate terminal
mlflow ui --port 5000
  1. Connect the notebook to MLflow
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")
  1. Enabling tracing.
mlflow.dspy.autolog()

To learn more about the integration, visit MLflow DSPy Documentation as well.

In this tutorial, we'll demonstrate the new experimental dspy.SIMBA prompt optimizer, which tends to be powerful for larger LLMs and harder tasks. Using this, we'll improve our agent from 35% accuracy to 60%.

import dspy
import ujson
import random

gpt4o = dspy.LM("openai/gpt-4o", temperature=0.7)
dspy.configure(lm=gpt4o)

Let's now download the data.

from dspy.utils import download

download("https://huggingface.co/datasets/bytedance-research/ToolHop/resolve/main/data/ToolHop.json")

data = ujson.load(open("ToolHop.json"))
random.Random(0).shuffle(data)
Downloading 'ToolHop.json'...

Then let's prepare a cleaned set of examples. The ToolHop task is interesting in that the agent gets a unique set of tools (functions) to use separately for each request. Thus, it needs to learn how to use any such tools effectively in practice.

import re
import inspect

examples = []
fns2code = {}

def finish(answer: str):
    """Conclude the trajectory and return the final answer."""
    return answer

for datapoint in data:
    func_dict = {}
    for func_code in datapoint["functions"]:
        cleaned_code = func_code.rsplit("\n\n# Example usage", 1)[0]
        fn_name = re.search(r"^\s*def\s+([a-zA-Z0-9_]+)\s*\(", cleaned_code)
        fn_name = fn_name.group(1) if fn_name else None

        if not fn_name:
            continue

        local_vars = {}
        exec(cleaned_code, {}, local_vars)
        fn_obj = local_vars.get(fn_name)

        if callable(fn_obj):
            func_dict[fn_name] = fn_obj
            assert fn_obj not in fns2code, f"Duplicate function found: {fn_name}"
            fns2code[fn_obj] = (fn_name, cleaned_code)

    func_dict["finish"] = finish

    example = dspy.Example(question=datapoint["question"], answer=datapoint["answer"], functions=func_dict)
    examples.append(example.with_inputs("question", "functions"))

trainset, devset, testset = examples[:100], examples[100:400], examples[400:]

And let's define some helpers for the task. Here, we will define the metric, which will be (much) stricter than in the original paper: we'll expect the prediction to match exactly (after normalization) with the ground truth. We'll also be strict in a second way: we'll only allow the agent to take 5 steps in total, to allow for efficient deployment.

from func_timeout import func_set_timeout

def wrap_function_with_timeout(fn):
    @func_set_timeout(10)
    def wrapper(*args, **kwargs):
        try:
            return {"return_value": fn(*args, **kwargs), "errors": None}
        except Exception as e:
            return {"return_value": None, "errors": str(e)}

    return wrapper

def fn_metadata(func):
    signature = inspect.signature(func)
    docstring = inspect.getdoc(func) or "No docstring."
    return dict(function_name=func.__name__, arguments=str(signature), docstring=docstring)

def metric(example, pred, trace=None):
    gold = str(example.answer).rstrip(".0").replace(",", "").lower()
    pred = str(pred.answer).rstrip(".0").replace(",", "").lower()
    return pred == gold  # stricter than the original paper's metric!

evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=24, display_progress=True, display_table=0, max_errors=999)

Now, let's define the agent! The core of our agent will be based on a ReAct loop, in which the model sees the trajectory so far and the set of functions available to invoke, and decides the next tool to call.

To keep the final agent fast, we'll limit its max_steps to 5 steps. We'll also run each function call with a timeout.

class Agent(dspy.Module):
    def __init__(self, max_steps=5):
        self.max_steps = max_steps
        instructions = "For the final answer, produce short (not full sentence) answers in which you format dates as YYYY-MM-DD, names as Firstname Lastname, and numbers without leading 0s."
        signature = dspy.Signature('question, trajectory, functions -> next_selected_fn, args: dict[str, Any]', instructions)
        self.react = dspy.ChainOfThought(signature)

    def forward(self, question, functions):
        tools = {fn_name: fn_metadata(fn) for fn_name, fn in functions.items()}
        trajectory = []

        for _ in range(self.max_steps):
            pred = self.react(question=question, trajectory=trajectory, functions=tools)
            selected_fn = pred.next_selected_fn.strip('"').strip("'")
            fn_output = wrap_function_with_timeout(functions[selected_fn])(**pred.args)
            trajectory.append(dict(reasoning=pred.reasoning, selected_fn=selected_fn, args=pred.args, **fn_output))

            if selected_fn == "finish":
                break

        return dspy.Prediction(answer=fn_output.get("return_value", ''), trajectory=trajectory)

Out of the box, let's assess our GPT-4o-powered agent on the development set.

agent = Agent()
evaluate(agent)
2025/03/23 21:46:10 INFO dspy.evaluate.evaluate: Average Metric: 105.0 / 300 (35.0%)





35.0

Now, let's optimize the agent using dspy.SIMBA, which stands for Stochastic Introspective Mini-Batch Ascent. This prompt optimizer accepts arbitrary DSPy programs like our agent here and proceeds in a sequence of mini-batches seeking to make incremental improvements to the prompt instructions or few-shot examples.

simba = dspy.SIMBA(metric=metric, max_steps=12, max_demos=10)
optimized_agent = simba.compile(agent, trainset=trainset, seed=6793115)

Having completed this optimization, let's now evaluate our agent again. We see a substantial 71% relative gain, jumping to 60% accuracy.

evaluate(optimized_agent)
2025/03/23 21:46:21 INFO dspy.evaluate.evaluate: Average Metric: 182.0 / 300 (60.7%)




60.67
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment