langfuse LLM-as-a-Judge tutorial

LLM-as-a-Judge Setup in Langfuse

LLM-as-a-Judge (also called model-based evaluation) is a scalable way to use one LLM to automatically score and reason about the outputs of another LLM or agent. In the context of your CanvassAssistant (backed by OpenAI's Assistant API via Azure), this is ideal for tuning prompts and fixing issues like repetitive questions. For example, you can evaluate thread histories for redundancy, coherence, or completeness, generating scores (e.g., 0-1) with chain-of-thought explanations.

Langfuse provides managed LLM-as-a-Judge evaluators through its UI, with a catalog of pre-built templates (e.g., for Hallucination, Helpfulness, Toxicity) and support for customization. It integrates seamlessly with your traced runs from the previous setup (using @langfuse/openai or manual observations). Evaluations can run on production traces (e.g., your callEndpoint executions) or offline datasets.

If you need fully custom logic beyond the UI (e.g., specialized scoring for repetitiveness), use the Langfuse SDK. Below is a detailed guide based on Langfuse's documentation (as of October 2025).

Prerequisites

Langfuse project set up (from your earlier tracing integration).
Tracing enabled in your code (e.g., observeOpenAI(this.client) and/or manual spans in callEndpoint).
An LLM provider API key (e.g., OpenAI, Azure OpenAI, Anthropic) for the judge model—models like GPT-4o, Claude-3.5-Sonnet, or Llama-3.1-405B work best, as they support structured outputs (JSON mode) for reliable scoring.

Step 1: Set Up LLM Connections

LLM connections power the judge model. This is a one-time global setup for your project.

Go to Settings > LLM Connections in the Langfuse UI.
Click + Add Connection.
Select your provider (e.g., Azure OpenAI).
Enter your API key, base URL (e.g., your Azure endpoint), and any defaults (e.g., model name like gpt-4o, temperature 0 for consistency).
Test the connection.
- Note: The model must support structured outputs (e.g., JSON) to parse scores reliably. If using Azure, ensure your deployment supports this.
- You can add multiple connections and switch them later—historic evaluations remain unchanged.

Step 2: Create an LLM-as-a-Judge Evaluator in the UI

Navigate to the Evaluators page in Langfuse.
Click + Set up Evaluator.
Set the Default Model: Choose from your LLM connections (e.g., GPT-4o). This applies to all evaluators unless overridden. You can pass extra params like max_tokens: 500 or temperature: 0.2.
Pick an Evaluator:
- Select from the catalog for ready-to-use templates (no prompt writing needed). Examples:
  - Hallucination: Checks if outputs are grounded in inputs.
  - Helpfulness: Rates usefulness and relevance.
  - Coherence: Assesses logical flow (useful for detecting repetitiveness).
  - Toxicity or Relevance: For other dimensions.
- For custom (e.g., repetitiveness): If not in catalog, use a similar one like "Coherence" as a base. Langfuse allows custom templates—edit the prompt in the setup wizard to override defaults.
  - Example Custom Prompt for Repetitive Questions:
```
Analyze the assistant's questions in this conversation thread: {{messages}}.

Score from 0 to 1, where:
- 1: No repetitive questions; all are unique and build on prior info.
- 0: High repetition; many questions re-ask known details.

Deduct points for duplicates (e.g., re-asking from extracted_info).
Reason step-by-step, then output JSON: {"score": <number>, "reason": "<explanation>"}
```
    - This uses chain-of-thought for explainable scores.
Choose Data to Evaluate:
- Traces (production data): Run on live callEndpoint executions.
  - Scope: "New traces only" (real-time monitoring) or "Existing traces" (backfill history).
  - Filters: Target specific traces, e.g., by name ("CanvassAssistantRun"), tags (add in code: span.update({ tags: ["canvass"] })), or metadata (e.g., threadId).
  - Preview: See sample traces from the last 24 hours matching your filters.
- Datasets: For offline experiments (from Step 4 in previous response).
- Sampling Rate: Set to 5-10% initially (e.g., "Sample 10% of matched traces") to control costs (~$0.01-0.05 per eval with GPT-4o).
Map Variables:
- Link trace/dataset fields to prompt placeholders.
  - E.g., {{input}}: Map to your prompt (from formatPayloadToPrompt).
  - {{output}}: Map to assistant response (e.g., responseContent).
  - {{messages}}: For threads, use JSONPath like $.messages (if logged as an array in traces).
  - For repetitiveness: Map {{messages}} to the full thread history (ensure your traces log this—add span.update({ metadata: { messages: messages.data } }) after fetching messages).
- Preview: Langfuse shows a live populated prompt with real data. Cycle through examples to verify.
Trigger the Evaluation:
- Save the evaluator—it auto-runs on new traces (if selected).
- For immediate testing: Send a new trace via your app (e.g., call callEndpoint).
- For datasets: Run an experiment (see below).
- Backfilling: If set to existing traces, it queues evals on history.

Step 3: Monitor and Iterate

View Results: Scores appear as badges in trace details (e.g., "Repetitiveness: 0.7" with hover for reasoning).
- In Scores table: Filter by evaluator name, see averages, distributions.
- In Dashboards: Aggregate over time, by prompt version (from prompt management), or environment.
Execution Traces: Every eval creates its own trace in a special environment (langfuse-llm-as-a-judge).
- Access: Hover score > "View execution trace" or filter tracing table.
- Debug: See exact prompt sent, judge response, tokens, latency, errors.
- Statuses: Completed, Error (with retries), Delayed (rate limits), Pending.
Actions: Pause/resume/delete evaluators. Rerun on subsets.
Iterate: If scores are low (e.g., repetitiveness <0.8), tune your assistant's prompt (e.g., add "Avoid repeats using extracted_info") and compare versions via datasets.

Step 4: Using with Datasets for Prompt Tuning

To test prompt changes offline (e.g., fixing repetitiveness):

Create a dataset from traces (UI or SDK).
Run an experiment: Select dataset > Run with prompt version > Apply evaluators.
Compare scores side-by-side (e.g., v1 vs. v2 prompts).

Custom Evaluators via SDK (for Advanced/Programmatic Setup)

If UI catalog doesn't suffice, use the TypeScript SDK for custom scoring. This lets you run LLM-as-a-Judge programmatically, e.g., after callEndpoint.

Install/Update SDK: npm install langfuse.

In code (e.g., after parsing response in callEndpoint):

import { Langfuse } from 'langfuse';

const langfuse = new Langfuse({
  secretKey: process.env.LANGFUSE_SECRET_KEY,
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  baseUrl: process.env.LANGFUSE_BASE_URL,
});

// Example: Custom LLM-as-a-Judge for repetitiveness
async function evaluateRepetitiveness(traceId: string, messages: any[], output: string) {
  // Use your AzureOpenAI client as the judge (or another)
  const judgeResponse = await this.client.chat.completions.create({
    model: 'gpt-4o',  // Or from config
    messages: [
      { role: 'system', content: 'You are an evaluator. Analyze for repetitive questions.' },
      { role: 'user', content: `Thread: ${JSON.stringify(messages)}\nOutput: ${output}\nScore 0-1 on uniqueness (1=no repeats). Reason first, then JSON: {"score": number, "reason": "text"}` }
    ],
    response_format: { type: 'json_object' },  // For structured output
    temperature: 0,
  });

  const result = JSON.parse(judgeResponse.choices[0].message.content);
  
  // Log score to Langfuse
  await langfuse.score({
    traceId,  // From your trace (e.g., currentThreadId if set as traceId)
    name: 'repetitiveness',
    value: result.score,  // 0-1
    comment: result.reason,
    observationId: run.id,  // Optional: Link to specific run/span
  });
}

// Call after successful run
await evaluateRepetitiveness(currentThreadId, messages.data, responseContent);

This creates a score tied to the trace, with reasoning as a comment.
For batch: Use langfuse.createBatchScores on datasets.
Trigger externally: Run this in a worker/queue for production traces.

Tips for Your Use Case (Repetitive Questions)

Map thread messages to detect duplicates (e.g., compare questions against extracted_info).
Start with catalog "Coherence" evaluator, customize prompt as above.
Costs: ~$0.01/eval; sample low initially.
Alternatives: If not using Langfuse, implement purely with OpenAI (similar to the SDK example), but Langfuse adds tracing/UI benefits.
Debug: If evals fail (e.g., no structured output), switch models or add retries.

For more, check Langfuse docs or GitHub discussions. If this doesn't match your intent, provide more details!

usametov/langfuse.md

Select an option

No results found