LLM-as-a-Judge (also called model-based evaluation) is a scalable way to use one LLM to automatically score and reason about the outputs of another LLM or agent. In the context of your CanvassAssistant (backed by OpenAI's Assistant API via Azure), this is ideal for tuning prompts and fixing issues like repetitive questions. For example, you can evaluate thread histories for redundancy, coherence, or completeness, generating scores (e.g., 0-1) with chain-of-thought explanations.
Langfuse provides managed LLM-as-a-Judge evaluators through its UI, with a catalog of pre-built templates (e.g., for Hallucination, Helpfulness, Toxicity) and support for customization. It integrates seamlessly with your traced runs from the previous setup (using @langfuse/openai or manual observations). Evaluations can run on production traces (e.g., your callEndpoint executions) or offline datasets.
If you need fully custom logic beyond the UI (e.g., specialized scoring for repetitiveness), use the Langfuse SDK. Below is a detailed guide based on Langfuse's documentation (as of October 2025).
- Langfuse project set up (from your earlier tracing integration).
- Tracing enabled in your code (e.g.,
observeOpenAI(this.client)and/or manual spans incallEndpoint). - An LLM provider API key (e.g., OpenAI, Azure OpenAI, Anthropic) for the judge model—models like GPT-4o, Claude-3.5-Sonnet, or Llama-3.1-405B work best, as they support structured outputs (JSON mode) for reliable scoring.
LLM connections power the judge model. This is a one-time global setup for your project.
- Go to Settings > LLM Connections in the Langfuse UI.
- Click + Add Connection.
- Select your provider (e.g., Azure OpenAI).
- Enter your API key, base URL (e.g., your Azure endpoint), and any defaults (e.g., model name like
gpt-4o, temperature 0 for consistency). - Test the connection.
- Note: The model must support structured outputs (e.g., JSON) to parse scores reliably. If using Azure, ensure your deployment supports this.
- You can add multiple connections and switch them later—historic evaluations remain unchanged.
- Navigate to the Evaluators page in Langfuse.
- Click + Set up Evaluator.
- Set the Default Model: Choose from your LLM connections (e.g., GPT-4o). This applies to all evaluators unless overridden. You can pass extra params like
max_tokens: 500ortemperature: 0.2. - Pick an Evaluator:
- Select from the catalog for ready-to-use templates (no prompt writing needed). Examples:
- Hallucination: Checks if outputs are grounded in inputs.
- Helpfulness: Rates usefulness and relevance.
- Coherence: Assesses logical flow (useful for detecting repetitiveness).
- Toxicity or Relevance: For other dimensions.
- For custom (e.g., repetitiveness): If not in catalog, use a similar one like "Coherence" as a base. Langfuse allows custom templates—edit the prompt in the setup wizard to override defaults.
- Example Custom Prompt for Repetitive Questions:
Analyze the assistant's questions in this conversation thread: {{messages}}. Score from 0 to 1, where: - 1: No repetitive questions; all are unique and build on prior info. - 0: High repetition; many questions re-ask known details. Deduct points for duplicates (e.g., re-asking from extracted_info). Reason step-by-step, then output JSON: {"score": <number>, "reason": "<explanation>"}- This uses chain-of-thought for explainable scores.
- Example Custom Prompt for Repetitive Questions:
- Select from the catalog for ready-to-use templates (no prompt writing needed). Examples:
- Choose Data to Evaluate:
- Traces (production data): Run on live
callEndpointexecutions.- Scope: "New traces only" (real-time monitoring) or "Existing traces" (backfill history).
- Filters: Target specific traces, e.g., by name ("CanvassAssistantRun"), tags (add in code:
span.update({ tags: ["canvass"] })), or metadata (e.g.,threadId). - Preview: See sample traces from the last 24 hours matching your filters.
- Datasets: For offline experiments (from Step 4 in previous response).
- Sampling Rate: Set to 5-10% initially (e.g., "Sample 10% of matched traces") to control costs (~$0.01-0.05 per eval with GPT-4o).
- Traces (production data): Run on live
- Map Variables:
- Link trace/dataset fields to prompt placeholders.
- E.g.,
{{input}}: Map to your prompt (fromformatPayloadToPrompt). {{output}}: Map to assistant response (e.g.,responseContent).{{messages}}: For threads, use JSONPath like$.messages(if logged as an array in traces).- For repetitiveness: Map
{{messages}}to the full thread history (ensure your traces log this—addspan.update({ metadata: { messages: messages.data } })after fetching messages).
- E.g.,
- Preview: Langfuse shows a live populated prompt with real data. Cycle through examples to verify.
- Link trace/dataset fields to prompt placeholders.
- Trigger the Evaluation:
- Save the evaluator—it auto-runs on new traces (if selected).
- For immediate testing: Send a new trace via your app (e.g., call
callEndpoint). - For datasets: Run an experiment (see below).
- Backfilling: If set to existing traces, it queues evals on history.
- View Results: Scores appear as badges in trace details (e.g., "Repetitiveness: 0.7" with hover for reasoning).
- In Scores table: Filter by evaluator name, see averages, distributions.
- In Dashboards: Aggregate over time, by prompt version (from prompt management), or environment.
- Execution Traces: Every eval creates its own trace in a special environment (
langfuse-llm-as-a-judge).- Access: Hover score > "View execution trace" or filter tracing table.
- Debug: See exact prompt sent, judge response, tokens, latency, errors.
- Statuses: Completed, Error (with retries), Delayed (rate limits), Pending.
- Actions: Pause/resume/delete evaluators. Rerun on subsets.
- Iterate: If scores are low (e.g., repetitiveness <0.8), tune your assistant's prompt (e.g., add "Avoid repeats using extracted_info") and compare versions via datasets.
To test prompt changes offline (e.g., fixing repetitiveness):
- Create a dataset from traces (UI or SDK).
- Run an experiment: Select dataset > Run with prompt version > Apply evaluators.
- Compare scores side-by-side (e.g., v1 vs. v2 prompts).
If UI catalog doesn't suffice, use the TypeScript SDK for custom scoring. This lets you run LLM-as-a-Judge programmatically, e.g., after callEndpoint.
- Install/Update SDK:
npm install langfuse. - In code (e.g., after parsing response in
callEndpoint):import { Langfuse } from 'langfuse'; const langfuse = new Langfuse({ secretKey: process.env.LANGFUSE_SECRET_KEY, publicKey: process.env.LANGFUSE_PUBLIC_KEY, baseUrl: process.env.LANGFUSE_BASE_URL, }); // Example: Custom LLM-as-a-Judge for repetitiveness async function evaluateRepetitiveness(traceId: string, messages: any[], output: string) { // Use your AzureOpenAI client as the judge (or another) const judgeResponse = await this.client.chat.completions.create({ model: 'gpt-4o', // Or from config messages: [ { role: 'system', content: 'You are an evaluator. Analyze for repetitive questions.' }, { role: 'user', content: `Thread: ${JSON.stringify(messages)}\nOutput: ${output}\nScore 0-1 on uniqueness (1=no repeats). Reason first, then JSON: {"score": number, "reason": "text"}` } ], response_format: { type: 'json_object' }, // For structured output temperature: 0, }); const result = JSON.parse(judgeResponse.choices[0].message.content); // Log score to Langfuse await langfuse.score({ traceId, // From your trace (e.g., currentThreadId if set as traceId) name: 'repetitiveness', value: result.score, // 0-1 comment: result.reason, observationId: run.id, // Optional: Link to specific run/span }); } // Call after successful run await evaluateRepetitiveness(currentThreadId, messages.data, responseContent);
- This creates a score tied to the trace, with reasoning as a comment.
- For batch: Use
langfuse.createBatchScoreson datasets. - Trigger externally: Run this in a worker/queue for production traces.
- Map thread messages to detect duplicates (e.g., compare questions against
extracted_info). - Start with catalog "Coherence" evaluator, customize prompt as above.
- Costs: ~$0.01/eval; sample low initially.
- Alternatives: If not using Langfuse, implement purely with OpenAI (similar to the SDK example), but Langfuse adds tracing/UI benefits.
- Debug: If evals fail (e.g., no structured output), switch models or add retries.
For more, check Langfuse docs or GitHub discussions. If this doesn't match your intent, provide more details!