Created
March 27, 2025 17:14
-
-
Save rlancemartin/f9332f8ebb3d96141c2781f2cd60e74d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[Evaluation Quick Start | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation): LLM should read this page when learning how to set up LangSmith evaluations, creating evaluation datasets, or implementing LLM-based evaluators. This page provides a step-by-step quick start guide for LangSmith's evaluation capabilities, covering installation, API key setup, dataset creation, defining evaluation targets, creating evaluators, and running evaluations in both Python and TypeScript. | |
[Evaluation concepts | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/concepts): LLM should read this page when wanting to understand LangSmith evaluation concepts, implementing evaluation strategies for LLM applications, or choosing appropriate metrics for different AI application types. This page covers LangSmith's evaluation framework including datasets, evaluators (human, heuristic, LLM-as-judge, pairwise), experiments, annotation queues, offline/online evaluation approaches, testing methodologies, and application-specific evaluation techniques for agents, RAG, summarization, and classification. | |
[Evaluation how-to guides | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides): LLM should read this page when looking for evaluation methods in LangSmith, needing to set up offline/online evaluations, or managing datasets for AI application testing. The page provides comprehensive how-to guides for LangSmith's evaluation features, covering offline evaluation setup, evaluator configuration, testing integration, online evaluation, experiment analysis, dataset management, and human feedback collection systems. | |
[Analyze a single experiment | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/analyze_single_experiment): LLM should read this page when analyzing experiment results in LangSmith, troubleshooting evaluation metrics, or learning how to interpret experiment visualizations. This page explains how to analyze experiment results in LangSmith including: navigating the experiment view, using heatmap visualization, sorting/filtering results, viewing different table formats (compact/full/diff), accessing traces, examining evaluator runs, grouping by metadata, working with repetitions, and comparing experiments. | |
[Annotate traces and runs inline | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/annotate_traces_inline): LLM should read this page when needing to annotate traces in LangSmith, adding manual feedback to application traces, or setting up inline annotation workflows. This page explains how to manually annotate traces in LangSmith by adding feedback, comments, and scores to any run in a trace, including intermediate spans, using the Annotate feature in the trace view. | |
[Use annotation queues | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/annotation_queues): LLM should read this page when needing to understand annotation queues in LangSmith, setting up human-in-the-loop feedback processes, or organizing systematic review of model outputs. This page explains how to create and use annotation queues in LangSmith, including creating queues with rubrics, assigning runs to queues, configuring multi-annotator workflows with reservations, and reviewing runs within a queue interface with keyboard shortcuts. | |
[How to run an evaluation asynchronously | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/async): LLM should read this page when needing to implement asynchronous evaluations in Python, handling concurrent evaluation requests, or working with async functions in LangSmith. This page explains how to use the `aevaluate()` function in LangSmith's Python SDK to run evaluations asynchronously, with sample code showing async function creation, handling concurrency, and integration with datasets for evaluation. | |
[Log user feedback | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/attach_user_feedback): LLM should read this page when needing to implement user feedback collection for LLM applications or wanting to integrate trace-based feedback with LangSmith. This page explains how to log user feedback to LangSmith traces using both Python and TypeScript, allowing developers to attach feedback scores and comments to any part of an application trace for analysis and evaluation. | |
[How to audit evaluator scores | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/audit_evaluator_scores): LLM should read this page when auditing AI evaluation scores, needing to correct LLM judge assessments, or implementing score correction workflows. This page explains how to audit and correct LLM-as-judge evaluator scores in LangSmith using three methods: through the comparison view UI, the runs table UI, or programmatically via the SDK (Python or TypeScript) using the update_feedback function. | |
[How to bind an evaluator to a dataset in the UI | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/bind_evaluator_to_dataset): LLM should read this page when learning how to configure evaluators for datasets in LangSmith UI, when setting up automatic evaluation, or when creating custom evaluation functions. This page provides step-by-step instructions for binding evaluators to datasets in the LangSmith UI, covering both LLM-as-judge evaluators and custom code evaluators, with examples of implementation, restrictions, and visualization of evaluation results. | |
[How to compare experiment results | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/compare_experiment_results): LLM should read this page when comparing experiment results in LangSmith, analyzing differences between LLM application iterations, or interpreting metrics from multiple evaluations. This page provides a comprehensive guide to using LangSmith's experiment comparison view, including opening the comparison view, adjusting table displays, analyzing regressions/improvements, filtering results, setting baseline experiments, viewing traces and detailed views, and creating summary charts with customizable metadata labels. | |
[How to create few-shot evaluators | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/create_few_shot_evaluators): LLM should read this page when creating few-shot evaluators in LangSmith, setting up corrective learning for LLM evaluators, or managing evaluation feedback cycles. This guide explains how to create few-shot evaluators that improve over time using human corrections as examples, including how to set up mustache-template prompts with the {{Few-shot examples}} variable, make corrections that populate into a dedicated dataset, and access/edit the few-shot examples dataset. | |
[How to define a custom evaluator | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/custom_evaluator): LLM should read this page when needing to create custom evaluators for LangSmith, understanding how to return different types of metrics, or implementing LLM-as-judge evaluators. The page explains how to define custom evaluator functions in LangSmith for both Python and TypeScript, including the required argument specifications, supported return types, and examples of different evaluation approaches such as exact matching, conciseness scoring, and using LLMs as judges. | |
[How to evaluate on a split / filtered view of a dataset | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/dataset_subset): LLM should read this page when evaluating subsets of datasets, applying filters to datasets, or running evaluations on specific dataset splits. This page explains how to evaluate filtered views of datasets by using list_examples with metadata filters and how to run evaluations on specific dataset splits (like "test" or "training") using the LangSmith evaluation framework in both Python and TypeScript. | |
[How to evaluate on a specific dataset version | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/dataset_version): LLM should read this page when needing to evaluate models on specific dataset versions, managing dataset versioning in LangSmith, or implementing version-specific evaluations. This page explains how to evaluate on a specific dataset version by using the list_examples/listExamples method with the as_of/asOf parameter to specify which version to use during evaluation. | |
[How to define a target function to evaluate | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/define_target): LLM should read this page when needing to configure a target function for evaluation in LangSmith, implementing evaluations for LLM applications, or setting up testing frameworks for AI components. This page explains how to define target functions for LangSmith evaluations, including the function signature requirements, examples for single LLM calls, non-LLM components, and complete applications/agents with code samples in Python and TypeScript. | |
[How to download experiment results as a CSV | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/download_experiment_results_as_csv): LLM should read this page when needing to download experiment data from LangSmith, wanting to export evaluation results, or analyzing LangSmith experiment outcomes offline. This page explains how to download experiment results as a CSV file from LangSmith by clicking the download icon at the top of the experiment view, located to the left of the "Compact" toggle. | |
[How to evaluate an existing experiment (Python only) | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/evaluate_existing_experiment): LLM should read this page when needing to evaluate existing experiments or add evaluation metrics to previously run experiments. This page explains how to use the Python SDK to apply evaluators to existing experiments using the evaluate() method with an experiment name/ID instead of a target function. | |
[How to run an evaluation | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/evaluate_llm_application): LLM should read this page when learning how to evaluate LLM applications, setting up evaluation pipelines, or debugging model performance. This page explains the process of running evaluations in LangSmith, covering how to define an application for testing, create datasets with labeled examples, define custom evaluators, run evaluation jobs with the evaluate() method, and analyze results in the LangSmith UI. | |
[How to evaluate an application's intermediate steps | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/evaluate_on_intermediate_steps): LLM should read this page when evaluating intermediate steps of AI applications, working with RAG pipelines, or creating custom evaluators for complex systems. This guide explains how to evaluate intermediate steps in LLM applications, with examples using a Wikipedia-based RAG pipeline, creating custom evaluators for retrieval relevance and hallucination detection, and implementing evaluation through LangSmith with both Python and TypeScript code samples. | |
[How to run pairwise evaluations | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/evaluate_pairwise): LLM should read this page when comparing outputs from multiple experiments against each other, setting up pairwise evaluations, or implementing LLM-as-judge comparisons. This page explains how to run pairwise evaluations in LangSmith, covering the evaluate() function arguments, defining custom pairwise evaluators, handling evaluator inputs/outputs, running evaluations, and viewing results in the LangSmith UI. | |
[Run an evaluation with large file inputs | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/evaluate_with_attachments): LLM should read this page when working with multimodal file inputs, processing large attachments in LangSmith, or creating evaluations with file uploads. This page explains how to run LangSmith evaluations with large file attachments, including creating examples with attachments via SDK or UI, defining target functions that use these files, building custom evaluators that process attachments, and managing attachments through updates and versioning. | |
[How to export filtered traces from experiment to dataset | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/export_filtered_traces_to_dataset): LLM should read this page when needing to save specific evaluation results as datasets, filtering experiment traces by evaluation criteria, or creating new datasets from filtered runs. This guide demonstrates how to export filtered traces from a LangSmith experiment to a dataset by navigating to experiment traces, applying filters based on evaluation criteria, selecting desired runs, and using the "Add to Dataset" feature. | |
[How to fetch performance metrics for an experiment | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/fetch_perf_metrics_experiment): LLM should read this page when needing to extract performance metrics from LangSmith experiments or when analyzing evaluation results programmatically. This page explains how to fetch and interpret experiment performance metrics using the LangSmith SDK, including details on latency, token usage, costs, feedback statistics, and error rates, with code examples in both Python and TypeScript. | |
[How to filter experiments in the UI | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/filter_experiments_ui): LLM should read this page when needing to filter LangSmith experiments, looking for ways to organize evaluation results, or wanting to compare specific experiment metrics in the UI. This page explains how to filter experiments in LangSmith's UI by adding metadata to experiments during creation and using the filtering interface to narrow down results by model provider, prompt type, feedback scores, and other criteria. | |
[Dynamic few shot example selection | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/index_datasets_for_dynamic_few_shot_example_selection): LLM should read this page when looking for dynamic few-shot example selection, searching datasets based on input similarity, or implementing retrieval-based prompting. This page explains how to configure LangSmith datasets for dynamic few-shot example selection, including prerequisites (paid team plan, KV store data type), indexing datasets for search, testing search quality in the playground, and implementing this feature in applications using code snippets in Python and TypeScript. | |
[How to evaluate a langchain runnable | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/langchain_runnable): LLM should read this page when evaluating LangChain Runnable objects, setting up evaluation pipelines, or using LangSmith for model testing. This page explains how to evaluate LangChain Runnable objects (like chains, retrievers, and models) using LangSmith, with code examples in Python and TypeScript showing how to define a chain, create evaluation metrics, and run evaluations against datasets. | |
[How to evaluate a langgraph graph | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/langgraph): LLM should read this page when evaluating langgraph applications, creating evaluators for agent workflows, or running evaluations on graph nodes. This guide explains how to evaluate langgraph graphs, covering end-to-end evaluations, evaluating intermediate steps, and testing individual nodes with examples for creating datasets, defining evaluators, and analyzing results. | |
[How to define an LLM-as-a-judge evaluator | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/llm_as_judge): LLM should read this page when implementing LLM-as-a-judge evaluators, setting up evaluation systems for conversational AI, or creating custom evaluators for language models. The page explains how to define an LLM-as-a-judge evaluator using the LangSmith SDK, including custom evaluator implementation with code examples that use OpenAI GPT models to evaluate the consistency of reasoning in AI-generated responses. | |
[How to run an evaluation locally (beta, Python only) | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/local): LLM should read this page when learning how to run evaluations locally without uploading results to LangSmith, when testing prompts quickly, or when validating target/evaluator functions. The page explains how to use the Python SDK's evaluate() function with upload_results=False parameter to run evaluations locally, with an example showing how to define evaluators, test a chatbot on a dataset, and analyze results locally with pandas. | |
[How to manage datasets in the UI | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/manage_datasets_in_application): LLM should read this page when needing to create or manage datasets in LangSmith UI, looking for ways to add examples to datasets, or wanting to organize evaluation data. This guide covers dataset creation (from CSV or empty), adding examples (from traced runs, annotation queues, UI input, or synthetic generation), exporting datasets, creating dataset splits, editing example metadata, and filtering examples in the LangSmith UI. | |
[How to manage datasets programmatically | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/manage_datasets_programmatically): LLM should read this page when needing to programmatically create, manage, query, or update datasets in LangSmith, or when working with dataset examples in Python or TypeScript. This page documents how to use the LangSmith SDK to: create datasets from lists, traces, CSV files, or pandas DataFrames; fetch and filter datasets; query examples using various criteria; and update examples individually or in bulk. | |
[How to return categorical vs numerical metrics | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/metric_type): LLM should read this page when implementing evaluators that need to distinguish between metric types, when creating custom metrics in LangSmith, or when configuring how metric results are displayed. This page explains how to return categorical vs numerical metrics in LangSmith custom evaluators, showing code examples in Python and TypeScript for both metric types and their proper return formats. | |
[How to return multiple scores in one evaluator | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/multiple_scores): LLM should read this page when implementing multiple evaluation metrics in a single evaluator, needing to optimize cost/time with LLM judges, or working with custom evaluators in LangSmith. The page explains how to return multiple scores from a single evaluator function in both Python and TypeScript, with code examples showing how to structure the return value as a list of dictionaries containing metric names and scores. | |
[How to use prebuilt evaluators | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/prebuilt_evaluators): LLM should read this page when needing to use prebuilt evaluators in LangSmith, implementing LLM-as-a-judge for evaluation, or integrating evaluation into testing frameworks. This page explains how to use the openevals package for ready-made LangSmith evaluators, covering setup requirements, implementation with Python and TypeScript, and integration with pytest/Vitest for running evaluations that automatically log results as feedback. | |
[How to run evals with pytest (beta) | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest): LLM should read this page when setting up evaluation with pytest, integrating LangSmith with Python testing, or implementing test caching for LLM applications. The page explains how to use LangSmith's pytest plugin for evaluation, covering installation, test definition, logging inputs/outputs/feedback, test suite organization, caching requests, rich terminal outputs, parameterization, and assertion utilities like the expect API. | |
[How to handle model rate limits | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/rate_limiting): LLM should read this page when handling rate limit errors in LangSmith evaluations, implementing throttling for LLM API calls, or optimizing concurrent model requests. This page covers three main approaches to handling model rate limits: using langchain RateLimiters to control request frequency, implementing retrying with exponential backoff, and limiting max_concurrency to reduce parallel API calls. | |
[Renaming an experiment | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/renaming_experiment): LLM should read this page when needing to rename LangSmith experiments, understanding experiment name management, or finding UI navigation options for experiment renaming. This page explains two methods for renaming experiments in LangSmith: using the Playground interface where you can edit names in the table header, and using the pencil icon in the Experiments view. Experiment names must be unique per workspace. | |
[How to evaluate with repetitions | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/repetition): LLM should read this page when configuring experiment repetitions, analyzing non-deterministic LLM outputs, or interpreting repetition results in LangSmith. This page explains how to configure multiple repetitions for evaluations to reduce noise in non-deterministic systems, set the num_repetitions parameter in evaluate functions, and view repetition results including averages and standard deviations in the LangSmith UI. | |
[How to use the REST API | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/run_evals_api_only): LLM should read this page when using the LangSmith REST API directly, implementing evaluations without SDKs, or working in a non-Python/JavaScript environment. The page explains how to use the LangSmith REST API to create datasets, run evaluations, and implement pairwise experiments, with complete code examples using Python's requests library. | |
[How to run an evaluation from the prompt playground | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/run_evaluation_from_prompt_playground): LLM should read this page when running evaluations from the LangSmith prompt playground, testing prompts across multiple inputs without coding, or understanding how to create experiments in LangSmith UI. This page explains how to run evaluations using LangSmith's prompt playground: navigating to the playground, switching to a dataset, starting an experiment, viewing results, and adding evaluation scores by binding evaluators to datasets or using the SDK programmatically. | |
[Set up feedback criteria | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/set_up_feedback_criteria): LLM should read this page when setting up feedback systems in LangSmith, configuring evaluation criteria, or creating custom feedback tags. This page explains how to create and configure feedback criteria in LangSmith, covering both continuous feedback (with min/max numerical values) and categorical feedback (with predefined categories mapped to scores). | |
[How to share or unshare a dataset publicly | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/share_dataset): LLM should read this page when needing to share datasets publicly, managing dataset access permissions, or understanding public sharing implications in LangSmith. The page explains how to share and unshare datasets publicly in LangSmith, including using the Share button, accessing shared datasets via links, viewing permissions for shared datasets, and methods to unshare datasets through the UI. | |
[How to define a summary evaluator | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/summary): LLM should read this page when needing to create evaluators that run across entire experiments, designing summary metrics like F1 scores, or implementing experiment-level evaluation metrics. This page explains how to define summary evaluators in LangSmith for metrics that operate on entire experiments rather than individual runs, including function signature requirements, available arguments, and output formats with Python and TypeScript examples. | |
[How to upload experiments run outside of LangSmith with the REST API | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/upload_existing_experiments): LLM should read this page when needing to upload experiments to LangSmith that were run outside the platform or when integrating external evaluation systems with LangSmith's visualization capabilities. This page explains how to use the REST API to upload externally-run experiments to LangSmith, including the request body schema, key considerations for dataset management, a working Python example using the requests library, and instructions for viewing uploaded experiments in the LangSmith UI. | |
[How to use off-the-shelf evaluators (Python only) | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/use_langchain_off_the_shelf_evaluators_old): LLM should read this page when needing to use pre-built evaluators in LangSmith with Python, understanding evaluation options for different types of responses, or configuring evaluator parameters. This page documents how to use LangChain's off-the-shelf evaluators in LangSmith including QA evaluators, criteria evaluators, labeled criteria evaluators, string/embedding distance metrics, using custom LLMs for evaluation, and handling multiple input/output fields. | |
[How to version a dataset | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/version_datasets): LLM should read this page when needing to understand dataset versioning in LangSmith, managing dataset history, or tagging specific dataset versions. This page explains how LangSmith automatically creates new dataset versions when examples are added/updated/deleted, shows how to view past versions, and demonstrates how to tag versions with human-readable names like "prod" through both the UI and Python SDK. | |
[How to run evals with Vitest/Jest (beta) | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/vitest_jest): LLM should read this page when implementing testing for LLM applications in JavaScript/TypeScript or when setting up automated evaluation of language models with Vitest/Jest. This page provides a comprehensive guide to running evaluations with Vitest/Jest in LangSmith, including setup instructions for both frameworks, defining test suites, logging outputs, tracing feedback, parameterizing tests across multiple examples, and configuring test behavior with options for skipping, focusing, and dry-run mode. | |
[Evaluation tutorials | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/tutorials): LLM should read this page when seeking evaluation tutorials for LLM applications, learning about assessment methodologies, or finding guides for testing specific AI systems. This page provides a collection of LangSmith evaluation tutorials covering how to evaluate chatbots, RAG applications, complex agents, ReAct agents with testing frameworks, and run backtests on agent versions. | |
[Evaluate a complex agent | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/tutorials/agents): LLM should read this page when evaluating complex agents, understanding agent evaluation techniques, or implementing evaluation for multi-step reasoning systems. This page demonstrates how to build and evaluate a complex customer support agent with LangSmith, covering three key evaluation types: final response evaluation, trajectory evaluation, and single-step evaluation, with code examples for each approach. | |
[Run backtests on a new version of an agent | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/tutorials/backtesting): LLM should read this page when testing new versions of AI agents, setting up backtesting workflows, or comparing model performance against historical data. This page explains how to run backtests on new agent versions using LangSmith, covering the process of converting production traces to datasets, defining evaluators, benchmarking systems, and analyzing comparative results to identify improvement opportunities before deployment. | |
[Evaluate a chatbot | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/tutorials/evaluation): LLM should read this page when learning how to evaluate chatbots with LangSmith, setting up automated evaluation systems, or comparing model performances. This page provides a complete tutorial on evaluating a chatbot using LangSmith, covering dataset creation, metric definition, running evaluations with different models/prompts, comparing results, tracking performance over time, and setting up automated testing in CI/CD pipelines. | |
[Evaluate a RAG application | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/tutorials/rag): LLM should read this page when evaluating RAG applications, implementing metrics for RAG evaluation, or designing comprehensive evaluation frameworks for retrieval-based systems. This tutorial demonstrates how to evaluate RAG applications using LangSmith, covering dataset creation, running evaluations, and implementing four key evaluation metrics: correctness (answer vs reference), relevance (response vs input), groundedness (response vs retrieved docs), and retrieval relevance (retrieved docs vs input). | |
[Running SWE-bench with LangSmith | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/tutorials/swe-benchmark): LLM should read this page when evaluating code agents on SWE-bench, implementing benchmarking for code generation, or setting up automated evaluation of coding tasks. This page provides a complete tutorial for running the SWE-bench benchmark with LangSmith, covering dataset loading, uploading to LangSmith, running a prediction function, evaluating code patches in Docker containers, and sending evaluation results back to LangSmith for analysis. | |
[Test a ReAct agent with Pytest/Vitest and LangSmith | ๐ฆ๏ธ๐ ๏ธ LangSmith](https://docs.smith.langchain.com/evaluation/tutorials/testing): LLM should read this page when testing a ReAct agent using automated testing frameworks or integrating LangSmith with Pytest/Vitest/Jest. This page explains how to test LLM applications using LangSmith with popular testing frameworks, including setting up the environment, creating a stock information agent with tools, writing comprehensive tests for tool usage accuracy, and evaluating response groundedness with LLM-as-a-judge. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment