LangSmith (by LangChain) and LangFuse are powerful observability and analytics tools for LLM workflows. You can use them to track, analyze, and improve model performance β and even synthesize better training data using real user interactions.
This guide covers:
- How to connect LangSmith or LangFuse
- How to log successful Q&A pairs
- How to use logged data for fine-tuning or RAG
- Sample integration script
- Dashboard template ideas
- End-to-end feedback loop
- Install LangChain and LangSmith SDK:
pip install langchain langsmith
- Set environment variables:
export LANGCHAIN_TRACING_V2=true export LANGCHAIN_API_KEY="your_langsmith_api_key" export LANGCHAIN_PROJECT="your_project_name"
- Wrap your chains:
from langsmith import traceable @traceable(name="user_question") def answer_question(prompt): return llm(prompt)
- Install LangFuse SDK:
npm install langfuse
- Use their Node/TS SDK in your backend logic:
const langfuse = new Langfuse({ publicKey: '...', secretKey: '...' }); langfuse.log({ trace: { name: "query" }, input: ..., output: ... });
You can tag data based on feedback:
- Use metadata:
trace_metadata = { "user_feedback": "π", "task_type": "qa", "user_id": "123" }
- Mark high-quality interactions with tags like
success=true
- Use feedback tagging or custom properties:
langfuse.log({ trace: { name: "answer_check" }, input: question, output: answer, metadata: { user_score: 5, was_helpful: true } });
Once you've tracked a set of high-quality examples:
-
Export from LangSmith (UI or API):
- Filter by tag:
success=true
- Extract prompt/response pairs
- Filter by tag:
-
Format for fine-tuning:
{ "prompt": "How do I reset my password?", "completion": "Go to settings and click 'Reset Password'." }
-
Use for supervised fine-tuning or to enrich your RAG index.
from langchain.chat_models import ChatOpenAI
from langsmith.run_trees import RunTree
from langsmith import traceable
chat = ChatOpenAI()
@traceable(name="chat_response")
def respond_to_user(message: str):
return chat.predict(message)
# Example usage
response = respond_to_user("What are the store hours?")
print(response)
A good dashboard might include:
Success Rate Panel
- Total requests
- Percentage of successful/approved answers
Latency Panel
- Average response time
- P95 response time
Feedback Panel
- Most common π / π reasons
- Histogram of user ratings
Top Queries Table
- Query text
- User score or outcome
- Timestamp + tags
Drop-off Funnel
- Steps from initial query to completion or re-ask
To close the loop from user interaction to model improvement:
- Capture Live Interactions using LangSmith or LangFuse.
- Tag & Rate Responses via:
- Thumbs up/down buttons
- 1β5 star rating
- Free-text user comments
- Log & Store Data into:
- LangSmith's project dashboard
- LangFuse logs with metadata
- Aggregate Insights through analytics dashboards.
- Export High-Quality Pairs and curate a fine-tuning dataset.
- Retrain or Augment your model or RAG index.
- Monitor Improvements over time via version comparison.
Optional additions:
- Slack or Discord bots to request feedback inline
- Airtable or Notion databases for tagging, editing, and review
- Auto-notify team when drops in accuracy occur
- Track latency, response quality, fallback usage.
- Identify high dropout paths or frequent re-asks.
- Use heatmaps or feedback tags to guide your iteration roadmap.
LangSmith and LangFuse allow you to go beyond black-box LLM usage. With traceable observability and feedback tagging, you can:
- Identify what works
- Create datasets from real usage
- Improve LLM accuracy through continuous learning
Use observability as a data refinery to evolve smarter models.