Multi-Agent System Loop Control and Output Evaluation in LangGraph

Can we determine if/when a multi-agent system will loop around (other than through an observability platform), and can a supervisor node reject output until it has gone through a specified number of loops?
Is there a way to measure if and how much better or worse the output is given different recursion levels?

Determining Looping and Supervisor Control in Multi-Agent Systems

Can We Know If/When a Multi-Agent System Will Loop (Without an Observability Platform)?

Yes, it is possible to determine if and when a multi-agent system in LangGraph will loop around without relying solely on an observability platform (e.g., LangChain tracing via LANGCHAIN_TRACING_V2). This can be achieved through the design of the graph’s structure, state management, and conditional edge logic, as demonstrated in the notebook and supported by the class materials.

Graph Structure and Conditional Edges:
- LangGraph’s conditional edges, managed by a router function (notebook: router), control whether the workflow loops back to another agent or proceeds to termination (__end__). The router function examines the state (e.g., messages, tool calls, or “FINAL ANSWER”) to decide the next step.
- Notebook Example: The router function checks:
  - If last_message.tool_calls exists, it routes to call_tool.
  - If "FINAL ANSWER" is in the message, it routes to __end__.
  - Otherwise, it continues to another agent (e.g., Researcher to chart_generator or vice versa).
- Determining Loops: By inspecting the graph’s conditional edge logic and state, you can predict looping behavior. For example, in the notebook, looping occurs when the chart_generator needs more data (no “FINAL ANSWER”) or the Researcher makes additional tool calls, creating a cycle between Researcher, chart_generator, and call_tool.
- Class Context: The slides emphasize conditional edges (“Edges can be… conditional”) and the supervisor concept, which can enforce looping based on state criteria, aligning with the notebook’s dynamic routing.
State Inspection:
- The state dictionary (notebook: AgentState with messages and sender) tracks the workflow’s progress. You can analyze the state at any node to determine if conditions for looping are met (e.g., incomplete data or missing “FINAL ANSWER”).
- Example: If the chart_generator’s output lacks sufficient data, the router routes back to Researcher, indicating a loop. This can be checked programmatically by examining the state’s messages or sender.
Explicit Loop Conditions:
- You can design the graph to include explicit conditions for looping, such as an iteration counter in the state. For instance, the class’s hands-on objective (“rewrites the report n number of times”) implies a loop controlled by a counter or quality criteria.
- Implementation: Add an iteration field to the state and check it in the router or a supervisor node to enforce looping until a condition is met.
Without Observability Platform:
- While observability platforms (e.g., LangChain tracing) provide runtime insights, you can predict looping statically by:
  - Reviewing the graph’s structure (nodes and edges).
  - Analyzing the router function’s logic (notebook: def router(state)).
  - Adding logging or print statements in nodes to inspect state transitions (e.g., print(state["messages"])).
- Example: In the notebook, you can add debug prints in the router to log when it routes to Researcher, chart_generator, or call_tool, revealing loop patterns.

Can a Supervisor Node Reject Output Until X Loops?

Yes, a supervisor node can reject output and enforce looping until a specified number of iterations (x) are completed, leveraging LangGraph’s state and conditional edges. This aligns with the class’s supervisor concept (slides: “The supervisor concept”) and the iterative nature of the hands-on objective.

Supervisor Node Role:

The supervisor node evaluates the state and decides whether to continue looping or terminate. In the notebook, the router acts as a pseudo-supervisor by checking for “FINAL ANSWER” or tool calls, but a dedicated supervisor node can enforce stricter loop control.

Implementation:

Add an iteration counter to the state (e.g., extend AgentState with iteration: int).
Create a supervisor node that checks the iteration count and rejects output (routes back to another node) until x loops are completed.

Example Code:

class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], operator.add]
    sender: str
    iteration: int  # Track loop count

def supervisor_node(state):
    state["iteration"] += 1
    max_loops = 3  # Example: enforce 3 loops
    if state["iteration"] < max_loops:
        return {"messages": [AIMessage(content="Continue looping", name="Supervisor")], "sender": "Supervisor"}
    return {"messages": [AIMessage(content="FINAL ANSWER", name="Supervisor")], "sender": "Supervisor"}

def supervisor_router(state) -> Literal["Researcher", "chart_generator", "__end__"]:
    if state["messages"][-1].content == "FINAL ANSWER":
        return "__end__"
    return "Researcher"  # Loop back to Researcher

workflow.add_node("supervisor", supervisor_node)
workflow.add_conditional_edges("supervisor", supervisor_router, {"Researcher": "Researcher", "__end__": END})

Behavior: The supervisor increments iteration each time it’s called and routes back to Researcher until iteration >= max_loops, then outputs “FINAL ANSWER” to terminate.

Class Context:
- The hands-on objective (“rewrites the report n number of times”) directly supports this approach, as it implies a supervisor enforcing a fixed number of iterations.
- The slides’ “supervisor concept” suggests a node that orchestrates workflow, which can include rejecting outputs until criteria (e.g., x loops) are met.
Practical Considerations:
- State Management: Ensure the state includes an iteration field or similar to track loops.
- Termination Conditions: Combine loop counts with quality checks (e.g., feedback scores) to avoid unnecessary iterations.
- Debugging: Use logging or tracing (README: LANGCHAIN_TRACING_V2) to verify the supervisor’s loop enforcement.

Measuring Output Quality Across Recursion Levels

Is There a Way to Measure If and How Much Better or Worse the Output Is Given Different Recursion Levels?

Yes, you can measure whether and how much the output improves or worsens across recursion levels in a LangGraph multi-agent system by incorporating evaluation mechanisms into the graph, such as quality metrics, human feedback, or automated scoring. This is particularly relevant for iterative workflows like the class’s report-writing task or the notebook’s charting task.

Approaches to Measure Output Quality

Automated Quality Metrics:
- Method: Define a node to evaluate output quality using metrics like completeness, accuracy, or clarity. For example:
  - For the notebook’s charting task, a node could check if the chart includes all requested data (e.g., 25 years of GDP) or meets formatting requirements (e.g., red bars).
  - For the class’s report-writing task, a node could use an LLM to score the report’s coherence or factual accuracy.
- Implementation:
  - Add an evaluation_node that uses an LLM to score the output (e.g., ChatOpenAI to rate the chart or report on a scale of 1-10).
  - Store scores in the state (e.g., state["scores"] = [score1, score2, ...]) to track quality across iterations.
  - Example Code:
```
def evaluation_node(state):
    last_output = state["messages"][-1].content
    prompt = f"Rate the quality of this output (1-10) for completeness and clarity: {last_output}"
    score = model.invoke(prompt).content
    state["scores"] = state.get("scores", []) + [int(score)]
    return {"messages": [AIMessage(content=f"Quality score: {score}", name="Evaluator")], "sender": "Evaluator"}
```
- Analysis: Compare state["scores"] across iterations to assess improvement (e.g., increasing scores) or degradation.
Human-in-the-Loop Evaluation:
- Method: As discussed in a prior answer, a human-in-the-loop node can collect feedback on output quality (e.g., “Is the chart clear?” or “Does the report need revision?”).
- Implementation:
  - Add a human_review_node (see prior answer) to prompt users for a quality score or qualitative feedback (e.g., “Rate 1-10” or “Needs more data”).
  - Store feedback in the state (e.g., state["human_scores"] or state["feedback"]).
  - Example: In the notebook, after the chart_generator produces a chart, a human node could rate its quality, and the state tracks scores across loops.
- Class Context: Aligns with the class’s iterative objective, where human feedback could replace automated feedback in the report-writing task.
Comparison Against a Baseline:
- Method: Compare outputs at different recursion levels against a reference (e.g., a ground-truth dataset for GDP or a sample report).
- Implementation:
  - Add a node to compute similarity metrics (e.g., cosine similarity for text outputs or data overlap for charts) between the current output and a baseline.
  - Store metrics in the state to track changes over iterations.
  - Example: For the notebook’s GDP chart, compare the chart’s data points to a known GDP dataset to measure accuracy.

Supervisor Node with Quality Threshold:

Method: The supervisor node can reject outputs until a quality threshold is met, in addition to loop counts. This combines loop control with quality evaluation.

Implementation:

Extend the supervisor node to check both iteration and a quality score (from an evaluation node or human feedback).

Example Code:

def supervisor_node(state):
    state["iteration"] += 1
    quality_score = state["scores"][-1] if state.get("scores") else 0
    if state["iteration"] < 3 or quality_score < 8:  # Require 3 loops or score >= 8
        return {"messages": [AIMessage(content="Continue looping", name="Supervisor")], "sender": "Supervisor"}
    return {"messages": [AIMessage(content="FINAL ANSWER", name="Supervisor")], "sender": "Supervisor"}

Analysis: Track state["scores"] to determine if quality improves with more loops (e.g., higher scores) or plateaus.

Measuring Improvement or Degradation

Quantitative Metrics:
- Scores: Plot or compare scores (e.g., state["scores"]) across iterations to identify trends (e.g., increasing scores indicate improvement).
- Error Rates: For the notebook’s GDP task, measure data accuracy (e.g., percentage of correct GDP values) per iteration.
- Convergence: Check if scores stabilize (e.g., minimal change after 3 loops), indicating diminishing returns.
Qualitative Feedback:
- Use human feedback (e.g., “more readable” or “incomplete data”) to assess subjective improvements, stored in state["feedback"].
Statistical Analysis:
- Compute metrics like mean, variance, or trend slopes for quality scores across iterations to quantify improvement or degradation.
- Example: If state["scores"] = [5, 6, 8, 8], an upward trend suggests improvement, while a plateau at 8 suggests no further gains.

Practical Considerations

State Design: Ensure the state includes fields for tracking iterations and quality metrics (e.g., iteration, scores, feedback).
Evaluation Node: Implement automated or human evaluation nodes to generate consistent metrics, avoiding subjective bias.
Termination Criteria: Combine loop counts with quality thresholds to balance iteration and output quality, preventing infinite loops.
Debugging: Use tracing (README: LANGCHAIN_TRACING_V2) or logging to monitor scores and state changes, helping analyze quality trends.
Class Context: The class’s iterative objective (“n number of times”) suggests tracking quality across loops, as in the report-writing task, where a supervisor could evaluate report quality per iteration.

Example Integration in the Notebook

Modified State:

class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], operator.add]
    sender: str
    iteration: int
    scores: List[int]

Workflow:
- Add an evaluation_node to score the chart_generator’s output (e.g., chart completeness).
- Modify the supervisor_node to reject outputs until iteration >= 3 or scores[-1] >= 8.
- Track scores to compare quality across recursion levels (e.g., scores = [6, 7, 8] for 3 loops).
Analysis: After running graph.stream, print state["scores"] to assess if quality improves (e.g., higher scores) or worsens (e.g., fluctuating scores).

Conclusion

Looping Determination: You can determine if/when a multi-agent system will loop by analyzing the graph’s conditional edges (e.g., notebook’s router) and state (e.g., iteration counter), without relying solely on observability platforms. A supervisor node can enforce looping until x iterations by tracking the state and using conditional edges, aligning with the class’s iterative objective.
Output Quality Measurement: You can measure output quality across recursion levels using automated metrics (e.g., LLM scoring), human feedback, or baseline comparisons, storing results in the state. Trends in quality scores indicate improvement or degradation, supporting the class’s focus on iterative workflows.

These capabilities leverage LangGraph’s flexible state management and conditional routing, making it ideal for controlling and evaluating multi-agent systems, as seen in the notebook and class materials.

decagondev/recursion-testing-llm-agents.md