llm_tools_with_schema.md

GitHub Issue: Enable Schema-Formatted Output for Tool-Using Chains

1. Problem:

Currently, the llm CLI's --schema option applies to the direct output of a single Language Model (LLM) call. When tools (via --tool or --functions) are used, the LLM engages in a multi-step chain (e.g., ReAct pattern) where intermediate outputs are tool call requests or textual reasoning. There's no direct way to specify that the final, user-visible result of such a multi-step, tool-using chain should conform to a user-defined schema. The existing --schema option doesn't automatically apply to the culmination of this chain.

2. Alternatives Considered:

A. New CLI Option: Introducing a distinct option (e.g., --final-schema or --output-schema) specifically for specifying the schema of the final output after a tool chain. This would keep the existing --schema behavior for direct, single-turn schema output and make the post-chain formatting explicit.
B. Overload Existing --schema (Implicit Deferral): Modify the behavior of the existing --schema option. When tools are also specified, --schema would be implicitly understood to apply to the final output of the entire tool chain, requiring an additional LLM call at the end to format the accumulated information. If no tools are present, --schema behaves as it does currently.

3. Suggested Approach (from discussion):

The preferred approach discussed was Alternative B: Overload Existing --schema. The rationale is that from a user's perspective, if they ask for tools and a schema, their intent is for the final output of the entire operation to be schema-formatted. This approach aims for a simpler user interface by reusing the existing --schema flag, with llm handling the "when" of schema application.

4. Edge Cases and Challenges (with the suggested approach):

Clarity of Conditional Behavior: Users must understand that --schema behaves differently (becomes a post-processing step) when tools are active.
Context for Final Formatting: Determining the optimal context (full history vs. last utterance vs. summary) to feed into the final schema-formatting LLM call is crucial for quality and token efficiency.
Token Limits & Cost: The additional (+1) LLM call for formatting adds latency and cost, and the context for this call could be large.
Streaming UX: While the tool chain might stream intermediate thoughts/actions, the final schema-formatted (especially JSON) output will likely be a block.
Error Handling: Failures in the implicit final formatting step need to be clearly distinguishable from errors in the tool-using chain.
Debugging: Users may need a way to inspect the context passed to the implicit final formatting call if the output is unsatisfactory.
Prompt Engineering for Final Call: The internal prompt used by llm for the final formatting call is fixed; users cannot easily tune it.
chain_limit Interaction: If the tool chain is cut short by chain_limit, the context for final formatting might be incomplete or suboptimal.

5. Preferred Solution Attributes/Components (High-Level Design for Implicit Deferral):

This section outlines the conceptual components and areas of existing code that would need modification, using only existing llm class/method names where applicable.

Deferred Schema Application Logic:
- Core Change Location: The primary logic for deferring and then applying the schema would reside within the llm.models.ChainResponse (and its asynchronous counterpart llm.models.AsyncChainResponse).
- Detection: These classes would detect if both tools are active (via llm.models.Prompt.tools) and a schema is provided (passed from llm.cli.prompt to the llm.models.Model.chain method, and then to ChainResponse).
Chain Execution Modification:
- Tool-Using Phase: The existing loop within ChainResponse that handles tool calls (iterating through Response objects, executing tools via Response.execute_tool_calls, and feeding results back) would run as usual, up to the chain_limit or until the model stops calling tools.
- Context Aggregation: After the tool-using phase, ChainResponse would internally aggregate a comprehensive context string from all preceding llm.models.Response objects within the chain (including user prompts, system prompts, tool calls, tool results, and intermediate assistant textual outputs).
Final Formatting Call:
- Trigger: If a schema was deferred, ChainResponse initiates a new, final LLM call.
- Prompt Construction: An internal, standardized prompt will be constructed (e.g., "Based on the following interaction, format the final answer into the schema: {schema_description}. Interaction: {aggregated_context}").
- Execution: This uses the existing llm.models.Model.prompt (or llm.models.Conversation.prompt) method of the same model instance used for the chain, but this time with the schema parameter set to the user's originally supplied (and deferred) schema. Tools will not be enabled for this specific formatting call.
Output Handling:
- Streaming: If the original llm command requested streaming, the chunks from this final formatting call will be yielded. If not, its complete text (which should be the schema-formatted string) will be returned by ChainResponse.text().
- Intermediate Output: Consideration is needed for how/if intermediate thoughts/actions from the tool-chain are streamed to the user if the final output is a single schema block. The current design suggests streaming intermediates if the final output is also streamed.
Logging Integration:
- Additional Response: The final schema-formatting LLM call must be logged as an additional llm.models.Response object within the same llm.models.Conversation (sharing the conversation_id).
- Distinction: The llm.models.Prompt for this logged formatting response should clearly reflect its purpose (e.g., its prompt.prompt text would be the internal formatting instruction).
CLI Layer (llm.cli.prompt function):
- Option Passing: The llm.cli.prompt function will continue to parse the --schema option. This schema object will be passed to llm.models.Model.chain (or llm.models.Conversation.chain). The decision to use it immediately or defer it happens within the model layer based on whether tools are also active.
- chain_limit: The existing --chain-limit applies to the tool-using phase. The final formatting call is an additional (+1) call, effectively outside this limit.

Simon asked on Discord:

Maybe could do it by having a bunch of earlier tool calls that just dump information into context and then a final reply that schemas it up?

Yes, Simon's suggestion is a very practical way to think about the flow and aligns well with the "implicit deferral" approach we've been discussing. Let's break down what that implies for the design and how it maps to the components.

Simon's Suggestion Deconstructed:

"A bunch of earlier tool calls that just dump information into context":
- This accurately describes the ReAct or general tool-using loop. The model reasons, decides it needs information, calls a tool.
- llm executes the tool.
- The tool's output (the "dumped information") is fed back to the model as an "Observation" or ToolResult. This becomes part of the ongoing conversation history (the "context").
- This loop repeats, accumulating information in the conversation history.
"and then a final reply that schemas it up":
- This is the crucial "+1" step.
- Once the model (or chain_limit) decides the information-gathering phase is complete, it would naturally produce a final textual reply based on the accumulated context.
- The modification here is: Instead of just taking that final textual reply directly, if a --schema was specified alongside tools, llm would intercept. It would take the accumulated context (which includes all tool interactions and the model's intermediate thoughts/replies) and then issue a new, specific instruction to the LLM: "Now, take all this information and format it according to this schema."

How this aligns with the "Implicitly Deferred Schema" Design:

This is essentially the same core idea:

Tool-Using Phase:
- llm.models.ChainResponse manages the loop.
- Each llm.models.Response within this phase logs its prompt, any tool calls it makes, and any textual output.
- llm.models.Prompt objects for subsequent turns in the chain contain tool_results from previous steps.
- This continues until chain_limit or the model provides a textual answer without requesting more tools.
Context Aggregation (Implicit):
- The llm.models.Conversation object (or the history maintained within ChainResponse) inherently stores this sequence of prompts, tool interactions, and textual replies. This is the "context" Simon refers to.
- The _gather_context_for_final_formatting() method we designed earlier would be responsible for serializing this conversation history into a string suitable for the final LLM call.
Final Schema-Formatting Call (The "Final Reply that Schemas It Up"):
- llm.models.ChainResponse, upon completion of the tool-using phase and if a --schema was initially provided (now acting as a _deferred_schema), triggers this.
- It makes a new call to llm.models.Model.prompt():
  - Prompt: "Based on the preceding conversation history: {aggregated_context}, provide the information structured according to the following schema: {schema_definition}."
  - Schema: The user's original --schema argument.
  - No Tools: Tools are not enabled for this specific formatting call. Its sole job is structuring.

Key Points Arising from Simon's Phrasing:

"Dumping information into context": This highlights the importance of the _gather_context_for_final_formatting() method. It needs to be good at creating a useful textual representation of the multi-turn interaction for the final formatting model. This might involve:
- Clear delineation of user queries, assistant thoughts/actions, tool calls, and tool results.
- Potentially summarizing very long histories if they exceed token limits for the final call (though this adds another layer of complexity/cost).
"Final reply that schemas it up": This reinforces that it's a distinct step. The model first completes its reasoning/tool-using process (potentially ending with a natural language summary), and then that output (or the whole context leading to it) is subject to a separate schema-enforcing instruction.

Refinement of _gather_context_for_final_formatting() in light of Simon's phrasing:

The method should aim to create a coherent narrative of the interaction. Instead of just a raw dump, it might look more like this pseudo-format:

User: Initial query...
Assistant (Thought): I need to do X.
Assistant (Action): Tool_A[param=value]
Tool Result (Tool_A): Output of Tool_A
Assistant (Thought): Based on Tool_A, I now know Y. I need to do Z.
Assistant (Action): Tool_B[param=value2]
Tool Result (Tool_B): Output of Tool_B
Assistant: The final answer based on my investigation is [natural language summary, if the model produced one before being intercepted for schema formatting].

Then the prompt for the final schema call would be:

"Given the above interaction, extract and structure the pertinent information into the following JSON schema: {...schema...}. If the model provided a final natural language summary, prioritize information from that summary, using the preceding interaction for detail and context."

This way, if the tool-using chain did culminate in a good textual answer, that answer is heavily weighted in the formatting step, rather than the formatting model having to re-derive everything from scratch from the whole log.

No fundamental change to the previously outlined design, but Simon's phrasing helps clarify the purpose and input for that final (+1) schema-formatting call. The "context" is indeed the history of the tool interactions and model utterances, and the "final reply" is the output of instructing the model to process that context against the schema.

Okay, to provide Simon with the most focused and helpful starting point, I'll aim for minimal but illustrative code snippets. The goal is to show the intent and the key touchpoints within the existing llm structure, rather than a fully polished, production-ready implementation.

Here's a breakdown of the minimal code snippets, focusing on the core logic for the "implicitly deferred schema" approach:

Core Idea: If Model.chain() (or Conversation.chain()) is called with tools enabled and a schema is provided, that schema is stored. After the regular tool-using chain finishes, an additional LLM call is made using this stored schema and the accumulated context.

I. llm/models.py - Modifications to _BaseChainResponse, ChainResponse, and AsyncChainResponse

This is where the bulk of the new logic will reside.

# llm/models.py
# ... (existing imports) ...
import json # Ensure json is imported

# ... (Prompt, Response, Tool, etc. classes as they exist) ...

class _BaseChainResponse:
    # ... (existing attributes like prompt, model, stream, conversation, _key, _responses) ...
    _deferred_schema: Optional[Union[Dict, type[BaseModel]]] = None # NEW attribute
    _final_formatting_response: Optional[Any] = None # NEW: To store the +1 response

    def __init__(
        self,
        prompt: Prompt,
        model: "_BaseModel",
        stream: bool,
        conversation: _BaseConversation,
        # ... (existing parameters: key, chain_limit, before_call, after_call) ...
        # ADD new parameter:
        deferred_schema: Optional[Union[Dict, type[BaseModel]]] = None,
    ):
        # ... (existing __init__ body) ...
        self.prompt = prompt # Contains original tools and options, but initial schema might be None
        self.model = model
        self.stream = stream
        self._key = key
        self._responses = []
        self._final_formatting_response = None
        self.conversation = conversation
        self.chain_limit = chain_limit
        self.before_call = before_call
        self.after_call = after_call
        self._deferred_schema = deferred_schema # Store the schema intended for final output

    def _gather_context_for_final_formatting(self) -> str:
        """
        Minimal example: concatenates previous prompts and textual responses.
        Needs refinement for clarity, tool calls/results, and token limits.
        """
        context_parts = []
        if not self._responses: # Should ideally not happen if we reached this stage
            return "No previous interactions recorded for formatting."

        for r_idx, response_obj in enumerate(self._responses):
            if response_obj.prompt.system:
                context_parts.append(f"System (Turn {r_idx+1}): {response_obj.prompt.system}")
            if response_obj.prompt.prompt:
                context_parts.append(f"User (Turn {r_idx+1}): {response_obj.prompt.prompt}")
            
            # Represent tool calls and results concisely
            if response_obj.prompt.tool_results: # Results fed INTO this turn
                for tr_idx, tr in enumerate(response_obj.prompt.tool_results):
                    context_parts.append(f"  Tool Result {tr_idx+1} (for call ID {tr.tool_call_id}): {tr.output}")
            
            # What the assistant said OR did in this turn
            assistant_actions = []
            text_output = response_obj.text_or_raise()
            tool_calls_made = response_obj.tool_calls_or_raise()

            if text_output.strip() and not tool_calls_made: # Purely textual response
                assistant_actions.append(text_output)
            elif tool_calls_made:
                for tc_idx, tc in enumerate(tool_calls_made):
                    assistant_actions.append(f"  Tool Call {tc_idx+1}: {tc.name}({json.dumps(tc.arguments)}) -> ID {tc.tool_call_id}")
            
            if assistant_actions:
                 context_parts.append(f"Assistant (Turn {r_idx+1}):\n" + "\n".join(assistant_actions))

        return "\n---\n".join(context_parts)

    def log_to_db(self, db): # Illustrative modification
        # Original loop for tool-chain responses
        for response_obj in self._responses:
            # ... (simplified existing logging logic for response_obj) ...
            # Ensure to use response_obj.text_or_raise() and .tool_calls_or_raise()
            # to handle potentially un-iterated async responses before logging
            actual_response_to_log = response_obj
            if isinstance(response_obj, AsyncResponse) and not response_obj._done:
                # This is tricky; log_to_db might be called before full async iteration.
                # For simplicity in this snippet, we assume it's resolved or needs resolving.
                # A robust solution would await/resolve it if needed.
                actual_response_to_log = asyncio.run(response_obj.to_sync_response()) # Simplification
            elif isinstance(response_obj, Response) and not response_obj._done:
                response_obj._force() # Ensure sync response is iterated
            
            actual_response_to_log.log_to_db(db) # Assuming Response.log_to_db exists

        # Log the final formatting response if it exists
        if self._final_formatting_response:
            final_response_to_log = self._final_formatting_response
            if isinstance(final_response_to_log, AsyncResponse) and not final_response_to_log._done:
                final_response_to_log = asyncio.run(final_response_to_log.to_sync_response())
            elif isinstance(final_response_to_log, Response) and not final_response_to_log._done:
                final_response_to_log._force()

            final_response_to_log.log_to_db(db)


class ChainResponse(_BaseChainResponse):
    _responses: List["Response"]
    _final_formatting_response: Optional["Response"] = None

    def _execute_tool_chain_and_prepare_formatting(self) -> None:
        """
        Internal helper to run the tool chain and store intermediate responses.
        This method will fully iterate the tool chain.
        """
        if self._responses: # Already executed
            return

        prompt = self.prompt
        count = 0
        # ... (Logic from existing ChainResponse.responses() generator to iterate tool calls) ...
        # Simplified:
        current_prompt_obj = self.prompt
        while True:
            if self.chain_limit and count >= self.chain_limit:
                break
            
            response_for_turn = Response(
                current_prompt_obj, self.model, self.stream, 
                conversation=self.conversation, key=self._key
            )
            self._responses.append(response_for_turn)
            
            # Force iteration to get tool calls and text
            for _ in response_for_turn: pass 
            response_for_turn._force() # Ensure it's fully processed

            count += 1
            tool_results = response_for_turn.execute_tool_calls(
                before_call=self.before_call, after_call=self.after_call
            )

            if tool_results:
                current_prompt_obj = Prompt(
                    "", self.model, tools=self.prompt.tools, # Pass original tools
                    tool_results=tool_results, options=self.prompt.options
                )
            else:
                break # No more tool calls, chain ends

    def __iter__(self) -> Iterator[str]:
        self._execute_tool_chain_and_prepare_formatting() # Ensure tool chain runs

        if self._deferred_schema:
            context = self._gather_context_for_final_formatting()
            formatting_instruction = (
                f"Based on the preceding interaction and gathered information:\n{context}\n\n"
                f"Please provide the final, consolidated answer strictly according to the following schema."
            )
            
            # Use original prompt's options, but override schema and clear tools
            final_prompt_options = self.prompt.options
            
            final_formatting_prompt_obj = Prompt(
                formatting_instruction,
                self.model,
                schema=self._deferred_schema,
                options=final_prompt_options,
                tools=[] # No tools for this final formatting step
            )
            
            self._final_formatting_response = Response(
                final_formatting_prompt_obj,
                self.model,
                self.stream, # Use original stream setting
                conversation=self.conversation, # For logging
                key=self._key,
            )
            yield from self._final_formatting_response
        
        elif self._responses: # No deferred schema, yield last textual part of tool chain
            last_tool_chain_response = self._responses[-1]
            # If already streamed by _execute_tool_chain_and_prepare_formatting (if it were a generator)
            # this might double-yield. text() is safer.
            # The __iter__ needs to be the single source of truth for yielding.
            # For simplicity, assume text() gives the final output of the chain here.
            # A more robust __iter__ would handle streaming from tool chain OR final step.
            # THIS PART IS SIMPLIFIED FOR BREVITY - a real __iter__ is more complex.
            if not last_tool_chain_response.tool_calls_or_raise():
                 yield last_tool_chain_response.text_or_raise()


    def text(self) -> str:
        # This will run __iter__ and collect all chunks
        return "".join(self)


class AsyncChainResponse(_BaseChainResponse): # Illustrative async version
    _responses: List["AsyncResponse"]
    _final_formatting_response: Optional["AsyncResponse"] = None

    async def _execute_tool_chain_and_prepare_formatting(self) -> None:
        if self._responses: return
        # ... (async equivalent of ChainResponse._execute_tool_chain_and_prepare_formatting) ...
        # Iterate using `async for _ in response_for_turn: pass`
        # and `await response_for_turn._force()`
        # Simplified:
        current_prompt_obj = self.prompt
        count = 0
        while True:
            if self.chain_limit and count >= self.chain_limit:
                break
            
            response_for_turn = AsyncResponse(
                current_prompt_obj, self.model, self.stream, 
                conversation=self.conversation, key=self._key
            )
            self._responses.append(response_for_turn)
            
            async for _ in response_for_turn: pass
            await response_for_turn._force()

            count += 1
            tool_results = await response_for_turn.execute_tool_calls(
                before_call=self.before_call, after_call=self.after_call
            )

            if tool_results:
                current_prompt_obj = Prompt(
                    "", self.model, tools=self.prompt.tools,
                    tool_results=tool_results, options=self.prompt.options
                )
            else:
                break

    async def __aiter__(self) -> AsyncIterator[str]:
        await self._execute_tool_chain_and_prepare_formatting()

        if self._deferred_schema:
            context = self._gather_context_for_final_formatting() # Sync part
            formatting_instruction = (
                f"Based on the preceding interaction and gathered information:\n{context}\n\n"
                f"Please provide the final, consolidated answer strictly according to the following schema."
            )
            final_prompt_options = self.prompt.options
            final_formatting_prompt_obj = Prompt(
                formatting_instruction,
                self.model,
                schema=self._deferred_schema,
                options=final_prompt_options,
                tools=[]
            )
            self._final_formatting_response = AsyncResponse(
                final_formatting_prompt_obj,
                self.model,
                self.stream,
                conversation=self.conversation,
                key=self._key,
            )
            async for chunk in self._final_formatting_response:
                yield chunk
        elif self._responses:
            last_tool_chain_response = self._responses[-1]
            if not await last_tool_chain_response.tool_calls(): # tool_calls() is async
                 async for chunk in last_tool_chain_response: # stream its content
                     yield chunk


    async def text(self) -> str:
        chunks = []
        async for chunk in self:
            chunks.append(chunk)
        return "".join(chunks)

# Modify Model.chain() and AsyncModel.chain() (and Conversation variants)
# to pass the `schema` as `deferred_schema` to ChainResponse if tools are active.

class _Model(_BaseModel): # Or _BaseModel if chain is there
    # ...
    def chain(
        self,
        prompt: Optional[str] = None,
        system: Optional[str] = None,
        # ... other existing args ...
        schema: Optional[Union[dict, type[BaseModel]]] = None, 
        tools: Optional[List[ToolDef]] = None, # Make sure `tools` is explicitly here
        # ... other existing args ...
    ) -> ChainResponse: # Or AsyncChainResponse for AsyncModel
        
        # Determine effective tools (passed arguments or conversation's tools)
        # This logic might be slightly different for Model.chain vs Conversation.chain
        effective_tools = _wrap_tools(tools or getattr(self.conversation if hasattr(self, 'conversation') else None, 'tools', []))

        current_prompt_schema = None
        deferred_schema_for_chain = None
        
        if effective_tools and schema:
            # Tools are active, so this schema is for the *final* output
            deferred_schema_for_chain = schema
        elif not effective_tools and schema:
            # No tools, schema applies to this direct call
            current_prompt_schema = schema
            
        # The prompt object passed to ChainResponse should have its 'schema'
        # field set to current_prompt_schema (which will be None if deferred).
        # The 'tools' field should contain effective_tools.
        # The deferred_schema_for_chain is passed separately to ChainResponse constructor.
        
        prompt_obj_for_chain = Prompt(
            prompt,
            model=self, # self is the model instance
            system=system,
            schema=current_prompt_schema, # Schema for *this turn* if not deferred
            tools=effective_tools, # All available tools
            # ... other Prompt fields like attachments, fragments, options ...
            options=self.Options(**(options or {})) # Assuming options is passed or available
        )

        return ChainResponse( # Or AsyncChainResponse
            prompt_obj_for_chain, # This is the initial prompt for the chain
            model=self,
            # ... other existing ChainResponse args like stream, conversation, key ...
            deferred_schema=deferred_schema_for_chain, # Pass the potentially deferred schema
        )
# Similar for AsyncModel.chain, Conversation.chain, AsyncConversation.chain

II. llm/cli.py - How prompt command passes schema

The llm.cli.prompt function already resolves schema_input to a schema dictionary. This schema dictionary just needs to be correctly passed to the model.chain() or conversation.chain() call.

# llm/cli.py
# In the `prompt` command function, where `kwargs_for_chain_or_prompt` is built:

    # ... (schema is already resolved from schema_input) ...

    # kwargs for the main model call (prompt or chain)
    # This 'schema' will be interpreted by Model.chain to become either
    # the direct schema for the first turn OR the deferred_schema for ChainResponse.
    call_kwargs = {
        "prompt": prompt_text_from_user_or_stdin, # The actual prompt string
        "system": system,
        "attachments": resolved_attachments,
        "fragments": resolved_fragments,
        "system_fragments": resolved_system_fragments,
        "schema": schema, # <--- This is the key part
        "tools": tool_implementations, # List of Tool objects/classes
        # ... any other relevant parameters for .prompt() or .chain() ...
    }
    
    # If tool_implementations exist, we are in a chain scenario.
    # Otherwise, it's a direct prompt.
    if tool_implementations:
        # Add chain-specific parameters if any, e.g., chain_limit
        call_kwargs["chain_limit"] = chain_limit
        if tools_debug: call_kwargs["after_call"] = _debug_tool_call
        if tools_approve: call_kwargs["before_call"] = _approve_tool_call
        
        # Use conversation.chain or model.chain
        prompt_method_to_call = conversation.chain if conversation else model.chain
    else:
        # Use conversation.prompt or model.prompt
        prompt_method_to_call = conversation.prompt if conversation else model.prompt

    # The `key` and model-specific `options` are often passed as **kwargs
    # to the prompt_method_to_call.
    # `validated_options` is already prepared in cli.py
    
    response = prompt_method_to_call(
        **call_kwargs, 
        **validated_options, # Contains model-specific options like temperature
        key=key_from_cli_or_env # The API key
    )
    # ...

Explanation of Snippets:

_BaseChainResponse:
- Gains _deferred_schema and _final_formatting_response attributes.
- The constructor now accepts deferred_schema.
- _gather_context_for_final_formatting() is a placeholder for the crucial logic of creating a good textual summary of the tool-chain interaction.
- log_to_db is updated to also log _final_formatting_response.
ChainResponse (and by extension AsyncChainResponse):
- _execute_tool_chain_and_prepare_formatting(): This new internal method encapsulates running the original tool-use loop and populating self._responses.
- __iter__() (and __aiter__):
  - First, ensures the tool chain part is executed by calling _execute_tool_chain_and_prepare_formatting().
  - Then, checks self._deferred_schema.
  - If set, it calls _gather_context_for_final_formatting(), creates a new instructional Prompt with this context and the _deferred_schema, and creates/yields from self._final_formatting_response (a new Response or AsyncResponse instance).
  - If no _deferred_schema, it (simplified here) yields the text from the last response in self._responses if that response was purely textual. A more robust version would stream the chunks of the tool chain if self.stream is true and no deferred schema is present.
- text(): Simply joins the output of __iter__().
_Model.chain() (and AsyncModel.chain() etc.):
- The chain() method (and its variants in Conversation) now intelligently decides: if tools are active and schema is also provided, it passes the schema to ChainResponse as deferred_schema. Otherwise (no tools, but schema given), it sets schema on the initial Prompt object for direct schema output.
llm/cli.py:
- The prompt command function passes the schema (resolved from user input) directly to the model.chain() or conversation.chain() method. The model layer then handles whether it's a direct or deferred schema.

This structure tries to centralize the new "deferred schema" logic within the ChainResponse classes, keeping the Model.chain() and cli.py changes relatively minor (mostly about passing the schema parameter correctly). The _gather_context... method is the piece that would require the most careful implementation and iteration to ensure good results.

imaurer/llm_tools_with_schema.md