Skip to content

Instantly share code, notes, and snippets.

@the-crypt-keeper
Last active June 8, 2025 12:53
Show Gist options
  • Save the-crypt-keeper/c5e6c2daa0163280cce69816a55227e1 to your computer and use it in GitHub Desktop.
Save the-crypt-keeper/c5e6c2daa0163280cce69816a55227e1 to your computer and use it in GitHub Desktop.

Ruminate: Taking Control of AI Reasoning Speed

How we built a proxy to make reasoning AI models faster and more predictable

The Problem: All-or-Nothing Reasoning

Modern AI models like Qwen3 and DeepSeek R1 have a cool feature called "reasoning" or "thinking" mode. When enabled, they work through problems step-by-step in a <think>...</think> block before giving you the final answer. This dramatically improves accuracy on complex tasks.

But there's a catch: it's all-or-nothing. You either get no reasoning (fast but often wrong) or unlimited reasoning (accurate but unpredictably slow).

Here's what we saw testing a geometric shape recognition task:

  • No reasoning: 27% accuracy, ~3 tokens, super fast
  • Full reasoning: 75% accuracy, ~2000 tokens average, but some responses took 8000+ tokens!

That 300x variance in response time makes reasoning models impractical for real applications. You can't tell users "this might take 1 second or 30 seconds, we'll see!"

The Solution: Staged Reasoning with Ruminate

What if we could give the AI a "thinking budget" and gentle nudges to wrap up when time is running short? That's exactly what our proxy server "Ruminate" does.

Instead of unlimited thinking, Ruminate breaks reasoning into stages:

Stage 1: Initial Thinking (The "Ideal" Budget)

"Here's your ideal time to think through this problem thoroughly."

Stage 2: Soft Warning (The Adaptive Buffer)

"Time's getting short, make sure you're on the right track."

Stage 3: Hard Warning (Last Chance)

"Really need to wrap this up now, summarize your thoughts."

Stage 4: Emergency Termination

"Time's up, give your best answer based on what you've thought so far."

How It Works

Ruminate sits between your application and the AI model as a proxy server. When you send a chat request, it:

  1. Converts your chat messages into a text completion prompt
  2. Adds reasoning tags and makes multiple API calls to the real model
  3. Injects helpful prompts at each stage ("considering the limited time...")
  4. Monitors for natural completion (when the model closes </think> on its own)
  5. Returns the final response as a standard chat completion

The key insight: most good reasoning happens early. If we can catch the runaway cases while letting normal reasoning complete naturally, we get the best of both worlds.

Real Results: The Numbers Don't Lie

Testing on geometric shape recognition with 275 test cases each:

Configuration Accuracy 95% CI Avg Tokens P95 Tokens
No reasoning 27.3% ±5.2% 3 3
Full reasoning 75.3% ±5.1% 2,036 4,532
Ruminate [200,200,200] 57.5% ±5.8% 568 686
Ruminate [1000,600,400] 68.8% ±5.7% 1,333 2,067

The Statistical Story

The confidence intervals tell a compelling story. The two Ruminate configurations have nearly overlapping confidence intervals (±5.8% vs ±5.7%), but their point estimates are meaningfully different (57.5% vs 68.8%). This suggests we're seeing a real performance difference, not just random variation.

More importantly, look at the predictability gains:

  • Full reasoning: P95 of 4,532 tokens (worst case 2.6x longer than average)
  • Ruminate [1000,600,400]: P95 of 2,067 tokens (worst case 1.6x longer than average)

We're getting 91% of full reasoning accuracy (68.8% vs 75.3%) while cutting worst-case response time in half.

The Key Insight: TERMINATE Rate as a KPI

Here's what we discovered: the frequency of hitting the final "emergency termination" stage is a crucial metric. When the AI is forced to stop mid-thought, quality suffers. When it completes reasoning naturally within the budget, quality stays high.

This gives us a principled way to tune the system:

  • Too many terminations? → Increase budgets
  • Response times too high? → Decrease budgets
  • Sweet spot: < 15% termination rate with controlled P95 times

The [1000,600,400] configuration significantly reduced termination events compared to [200,200,200], which explains the accuracy jump while maintaining reasonable response times.

A Framework for Task-Specific Optimization

We realized the three budgets serve different purposes:

  • initial_think: The "ideal" budget based on task complexity
  • hard_warn: Small fixed "emergency escape" budget
  • soft_warn: The tuning parameter that adapts to the model's reasoning patterns

For different tasks:

  • Simple math: [300, 150, 100]
  • Complex reasoning: [1000, 500, 200]
  • Multi-step proofs: [2000, 1000, 300]

You could even automate this: run a calibration phase for each task type, measure termination rates, and auto-tune the soft_warn parameter.

The Speed-Accuracy Tradeoff Curve

What's exciting is that Ruminate gives you granular control over the speed-accuracy tradeoff:

27% accuracy → 57% accuracy → 69% accuracy → 75% accuracy
   3 tokens     568 tokens    1,333 tokens   2,036 tokens

Instead of a binary choice, you can dial in exactly where you want to be on this curve based on your application's needs.

Why This Matters

Ruminate transforms reasoning models from research curiosities into practical tools. Instead of choosing between "fast but dumb" or "smart but unpredictable," you can dial in exactly the speed/accuracy tradeoff your application needs.

This opens up reasoning models for:

  • Interactive applications (where response time matters)
  • Batch processing (where you can budget compute precisely)
  • Production systems (where predictability is crucial)

Try It Yourself

The full code is available [GitHub link]. It's built with Python asyncio and FastAPI, designed to work with any OpenAI-compatible API.

Key features:

  • Drop-in replacement for /v1/chat/completions
  • Configurable reasoning budgets via request parameters
  • Detailed metrics including termination reasons
  • Works with any model that supports thinking tags

What's Next?

We're exploring:

  • Adaptive budgets that learn from previous requests
  • Task-specific profiles that auto-configure for different problem types
  • Multi-model support beyond just Qwen
  • Streaming responses during the final answer phase

The future of AI isn't just about making models smarter—it's about making them smarter and more controllable. Ruminate is a step toward AI that thinks as much as you need, when you need it.

import asyncio
import aiohttp
import json
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from transformers import AutoTokenizer
@dataclass
class ReasoningConfig:
initial_think: int
soft_warn: int
hard_warn: int
reason_initial_text: str = ""
reason_soft_text: str = "\nConsidering the limited time by the user, I'd better make sure I'm on the right track."
reason_hard_text: str = "\nConsidering the limited time by the user, let me summarize my thoughts and finish up."
reason_terminate_text: str = "\nConsidering the limited time by the user, I have to give the solution based on the thinking directly now."
class ReasoningProxy:
def __init__(self, backend_url: str, model_name: str):
self.backend_url = backend_url.rstrip('/')
self.model_name = model_name
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
async def _complete(self, session: aiohttp.ClientSession, prompt: str, max_tokens: int, **kwargs) -> tuple[str, int]:
"""Make a completion request and return (completion_text, tokens_used)"""
payload = {
"prompt": prompt,
"max_tokens": max_tokens,
**kwargs
}
# print("upstream completion:", payload)
async with session.post(f"{self.backend_url}/v1/completions", json=payload) as response:
# print(await response.text())
response.raise_for_status()
result = await response.json()
completion_text = result["choices"][0]["text"]
tokens_used = result["usage"]["completion_tokens"]
return completion_text, tokens_used
async def process_chat_completion(self, messages: List[Dict], reason_control: List[int],
max_tokens: int, reason_initial_text: str = "",
reason_soft_text: str = "\nConsidering the limited time by the user, I'd better make sure I'm on the right track.",
reason_hard_text: str = "\nConsidering the limited time by the user, let me summarize my thoughts and finish up.",
reason_terminate_text: str = "\nConsidering the limited time by the user, I have to give the solution based on the thinking directly now.",
**kwargs) -> Dict[str, Any]:
# Stage 0: Setup
initial_prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
initial_think, soft_warn, hard_warn = reason_control
current_prompt = initial_prompt
termination_reason = "natural"
skip_to_answer = False
async with aiohttp.ClientSession() as session:
# Stage 1: THINK
current_prompt += "\n<think>\n"
if initial_think > 0 and not skip_to_answer:
current_prompt += reason_initial_text
completion, used_tokens = await self._complete(session, current_prompt, initial_think, **kwargs)
current_prompt += completion
if used_tokens < initial_think:
# Natural completion
skip_to_answer = True
elif "</think>" in completion:
# Found end of thinking
skip_to_answer = True
# Otherwise continue to SOFT stage
else:
current_prompt += "\n</think>\n"
skip_to_answer = True
# Stage 2: SOFT
if soft_warn > 0 and not skip_to_answer:
current_prompt += reason_soft_text
completion, used_tokens = await self._complete(session, current_prompt, soft_warn, **kwargs)
current_prompt += completion
if used_tokens < soft_warn:
# Natural completion
skip_to_answer = True
elif "</think>" in completion:
# Found end of thinking
skip_to_answer = True
# Otherwise continue to HARD stage
# Stage 3: HARD
if hard_warn > 0 and not skip_to_answer:
print('HARD reached: ', current_prompt)
current_prompt += reason_hard_text
completion, used_tokens = await self._complete(session, current_prompt, hard_warn, **kwargs)
current_prompt += completion
if used_tokens < hard_warn:
# Natural completion
skip_to_answer = True
elif "</think>" in completion:
# Found end of thinking
skip_to_answer = True
# Otherwise continue to TERMINATE stage
# Stage 4: TERMINATE
if not skip_to_answer:
print('TERMINATE reached: ', current_prompt)
current_prompt += reason_terminate_text + "\n</think>\n\n"
completion, used_tokens = await self._complete(session, current_prompt, max_tokens, **kwargs)
current_prompt += completion
termination_reason = "hard_terminated"
else:
# Stage 5: ANSWER
completion, used_tokens = await self._complete(session, current_prompt, max_tokens, **kwargs)
current_prompt += completion
# Stage 6: DONE - return response
final_response = current_prompt[len(initial_prompt):]
return {
"id": f"chatcmpl-reasoning-proxy",
"object": "chat.completion",
"model": self.model_name,
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": final_response
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": len(self.tokenizer.encode(initial_prompt)),
"completion_tokens": len(self.tokenizer.encode(final_response)),
"total_tokens": len(self.tokenizer.encode(current_prompt))
},
"termination_reason": termination_reason
}
# FastAPI server
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
app = FastAPI()
# Global proxy instance - configure these for your setup
BACKEND_URL = "http://localhost:3333" # Your actual model server
MODEL_NAME = "Qwen/Qwen3-4B" # Update as needed
proxy = ReasoningProxy(BACKEND_URL, MODEL_NAME)
class ChatCompletionRequest(BaseModel):
messages: List[Dict[str, str]]
model: str
max_tokens: int
temperature: Optional[float] = None
top_p: Optional[float] = None
min_p: Optional[float] = None
top_k: Optional[float] = None
reason_control: Optional[List[int]] = [0, 0, 0] # Default values
reason_initial_text: Optional[str] = ""
reason_soft_text: Optional[str] = "\nConsidering the limited time by the user, I'd better make sure I'm on the right track."
reason_hard_text: Optional[str] = "\nConsidering the limited time by the user, let me summarize my thoughts and finish up."
reason_terminate_text: Optional[str] = "\nConsidering the limited time by the user, I have to give the solution based on the thinking directly now."
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
try:
# Extract reasoning parameters
reason_params = {
"reason_control": request.reason_control,
"reason_initial_text": request.reason_initial_text,
"reason_soft_text": request.reason_soft_text,
"reason_hard_text": request.reason_hard_text,
"reason_terminate_text": request.reason_terminate_text
}
# Extract other parameters (filter out reason_* and messages/max_tokens)
other_params = {}
for field_name, field_value in request.model_dump().items():
if not field_name.startswith("reason_") and field_name not in ["messages", "max_tokens"]:
if field_value is not None:
other_params[field_name] = field_value
result = await proxy.process_chat_completion(
messages=request.messages,
max_tokens=request.max_tokens,
**reason_params,
**other_params
)
return result
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8001)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment