How we built a proxy to make reasoning AI models faster and more predictable
Modern AI models like Qwen3 and DeepSeek R1 have a cool feature called "reasoning" or "thinking" mode. When enabled, they work through problems step-by-step in a <think>...</think>
block before giving you the final answer. This dramatically improves accuracy on complex tasks.
But there's a catch: it's all-or-nothing. You either get no reasoning (fast but often wrong) or unlimited reasoning (accurate but unpredictably slow).
Here's what we saw testing a geometric shape recognition task:
- No reasoning: 27% accuracy, ~3 tokens, super fast
- Full reasoning: 75% accuracy, ~2000 tokens average, but some responses took 8000+ tokens!
That 300x variance in response time makes reasoning models impractical for real applications. You can't tell users "this might take 1 second or 30 seconds, we'll see!"
What if we could give the AI a "thinking budget" and gentle nudges to wrap up when time is running short? That's exactly what our proxy server "Ruminate" does.
Instead of unlimited thinking, Ruminate breaks reasoning into stages:
"Here's your ideal time to think through this problem thoroughly."
"Time's getting short, make sure you're on the right track."
"Really need to wrap this up now, summarize your thoughts."
"Time's up, give your best answer based on what you've thought so far."
Ruminate sits between your application and the AI model as a proxy server. When you send a chat request, it:
- Converts your chat messages into a text completion prompt
- Adds reasoning tags and makes multiple API calls to the real model
- Injects helpful prompts at each stage ("considering the limited time...")
- Monitors for natural completion (when the model closes
</think>
on its own) - Returns the final response as a standard chat completion
The key insight: most good reasoning happens early. If we can catch the runaway cases while letting normal reasoning complete naturally, we get the best of both worlds.
Testing on geometric shape recognition with 275 test cases each:
Configuration | Accuracy | 95% CI | Avg Tokens | P95 Tokens |
---|---|---|---|---|
No reasoning | 27.3% | ±5.2% | 3 | 3 |
Full reasoning | 75.3% | ±5.1% | 2,036 | 4,532 |
Ruminate [200,200,200] | 57.5% | ±5.8% | 568 | 686 |
Ruminate [1000,600,400] | 68.8% | ±5.7% | 1,333 | 2,067 |
The confidence intervals tell a compelling story. The two Ruminate configurations have nearly overlapping confidence intervals (±5.8% vs ±5.7%), but their point estimates are meaningfully different (57.5% vs 68.8%). This suggests we're seeing a real performance difference, not just random variation.
More importantly, look at the predictability gains:
- Full reasoning: P95 of 4,532 tokens (worst case 2.6x longer than average)
- Ruminate [1000,600,400]: P95 of 2,067 tokens (worst case 1.6x longer than average)
We're getting 91% of full reasoning accuracy (68.8% vs 75.3%) while cutting worst-case response time in half.
Here's what we discovered: the frequency of hitting the final "emergency termination" stage is a crucial metric. When the AI is forced to stop mid-thought, quality suffers. When it completes reasoning naturally within the budget, quality stays high.
This gives us a principled way to tune the system:
- Too many terminations? → Increase budgets
- Response times too high? → Decrease budgets
- Sweet spot: < 15% termination rate with controlled P95 times
The [1000,600,400] configuration significantly reduced termination events compared to [200,200,200], which explains the accuracy jump while maintaining reasonable response times.
We realized the three budgets serve different purposes:
- initial_think: The "ideal" budget based on task complexity
- hard_warn: Small fixed "emergency escape" budget
- soft_warn: The tuning parameter that adapts to the model's reasoning patterns
For different tasks:
- Simple math:
[300, 150, 100]
- Complex reasoning:
[1000, 500, 200]
- Multi-step proofs:
[2000, 1000, 300]
You could even automate this: run a calibration phase for each task type, measure termination rates, and auto-tune the soft_warn parameter.
What's exciting is that Ruminate gives you granular control over the speed-accuracy tradeoff:
27% accuracy → 57% accuracy → 69% accuracy → 75% accuracy
3 tokens 568 tokens 1,333 tokens 2,036 tokens
Instead of a binary choice, you can dial in exactly where you want to be on this curve based on your application's needs.
Ruminate transforms reasoning models from research curiosities into practical tools. Instead of choosing between "fast but dumb" or "smart but unpredictable," you can dial in exactly the speed/accuracy tradeoff your application needs.
This opens up reasoning models for:
- Interactive applications (where response time matters)
- Batch processing (where you can budget compute precisely)
- Production systems (where predictability is crucial)
The full code is available [GitHub link]. It's built with Python asyncio and FastAPI, designed to work with any OpenAI-compatible API.
Key features:
- Drop-in replacement for
/v1/chat/completions
- Configurable reasoning budgets via request parameters
- Detailed metrics including termination reasons
- Works with any model that supports thinking tags
We're exploring:
- Adaptive budgets that learn from previous requests
- Task-specific profiles that auto-configure for different problem types
- Multi-model support beyond just Qwen
- Streaming responses during the final answer phase
The future of AI isn't just about making models smarter—it's about making them smarter and more controllable. Ruminate is a step toward AI that thinks as much as you need, when you need it.