the-crypt-keeper

Ruminate: Taking Control of AI Reasoning Speed

How we built a proxy to make reasoning AI models faster and more predictable

The Problem: All-or-Nothing Reasoning

Modern AI models like Qwen3 and DeepSeek R1 have a cool feature called "reasoning" or "thinking" mode. When enabled, they work through problems step-by-step in a <think>...</think> block before giving you the final answer. This dramatically improves accuracy on complex tasks.

But there's a catch: it's all-or-nothing. You either get no reasoning (fast but often wrong) or unlimited reasoning (accurate but unpredictably slow).

Write a snake game in python using SDL.

Okay, I need to write a snake game in Python using SDL. Let's start by figuring out which Python library uses SDL. I remember that PySDL2 is a wrapper around SDL2, so that's probably the right choice. I'll need to install that first.

Next, the basic structure of a snake game: there's a snake that moves around, eating food and growing. The game ends when the snake hits the wall or itself. So I'll need to handle the game loop, user input for direction changes, collision detection, and rendering.

I should start by initializing the SDL window and renderer. Then set up the game variables: snake position, direction, food position, and the score. The snake can be represented as a list of segments, each with x and y coordinates. The direction will change based on arrow key presses.

For the game loop, I'll need to process events each frame. Check for key presses to change direction, and also check if the window is closing. Then, update the snake's position by moving

Here's the exact prompt thats causing the grief:

Write a javascript function assemble(program) with a list input program each entry representing one line of program text that returns a list of numbers representing bytes of VM CPU instructions.\n\nThe assembler syntax is one instruction per line in the following form:\n\nOPCODE [argument] [# comment]\n\n- Required OPCODE (see OPCODE table below)\n- Optional argument, an integer immediate value (default 0). Convert negative argument values to unit8 (twos complement).\n- Optional comment prefixed by # that should be ignored\n\nInstructions are always 2 encoded as bytes: [command_byte] [argument_byte]\n\nThe valid OPCODE values, their corresponding command_byte values and explanations of their arguments are:\n\n- CONTROL (command_byte=0) run control - halts the program and return the stack when argument is 255, otherwise does nothing.\n- COND (command_byte=128) conditional - skip the next instruction (program counter +4 instead of the usual +2) if R0 i

This is a raw dump of litellm image generation adapters for kobold (which is automatic1111 compatible), sd-server and llamabox.

Only the /v1/images/generations endpoint is supported by the adapters.

To use:

Register the custom handlers in custom_provider_map
Define endpoints for kobold/automatic1111, sd-server and llamabox as needed

Things I hate about this:

Quad P40 Xeon Build

I accidentally 4x P40s and need something to plug them into.
Original plan fell apart: the Dell R730 outsmarted me. It seems to force any GPUs past the first two to power limit themselves and refuse to draw any external power.
EPYCs don't seem to make sense in this secenario, the host mobo+cpu alone would cost more then the GPUs.
Prefer to avoid dual-socket Xeon systems both because of the additional idle power and NUMA.
With 96GB VRAM, I need at least 128GB system RAM to not have problems.
- I have a large stash of ECC DDR4 2133 modules.

Requirements

Enviroment setup

Test Host is an HP z640 with Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz

Device 0 is a GeForce RTX 3060 Lite Hash Rate (EVGA) connected via the cheapest USB x1 crypto mining riser I could find online.

Device 1 is a GeForce RTX 3060 (HP OEM) connected via an x16 PCIe 4.0 riser cable (although note that my host does not support past PCIe past 3.0)

tensorrt-llm 0.8.0

idx	Languages	Model	Params	Template	Passed	Score
37	python	openai-chatgpt	precise	chat-generic	65	1.0
38	javascript	openai-chatgpt	precise	chat-generic	65	1.0
52	javascript	ggml-vicuna-13B-1.1-q5	precise	Vicuna-1p1-v2	58	0.8923076923076924
31	javascript	ai21-j2-jumbo-instruct	precise	chat-generic	57	0.8769230769230769
4	python	ggml-vicuna-13B-1.1-q5	precise	Vicuna-1p1	57	0.8769230769230769
5	javascript	ggml-vicuna-13B-1.1-q5	precise	Vicuna-1p1	57	0.8769230769230769
30	python	ai21-j2-jumbo-instruct	precise	chat-generic	57	0.8769230769230769
35	python	tsumeone-llama-30b-supercot-4bit-cuda	precise	Alpaca	56	0.8615384615384616
26	python	airoboros-13b-gpt4.ggmlv3.q5	precise	Vicuna-FastChat	55	0.8461538461538461

	import requests
	import json
	import sys
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

	SYSTEM_PROMPT = "You are a helpful assistant."
	messages = [{"role": "system", "content": SYSTEM_PROMPT}]

	#
	# this is adapted from https://github.com/facebookresearch/llama/blob/main/llama/generation.py#L213
	# the tokenizer is replaced with ord() to make it easier to see whats actually happening

	from typing_extensions import TypedDict, Literal
	from typing import List, Optional

	Role = Literal["system", "user", "assistant"]

	class Message(TypedDict):