Skip to content

Instantly share code, notes, and snippets.

@noperator
Created June 23, 2026 16:47
Show Gist options
  • Select an option

  • Save noperator/6c711ab19027ea8056442df839f2d7e6 to your computer and use it in GitHub Desktop.

Select an option

Save noperator/6c711ab19027ea8056442df839f2d7e6 to your computer and use it in GitHub Desktop.

In fruits.py, the model is allowed to generate normally until it emits </think>. Our vLLM request specifies response_format={"type": "structural_tag", ...} so that </think> acts as the boundary where constrained decoding begins.

The relevant part is:

END = "ekYyhUMDCGhE65J5"

response_format={
    "type": "structural_tag",
    "structures": [
        {
            "begin": "</think>",
            "schema": schema,
            "end": END,
        }
    ],
    "triggers": ["</think>"],
}

After </think>, vLLM constrains generation to match the JSON schema. In this example, the schema only allows:

{ "fruit": "apple" }

Or { "fruit": "banana" }, etc. The random-looking END string is a private delimiter that vLLM appends after the schema-constrained JSON. The request also uses stop=[END], so the delimiter is used to stop generation but is not included in the returned output. This means the final JSON is produced by constrained decoding after the model’s natural <think>...</think> block, not merely by asking the model nicely to output JSON.

In this example, VibeThinker-3B emits its normal <think> block, and vLLM constrains the post-</think> suffix to strict JSON.

$ python fruits.py
RAW:
<think>The user says: "Pick one fruit. Keep your reasoning brief." The request is ambiguous: they want me to pick a fruit and give brief reasoning. Probably they want me to choose a fruit and give a short reason why I'm picking it? Or maybe they want me to pick a fruit as a response, with brief reasoning. There's no further context. The user says "Pick one fruit. Keep your reasoning brief." So I need to respond with a fruit name and a brief reason. Possibly they want me to pick a fruit that fits some hidden criteria? The instruction is minimal. I need to produce a short answer: e.g., "I pick the banana because it's popular." Or maybe they want me to pick a fruit and then reason why it's a good example of something. But no context. So I'll just pick a fruit, maybe "Apple". Reason: it's widely recognized and used in many examples. Or "Banana" because it's easy to mention. The user wants reasoning brief, so maybe a one-sentence reason.

Thus answer: "I choose the banana because it's convenient and widely available." Or "I pick the strawberry because of its sweet flavor." I'll choose something.

Given the instruction: "Pick one fruit. Keep your reasoning brief." I'd respond: "I'll pick the banana – it's popular, easy to handle, and represents variety of flavors." That's brief enough.

But maybe they want me to pick a fruit and then reason why it's the best example of something like "fruit for a healthy diet"? Not sure. The simplest: "I choose the apple because it's nutritious and widely consumed." That is a brief reasoning.

I'll answer: "Apple – it's a common, nutritious fruit." Or "Banana – convenient and versatile."

I think I'll answer: "Banana – easy to eat, provides potassium, and a great example of a simple fruit." That is brief.

Thus final answer: "Banana – it's convenient, nutritious, and a perfect example of a simple fruit."</think>{ "fruit": "banana" }

JSON TEXT:
{ "fruit": "banana" }

parsed: {'fruit': 'banana'}
OK

For reference, I ran vLLM like this. There are a few extra params here from some evaluations I was doing, and I haven't tried to trim them down for this example. Just want to provide a known-working configuration.

vllm serve WeiboAI/VibeThinker-3B \
    --dtype auto \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.92 \
    --max-num-seqs 100 \
    --max-num-batched-tokens 32768 \
    --enable-prefix-caching \
    --generation-config vllm \
    --override-generation-config {"temperature":1.0,"top_p":0.95,"top_k":-1,"max_new_tokens":8192}
#!/usr/bin/env python3
import json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="-")
model = client.models.list().data[0].id
END = "ekYyhUMDCGhE65J5"
schema = {
"type": "object",
"properties": {
"fruit": {
"type": "string",
"enum": ["apple", "banana", "orange"],
}
},
"required": ["fruit"],
"additionalProperties": False,
}
resp = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": "Pick one fruit. Keep your reasoning brief.",
}
],
temperature=0,
max_tokens=1024,
stop=[END],
response_format={
"type": "structural_tag",
"structures": [
{
"begin": "</think>",
"schema": schema,
"end": END,
}
],
"triggers": ["</think>"],
},
)
text = resp.choices[0].message.content
print("RAW:")
print(text)
if "</think>" not in text:
raise RuntimeError("Model never emitted </think>; structural constraint never activated.")
after = text.split("</think>", 1)[1]
# With stop=[END], vLLM usually removes END from returned text.
# This split keeps the parser safe if behavior/config changes.
json_text = after.split(END, 1)[0].strip()
print("\nJSON TEXT:")
print(json_text)
parsed = json.loads(json_text)
assert isinstance(parsed, dict)
assert set(parsed.keys()) == {"fruit"}
assert parsed["fruit"] in schema["properties"]["fruit"]["enum"]
print("\nparsed:", parsed)
print("OK")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment