In fruits.py, the model is allowed to generate normally until it emits </think>. Our vLLM request specifies response_format={"type": "structural_tag", ...} so that </think> acts as the boundary where constrained decoding begins.
The relevant part is:
END = "ekYyhUMDCGhE65J5"
response_format={
"type": "structural_tag",
"structures": [
{
"begin": "</think>",
"schema": schema,
"end": END,
}
],
"triggers": ["</think>"],
}After </think>, vLLM constrains generation to match the JSON schema. In this example, the schema only allows:
{ "fruit": "apple" }Or { "fruit": "banana" }, etc. The random-looking END string is a private delimiter that vLLM appends after the schema-constrained JSON. The request also uses stop=[END], so the delimiter is used to stop generation but is not included in the returned output.
This means the final JSON is produced by constrained decoding after the model’s natural <think>...</think> block, not merely by asking the model nicely to output JSON.
In this example, VibeThinker-3B emits its normal <think> block, and vLLM constrains the post-</think> suffix to strict JSON.
$ python fruits.py
RAW:
<think>The user says: "Pick one fruit. Keep your reasoning brief." The request is ambiguous: they want me to pick a fruit and give brief reasoning. Probably they want me to choose a fruit and give a short reason why I'm picking it? Or maybe they want me to pick a fruit as a response, with brief reasoning. There's no further context. The user says "Pick one fruit. Keep your reasoning brief." So I need to respond with a fruit name and a brief reason. Possibly they want me to pick a fruit that fits some hidden criteria? The instruction is minimal. I need to produce a short answer: e.g., "I pick the banana because it's popular." Or maybe they want me to pick a fruit and then reason why it's a good example of something. But no context. So I'll just pick a fruit, maybe "Apple". Reason: it's widely recognized and used in many examples. Or "Banana" because it's easy to mention. The user wants reasoning brief, so maybe a one-sentence reason.
Thus answer: "I choose the banana because it's convenient and widely available." Or "I pick the strawberry because of its sweet flavor." I'll choose something.
Given the instruction: "Pick one fruit. Keep your reasoning brief." I'd respond: "I'll pick the banana – it's popular, easy to handle, and represents variety of flavors." That's brief enough.
But maybe they want me to pick a fruit and then reason why it's the best example of something like "fruit for a healthy diet"? Not sure. The simplest: "I choose the apple because it's nutritious and widely consumed." That is a brief reasoning.
I'll answer: "Apple – it's a common, nutritious fruit." Or "Banana – convenient and versatile."
I think I'll answer: "Banana – easy to eat, provides potassium, and a great example of a simple fruit." That is brief.
Thus final answer: "Banana – it's convenient, nutritious, and a perfect example of a simple fruit."</think>{ "fruit": "banana" }
JSON TEXT:
{ "fruit": "banana" }
parsed: {'fruit': 'banana'}
OK
For reference, I ran vLLM like this. There are a few extra params here from some evaluations I was doing, and I haven't tried to trim them down for this example. Just want to provide a known-working configuration.
vllm serve WeiboAI/VibeThinker-3B \
--dtype auto \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 100 \
--max-num-batched-tokens 32768 \
--enable-prefix-caching \
--generation-config vllm \
--override-generation-config {"temperature":1.0,"top_p":0.95,"top_k":-1,"max_new_tokens":8192}