For Qwen3 and compatible models · Based on the Qwen3-Embedding Technical Report
ChatML (Chat Markup Language) is a simple format that labels who is speaking in a conversation with an AI. It wraps each message in special tokens so the model always knows: is this an instruction from the app? A question from the user? Or a response it already gave?
Without ChatML, the model gets one big block of text with no way to tell voices apart. With it, every message has a clear owner — and that makes conversations reliable, safe, and easy to debug.
Qwen3, the model family this manual focuses on, uses ChatML as its native format. It adds a few extra tags on top (like <think> and <tool_call>) to support reasoning and tool use.
Every ChatML message follows the same pattern: <|im_start|>role → content → <|im_end|>. Here are all the tokens and tags you'll encounter:
| Token / Tag | What it does |
|---|---|
| <|im_start|> | Opens a new message block. Always followed by a role name on the same line. |
| <|im_end|> | Closes the current message block. Placed at the very end of a message. |
| system | Role for the app's setup instructions. Tells the AI how to behave. |
| user | Role for the human's message. What the person typed. |
| assistant | Role for the AI's reply. Used in history to show previous responses. |
| tool | Role for tool/function output (e.g. search results, calculator data). |
| <think> | Qwen3 extension. Opening tag for the model's internal reasoning. |
| </think> | Qwen3 extension. Closing tag for internal reasoning. Comes before the reply. |
| <tool_call> | Qwen3 extension. Wraps a JSON function call the model wants to make. |
| <tool_response> | Qwen3 extension. Wraps the result returned by a tool. |
ChatML conversations are built from messages. Each message belongs to one of four roles:
| Role | Who writes it | Purpose |
|---|---|---|
| system | App developer | Always first. Sets behavior, persona, rules. Usually just one. |
| user | Human | The person's message. Repeats every turn. |
| assistant | AI model | The model's reply. Previous replies go here as history. |
| tool | External tool | Data returned by a function call (e.g. search, calculator). |
Every message follows this exact pattern:
<|im_start|>role
Your message content goes here.
<|im_end|>
A minimal working conversation looks like this:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>
Notice the three-part structure: the system block sets up the AI, the user block asks the question, and the assistant block holds the answer.
For ongoing conversations, each new round adds more blocks. The model has no memory on its own, so you must re-send the full history every time you call it.
<|im_start|>system
You are a helpful cooking assistant.<|im_end|>
<|im_start|>user
What can I make with eggs and cheese?<|im_end|>
<|im_start|>assistant
You could make an omelette, a frittata, or scrambled eggs with cheese.<|im_end|>
<|im_start|>user
How do I make a frittata?<|im_end|>
<|im_start|>assistant
← model generates here
The last open <|im_start|>assistant tag without a closing <|im_end|> is the model's cue to start generating its next response.
Qwen3 can show its reasoning before giving a final answer. This happens inside <think> … </think> tags, which appear inside the assistant block.
<|im_start|>user
If I have 12 apples and give away a third, how many do I have?<|im_end|>
<|im_start|>assistant
<think>
A third of 12 is 4. So 12 minus 4 equals 8.
</think>
You would have 8 apples left.<|im_end|>
The text inside <think> is the model's internal scratchpad. Only the text after </think> is the actual response the user sees. You can turn this off by passing enable_thinking=False in your code.
When the model needs to call an external function (like a web search or calculator), it emits a <tool_call> block. Your app runs the function, then sends back a <tool_response> block with the result.
<|im_start|>user
What's the weather in Tokyo right now?<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call><|im_end|>
<|im_start|>user
<tool_response>
{"temperature": "18°C", "condition": "Partly cloudy"}
</tool_response><|im_end|>
<|im_start|>assistant
It's currently 18°C and partly cloudy in Tokyo.<|im_end|>
Tool responses are sent back as a special user message. The model then reads the data and writes its final reply.
When you call a ChatML-compatible API (like OpenAI, Qwen, or most open-source models), you don't write the raw token strings. You send a JSON array of message objects. The API converts it for you.
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2 + 2?"},
{"role": "assistant", "content": "2 + 2 equals 4."},
{"role": "user", "content": "And 3 + 3?"}
]
The API takes this array and turns it into the <|im_start|> / <|im_end|> format automatically before sending it to the model.
| Rule | Why it matters |
|---|---|
| Always close every block | A missing <|im_end|> will confuse the model and break output. |
| System message goes first | Put it at the top, before any user/assistant messages. |
| Send the full history | Models have no memory. Re-send every prior message each turn. |
| Keep roles accurate | Don't put user text in the system role. It defeats the security benefit. |
| Be specific in system prompts | Vague instructions produce inconsistent behavior. |
| One system message is enough | Too many system blocks can cause unpredictable behavior. |
ChatML was originally introduced by OpenAI and is now widely adopted across many model families including Qwen, SmolLM, and others. Qwen3-specific features (<think>, <tool_call>) are Alibaba/Qwen extensions to the base format.