ChatML - Quick Reference Manual

For Qwen3 and compatible models · Based on the Qwen3-Embedding Technical Report

What is ChatML?

ChatML (Chat Markup Language) is a simple format that labels who is speaking in a conversation with an AI. It wraps each message in special tokens so the model always knows: is this an instruction from the app? A question from the user? Or a response it already gave?

Without ChatML, the model gets one big block of text with no way to tell voices apart. With it, every message has a clear owner — and that makes conversations reliable, safe, and easy to debug.

Qwen3, the model family this manual focuses on, uses ChatML as its native format. It adds a few extra tags on top (like <think> and <tool_call>) to support reasoning and tool use.

All Tokens at a Glance

Every ChatML message follows the same pattern: <|im_start|>role → content → <|im_end|>. Here are all the tokens and tags you'll encounter:

Token / Tag	What it does
<\|im_start\|>	Opens a new message block. Always followed by a role name on the same line.
<\|im_end\|>	Closes the current message block. Placed at the very end of a message.
system	Role for the app's setup instructions. Tells the AI how to behave.
user	Role for the human's message. What the person typed.
assistant	Role for the AI's reply. Used in history to show previous responses.
tool	Role for tool/function output (e.g. search results, calculator data).
<think>	Qwen3 extension. Opening tag for the model's internal reasoning.
</think>	Qwen3 extension. Closing tag for internal reasoning. Comes before the reply.
<tool_call>	Qwen3 extension. Wraps a JSON function call the model wants to make.
<tool_response>	Qwen3 extension. Wraps the result returned by a tool.

The Four Roles

ChatML conversations are built from messages. Each message belongs to one of four roles:

Role	Who writes it	Purpose
system	App developer	Always first. Sets behavior, persona, rules. Usually just one.
user	Human	The person's message. Repeats every turn.
assistant	AI model	The model's reply. Previous replies go here as history.
tool	External tool	Data returned by a function call (e.g. search, calculator).

Basic Structure

Every message follows this exact pattern:

<|im_start|>role

Your message content goes here.

<|im_end|>

A minimal working conversation looks like this:

<|im_start|>system

You are a helpful assistant.<|im_end|>

<|im_start|>user

What is the capital of France?<|im_end|>

<|im_start|>assistant

The capital of France is Paris.<|im_end|>

Notice the three-part structure: the system block sets up the AI, the user block asks the question, and the assistant block holds the answer.

Multi-Turn Conversations

For ongoing conversations, each new round adds more blocks. The model has no memory on its own, so you must re-send the full history every time you call it.

<|im_start|>system

You are a helpful cooking assistant.<|im_end|>

<|im_start|>user

What can I make with eggs and cheese?<|im_end|>

<|im_start|>assistant

You could make an omelette, a frittata, or scrambled eggs with cheese.<|im_end|>

<|im_start|>user

How do I make a frittata?<|im_end|>

<|im_start|>assistant

← model generates here

The last open <|im_start|>assistant tag without a closing <|im_end|> is the model's cue to start generating its next response.

Qwen3 Thinking Mode

Qwen3 can show its reasoning before giving a final answer. This happens inside <think> … </think> tags, which appear inside the assistant block.

<|im_start|>user

If I have 12 apples and give away a third, how many do I have?<|im_end|>

<|im_start|>assistant

<think>

A third of 12 is 4. So 12 minus 4 equals 8.

</think>

You would have 8 apples left.<|im_end|>

The text inside <think> is the model's internal scratchpad. Only the text after </think> is the actual response the user sees. You can turn this off by passing enable_thinking=False in your code.

Tool Calls (Function Calling)

When the model needs to call an external function (like a web search or calculator), it emits a <tool_call> block. Your app runs the function, then sends back a <tool_response> block with the result.

<|im_start|>user

What's the weather in Tokyo right now?<|im_end|>

<|im_start|>assistant

<tool_call>

{"name": "get_weather", "arguments": {"city": "Tokyo"}}

</tool_call><|im_end|>

<|im_start|>user

<tool_response>

{"temperature": "18°C", "condition": "Partly cloudy"}

</tool_response><|im_end|>

<|im_start|>assistant

It's currently 18°C and partly cloudy in Tokyo.<|im_end|>

Tool responses are sent back as a special user message. The model then reads the data and writes its final reply.

Using ChatML via API

When you call a ChatML-compatible API (like OpenAI, Qwen, or most open-source models), you don't write the raw token strings. You send a JSON array of message objects. The API converts it for you.

messages = [

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "What is 2 + 2?"},

{"role": "assistant", "content": "2 + 2 equals 4."},

{"role": "user", "content": "And 3 + 3?"}

]

The API takes this array and turns it into the <|im_start|> / <|im_end|> format automatically before sending it to the model.

Tips & Common Mistakes

Rule	Why it matters
Always close every block	A missing <\|im_end\|> will confuse the model and break output.
System message goes first	Put it at the top, before any user/assistant messages.
Send the full history	Models have no memory. Re-send every prior message each turn.
Keep roles accurate	Don't put user text in the system role. It defeats the security benefit.
Be specific in system prompts	Vague instructions produce inconsistent behavior.
One system message is enough	Too many system blocks can cause unpredictable behavior.

ChatML was originally introduced by OpenAI and is now widely adopted across many model families including Qwen, SmolLM, and others. Qwen3-specific features (<think>, <tool_call>) are Alibaba/Qwen extensions to the base format.

barseghyanartur/ChatML - Quick Reference Manual.rst

Select an option

No results found