Getting Started with the Claude API

The Claude API lets you send text to a model and get text back. That's it. There's no session, no memory, no connection between calls — every request is independent.

Disclaimer: this is based on a day of hands-on experimentation, not deep expertise. I don't speak with authority on exactly how things work internally, but everything here reflects what I observed in practice. All outputs shown are real API responses.

To follow along, you need:

Python 3.8+
pip install anthropic
An API key from console.anthropic.com (pay-per-token, separate from any subscription)

1. Your First API Call

from anthropic import Anthropic

client = Anthropic(api_key="sk-ant-YOUR_KEY_HERE")

response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=256,
    messages=[{"role": "user", "content": "How many 'e' are there in strawberry?"}],
)

print(response)

Here's the full response object:

{
  "id": "msg_016x6AvBAQabiUV8BeECc39h",
  "type": "message",
  "role": "assistant",
  "model": "claude-haiku-4-5-20251001",
  "content": [
    {
      "type": "text",
      "text": "To count the letter 'e' in \"strawberry\", I'll examine each letter:\n\ns-t-r-a-w-b-e-r-r-y\n\nThe letter 'e' appears **1 time** (in the 7th position)."
    }
  ],
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 19,
    "output_tokens": 63,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 0
  }
}

Let's break that down.

The request

Three parameters are required:

Parameter	What it does
`model`	Which model to use (see pricing table below)
`max_tokens`	Hard cap on how many tokens the model can generate
`messages`	A list of `{role, content}` objects — the conversation

Messages have role set to user or assistant. Conventionally they alternate, but the API doesn't strictly enforce ordering — consecutive messages with the same role work fine. The only hard rule on Sonnet and Opus 4.6 is that the last message must be user. This means you can't do "prefilling" — ending with an assistant message to force the model to continue from a specific starting point (e.g. prefilling {" to force JSON output). Haiku still allows prefilling; on newer models, use structured output or system instructions instead.

There are also optional parameters:

Parameter	What it does
`system`	A string (or list of text blocks) that sets model behavior. Just another text field with no special mechanics — it's conventional to put instructions here.
`temperature`	Controls randomness (0.0 = nearly deterministic, 1.0 = default)
`top_p` / `top_k`	Alternative sampling controls — truncate the probability distribution instead of reshaping it
`tools`	Tool definitions the model can "call" (structured output — your code executes them)
`thinking`	Enable extended thinking (chain-of-thought reasoning)
`stream`	Stream the response token-by-token instead of waiting for the full thing

The response

The response content is always a list of typed blocks. For a simple text response, it's a single text block. When thinking is enabled, you'll also get thinking blocks. When the model wants to call a tool, you get tool_use blocks.

stop_reason tells you why the model stopped:

end_turn — it finished naturally
max_tokens — it hit the max_tokens cap (response is truncated)
tool_use — it wants you to execute a tool call and send the result back

Tokens and cost

The model operates on tokens, not characters or words. Tokens are subword pieces from a fixed vocabulary — common words like "the" are one token, rare words get split into multiple. The tokenizer handles any UTF-8 text, just more or less efficiently.

The usage field tells you exactly what you were charged for:

input_tokens — tokens in your request (messages + system + tool definitions)
output_tokens — tokens the model generated

Cost = input_tokens * input_price + output_tokens * output_price. For Haiku at $1/$5 per million tokens:

19 input * $1/1M + 63 output * $5/1M = $0.000019 + $0.000315 = $0.000334

A third of a cent.

The model is nondeterministic

Run the same call 10 times and you'll get different responses each time:

for i in range(10):
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=256,
        messages=[{"role": "user", "content": "How many 'e' are there in strawberry?"}],
    )
    print(f"{i+1}. {response.content[0].text}")

 1. To count the letter 'e' in "strawberry", I'll examine each letter:\n\ns-t-r-a-w-b-e-r-r-y\n\nThe letter 'e' appears **1 time** (in the 7th position).
 2. To count the letter 'e' in "strawberry", I'll go through each letter:\n\ns-t-r-a-w-b-e-r-r-y\n\nThe letter 'e' appears **1 time**.
 3. To count the letter 'e' in the word "strawberry", I'll examine each letter:\n\ns-t-r-a-w-b-e-r-r-y\n\nThe letter 'e' appears **1 time**.
 4. To count the letter 'e' in "strawberry", I'll go through each letter:\n\ns-t-r-a-w-b-e-r-r-y\n\nThe letter 'e' appears **1 time**.       <-- same as #2
 5. To count the letter 'e' in "strawberry", I'll examine each letter:\n\ns-t-r-a-w-b-e-r-r-y\n\nThe letter 'e' appears **1 time** (in the 7th position).  <-- same as #1
 6. To count the 'e' in "strawberry", I'll go through each letter:\n\ns-t-r-a-w-b-e-r-r-y\n\nThe letter 'e' appears **1 time** in "strawberry".
 7. To count the letter 'e' in "strawberry", I'll go through each letter:\n\ns-t-r-a-w-b-e-r-r-y\n\nThe letter 'e' appears **1 time**.       <-- same as #2
 8. To count the letter 'e' in the word "strawberry", I'll examine each letter:\n\ns-t-r-a-w-b-e-r-r-y\n\nThe letter 'e' appears **1 time**. <-- same as #3
 9. To count the letter 'e' in "strawberry", I'll go through each letter:\n\ns-t-r-a-w-b-e-r-r-y\n\nThe letter 'e' appears **1 time**.       <-- same as #2
10. To count the letter 'e' in "strawberry", I'll go through each letter:\n\ns-t-r-a-w-b-e-r-r-y\n\nThe letter 'e' appears **1 time**.       <-- same as #2

Same answer every time, but the wording varies — and some responses are identical to each other. "How many letters in strawberry" used to be a classic LLM gotcha — older models would confidently miscount. We ran several dozen attempts and even Haiku (the cheapest model) got it right every single time, always spelling the word out letter by letter before answering. Whether the spelling-out is what makes it reliable or the whole sequence is just a trained pattern at this point is hard to say — but either way, the failure mode is gone.

2. Models and Cost

Three model tiers, trading cost for capability:

Model	ID	Input	Output	Speed
Haiku	`claude-haiku-4-5-20251001`	$1/MTok	$5/MTok	Fast
Sonnet	`claude-sonnet-4-6`	$3/MTok	$15/MTok	Medium
Opus	`claude-opus-4-6`	$5/MTok	$25/MTok	Slower

Let's compare them on a different question — one that any human would answer without hesitation, but that has a subtle linguistic ambiguity:

from anthropic import Anthropic

client = Anthropic(api_key="sk-ant-YOUR_KEY_HERE")

prompt = "You need to wash your car, but the car wash is 50 meters away. Do you drive or do you walk?"

for model in ["claude-haiku-4-5-20251001", "claude-sonnet-4-6", "claude-opus-4-6"]:
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )
    print(f"\n--- {model} ---")
    print(response.content[0].text)
    print(f"Tokens: in={response.usage.input_tokens} out={response.usage.output_tokens}")

Haiku (83 output tokens):

I'd walk -- it's only 50 meters (about 150 feet, or a half-block).

Driving would be:

Slower than walking for such a short distance

Pointless (the car's already dirty, so no benefit to driving it there)

A waste of fuel and effort

Sonnet (214 output tokens):

You walk (or at least, you don't drive there to get it washed). [...]

Actually the straightforward answer is: you drive the car there, because that's how car washes work — the car needs to get to the wash regardless of the distance.

Opus (156 output tokens):

Most people would drive, and it makes the most sense because:

You need the car at the car wash — that's the whole point

Whether it's an automatic wash or self-serve, the car needs to be there

Walking there doesn't accomplish anything since you'd have to go back for the car anyway

Any human would answer "drive" without hesitation — you need the car at the car wash. Haiku answered "walk," treating it purely as a travel distance question. Sonnet waffled but eventually got there. Opus answered "drive" cleanly.

More capable models handle these kinds of questions better — or you can use thinking to give cheaper models more room to reason, as we'll see next.

3. Extended Thinking

Extended thinking gives the model a scratch pad to reason through problems before answering. You enable it by adding a thinking parameter:

from anthropic import Anthropic

client = Anthropic(api_key="sk-ant-YOUR_KEY_HERE")

prompt = "You need to wash your car, but the car wash is 50 meters away. Do you drive or do you walk?"

response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=8192,
    messages=[{"role": "user", "content": prompt}],
    thinking={"type": "enabled", "budget_tokens": 4000},
)

for block in response.content:
    if block.type == "thinking":
        print("=== THINKING ===")
        print(block.thinking)
    elif block.type == "text":
        print("\n=== RESPONSE ===")
        print(block.text)

Haiku with thinking (268 output tokens):

=== THINKING ===
This is a straightforward practical question. Let me think about the pros
and cons:

Walk:
- 50 meters is quite short (about a 30-45 second walk)
- Saves gas/fuel
- No need to start the engine
[...]

The logical answer is to walk. 50 meters is a very short distance [...]

=== RESPONSE ===
I'd **walk**.

50 meters is only about a minute's walk—short enough that driving uses more
energy and fuel than it saves. Plus, you'd need to drive back anyway.

Still wrong — Haiku with thinking listed pros and cons but never considered that the car needs to be there. More thinking tokens didn't help because the model never explored the right angle.

Now the same thing with Sonnet:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=8192,
    messages=[{"role": "user", "content": prompt}],
    thinking={"type": "enabled", "budget_tokens": 8000},
)

Sonnet with thinking (248 output tokens):

=== THINKING ===
This is a fun logical/practical question. Let me think about it carefully.

If the car wash is 50 meters away, and you need to wash your car, you would
drive the car there - because the whole point is to wash the car. You can't
wash your car at the car wash if you walk there without the car.
[...]

=== RESPONSE ===
The answer is pretty clearly **you drive** — because you need the **car** to
be at the car wash to wash it. Walking there without the car doesn't
accomplish the goal.

Thinking fixed Sonnet. Without thinking, it waffled. With thinking, it caught the constraint immediately: the car needs to be at the car wash.

What's happening mechanically

The model generates text token by token. Each token becomes context for the next. When it writes "this is a trick question, let me think carefully" in its thinking block, those tokens steer subsequent reasoning — similar to how writing out intermediate steps on paper helps you catch errors.

Thinking blocks appear before the text response in response.content. Thinking tokens count toward max_tokens and you're billed for them. On Claude 4 models, the thinking trace you see is a summary of the internal reasoning — you're billed for the full (hidden) trace.

Thinking modes

There are two ways to enable thinking:

Fixed budget (used above):

thinking={"type": "enabled", "budget_tokens": 10000}

You set a cap. Model uses what it needs up to that limit. Minimum 1,024 tokens.

Adaptive (recommended for newer models):

thinking={"type": "adaptive"}

Model decides when and how much to think. Can be combined with an effort parameter:

thinking={"type": "adaptive"},
output_config={"effort": "medium"}  # low, medium, high (default), max (API only)

This is important: thinking tokens share the max_tokens budget with the actual response. With high effort, the model routinely spends tens of thousands of tokens thinking on complex prompts. If thinking consumes the entire budget, you get a truncated or empty response (stop_reason: "max_tokens") — and you still pay for all those thinking tokens. This is a common issue in practice. Claude Code defaults to 32K max_tokens (CLAUDE_CODE_MAX_OUTPUT_TOKENS), and even Opus 4.6's 128K max can get consumed by thinking on open-ended problems. medium or even low effort works well for most use cases — especially on Opus, which is capable enough to handle most tasks without deep thinking.

The cost of thinking

Thinking can get expensive. Comparing the car wash calls:

Model	Thinking	Output Tokens (billed)	Visible Response	Cost	Correct?
Haiku	off	83	83	$0.0004	No
Haiku	4k budget	268	~80	$0.0014	No
Sonnet	off	214	214	$0.0032	Waffled
Sonnet	8k budget	248	~120	$0.0037	Yes
Opus	off	156	156	$0.0039	Yes

The "Output Tokens (billed)" column includes thinking tokens — you pay for all of them even though most aren't visible in the response. On Claude 4 models, the thinking trace you see is a summary; the actual internal reasoning is larger.

The cost difference is small here — partly because it's a simple prompt, and partly because Sonnet's waffling without thinking produced a long response anyway. On complex problems, thinking can get expensive.

Thinking flipped Sonnet from waffling to immediately correct. Meanwhile, Opus got it right without thinking at all, and Haiku got it wrong even with thinking. The tradeoff: pay for a more capable model upfront, or pay for more thinking tokens on a cheaper model. For most tasks, low effort with adaptive thinking is a good default.

4. The "Agent" is Just a Loop

If you've used Claude Code, you've used an agent. As far as I can tell, there's no special agent API — it's essentially just a loop:

while True:
    user_input = input("> ")
    messages.append({"role": "user", "content": user_input})

    response = client.messages.create(model=model, max_tokens=4096, messages=messages)

    assistant_message = response.content[0].text
    messages.append({"role": "assistant", "content": assistant_message})
    print(assistant_message)

Every iteration is a fresh, stateless API call. The "memory" is just the messages list that you keep appending to and sending back each time. The model doesn't remember anything — you're replaying the full conversation on every call.

When tool use is involved (the model asking to read a file, run a command, etc.), there's an inner loop:

while True:
    response = client.messages.create(...)

    if response.stop_reason == "tool_use":
        # Model wants to call a tool — execute it, send results back
        tool_results = execute_tools(response.content)
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})
    else:
        break  # Model is done, show the final response

Claude Code appears to work this way. The model suggests file reads and edits, the client executes them locally and sends results back. The model never touches your filesystem directly. The client can refuse any tool call — that's what permission prompts are.

You control everything

Since the API is stateless and every call is just "here's the full context, give me a response," you have control over what the model sees:

Prune context: drop irrelevant earlier messages to save tokens
Edit history: fix mistakes in earlier turns before sending the next request
Branch: take a conversation, send two different follow-ups — now you have two threads that share a common history
Merge: pull context from multiple threads into one request
Route: maintain separate cached contexts (one with docs loaded, one with code loaded) and route questions to whichever already has the right context

The tradeoff is that you're replaying the full conversation every call, so input tokens grow with each turn. The API has prompt caching (pay 0.1x for repeated prefixes), but you need to manage it carefully — what to cache, what to prune, when to start fresh. More control means more responsibility.

5. Cost Reality Check

Claude's API pricing is expensive — among the most expensive across all vendors. But the subscription pricing is subsidized relative to what the equivalent API usage would actually cost.

Consider Claude Code. Even a fresh session with nothing going on has a baseline context cost per turn — system prompt, tool definitions, and conversation scaffolding all count as input tokens. With caching, a single baseline turn costs roughly $0.015-0.020 on Opus, $0.010-0.015 on Sonnet, and less than $0.01 on Haiku. That's before you've done anything useful. Additional turns cost around $0.02 early on, and can grow to $0.20 later in a session even with 99% cache hit rates, as the conversation history accumulates.

For Opus, a single session will frequently accumulate $10-20 in equivalent API costs before Claude Code starts auto-compacting the context (at 85% capacity). A Max subscription at $100/month is heavily subsidized relative to what that usage would actually cost at API rates — a Max20 at $20/month doubly so.

But here's the flip side: with the API, you control exactly what goes into each request. A custom script with a focused system prompt and no tool definitions has a per-turn cost that's a fraction of Claude Code's baseline. For narrowly-scoped tasks — batch processing, classification, extraction — a well-optimized API integration can be much cheaper than subscription usage for the equivalent work.

This tutorial covered the basics: sending messages, understanding the response, comparing models, and using extended thinking. For advanced topics like tool use, streaming, caching, and structured output, see the official docs.

vassvik/TUTORIAL.md

Select an option

No results found