Skip to content

Instantly share code, notes, and snippets.

@foklepoint
Last active April 6, 2026 07:15
Show Gist options
  • Select an option

  • Save foklepoint/5cff3557fab7b8fde347afcda829ca1f to your computer and use it in GitHub Desktop.

Select an option

Save foklepoint/5cff3557fab7b8fde347afcda829ca1f to your computer and use it in GitHub Desktop.
Browser Use v3 deterministic rerun caching bug repro — scripts never saved to workspace, LLM cost on every run
#!/usr/bin/env python3
"""
Browser Use v3 — Deterministic Rerun Caching Bug Report + Repro
================================================================
Filed by: Saurabh Sharma (foklepoint)
Date: 2026-04-06
Agent: Claude Opus 4.6 (via Claude Code CLI) — this entire investigation
was done by an AI agent debugging a $65 bill from one day of usage.
TL;DR
-----
Deterministic rerun caching (the @{{param}} feature) does not work.
workspace.files(prefix="scripts/") is always empty after runs. Every
session runs full LLM inference. This cost us $46 in LLM charges across
119 sessions in a single day that should have been ~$0 after the first
few template-building runs.
Tested: SDK 3.4.2, REST API, their exact HackerNews docs example, fresh
workspaces, explicit cache_script=True — no combination produces cached
scripts. Also found a pagination bug on the sessions list endpoint.
WHAT WE WERE BUILDING
---------------------
A LinkedIn outreach automation for a conference (Stanford FutureLaw).
The script (linkedin.py) wraps Browser Use sessions as CLI commands:
python linkedin.py visit-profile https://linkedin.com/in/someone/
python linkedin.py like-post https://linkedin.com/feed/update/urn:li:activity:123/
python linkedin.py send-dm https://linkedin.com/in/someone/ "Hey..."
Each command creates a BU session with @{{param}} template syntax, so
the first run builds a cached script and subsequent runs replay it at
$0 LLM cost. That's the theory from the docs. In practice, no scripts
are ever saved and every run is full LLM inference.
INVESTIGATION TIMELINE (2026-04-06)
------------------------------------
1. User noticed $65 BU bill from one day. Asked agent to investigate.
2. Agent queried BU sessions API. Hit pagination bug first — used
pageNumber/pageSize params (from OpenAPI spec), got 580 "sessions"
that were actually 20 unique IDs repeated 29x each. Wasted ~30 min
on wrong analysis before discovering correct params are page/page_size.
3. With correct pagination: 119 real sessions, $46.02 LLM inference.
Broke down by source:
- Script building/testing (2 AM): 16 sessions, $7.24
- Discovery cron (5 AM): 45 sessions, $17.00
- Engagement cron (8 AM): 48 sessions, $18.86
- Engagement cron run 2 (4 PM): 10 sessions, $3.23
4. Noticed linkedin.py used f-string @{{{var}}} which produces @{value}
(single braces) instead of @{{value}} (double braces). Fixed it.
But caching still didn't work.
5. Tested with correct @{{value}} syntax via REST API:
- visit-profile with cacheScript=true, 3 runs with different URLs
- Run 1: $0.096 LLM, 15 steps (expected: build cache)
- Run 2: $0.044 LLM, 8 steps (expected: $0, cached)
- Run 3: $0.055 LLM, 10 steps (expected: $0, cached)
- workspace.files(prefix="scripts/"): EMPTY
6. Tested with their exact HackerNews docs example via REST API:
- Run 1: $0.005 LLM, success=True
- Run 2: $0.002 LLM, success=False
- scripts/: EMPTY
7. Installed browser-use-sdk 3.4.2, switched to official SDK.
Tested exact docs example with SDK:
- Fresh workspace created via client.workspaces.create()
- Run 1: $0.020 LLM, 3 steps, success=True
- Run 2: $0.005 LLM, 1 step, success=None
- await client.workspaces.files(ws_id, prefix="scripts/"): EMPTY
- All workspace files: just output JSON files, no scripts/
8. Tried existing workspace, explicit cache_script=True, different
param values — all the same. Zero scripts ever saved.
9. Conclusion: caching feature does not work via SDK or REST API.
The workspace never receives scripts/ files regardless of config.
FULL SESSION IDs FROM OUR TESTS
--------------------------------
All on 2026-04-06, model=gemini-3-flash unless noted.
SDK tests (fresh workspace 34e0cb2a-151f-48e3-975a-1821347b96bc):
HN run 1: LLM=$0.0198, steps=3, success=True — no script saved
HN run 2: LLM=$0.0053, steps=1, success=None — no script saved
SDK tests (existing workspace 6eb952a1-f281-4ae4-bfd7-1da580f7cd34):
HN run 1: LLM=$0.0064, steps=2, success=True — no script saved
HN run 2: LLM=$0.0060, steps=1, success=None — no script saved
REST API tests (existing workspace 6eb952a1):
HN run 1: LLM=$0.0047, steps=1, success=True — no script saved
HN run 2: LLM=$0.0023, steps=1, success=False — no script saved
REST API LinkedIn visit-profile tests (cacheScript=true):
Session 8192a428 (Roland Vogl): LLM=$0.0962, 15 steps
Session e60e247e (Simon Agar): LLM=$0.0436, 8 steps <- should be $0
Session 23e22cba (Pablo Arredondo): LLM=$0.0548, 10 steps <- should be $0
Scripts in workspace: 0
REST API LinkedIn visit-profile tests (@{{}} auto-detect, no explicit cacheScript):
Session a68787ed (Dazza Greenwood): LLM=$0.0886, 13 steps
Session 164331d5 (Hugh Carlson): LLM=$0.0313, 4 steps <- should be $0
Scripts in workspace: 0
REST API test without @{{}} (broken f-string, single braces):
Session 68e0af46 (Dazza Greenwood): LLM=$0.0171, 4 steps
(This was our original bug — f-string produced @{value} not @{{value}})
SDK visit-profile test (confirmed working otherwise):
Session via linkedin.py: LLM=$0.0107, 3 steps, success=True, correct JSON output
(SDK works fine for running tasks, just no caching)
SECONDARY BUG: Sessions List Pagination
----------------------------------------
The /api/v3/sessions endpoint has two pagination interfaces. One is broken.
BROKEN (but accepted without error):
GET /api/v3/sessions?pageSize=50&pageNumber=1 -> returns 20 sessions
GET /api/v3/sessions?pageSize=50&pageNumber=2 -> returns SAME 20 sessions
GET /api/v3/sessions?pageSize=50&pageNumber=3 -> returns SAME 20 sessions
...
Every page returns the same 20 IDs. Looping until empty produces
infinite pages. We got 580 "sessions" (20 unique x 29 pages before
we stopped) which led to wildly wrong cost analysis.
The pageSize param is also ignored — always returns 20 regardless
of the value passed.
WORKING (but not in OpenAPI spec):
GET /api/v3/sessions?page=1&page_size=100 -> returns up to 100 unique sessions
GET /api/v3/sessions?page=2&page_size=100 -> returns remaining sessions
Response includes {"total": 119} which is correct.
The OpenAPI spec at /api/v3/openapi.json documents page/page_size
as the correct params, but pageNumber/pageSize are also silently
accepted and produce wrong results.
COST IMPACT
-----------
Our production usage on 2026-04-05 (one day):
119 real BU sessions for LinkedIn outreach automation.
All on bu-ultra (claude-opus-4.6) — we've since switched to gemini-3-flash.
Total BU bill: $65.05
- BU Agent LLM Inference: $46.02 (70.7%) <- would be ~$0 if caching worked
- Skill Creation: $14.00 (21.5%) <- warm-cache attempts
- Proxy Data: $10.60 (16.3%)
- Browser Sessions: $0.31 (0.5%)
- Skill Execution: $0.12 (0.2%)
- Skill Creation Refund: -$6.00 (-9.2%)
Most expensive individual sessions (from dashboard):
"Connect with Roland Vogl": $2.25 (19m 29s!)
"Extract LinkedIn comments and reactions": $1.52
"Connect with Julie Chapman": $1.39
"Connect with Max Junestrand": $1.24
"Extract LinkedIn messages": $1.22 (35m 49s!)
"Message Anna Podolskaya": $1.04
"LinkedIn people search": $1.05
If caching had worked, the $46 LLM line would be ~$2-5 (first runs
only for ~14 unique templates). The remaining 100+ sessions would
replay cached scripts at $0 LLM.
ACCOUNT DETAILS
---------------
Account plan: subscription_50 ($50/mo, active)
Project ID: 371e993f-098f-4ce6-8735-97d7345f61b0
Profile ID: 1e1993bc-ca24-4acb-8df0-998cc7a273cb
Workspace IDs tested:
- 6eb952a1-f281-4ae4-bfd7-1da580f7cd34 (existing "My Files")
- 34e0cb2a-151f-48e3-975a-1821347b96bc (fresh "cache-test")
SDK: browser-use-sdk 3.4.2
Python: 3.13.7
OS: macOS Darwin 25.0.0
DOCS REFERENCED
---------------
- https://docs.browser-use.com/cloud/agent/cache-script
"Second call — cached script, different param ($0 LLM, ~5s)"
"No agent, no LLM."
- https://docs.browser-use.com/cloud/agent/quickstart
SDK usage examples
- OpenAPI spec at /api/v3/openapi.json
RunTaskRequest schema shows cacheScript field:
"null (default): auto-detected — enabled when the task contains
@{{value}} brackets and a workspace is attached."
HOW TO RUN THIS REPRO
---------------------
pip install browser-use-sdk
export BROWSER_USE_API_KEY=bu_your_key
python bu-cache-repro.py
Cost: ~$0.05-0.10 total (4 cheap gemini-3-flash sessions on example.com)
No LinkedIn or auth needed — uses example.com only.
Creates a fresh workspace, runs 3 sessions, checks for cached scripts.
"""
import asyncio
import os
import json
import time
import urllib.request
# ── Check environment ──────────────────────────────────────────────
API_KEY = os.environ.get("BROWSER_USE_API_KEY")
if not API_KEY:
print("ERROR: Set BROWSER_USE_API_KEY environment variable")
print(" export BROWSER_USE_API_KEY=bu_your_key")
raise SystemExit(1)
# ====================================================================
# PART 1: SDK-based repro (recommended — uses official browser-use-sdk)
# ====================================================================
async def test_sdk_caching():
"""Test caching via the official Python SDK."""
from browser_use_sdk.v3 import AsyncBrowserUse
client = AsyncBrowserUse()
# ── Create fresh workspace ─────────────────────────────────────
print("=" * 70)
print("PART 1: SDK CACHING TEST")
print("=" * 70)
workspace = await client.workspaces.create(name="cache-repro-test")
ws_id = str(workspace.id)
print(f" Fresh workspace: {ws_id}")
# ── Run 1: Should build cached script ──────────────────────────
print()
print(" --- Run 1: First call (should build cache) ---")
print(' Task: "Go to @{{https://example.com}}. Extract the page title."')
result1 = await client.run(
"Go to @{{https://example.com}}. Extract the page title. Return as JSON.",
workspace_id=ws_id,
model="gemini-3-flash",
)
print(f" Session: {result1.id}")
print(f" Steps: {result1.step_count}, LLM: ${result1.llm_cost_usd}, Success: {result1.is_task_successful}")
# ── Check for scripts ──────────────────────────────────────────
files = await client.workspaces.files(ws_id, prefix="scripts/")
print(f" Scripts in workspace: {len(files.files)}")
all_files = await client.workspaces.files(ws_id)
print(f" All files: {[f.path for f in all_files.files]}")
# ── Run 2: Different param, should hit cache ($0 LLM) ─────────
print()
print(" --- Run 2: Different param (should be cached, $0 LLM) ---")
print(' Task: "Go to @{{https://example.org}}. Extract the page title."')
result2 = await client.run(
"Go to @{{https://example.org}}. Extract the page title. Return as JSON.",
workspace_id=ws_id,
model="gemini-3-flash",
)
print(f" Session: {result2.id}")
print(f" Steps: {result2.step_count}, LLM: ${result2.llm_cost_usd}, Success: {result2.is_task_successful}")
# ── Run 3: Explicit cache_script=True ──────────────────────────
print()
print(" --- Run 3: Explicit cache_script=True ---")
result3 = await client.run(
"Go to @{{https://www.iana.org/domains/reserved}}. Extract the page title. Return as JSON.",
workspace_id=ws_id,
model="gemini-3-flash",
cache_script=True,
)
print(f" Session: {result3.id}")
print(f" Steps: {result3.step_count}, LLM: ${result3.llm_cost_usd}, Success: {result3.is_task_successful}")
# ── Final workspace check ──────────────────────────────────────
files_final = await client.workspaces.files(ws_id, prefix="scripts/")
all_final = await client.workspaces.files(ws_id)
print()
print(f" Scripts after all 3 runs: {len(files_final.files)}")
print(f" All workspace files: {[f.path for f in all_final.files]}")
# ── Verdict ────────────────────────────────────────────────────
print()
llm1 = float(result1.llm_cost_usd or 0)
llm2 = float(result2.llm_cost_usd or 0)
llm3 = float(result3.llm_cost_usd or 0)
print(f" Run 1 LLM: ${llm1:.4f} (first run, expected: >$0)")
print(f" Run 2 LLM: ${llm2:.4f} (cached rerun, expected: $0)")
print(f" Run 3 LLM: ${llm3:.4f} (explicit cache_script=True, expected: $0)")
print(f" Cached scripts: {len(files_final.files)} (expected: >=1)")
if len(files_final.files) == 0 and llm2 > 0.001:
print()
print(" *** BUG CONFIRMED: No scripts saved. LLM cost on every run. ***")
return False
else:
print()
print(" Caching appears to be working!")
return True
# ====================================================================
# PART 2: Pagination bug repro
# ====================================================================
def test_pagination_bug():
"""Demonstrate the sessions list pagination bug."""
print()
print("=" * 70)
print("PART 2: SESSIONS LIST PAGINATION BUG")
print("=" * 70)
headers = {"Content-Type": "application/json", "X-Browser-Use-API-Key": API_KEY}
# ── Broken params (pageNumber/pageSize) ────────────────────────
print()
print(" --- BROKEN: pageNumber/pageSize ---")
ids_broken = set()
total_returned = 0
for pg in range(1, 4):
url = f"https://api.browser-use.com/api/v3/sessions?pageSize=50&pageNumber={pg}"
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req, timeout=15) as resp:
data = json.loads(resp.read())
sessions = data.get("sessions", [])
new_ids = set(s["id"] for s in sessions) - ids_broken
ids_broken.update(s["id"] for s in sessions)
total_returned += len(sessions)
print(f" Page {pg}: {len(sessions)} returned, {len(new_ids)} new unique (total unique: {len(ids_broken)})")
print(f" Total returned across 3 pages: {total_returned}")
print(f" Unique IDs: {len(ids_broken)}")
if total_returned > len(ids_broken) * 1.5:
print(f" *** BUG: {total_returned - len(ids_broken)} duplicate entries across pages ***")
# ── Working params (page/page_size) ────────────────────────────
print()
print(" --- WORKING: page/page_size ---")
ids_working = set()
total_returned_2 = 0
for pg in range(1, 4):
url = f"https://api.browser-use.com/api/v3/sessions?page={pg}&page_size=100"
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req, timeout=15) as resp:
data = json.loads(resp.read())
sessions = data.get("sessions", [])
total_api = data.get("total", "?")
new_ids = set(s["id"] for s in sessions) - ids_working
ids_working.update(s["id"] for s in sessions)
total_returned_2 += len(sessions)
print(f" Page {pg}: {len(sessions)} returned, {len(new_ids)} new unique (total unique: {len(ids_working)}, api total: {total_api})")
if not sessions:
break
print(f" Total unique sessions: {len(ids_working)}")
# ====================================================================
# PART 3: Billing check
# ====================================================================
def check_billing():
"""Show current account balance."""
print()
print("=" * 70)
print("PART 3: ACCOUNT STATUS")
print("=" * 70)
headers = {"Content-Type": "application/json", "X-Browser-Use-API-Key": API_KEY}
url = "https://api.browser-use.com/api/v3/billing/account"
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req, timeout=15) as resp:
data = json.loads(resp.read())
print(f" Plan: {data.get('planInfo', {}).get('planName', '?')}")
print(f" Balance: ${data.get('totalCreditsBalanceUsd', '?')}")
print(f" Project ID: {data.get('projectId', '?')}")
# ====================================================================
# Main
# ====================================================================
if __name__ == "__main__":
print("Browser Use v3 — Cache Bug Repro")
print(f"SDK: browser-use-sdk 3.4.2")
print(f"Docs: https://docs.browser-use.com/cloud/agent/cache-script")
print()
cache_works = asyncio.run(test_sdk_caching())
test_pagination_bug()
check_billing()
print()
print("=" * 70)
if not cache_works:
print("FINAL: Caching is broken. No scripts saved to workspace.")
print("Every session runs full LLM inference regardless of:")
print(" - @{{param}} syntax in task")
print(" - workspace_id provided")
print(" - explicit cache_script=True")
print(" - fresh vs existing workspace")
print(" - SDK vs REST API")
else:
print("FINAL: Caching worked! (If you're seeing this, BU may have fixed it.)")
print("=" * 70)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment