Last active
April 6, 2026 07:15
-
-
Save foklepoint/5cff3557fab7b8fde347afcda829ca1f to your computer and use it in GitHub Desktop.
Browser Use v3 deterministic rerun caching bug repro — scripts never saved to workspace, LLM cost on every run
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/usr/bin/env python3 | |
| """ | |
| Browser Use v3 — Deterministic Rerun Caching Bug Report + Repro | |
| ================================================================ | |
| Filed by: Saurabh Sharma (foklepoint) | |
| Date: 2026-04-06 | |
| Agent: Claude Opus 4.6 (via Claude Code CLI) — this entire investigation | |
| was done by an AI agent debugging a $65 bill from one day of usage. | |
| TL;DR | |
| ----- | |
| Deterministic rerun caching (the @{{param}} feature) does not work. | |
| workspace.files(prefix="scripts/") is always empty after runs. Every | |
| session runs full LLM inference. This cost us $46 in LLM charges across | |
| 119 sessions in a single day that should have been ~$0 after the first | |
| few template-building runs. | |
| Tested: SDK 3.4.2, REST API, their exact HackerNews docs example, fresh | |
| workspaces, explicit cache_script=True — no combination produces cached | |
| scripts. Also found a pagination bug on the sessions list endpoint. | |
| WHAT WE WERE BUILDING | |
| --------------------- | |
| A LinkedIn outreach automation for a conference (Stanford FutureLaw). | |
| The script (linkedin.py) wraps Browser Use sessions as CLI commands: | |
| python linkedin.py visit-profile https://linkedin.com/in/someone/ | |
| python linkedin.py like-post https://linkedin.com/feed/update/urn:li:activity:123/ | |
| python linkedin.py send-dm https://linkedin.com/in/someone/ "Hey..." | |
| Each command creates a BU session with @{{param}} template syntax, so | |
| the first run builds a cached script and subsequent runs replay it at | |
| $0 LLM cost. That's the theory from the docs. In practice, no scripts | |
| are ever saved and every run is full LLM inference. | |
| INVESTIGATION TIMELINE (2026-04-06) | |
| ------------------------------------ | |
| 1. User noticed $65 BU bill from one day. Asked agent to investigate. | |
| 2. Agent queried BU sessions API. Hit pagination bug first — used | |
| pageNumber/pageSize params (from OpenAPI spec), got 580 "sessions" | |
| that were actually 20 unique IDs repeated 29x each. Wasted ~30 min | |
| on wrong analysis before discovering correct params are page/page_size. | |
| 3. With correct pagination: 119 real sessions, $46.02 LLM inference. | |
| Broke down by source: | |
| - Script building/testing (2 AM): 16 sessions, $7.24 | |
| - Discovery cron (5 AM): 45 sessions, $17.00 | |
| - Engagement cron (8 AM): 48 sessions, $18.86 | |
| - Engagement cron run 2 (4 PM): 10 sessions, $3.23 | |
| 4. Noticed linkedin.py used f-string @{{{var}}} which produces @{value} | |
| (single braces) instead of @{{value}} (double braces). Fixed it. | |
| But caching still didn't work. | |
| 5. Tested with correct @{{value}} syntax via REST API: | |
| - visit-profile with cacheScript=true, 3 runs with different URLs | |
| - Run 1: $0.096 LLM, 15 steps (expected: build cache) | |
| - Run 2: $0.044 LLM, 8 steps (expected: $0, cached) | |
| - Run 3: $0.055 LLM, 10 steps (expected: $0, cached) | |
| - workspace.files(prefix="scripts/"): EMPTY | |
| 6. Tested with their exact HackerNews docs example via REST API: | |
| - Run 1: $0.005 LLM, success=True | |
| - Run 2: $0.002 LLM, success=False | |
| - scripts/: EMPTY | |
| 7. Installed browser-use-sdk 3.4.2, switched to official SDK. | |
| Tested exact docs example with SDK: | |
| - Fresh workspace created via client.workspaces.create() | |
| - Run 1: $0.020 LLM, 3 steps, success=True | |
| - Run 2: $0.005 LLM, 1 step, success=None | |
| - await client.workspaces.files(ws_id, prefix="scripts/"): EMPTY | |
| - All workspace files: just output JSON files, no scripts/ | |
| 8. Tried existing workspace, explicit cache_script=True, different | |
| param values — all the same. Zero scripts ever saved. | |
| 9. Conclusion: caching feature does not work via SDK or REST API. | |
| The workspace never receives scripts/ files regardless of config. | |
| FULL SESSION IDs FROM OUR TESTS | |
| -------------------------------- | |
| All on 2026-04-06, model=gemini-3-flash unless noted. | |
| SDK tests (fresh workspace 34e0cb2a-151f-48e3-975a-1821347b96bc): | |
| HN run 1: LLM=$0.0198, steps=3, success=True — no script saved | |
| HN run 2: LLM=$0.0053, steps=1, success=None — no script saved | |
| SDK tests (existing workspace 6eb952a1-f281-4ae4-bfd7-1da580f7cd34): | |
| HN run 1: LLM=$0.0064, steps=2, success=True — no script saved | |
| HN run 2: LLM=$0.0060, steps=1, success=None — no script saved | |
| REST API tests (existing workspace 6eb952a1): | |
| HN run 1: LLM=$0.0047, steps=1, success=True — no script saved | |
| HN run 2: LLM=$0.0023, steps=1, success=False — no script saved | |
| REST API LinkedIn visit-profile tests (cacheScript=true): | |
| Session 8192a428 (Roland Vogl): LLM=$0.0962, 15 steps | |
| Session e60e247e (Simon Agar): LLM=$0.0436, 8 steps <- should be $0 | |
| Session 23e22cba (Pablo Arredondo): LLM=$0.0548, 10 steps <- should be $0 | |
| Scripts in workspace: 0 | |
| REST API LinkedIn visit-profile tests (@{{}} auto-detect, no explicit cacheScript): | |
| Session a68787ed (Dazza Greenwood): LLM=$0.0886, 13 steps | |
| Session 164331d5 (Hugh Carlson): LLM=$0.0313, 4 steps <- should be $0 | |
| Scripts in workspace: 0 | |
| REST API test without @{{}} (broken f-string, single braces): | |
| Session 68e0af46 (Dazza Greenwood): LLM=$0.0171, 4 steps | |
| (This was our original bug — f-string produced @{value} not @{{value}}) | |
| SDK visit-profile test (confirmed working otherwise): | |
| Session via linkedin.py: LLM=$0.0107, 3 steps, success=True, correct JSON output | |
| (SDK works fine for running tasks, just no caching) | |
| SECONDARY BUG: Sessions List Pagination | |
| ---------------------------------------- | |
| The /api/v3/sessions endpoint has two pagination interfaces. One is broken. | |
| BROKEN (but accepted without error): | |
| GET /api/v3/sessions?pageSize=50&pageNumber=1 -> returns 20 sessions | |
| GET /api/v3/sessions?pageSize=50&pageNumber=2 -> returns SAME 20 sessions | |
| GET /api/v3/sessions?pageSize=50&pageNumber=3 -> returns SAME 20 sessions | |
| ... | |
| Every page returns the same 20 IDs. Looping until empty produces | |
| infinite pages. We got 580 "sessions" (20 unique x 29 pages before | |
| we stopped) which led to wildly wrong cost analysis. | |
| The pageSize param is also ignored — always returns 20 regardless | |
| of the value passed. | |
| WORKING (but not in OpenAPI spec): | |
| GET /api/v3/sessions?page=1&page_size=100 -> returns up to 100 unique sessions | |
| GET /api/v3/sessions?page=2&page_size=100 -> returns remaining sessions | |
| Response includes {"total": 119} which is correct. | |
| The OpenAPI spec at /api/v3/openapi.json documents page/page_size | |
| as the correct params, but pageNumber/pageSize are also silently | |
| accepted and produce wrong results. | |
| COST IMPACT | |
| ----------- | |
| Our production usage on 2026-04-05 (one day): | |
| 119 real BU sessions for LinkedIn outreach automation. | |
| All on bu-ultra (claude-opus-4.6) — we've since switched to gemini-3-flash. | |
| Total BU bill: $65.05 | |
| - BU Agent LLM Inference: $46.02 (70.7%) <- would be ~$0 if caching worked | |
| - Skill Creation: $14.00 (21.5%) <- warm-cache attempts | |
| - Proxy Data: $10.60 (16.3%) | |
| - Browser Sessions: $0.31 (0.5%) | |
| - Skill Execution: $0.12 (0.2%) | |
| - Skill Creation Refund: -$6.00 (-9.2%) | |
| Most expensive individual sessions (from dashboard): | |
| "Connect with Roland Vogl": $2.25 (19m 29s!) | |
| "Extract LinkedIn comments and reactions": $1.52 | |
| "Connect with Julie Chapman": $1.39 | |
| "Connect with Max Junestrand": $1.24 | |
| "Extract LinkedIn messages": $1.22 (35m 49s!) | |
| "Message Anna Podolskaya": $1.04 | |
| "LinkedIn people search": $1.05 | |
| If caching had worked, the $46 LLM line would be ~$2-5 (first runs | |
| only for ~14 unique templates). The remaining 100+ sessions would | |
| replay cached scripts at $0 LLM. | |
| ACCOUNT DETAILS | |
| --------------- | |
| Account plan: subscription_50 ($50/mo, active) | |
| Project ID: 371e993f-098f-4ce6-8735-97d7345f61b0 | |
| Profile ID: 1e1993bc-ca24-4acb-8df0-998cc7a273cb | |
| Workspace IDs tested: | |
| - 6eb952a1-f281-4ae4-bfd7-1da580f7cd34 (existing "My Files") | |
| - 34e0cb2a-151f-48e3-975a-1821347b96bc (fresh "cache-test") | |
| SDK: browser-use-sdk 3.4.2 | |
| Python: 3.13.7 | |
| OS: macOS Darwin 25.0.0 | |
| DOCS REFERENCED | |
| --------------- | |
| - https://docs.browser-use.com/cloud/agent/cache-script | |
| "Second call — cached script, different param ($0 LLM, ~5s)" | |
| "No agent, no LLM." | |
| - https://docs.browser-use.com/cloud/agent/quickstart | |
| SDK usage examples | |
| - OpenAPI spec at /api/v3/openapi.json | |
| RunTaskRequest schema shows cacheScript field: | |
| "null (default): auto-detected — enabled when the task contains | |
| @{{value}} brackets and a workspace is attached." | |
| HOW TO RUN THIS REPRO | |
| --------------------- | |
| pip install browser-use-sdk | |
| export BROWSER_USE_API_KEY=bu_your_key | |
| python bu-cache-repro.py | |
| Cost: ~$0.05-0.10 total (4 cheap gemini-3-flash sessions on example.com) | |
| No LinkedIn or auth needed — uses example.com only. | |
| Creates a fresh workspace, runs 3 sessions, checks for cached scripts. | |
| """ | |
| import asyncio | |
| import os | |
| import json | |
| import time | |
| import urllib.request | |
| # ── Check environment ────────────────────────────────────────────── | |
| API_KEY = os.environ.get("BROWSER_USE_API_KEY") | |
| if not API_KEY: | |
| print("ERROR: Set BROWSER_USE_API_KEY environment variable") | |
| print(" export BROWSER_USE_API_KEY=bu_your_key") | |
| raise SystemExit(1) | |
| # ==================================================================== | |
| # PART 1: SDK-based repro (recommended — uses official browser-use-sdk) | |
| # ==================================================================== | |
| async def test_sdk_caching(): | |
| """Test caching via the official Python SDK.""" | |
| from browser_use_sdk.v3 import AsyncBrowserUse | |
| client = AsyncBrowserUse() | |
| # ── Create fresh workspace ───────────────────────────────────── | |
| print("=" * 70) | |
| print("PART 1: SDK CACHING TEST") | |
| print("=" * 70) | |
| workspace = await client.workspaces.create(name="cache-repro-test") | |
| ws_id = str(workspace.id) | |
| print(f" Fresh workspace: {ws_id}") | |
| # ── Run 1: Should build cached script ────────────────────────── | |
| print() | |
| print(" --- Run 1: First call (should build cache) ---") | |
| print(' Task: "Go to @{{https://example.com}}. Extract the page title."') | |
| result1 = await client.run( | |
| "Go to @{{https://example.com}}. Extract the page title. Return as JSON.", | |
| workspace_id=ws_id, | |
| model="gemini-3-flash", | |
| ) | |
| print(f" Session: {result1.id}") | |
| print(f" Steps: {result1.step_count}, LLM: ${result1.llm_cost_usd}, Success: {result1.is_task_successful}") | |
| # ── Check for scripts ────────────────────────────────────────── | |
| files = await client.workspaces.files(ws_id, prefix="scripts/") | |
| print(f" Scripts in workspace: {len(files.files)}") | |
| all_files = await client.workspaces.files(ws_id) | |
| print(f" All files: {[f.path for f in all_files.files]}") | |
| # ── Run 2: Different param, should hit cache ($0 LLM) ───────── | |
| print() | |
| print(" --- Run 2: Different param (should be cached, $0 LLM) ---") | |
| print(' Task: "Go to @{{https://example.org}}. Extract the page title."') | |
| result2 = await client.run( | |
| "Go to @{{https://example.org}}. Extract the page title. Return as JSON.", | |
| workspace_id=ws_id, | |
| model="gemini-3-flash", | |
| ) | |
| print(f" Session: {result2.id}") | |
| print(f" Steps: {result2.step_count}, LLM: ${result2.llm_cost_usd}, Success: {result2.is_task_successful}") | |
| # ── Run 3: Explicit cache_script=True ────────────────────────── | |
| print() | |
| print(" --- Run 3: Explicit cache_script=True ---") | |
| result3 = await client.run( | |
| "Go to @{{https://www.iana.org/domains/reserved}}. Extract the page title. Return as JSON.", | |
| workspace_id=ws_id, | |
| model="gemini-3-flash", | |
| cache_script=True, | |
| ) | |
| print(f" Session: {result3.id}") | |
| print(f" Steps: {result3.step_count}, LLM: ${result3.llm_cost_usd}, Success: {result3.is_task_successful}") | |
| # ── Final workspace check ────────────────────────────────────── | |
| files_final = await client.workspaces.files(ws_id, prefix="scripts/") | |
| all_final = await client.workspaces.files(ws_id) | |
| print() | |
| print(f" Scripts after all 3 runs: {len(files_final.files)}") | |
| print(f" All workspace files: {[f.path for f in all_final.files]}") | |
| # ── Verdict ──────────────────────────────────────────────────── | |
| print() | |
| llm1 = float(result1.llm_cost_usd or 0) | |
| llm2 = float(result2.llm_cost_usd or 0) | |
| llm3 = float(result3.llm_cost_usd or 0) | |
| print(f" Run 1 LLM: ${llm1:.4f} (first run, expected: >$0)") | |
| print(f" Run 2 LLM: ${llm2:.4f} (cached rerun, expected: $0)") | |
| print(f" Run 3 LLM: ${llm3:.4f} (explicit cache_script=True, expected: $0)") | |
| print(f" Cached scripts: {len(files_final.files)} (expected: >=1)") | |
| if len(files_final.files) == 0 and llm2 > 0.001: | |
| print() | |
| print(" *** BUG CONFIRMED: No scripts saved. LLM cost on every run. ***") | |
| return False | |
| else: | |
| print() | |
| print(" Caching appears to be working!") | |
| return True | |
| # ==================================================================== | |
| # PART 2: Pagination bug repro | |
| # ==================================================================== | |
| def test_pagination_bug(): | |
| """Demonstrate the sessions list pagination bug.""" | |
| print() | |
| print("=" * 70) | |
| print("PART 2: SESSIONS LIST PAGINATION BUG") | |
| print("=" * 70) | |
| headers = {"Content-Type": "application/json", "X-Browser-Use-API-Key": API_KEY} | |
| # ── Broken params (pageNumber/pageSize) ──────────────────────── | |
| print() | |
| print(" --- BROKEN: pageNumber/pageSize ---") | |
| ids_broken = set() | |
| total_returned = 0 | |
| for pg in range(1, 4): | |
| url = f"https://api.browser-use.com/api/v3/sessions?pageSize=50&pageNumber={pg}" | |
| req = urllib.request.Request(url, headers=headers) | |
| with urllib.request.urlopen(req, timeout=15) as resp: | |
| data = json.loads(resp.read()) | |
| sessions = data.get("sessions", []) | |
| new_ids = set(s["id"] for s in sessions) - ids_broken | |
| ids_broken.update(s["id"] for s in sessions) | |
| total_returned += len(sessions) | |
| print(f" Page {pg}: {len(sessions)} returned, {len(new_ids)} new unique (total unique: {len(ids_broken)})") | |
| print(f" Total returned across 3 pages: {total_returned}") | |
| print(f" Unique IDs: {len(ids_broken)}") | |
| if total_returned > len(ids_broken) * 1.5: | |
| print(f" *** BUG: {total_returned - len(ids_broken)} duplicate entries across pages ***") | |
| # ── Working params (page/page_size) ──────────────────────────── | |
| print() | |
| print(" --- WORKING: page/page_size ---") | |
| ids_working = set() | |
| total_returned_2 = 0 | |
| for pg in range(1, 4): | |
| url = f"https://api.browser-use.com/api/v3/sessions?page={pg}&page_size=100" | |
| req = urllib.request.Request(url, headers=headers) | |
| with urllib.request.urlopen(req, timeout=15) as resp: | |
| data = json.loads(resp.read()) | |
| sessions = data.get("sessions", []) | |
| total_api = data.get("total", "?") | |
| new_ids = set(s["id"] for s in sessions) - ids_working | |
| ids_working.update(s["id"] for s in sessions) | |
| total_returned_2 += len(sessions) | |
| print(f" Page {pg}: {len(sessions)} returned, {len(new_ids)} new unique (total unique: {len(ids_working)}, api total: {total_api})") | |
| if not sessions: | |
| break | |
| print(f" Total unique sessions: {len(ids_working)}") | |
| # ==================================================================== | |
| # PART 3: Billing check | |
| # ==================================================================== | |
| def check_billing(): | |
| """Show current account balance.""" | |
| print() | |
| print("=" * 70) | |
| print("PART 3: ACCOUNT STATUS") | |
| print("=" * 70) | |
| headers = {"Content-Type": "application/json", "X-Browser-Use-API-Key": API_KEY} | |
| url = "https://api.browser-use.com/api/v3/billing/account" | |
| req = urllib.request.Request(url, headers=headers) | |
| with urllib.request.urlopen(req, timeout=15) as resp: | |
| data = json.loads(resp.read()) | |
| print(f" Plan: {data.get('planInfo', {}).get('planName', '?')}") | |
| print(f" Balance: ${data.get('totalCreditsBalanceUsd', '?')}") | |
| print(f" Project ID: {data.get('projectId', '?')}") | |
| # ==================================================================== | |
| # Main | |
| # ==================================================================== | |
| if __name__ == "__main__": | |
| print("Browser Use v3 — Cache Bug Repro") | |
| print(f"SDK: browser-use-sdk 3.4.2") | |
| print(f"Docs: https://docs.browser-use.com/cloud/agent/cache-script") | |
| print() | |
| cache_works = asyncio.run(test_sdk_caching()) | |
| test_pagination_bug() | |
| check_billing() | |
| print() | |
| print("=" * 70) | |
| if not cache_works: | |
| print("FINAL: Caching is broken. No scripts saved to workspace.") | |
| print("Every session runs full LLM inference regardless of:") | |
| print(" - @{{param}} syntax in task") | |
| print(" - workspace_id provided") | |
| print(" - explicit cache_script=True") | |
| print(" - fresh vs existing workspace") | |
| print(" - SDK vs REST API") | |
| else: | |
| print("FINAL: Caching worked! (If you're seeing this, BU may have fixed it.)") | |
| print("=" * 70) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment