foklepoint · April 6, 2026 07:15
diff --git a/bu-cache-repro.py b/bu-cache-repro.py
 #!/usr/bin/env python3
 """
 Browser Use v3 — Deterministic Rerun Caching Bug Report + Repro
 ================================================================
 Filed by: Saurabh Sharma (foklepoint)
 Date: 2026-04-06
 Agent: Claude Opus 4.6 (via Claude Code CLI) — this entire investigation
       was done by an AI agent debugging a $65 bill from one day of usage.

 TL;DR
 -----
 Deterministic rerun caching (the @{{param}} feature) does not work.
 workspace.files(prefix="scripts/") is always empty after runs. Every
 session runs full LLM inference. This cost us $46 in LLM charges across
 119 sessions in a single day that should have been ~$0 after the first
 few template-building runs.

 Tested: SDK 3.4.2, REST API, their exact HackerNews docs example, fresh
 workspaces, explicit cache_script=True — no combination produces cached
 scripts. Also found a pagination bug on the sessions list endpoint.

 WHAT WE WERE BUILDING
 ---------------------
 A LinkedIn outreach automation for a conference (Stanford FutureLaw).
 The script (linkedin.py) wraps Browser Use sessions as CLI commands:
  python linkedin.py visit-profile https://linkedin.com/in/someone/
  python linkedin.py like-post https://linkedin.com/feed/update/urn:li:activity:123/
  python linkedin.py send-dm https://linkedin.com/in/someone/ "Hey..."

 Each command creates a BU session with @{{param}} template syntax, so
 the first run builds a cached script and subsequent runs replay it at
 $0 LLM cost. That's the theory from the docs. In practice, no scripts
 are ever saved and every run is full LLM inference.

 INVESTIGATION TIMELINE (2026-04-06)
 ------------------------------------
 1. User noticed $65 BU bill from one day. Asked agent to investigate.

 2. Agent queried BU sessions API. Hit pagination bug first — used
   pageNumber/pageSize params (from OpenAPI spec), got 580 "sessions"
   that were actually 20 unique IDs repeated 29x each. Wasted ~30 min
   on wrong analysis before discovering correct params are page/page_size.

 3. With correct pagination: 119 real sessions, $46.02 LLM inference.
   Broke down by source:
     - Script building/testing (2 AM): 16 sessions, $7.24
     - Discovery cron (5 AM): 45 sessions, $17.00
     - Engagement cron (8 AM): 48 sessions, $18.86
     - Engagement cron run 2 (4 PM): 10 sessions, $3.23

 4. Noticed linkedin.py used f-string @{{{var}}} which produces @{value}
   (single braces) instead of @{{value}} (double braces). Fixed it.
   But caching still didn't work.

 5. Tested with correct @{{value}} syntax via REST API:
     - visit-profile with cacheScript=true, 3 runs with different URLs
     - Run 1: $0.096 LLM, 15 steps (expected: build cache)
     - Run 2: $0.044 LLM, 8 steps (expected: $0, cached)
     - Run 3: $0.055 LLM, 10 steps (expected: $0, cached)
     - workspace.files(prefix="scripts/"): EMPTY

 6. Tested with their exact HackerNews docs example via REST API:
     - Run 1: $0.005 LLM, success=True
     - Run 2: $0.002 LLM, success=False
     - scripts/: EMPTY

 7. Installed browser-use-sdk 3.4.2, switched to official SDK.
   Tested exact docs example with SDK:
     - Fresh workspace created via client.workspaces.create()
     - Run 1: $0.020 LLM, 3 steps, success=True
     - Run 2: $0.005 LLM, 1 step, success=None
     - await client.workspaces.files(ws_id, prefix="scripts/"): EMPTY
     - All workspace files: just output JSON files, no scripts/

 8. Tried existing workspace, explicit cache_script=True, different
   param values — all the same. Zero scripts ever saved.

 9. Conclusion: caching feature does not work via SDK or REST API.
   The workspace never receives scripts/ files regardless of config.

 FULL SESSION IDs FROM OUR TESTS
 --------------------------------
 All on 2026-04-06, model=gemini-3-flash unless noted.

 SDK tests (fresh workspace 34e0cb2a-151f-48e3-975a-1821347b96bc):
  HN run 1: LLM=$0.0198, steps=3, success=True   — no script saved
  HN run 2: LLM=$0.0053, steps=1, success=None    — no script saved

 SDK tests (existing workspace 6eb952a1-f281-4ae4-bfd7-1da580f7cd34):
  HN run 1: LLM=$0.0064, steps=2, success=True    — no script saved
  HN run 2: LLM=$0.0060, steps=1, success=None    — no script saved

 REST API tests (existing workspace 6eb952a1):
  HN run 1: LLM=$0.0047, steps=1, success=True    — no script saved
  HN run 2: LLM=$0.0023, steps=1, success=False   — no script saved

 REST API LinkedIn visit-profile tests (cacheScript=true):
  Session 8192a428 (Roland Vogl):     LLM=$0.0962, 15 steps
  Session e60e247e (Simon Agar):      LLM=$0.0436,  8 steps  <- should be $0
  Session 23e22cba (Pablo Arredondo): LLM=$0.0548, 10 steps  <- should be $0
  Scripts in workspace: 0

 REST API LinkedIn visit-profile tests (@{{}} auto-detect, no explicit cacheScript):
  Session a68787ed (Dazza Greenwood): LLM=$0.0886, 13 steps
  Session 164331d5 (Hugh Carlson):    LLM=$0.0313,  4 steps  <- should be $0
  Scripts in workspace: 0

 REST API test without @{{}} (broken f-string, single braces):
  Session 68e0af46 (Dazza Greenwood): LLM=$0.0171,  4 steps
  (This was our original bug — f-string produced @{value} not @{{value}})

 SDK visit-profile test (confirmed working otherwise):
  Session via linkedin.py: LLM=$0.0107, 3 steps, success=True, correct JSON output
  (SDK works fine for running tasks, just no caching)

 SECONDARY BUG: Sessions List Pagination
 ----------------------------------------
 The /api/v3/sessions endpoint has two pagination interfaces. One is broken.

 BROKEN (but accepted without error):
  GET /api/v3/sessions?pageSize=50&pageNumber=1  -> returns 20 sessions
  GET /api/v3/sessions?pageSize=50&pageNumber=2  -> returns SAME 20 sessions
  GET /api/v3/sessions?pageSize=50&pageNumber=3  -> returns SAME 20 sessions
  ...
  Every page returns the same 20 IDs. Looping until empty produces
  infinite pages. We got 580 "sessions" (20 unique x 29 pages before
  we stopped) which led to wildly wrong cost analysis.

  The pageSize param is also ignored — always returns 20 regardless
  of the value passed.

 WORKING (but not in OpenAPI spec):
  GET /api/v3/sessions?page=1&page_size=100  -> returns up to 100 unique sessions
  GET /api/v3/sessions?page=2&page_size=100  -> returns remaining sessions
  Response includes {"total": 119} which is correct.

  The OpenAPI spec at /api/v3/openapi.json documents page/page_size
  as the correct params, but pageNumber/pageSize are also silently
  accepted and produce wrong results.

 COST IMPACT
 -----------
 Our production usage on 2026-04-05 (one day):

  119 real BU sessions for LinkedIn outreach automation.
  All on bu-ultra (claude-opus-4.6) — we've since switched to gemini-3-flash.

  Total BU bill: $65.05
    - BU Agent LLM Inference: $46.02 (70.7%)  <- would be ~$0 if caching worked
    - Skill Creation: $14.00 (21.5%)           <- warm-cache attempts
    - Proxy Data: $10.60 (16.3%)
    - Browser Sessions: $0.31 (0.5%)
    - Skill Execution: $0.12 (0.2%)
    - Skill Creation Refund: -$6.00 (-9.2%)

  Most expensive individual sessions (from dashboard):
    "Connect with Roland Vogl": $2.25 (19m 29s!)
    "Extract LinkedIn comments and reactions": $1.52
    "Connect with Julie Chapman": $1.39
    "Connect with Max Junestrand": $1.24
    "Extract LinkedIn messages": $1.22 (35m 49s!)
    "Message Anna Podolskaya": $1.04
    "LinkedIn people search": $1.05

  If caching had worked, the $46 LLM line would be ~$2-5 (first runs
  only for ~14 unique templates). The remaining 100+ sessions would
  replay cached scripts at $0 LLM.

 ACCOUNT DETAILS
 ---------------
  Account plan: subscription_50 ($50/mo, active)
  Project ID: 371e993f-098f-4ce6-8735-97d7345f61b0
  Profile ID: 1e1993bc-ca24-4acb-8df0-998cc7a273cb
  Workspace IDs tested:
    - 6eb952a1-f281-4ae4-bfd7-1da580f7cd34 (existing "My Files")
    - 34e0cb2a-151f-48e3-975a-1821347b96bc (fresh "cache-test")
  SDK: browser-use-sdk 3.4.2
  Python: 3.13.7
  OS: macOS Darwin 25.0.0

 DOCS REFERENCED
 ---------------
  - https://docs.browser-use.com/cloud/agent/cache-script
    "Second call — cached script, different param ($0 LLM, ~5s)"
    "No agent, no LLM."

  - https://docs.browser-use.com/cloud/agent/quickstart
    SDK usage examples

  - OpenAPI spec at /api/v3/openapi.json
    RunTaskRequest schema shows cacheScript field:
    "null (default): auto-detected — enabled when the task contains
     @{{value}} brackets and a workspace is attached."

 HOW TO RUN THIS REPRO
 ---------------------
  pip install browser-use-sdk
  export BROWSER_USE_API_KEY=bu_your_key
  python bu-cache-repro.py

  Cost: ~$0.05-0.10 total (4 cheap gemini-3-flash sessions on example.com)
  No LinkedIn or auth needed — uses example.com only.
  Creates a fresh workspace, runs 3 sessions, checks for cached scripts.
 """

 import asyncio
 import os
 import json
 import time
 import urllib.request

 # ── Check environment ──────────────────────────────────────────────
 API_KEY = os.environ.get("BROWSER_USE_API_KEY")
 if not API_KEY:
    print("ERROR: Set BROWSER_USE_API_KEY environment variable")
    print("  export BROWSER_USE_API_KEY=bu_your_key")
    raise SystemExit(1)


 # ====================================================================
 # PART 1: SDK-based repro (recommended — uses official browser-use-sdk)
 # ====================================================================

 async def test_sdk_caching():
    """Test caching via the official Python SDK."""
    from browser_use_sdk.v3 import AsyncBrowserUse

    client = AsyncBrowserUse()

    # ── Create fresh workspace ─────────────────────────────────────
    print("=" * 70)
    print("PART 1: SDK CACHING TEST")
    print("=" * 70)
    workspace = await client.workspaces.create(name="cache-repro-test")
    ws_id = str(workspace.id)
    print(f"  Fresh workspace: {ws_id}")

    # ── Run 1: Should build cached script ──────────────────────────
    print()
    print("  --- Run 1: First call (should build cache) ---")
    print('  Task: "Go to @{{https://example.com}}. Extract the page title."')
    result1 = await client.run(
        "Go to @{{https://example.com}}. Extract the page title. Return as JSON.",
        workspace_id=ws_id,
        model="gemini-3-flash",
    )
    print(f"  Session: {result1.id}")
    print(f"  Steps: {result1.step_count}, LLM: ${result1.llm_cost_usd}, Success: {result1.is_task_successful}")

    # ── Check for scripts ──────────────────────────────────────────
    files = await client.workspaces.files(ws_id, prefix="scripts/")
    print(f"  Scripts in workspace: {len(files.files)}")
    all_files = await client.workspaces.files(ws_id)
    print(f"  All files: {[f.path for f in all_files.files]}")

    # ── Run 2: Different param, should hit cache ($0 LLM) ─────────
    print()
    print("  --- Run 2: Different param (should be cached, $0 LLM) ---")
    print('  Task: "Go to @{{https://example.org}}. Extract the page title."')
    result2 = await client.run(
        "Go to @{{https://example.org}}. Extract the page title. Return as JSON.",
        workspace_id=ws_id,
        model="gemini-3-flash",
    )
    print(f"  Session: {result2.id}")
    print(f"  Steps: {result2.step_count}, LLM: ${result2.llm_cost_usd}, Success: {result2.is_task_successful}")

    # ── Run 3: Explicit cache_script=True ──────────────────────────
    print()
    print("  --- Run 3: Explicit cache_script=True ---")
    result3 = await client.run(
        "Go to @{{https://www.iana.org/domains/reserved}}. Extract the page title. Return as JSON.",
        workspace_id=ws_id,
        model="gemini-3-flash",
        cache_script=True,
    )
    print(f"  Session: {result3.id}")
    print(f"  Steps: {result3.step_count}, LLM: ${result3.llm_cost_usd}, Success: {result3.is_task_successful}")

    # ── Final workspace check ──────────────────────────────────────
    files_final = await client.workspaces.files(ws_id, prefix="scripts/")
    all_final = await client.workspaces.files(ws_id)
    print()
    print(f"  Scripts after all 3 runs: {len(files_final.files)}")
    print(f"  All workspace files: {[f.path for f in all_final.files]}")

    # ── Verdict ────────────────────────────────────────────────────
    print()
    llm1 = float(result1.llm_cost_usd or 0)
    llm2 = float(result2.llm_cost_usd or 0)
    llm3 = float(result3.llm_cost_usd or 0)
    print(f"  Run 1 LLM: ${llm1:.4f}  (first run, expected: >$0)")
    print(f"  Run 2 LLM: ${llm2:.4f}  (cached rerun, expected: $0)")
    print(f"  Run 3 LLM: ${llm3:.4f}  (explicit cache_script=True, expected: $0)")
    print(f"  Cached scripts: {len(files_final.files)}  (expected: >=1)")

    if len(files_final.files) == 0 and llm2 > 0.001:
        print()
        print("  *** BUG CONFIRMED: No scripts saved. LLM cost on every run. ***")
        return False
    else:
        print()
        print("  Caching appears to be working!")
        return True


 # ====================================================================
 # PART 2: Pagination bug repro
 # ====================================================================

 def test_pagination_bug():
    """Demonstrate the sessions list pagination bug."""
    print()
    print("=" * 70)
    print("PART 2: SESSIONS LIST PAGINATION BUG")
    print("=" * 70)
    headers = {"Content-Type": "application/json", "X-Browser-Use-API-Key": API_KEY}

    # ── Broken params (pageNumber/pageSize) ────────────────────────
    print()
    print("  --- BROKEN: pageNumber/pageSize ---")
    ids_broken = set()
    total_returned = 0
    for pg in range(1, 4):
        url = f"https://api.browser-use.com/api/v3/sessions?pageSize=50&pageNumber={pg}"
        req = urllib.request.Request(url, headers=headers)
        with urllib.request.urlopen(req, timeout=15) as resp:
            data = json.loads(resp.read())
        sessions = data.get("sessions", [])
        new_ids = set(s["id"] for s in sessions) - ids_broken
        ids_broken.update(s["id"] for s in sessions)
        total_returned += len(sessions)
        print(f"  Page {pg}: {len(sessions)} returned, {len(new_ids)} new unique (total unique: {len(ids_broken)})")

    print(f"  Total returned across 3 pages: {total_returned}")
    print(f"  Unique IDs: {len(ids_broken)}")
    if total_returned > len(ids_broken) * 1.5:
        print(f"  *** BUG: {total_returned - len(ids_broken)} duplicate entries across pages ***")

    # ── Working params (page/page_size) ────────────────────────────
    print()
    print("  --- WORKING: page/page_size ---")
    ids_working = set()
    total_returned_2 = 0
    for pg in range(1, 4):
        url = f"https://api.browser-use.com/api/v3/sessions?page={pg}&page_size=100"
        req = urllib.request.Request(url, headers=headers)
        with urllib.request.urlopen(req, timeout=15) as resp:
            data = json.loads(resp.read())
        sessions = data.get("sessions", [])
        total_api = data.get("total", "?")
        new_ids = set(s["id"] for s in sessions) - ids_working
        ids_working.update(s["id"] for s in sessions)
        total_returned_2 += len(sessions)
        print(f"  Page {pg}: {len(sessions)} returned, {len(new_ids)} new unique (total unique: {len(ids_working)}, api total: {total_api})")
        if not sessions:
            break

    print(f"  Total unique sessions: {len(ids_working)}")


 # ====================================================================
 # PART 3: Billing check
 # ====================================================================

 def check_billing():
    """Show current account balance."""
    print()
    print("=" * 70)
    print("PART 3: ACCOUNT STATUS")
    print("=" * 70)
    headers = {"Content-Type": "application/json", "X-Browser-Use-API-Key": API_KEY}
    url = "https://api.browser-use.com/api/v3/billing/account"
    req = urllib.request.Request(url, headers=headers)
    with urllib.request.urlopen(req, timeout=15) as resp:
        data = json.loads(resp.read())
    print(f"  Plan: {data.get('planInfo', {}).get('planName', '?')}")
    print(f"  Balance: ${data.get('totalCreditsBalanceUsd', '?')}")
    print(f"  Project ID: {data.get('projectId', '?')}")


 # ====================================================================
 # Main
 # ====================================================================

 if __name__ == "__main__":
    print("Browser Use v3 — Cache Bug Repro")
    print(f"SDK: browser-use-sdk 3.4.2")
    print(f"Docs: https://docs.browser-use.com/cloud/agent/cache-script")
    print()

    cache_works = asyncio.run(test_sdk_caching())
    test_pagination_bug()
    check_billing()

    print()
    print("=" * 70)
    if not cache_works:
        print("FINAL: Caching is broken. No scripts saved to workspace.")
        print("Every session runs full LLM inference regardless of:")
        print("  - @{{param}} syntax in task")
        print("  - workspace_id provided")
        print("  - explicit cache_script=True")
        print("  - fresh vs existing workspace")
        print("  - SDK vs REST API")
    else:
        print("FINAL: Caching worked! (If you're seeing this, BU may have fixed it.)")
    print("=" * 70)
	#!/usr/bin/env python3
	"""
	Browser Use v3 — Deterministic Rerun Caching Bug Report + Repro
	================================================================
	Filed by: Saurabh Sharma (foklepoint)
	Date: 2026-04-06
	Agent: Claude Opus 4.6 (via Claude Code CLI) — this entire investigation
	was done by an AI agent debugging a $65 bill from one day of usage.

	TL;DR
	-----
	Deterministic rerun caching (the @{{param}} feature) does not work.
	workspace.files(prefix="scripts/") is always empty after runs. Every
	session runs full LLM inference. This cost us $46 in LLM charges across
	119 sessions in a single day that should have been ~$0 after the first
	few template-building runs.

	Tested: SDK 3.4.2, REST API, their exact HackerNews docs example, fresh
	workspaces, explicit cache_script=True — no combination produces cached
	scripts. Also found a pagination bug on the sessions list endpoint.

	WHAT WE WERE BUILDING
	---------------------
	A LinkedIn outreach automation for a conference (Stanford FutureLaw).
	The script (linkedin.py) wraps Browser Use sessions as CLI commands:
	python linkedin.py visit-profile https://linkedin.com/in/someone/
	python linkedin.py like-post https://linkedin.com/feed/update/urn:li:activity:123/
	python linkedin.py send-dm https://linkedin.com/in/someone/ "Hey..."

	Each command creates a BU session with @{{param}} template syntax, so
	the first run builds a cached script and subsequent runs replay it at
	$0 LLM cost. That's the theory from the docs. In practice, no scripts
	are ever saved and every run is full LLM inference.

	INVESTIGATION TIMELINE (2026-04-06)
	------------------------------------
	1. User noticed $65 BU bill from one day. Asked agent to investigate.

	2. Agent queried BU sessions API. Hit pagination bug first — used
	pageNumber/pageSize params (from OpenAPI spec), got 580 "sessions"
	that were actually 20 unique IDs repeated 29x each. Wasted ~30 min
	on wrong analysis before discovering correct params are page/page_size.

	3. With correct pagination: 119 real sessions, $46.02 LLM inference.
	Broke down by source:
	- Script building/testing (2 AM): 16 sessions, $7.24
	- Discovery cron (5 AM): 45 sessions, $17.00
	- Engagement cron (8 AM): 48 sessions, $18.86
	- Engagement cron run 2 (4 PM): 10 sessions, $3.23

	4. Noticed linkedin.py used f-string @{{{var}}} which produces @{value}
	(single braces) instead of @{{value}} (double braces). Fixed it.
	But caching still didn't work.

	5. Tested with correct @{{value}} syntax via REST API:
	- visit-profile with cacheScript=true, 3 runs with different URLs
	- Run 1: $0.096 LLM, 15 steps (expected: build cache)
	- Run 2: $0.044 LLM, 8 steps (expected: $0, cached)
	- Run 3: $0.055 LLM, 10 steps (expected: $0, cached)
	- workspace.files(prefix="scripts/"): EMPTY

	6. Tested with their exact HackerNews docs example via REST API:
	- Run 1: $0.005 LLM, success=True
	- Run 2: $0.002 LLM, success=False
	- scripts/: EMPTY

	7. Installed browser-use-sdk 3.4.2, switched to official SDK.
	Tested exact docs example with SDK:
	- Fresh workspace created via client.workspaces.create()
	- Run 1: $0.020 LLM, 3 steps, success=True
	- Run 2: $0.005 LLM, 1 step, success=None
	- await client.workspaces.files(ws_id, prefix="scripts/"): EMPTY
	- All workspace files: just output JSON files, no scripts/

	8. Tried existing workspace, explicit cache_script=True, different
	param values — all the same. Zero scripts ever saved.

	9. Conclusion: caching feature does not work via SDK or REST API.
	The workspace never receives scripts/ files regardless of config.

	FULL SESSION IDs FROM OUR TESTS
	--------------------------------
	All on 2026-04-06, model=gemini-3-flash unless noted.

	SDK tests (fresh workspace 34e0cb2a-151f-48e3-975a-1821347b96bc):
	HN run 1: LLM=$0.0198, steps=3, success=True — no script saved
	HN run 2: LLM=$0.0053, steps=1, success=None — no script saved

	SDK tests (existing workspace 6eb952a1-f281-4ae4-bfd7-1da580f7cd34):
	HN run 1: LLM=$0.0064, steps=2, success=True — no script saved
	HN run 2: LLM=$0.0060, steps=1, success=None — no script saved

	REST API tests (existing workspace 6eb952a1):
	HN run 1: LLM=$0.0047, steps=1, success=True — no script saved
	HN run 2: LLM=$0.0023, steps=1, success=False — no script saved

	REST API LinkedIn visit-profile tests (cacheScript=true):
	Session 8192a428 (Roland Vogl): LLM=$0.0962, 15 steps
	Session e60e247e (Simon Agar): LLM=$0.0436, 8 steps <- should be $0
	Session 23e22cba (Pablo Arredondo): LLM=$0.0548, 10 steps <- should be $0
	Scripts in workspace: 0

	REST API LinkedIn visit-profile tests (@{{}} auto-detect, no explicit cacheScript):
	Session a68787ed (Dazza Greenwood): LLM=$0.0886, 13 steps
	Session 164331d5 (Hugh Carlson): LLM=$0.0313, 4 steps <- should be $0
	Scripts in workspace: 0

	REST API test without @{{}} (broken f-string, single braces):
	Session 68e0af46 (Dazza Greenwood): LLM=$0.0171, 4 steps
	(This was our original bug — f-string produced @{value} not @{{value}})

	SDK visit-profile test (confirmed working otherwise):
	Session via linkedin.py: LLM=$0.0107, 3 steps, success=True, correct JSON output
	(SDK works fine for running tasks, just no caching)

	SECONDARY BUG: Sessions List Pagination
	----------------------------------------
	The /api/v3/sessions endpoint has two pagination interfaces. One is broken.

	BROKEN (but accepted without error):
	GET /api/v3/sessions?pageSize=50&pageNumber=1 -> returns 20 sessions
	GET /api/v3/sessions?pageSize=50&pageNumber=2 -> returns SAME 20 sessions
	GET /api/v3/sessions?pageSize=50&pageNumber=3 -> returns SAME 20 sessions
	...
	Every page returns the same 20 IDs. Looping until empty produces
	infinite pages. We got 580 "sessions" (20 unique x 29 pages before
	we stopped) which led to wildly wrong cost analysis.

	The pageSize param is also ignored — always returns 20 regardless
	of the value passed.

	WORKING (but not in OpenAPI spec):
	GET /api/v3/sessions?page=1&page_size=100 -> returns up to 100 unique sessions
	GET /api/v3/sessions?page=2&page_size=100 -> returns remaining sessions
	Response includes {"total": 119} which is correct.

	The OpenAPI spec at /api/v3/openapi.json documents page/page_size
	as the correct params, but pageNumber/pageSize are also silently
	accepted and produce wrong results.

	COST IMPACT
	-----------
	Our production usage on 2026-04-05 (one day):

	119 real BU sessions for LinkedIn outreach automation.
	All on bu-ultra (claude-opus-4.6) — we've since switched to gemini-3-flash.

	Total BU bill: $65.05
	- BU Agent LLM Inference: $46.02 (70.7%) <- would be ~$0 if caching worked
	- Skill Creation: $14.00 (21.5%) <- warm-cache attempts
	- Proxy Data: $10.60 (16.3%)
	- Browser Sessions: $0.31 (0.5%)
	- Skill Execution: $0.12 (0.2%)
	- Skill Creation Refund: -$6.00 (-9.2%)

	Most expensive individual sessions (from dashboard):
	"Connect with Roland Vogl": $2.25 (19m 29s!)
	"Extract LinkedIn comments and reactions": $1.52
	"Connect with Julie Chapman": $1.39
	"Connect with Max Junestrand": $1.24
	"Extract LinkedIn messages": $1.22 (35m 49s!)
	"Message Anna Podolskaya": $1.04
	"LinkedIn people search": $1.05

	If caching had worked, the $46 LLM line would be ~$2-5 (first runs
	only for ~14 unique templates). The remaining 100+ sessions would
	replay cached scripts at $0 LLM.

	ACCOUNT DETAILS
	---------------
	Account plan: subscription_50 ($50/mo, active)
	Project ID: 371e993f-098f-4ce6-8735-97d7345f61b0
	Profile ID: 1e1993bc-ca24-4acb-8df0-998cc7a273cb
	Workspace IDs tested:
	- 6eb952a1-f281-4ae4-bfd7-1da580f7cd34 (existing "My Files")
	- 34e0cb2a-151f-48e3-975a-1821347b96bc (fresh "cache-test")
	SDK: browser-use-sdk 3.4.2
	Python: 3.13.7
	OS: macOS Darwin 25.0.0

	DOCS REFERENCED
	---------------
	- https://docs.browser-use.com/cloud/agent/cache-script
	"Second call — cached script, different param ($0 LLM, ~5s)"
	"No agent, no LLM."

	- https://docs.browser-use.com/cloud/agent/quickstart
	SDK usage examples

	- OpenAPI spec at /api/v3/openapi.json
	RunTaskRequest schema shows cacheScript field:
	"null (default): auto-detected — enabled when the task contains
	@{{value}} brackets and a workspace is attached."

	HOW TO RUN THIS REPRO
	---------------------
	pip install browser-use-sdk
	export BROWSER_USE_API_KEY=bu_your_key
	python bu-cache-repro.py

	Cost: ~$0.05-0.10 total (4 cheap gemini-3-flash sessions on example.com)
	No LinkedIn or auth needed — uses example.com only.
	Creates a fresh workspace, runs 3 sessions, checks for cached scripts.
	"""

	import asyncio
	import os
	import json
	import time
	import urllib.request

	# ── Check environment ──────────────────────────────────────────────
	API_KEY = os.environ.get("BROWSER_USE_API_KEY")
	if not API_KEY:
	print("ERROR: Set BROWSER_USE_API_KEY environment variable")
	print(" export BROWSER_USE_API_KEY=bu_your_key")
	raise SystemExit(1)


	# ====================================================================
	# PART 1: SDK-based repro (recommended — uses official browser-use-sdk)
	# ====================================================================

	async def test_sdk_caching():
	"""Test caching via the official Python SDK."""
	from browser_use_sdk.v3 import AsyncBrowserUse

	client = AsyncBrowserUse()

	# ── Create fresh workspace ─────────────────────────────────────
	print("=" * 70)
	print("PART 1: SDK CACHING TEST")
	print("=" * 70)
	workspace = await client.workspaces.create(name="cache-repro-test")
	ws_id = str(workspace.id)
	print(f" Fresh workspace: {ws_id}")

	# ── Run 1: Should build cached script ──────────────────────────
	print()
	print(" --- Run 1: First call (should build cache) ---")
	print(' Task: "Go to @{{https://example.com}}. Extract the page title."')
	result1 = await client.run(
	"Go to @{{https://example.com}}. Extract the page title. Return as JSON.",
	workspace_id=ws_id,
	model="gemini-3-flash",
	)
	print(f" Session: {result1.id}")
	print(f" Steps: {result1.step_count}, LLM: ${result1.llm_cost_usd}, Success: {result1.is_task_successful}")

	# ── Check for scripts ──────────────────────────────────────────
	files = await client.workspaces.files(ws_id, prefix="scripts/")
	print(f" Scripts in workspace: {len(files.files)}")
	all_files = await client.workspaces.files(ws_id)
	print(f" All files: {[f.path for f in all_files.files]}")

	# ── Run 2: Different param, should hit cache ($0 LLM) ─────────
	print()
	print(" --- Run 2: Different param (should be cached, $0 LLM) ---")
	print(' Task: "Go to @{{https://example.org}}. Extract the page title."')
	result2 = await client.run(
	"Go to @{{https://example.org}}. Extract the page title. Return as JSON.",
	workspace_id=ws_id,
	model="gemini-3-flash",
	)
	print(f" Session: {result2.id}")
	print(f" Steps: {result2.step_count}, LLM: ${result2.llm_cost_usd}, Success: {result2.is_task_successful}")

	# ── Run 3: Explicit cache_script=True ──────────────────────────
	print()
	print(" --- Run 3: Explicit cache_script=True ---")
	result3 = await client.run(
	"Go to @{{https://www.iana.org/domains/reserved}}. Extract the page title. Return as JSON.",
	workspace_id=ws_id,
	model="gemini-3-flash",
	cache_script=True,
	)
	print(f" Session: {result3.id}")
	print(f" Steps: {result3.step_count}, LLM: ${result3.llm_cost_usd}, Success: {result3.is_task_successful}")

	# ── Final workspace check ──────────────────────────────────────
	files_final = await client.workspaces.files(ws_id, prefix="scripts/")
	all_final = await client.workspaces.files(ws_id)
	print()
	print(f" Scripts after all 3 runs: {len(files_final.files)}")
	print(f" All workspace files: {[f.path for f in all_final.files]}")

	# ── Verdict ────────────────────────────────────────────────────
	print()
	llm1 = float(result1.llm_cost_usd or 0)
	llm2 = float(result2.llm_cost_usd or 0)
	llm3 = float(result3.llm_cost_usd or 0)
	print(f" Run 1 LLM: ${llm1:.4f} (first run, expected: >$0)")
	print(f" Run 2 LLM: ${llm2:.4f} (cached rerun, expected: $0)")
	print(f" Run 3 LLM: ${llm3:.4f} (explicit cache_script=True, expected: $0)")
	print(f" Cached scripts: {len(files_final.files)} (expected: >=1)")

	if len(files_final.files) == 0 and llm2 > 0.001:
	print()
	print(" * BUG CONFIRMED: No scripts saved. LLM cost on every run. *")
	return False
	else:
	print()
	print(" Caching appears to be working!")
	return True


	# ====================================================================
	# PART 2: Pagination bug repro
	# ====================================================================

	def test_pagination_bug():
	"""Demonstrate the sessions list pagination bug."""
	print()
	print("=" * 70)
	print("PART 2: SESSIONS LIST PAGINATION BUG")
	print("=" * 70)
	headers = {"Content-Type": "application/json", "X-Browser-Use-API-Key": API_KEY}

	# ── Broken params (pageNumber/pageSize) ────────────────────────
	print()
	print(" --- BROKEN: pageNumber/pageSize ---")
	ids_broken = set()
	total_returned = 0
	for pg in range(1, 4):
	url = f"https://api.browser-use.com/api/v3/sessions?pageSize=50&pageNumber={pg}"
	req = urllib.request.Request(url, headers=headers)
	with urllib.request.urlopen(req, timeout=15) as resp:
	data = json.loads(resp.read())
	sessions = data.get("sessions", [])
	new_ids = set(s["id"] for s in sessions) - ids_broken
	ids_broken.update(s["id"] for s in sessions)
	total_returned += len(sessions)
	print(f" Page {pg}: {len(sessions)} returned, {len(new_ids)} new unique (total unique: {len(ids_broken)})")

	print(f" Total returned across 3 pages: {total_returned}")
	print(f" Unique IDs: {len(ids_broken)}")
	if total_returned > len(ids_broken) * 1.5:
	print(f" * BUG: {total_returned - len(ids_broken)} duplicate entries across pages *")

	# ── Working params (page/page_size) ────────────────────────────
	print()
	print(" --- WORKING: page/page_size ---")
	ids_working = set()
	total_returned_2 = 0
	for pg in range(1, 4):
	url = f"https://api.browser-use.com/api/v3/sessions?page={pg}&page_size=100"
	req = urllib.request.Request(url, headers=headers)
	with urllib.request.urlopen(req, timeout=15) as resp:
	data = json.loads(resp.read())
	sessions = data.get("sessions", [])
	total_api = data.get("total", "?")
	new_ids = set(s["id"] for s in sessions) - ids_working
	ids_working.update(s["id"] for s in sessions)
	total_returned_2 += len(sessions)
	print(f" Page {pg}: {len(sessions)} returned, {len(new_ids)} new unique (total unique: {len(ids_working)}, api total: {total_api})")
	if not sessions:
	break

	print(f" Total unique sessions: {len(ids_working)}")


	# ====================================================================
	# PART 3: Billing check
	# ====================================================================

	def check_billing():
	"""Show current account balance."""
	print()
	print("=" * 70)
	print("PART 3: ACCOUNT STATUS")
	print("=" * 70)
	headers = {"Content-Type": "application/json", "X-Browser-Use-API-Key": API_KEY}
	url = "https://api.browser-use.com/api/v3/billing/account"
	req = urllib.request.Request(url, headers=headers)
	with urllib.request.urlopen(req, timeout=15) as resp:
	data = json.loads(resp.read())
	print(f" Plan: {data.get('planInfo', {}).get('planName', '?')}")
	print(f" Balance: ${data.get('totalCreditsBalanceUsd', '?')}")
	print(f" Project ID: {data.get('projectId', '?')}")


	# ====================================================================
	# Main
	# ====================================================================

	if __name__ == "__main__":
	print("Browser Use v3 — Cache Bug Repro")
	print(f"SDK: browser-use-sdk 3.4.2")
	print(f"Docs: https://docs.browser-use.com/cloud/agent/cache-script")
	print()

	cache_works = asyncio.run(test_sdk_caching())
	test_pagination_bug()
	check_billing()

	print()
	print("=" * 70)
	if not cache_works:
	print("FINAL: Caching is broken. No scripts saved to workspace.")
	print("Every session runs full LLM inference regardless of:")
	print(" - @{{param}} syntax in task")
	print(" - workspace_id provided")
	print(" - explicit cache_script=True")
	print(" - fresh vs existing workspace")
	print(" - SDK vs REST API")
	else:
	print("FINAL: Caching worked! (If you're seeing this, BU may have fixed it.)")
	print("=" * 70)
No results found