Verbatim Test Suite Quality Audit

Phase 1 — System Map

What it is: A CLI tool (teams) that extracts Microsoft Teams conversations as markdown/JSON. Single-file Python script (~2000 lines) using Playwright to drive a headless Chromium browser against the Teams web app.

Entry points: One CLI command (teams) with mutually exclusive modes:

teams <url> — Extract a specific conversation
teams --find <query> — Global search across Teams
teams --recent — List sidebar conversations
teams --login — Open browser for manual login

Core domain:

URL detection — Classify a Teams URL into one of 5 content types (chat, thread, channel, meeting_chat, transcript)
Authentication — Login flow (TOTP, account picker, password), session persistence, mid-extraction session expiry detection and recovery
Extraction — Navigate to Teams URL, execute JavaScript in the DOM, scroll virtualized lists, deduplicate, collect messages
Search — Drive Teams search UI, intercept Substrate API responses, parse structured results, apply filters
Formatting — Convert extracted data to markdown or JSON
Caching — URL-keyed JSON cache with refresh/no-cache flags
Orchestration — Multi-URL extraction, expand search results, meeting chat → transcript chaining

External dependencies: Microsoft Teams web app (the entire DOM structure), Substrate search API (substrate.office.com), SharePoint iframes (cross-origin transcripts), Microsoft login (login.microsoftonline.com), TOTP (pyotp).

Critical paths (things that, if broken, make the tool useless):

Auth flow — Can't do anything without a valid session
Chat extraction JS payload — Wrong selectors = no messages or corrupt output
Thread extraction JS payload — Same, different DOM
Search API interception and parsing — No URLs in search results without this
URL type detection — Wrong type = wrong extraction function = crash or garbage
Scroll + deduplicate loop — Miss messages or get duplicates in long conversations

CI: GitHub Actions runs unit tests only (no E2E — requires real Teams session). Tests Python 3.12 and 3.13. No coverage thresholds. No retry config. No flaky test markers.

Recent bugs (from git log):

09b21f8 Fix DM extraction — DOM walk was wrong, found wrong author/timestamp. No unit test existed for the JS payload; this was caught manually.
75ce940 Fix person filter — intermediate API responses were polluting results. No unit test covered the filter-then-keep-last-response logic; caught manually.
07e3868 Fix empty-query search — not args.find is True for "". No unit test; caught during E2E testing.
b93c4f7 Fix Chrome extension submodule — irrelevant to current code.

All three real bugs were in code with zero test coverage. The tests that existed at the time would not have caught any of them.

Phase 2 — Behavior Inventory

Domain: URL Detection

Behavior	Criticality
Correctly classifies DM chat URLs	High
Correctly classifies channel thread URLs (tacv2)	High
Correctly classifies meeting recap/transcript URLs	High
Correctly classifies meeting chat URLs (meeting_ prefix)	High
Correctly classifies channel landing page URLs	High
Correctly classifies direct message links (thread.v2 + /l/message/)	High
Handles URL-encoded characters in path segments	Medium
Falls through gracefully for unrecognized Teams URLs	Medium

Domain: Authentication

Behavior	Criticality
Detects login page and fills email/password	Catastrophic
Handles "Pick an account" page	High
Enters TOTP code with 35s cooldown	High
Detects login errors (bad password, rate limit)	High
Detects "Stay signed in?" and clicks through	Medium
Detects mid-extraction session expiry (redirect to login)	Catastrophic
Detects auth banner (stale MSAL tokens)	High
Re-login + resume after session expiry	High
Fails gracefully when env vars are missing	Medium
Validates login actually completed (not stuck on login page)	High

Domain: Extraction — Chat

Behavior	Criticality
Extracts messages with author, timestamp, content from DM DOM	Catastrophic
Scrolls virtualized list to load historical messages	High
Deduplicates messages by ID across scroll iterations	High
Sorts messages by ID (chronological order)	High
Stops scrolling after N no-change iterations	Medium
Respects `--days` cutoff for chat messages	High
Handles empty chat pane (no messages)	Medium
Handles missing author (anonymous/system messages)	Medium
Quick mode: extracts visible messages only	Medium

Domain: Extraction — Channel Thread

Behavior	Criticality
Extracts thread messages from channel-replies-viewport	Catastrophic
Handles subject line on parent message	Medium
Scrolls DOWN to load replies (opposite of chat)	High
Accumulates messages across scroll iterations via ID map	High
Respects `--days` cutoff with channel timestamp format	High

Domain: Extraction — Transcript

Behavior	Criticality
Finds correct SharePoint iframe (xplatplugins over streamembed)	High
Extracts speaker, timestamp, text from transcript cells	High
Scrolls virtualized transcript list	High
Clicks Transcript tab if not already selected	Medium

Domain: Extraction — Meeting Chat

Behavior	Criticality
Extracts chat messages first, then checks for Recap tab	High
Clicks Recap tab and extracts transcript	High
Handles meetings with no recap available	Medium
Returns combined dict with messages + transcript	High

Domain: Extraction — Channel Listing

Behavior	Criticality
Lists thread subjects from channel page	Medium
Builds thread URLs from channel URL + message IDs	High
Scrolls to load more threads	Medium

Domain: Search

Behavior	Criticality
Enters search text and clicks Messages tab	High
Intercepts Substrate API responses	Catastrophic
Parses API response structure (EntitySets → ResultSets → Results → Source)	Catastrophic
Builds Teams deep link URLs from thread_id + message_id	High
Deduplicates results by InternetMessageId	High
Handles missing fields in API response gracefully	Medium
Filters bot MRIs from DisplayTo channel names	Medium
Falls back to HitHighlightedSummary when Preview is empty	Medium
Applies date filter (preset labels and custom date picker)	High
Applies person filter (type name, click suggestion)	High
Keeps only last API response after filters are applied	High
Scrolls + clicks "Show more" for pagination	Medium
Falls back to DOM scraping when API interception fails	Medium
Empty query with `--from` uses person name as search text	High
`--expand`: navigates to result URLs and extracts context	Medium

Domain: Formatting

Behavior	Criticality
Markdown: author header, quoted content, separators	High
Markdown: consecutive messages from same author collapse header	Medium
Markdown: date change inserts date separator	Low
Markdown: subject rendered as ### heading	Low
Markdown: empty content messages skipped	Medium
Markdown: grep filters by content/author/subject (case-insensitive)	High
Markdown: sort newest reverses message order	Medium
Markdown: multiline content each line gets `>` prefix	Medium
Markdown: channel thread timestamp format handled	Low
Markdown: transcript speaker continuation (same speaker, no repeat header)	Medium
Markdown: search results with/without URL, with/without expanded context	Medium
Markdown: channel threads list with subjects and URLs	Medium
Markdown: recent conversations with type labels and URLs	Medium
JSON: messages serialized as JSON array	High
JSON: meeting_chat has messages + transcript keys	Medium
JSON: unicode preserved	Low
JSON: multi-URL output as list of {url, type, data}	Medium

Domain: Caching

Behavior	Criticality
Cache key is stable for same URL (normalized)	High
Trailing slash and fragment are stripped for key	Medium
Different URLs get different keys	High
Round-trip write then read returns same data	High
Missing cache file returns None	Medium
Corrupt cache file returns None (not crash)	Medium
`--refresh` re-extracts and overwrites cache	Medium
`--no-cache` skips both read and write	Medium
`--quick` mode disables cache reads	Medium

Domain: Cookie Parsing

Behavior	Criticality
Parses Netscape cookie format	Low
Strips leading dots from domains	Low
Filters to Microsoft domains only	Low
Skips comments and blank lines	Low

Domain: CLI Orchestration

Behavior	Criticality
Multiple URLs reuse one browser session	Medium
Launcher interstitial ("Use the web app instead") handled	High
PWA cache cleared to detect stale auth	High
Output to file with `-o`	Low
Debug mode opens visible browser	Low

Phase 3 — Test Audit: What's Load-Bearing, What's Theater

Summary: 98 unit tests, 32 E2E tests

130 total tests. Of those, 32 E2E tests are skipped in CI (they require a real Teams session and urls.env). That means CI runs 98 tests. Let me examine what those 98 actually test.

test_url_detection.py (10 tests) — Mostly load-bearing

These are solid. Each test provides a realistic URL and asserts the correct type string. They cover the 5 return values and edge cases (URL encoding, query params, fragment paths).

Gap: No test for URLs that don't match any pattern — e.g., https://outlook.office365.com/something or a completely malformed URL. The fallback to "chat" is untested. Minor.

test_cookies.py (4 tests) — Load-bearing but low-criticality

Cookie parsing is a fallback auth method. These tests are solid for what they cover. The code they test is straightforward string splitting.

Verdict: Real tests, but they protect a non-critical code path. Cookie-based auth is secondary to the persistent profile approach.

test_search.py (14 tests) — Load-bearing

All tests for _days_to_date_filter. Good boundary testing (1, 2, 3, 5, 7, 8, 14, 30, 31, 32, 60, 90 days). Each asserts the correct preset label or custom date.

Gap: Only tests the pure mapping function. The actual application of the filter (clicking the dropdown, selecting the option, clicking Apply) is untested at the unit level — and it can't be, since it requires a browser. The E2E tests cover this path when urls.env is present.

test_cache.py (8 tests) — Load-bearing

Good: round-trip, missing file, corrupt file, URL normalization. Uses tempfile.TemporaryDirectory to avoid side effects. The CACHE_DIR monkey-patching is a bit fragile but works.

Minor anti-pattern: test_round_trip (line 37-57) mutates module-level state (teams.CACHE_DIR). A test crash before the finally block would leave global state dirty. In practice this is fine since pytest isolates processes.

test_markdown.py (58 tests) — Mixed: mostly load-bearing, some shallow

Load-bearing tests:

TestMessagesToMarkdown (11 tests): Good behavioral tests. Checks author headers, separators, date changes, grep filtering, sorting, multiline quoting, subject headings, empty content skipping.
TestTranscriptToMarkdown (4 tests): Speaker continuation, speaker change, grep filter.
TestSearchToMarkdown (5 tests): URL rendering, expanded context with <details>.
TestParseSearchApiResults (10 tests): This is the highest-value unit test class. Covers the Substrate API parsing: field extraction, deduplication, missing fields, bot MRI filtering, fallback fields, multi-response merging, missing message ID.
TestBuildTeamsUrl (4 tests): URL construction for chat vs. channel vs. meeting threads.

Shallow tests:

TestFormatOutput — mostly a routing test, not a behavior test (test_markdown.py:112-189):
```
def test_chat_markdown(self):
    data = [{"author": "Alice", "timestamp": "", "content": "Hello"}]
    output = teams._format_output("chat", data, use_json=False)
    assert "**Alice**" in output
    assert "> Hello" in output
```
This test is load-bearing only for the routing logic ("does _format_output('chat', ...) call messages_to_markdown?"). The actual formatting is already tested in TestMessagesToMarkdown. Seven tests that mostly verify the same thing: "does the dispatcher dispatch?" These aren't bad, but they're duplicative.
TestToJson tests are shallow (test_markdown.py:612-658):
```
def test_messages_to_json(self):
    output = teams._to_json(messages)
    parsed = json.loads(output)
    assert len(parsed) == 2
    assert parsed[0]["author"] == "Alice"
```
_to_json is literally json.dumps(data, indent=2, ensure_ascii=False). Five tests for json.dumps. The test_json_preserves_unicode test is the only one providing non-trivial value; the rest are testing the standard library.

test_e2e.py (32 tests) — Load-bearing but unavailable in CI

These are real E2E tests that call the CLI via subprocess against live Teams. They're the only tests that exercise:

Browser lifecycle (launch, navigate, close)
Authentication
DOM extraction JavaScript payloads
Scroll loops
Cache behavior end-to-end
Search with real API interception
Multi-URL extraction
Meeting chat → transcript chaining
Sidebar listing with filters

Critical problem: All 32 are skipped in CI. They only run on the developer's machine when urls.env and ~/.teams-cli/profile exist. This means CI tests zero percent of the critical paths. The 98 tests that CI runs test formatting, URL parsing, cache key generation, and cookie parsing — all of which are supporting functions, not the core value.

Anti-pattern — Duplicate test name (test_e2e.py:459-476):

def test_recent_filter_channels(self):  # line 459
    ...
def test_recent_filter_channels(self):  # line 468
    ...

Two methods with the same name in TestRecentConversations. Python silently overwrites the first with the second. The first test (line 459) never runs. It's dead code that inflates the count.

Weak assertions in E2E (test_e2e.py:37-41):

def _assert_has_messages(stdout, label="output"):
    assert stdout.strip(), f"{label} is empty"
    assert "**" in stdout, f"No author names found in {label}"
    assert ">" in stdout, f"No message content found in {label}"

This helper is used by most extraction tests. It checks that something bold and something quoted exists in the output. It doesn't verify message count, author accuracy, content completeness, or structural correctness. A regression that corrupts all timestamps, drops half the messages, or garbles author names would pass these assertions. That said, for E2E tests against live data whose content changes, this level of assertion is defensible — you can't assert on specific content.

Phase 4 — Coverage Matrix

Behavior	Crit.	Unit	Integration	E2E	Notes
URL detection (all 5 types)	High	Solid	Absent	Absent	10 good unit tests
Auth: login flow	Catastrophic	Absent	Absent	Shallow	E2E tests exercise it implicitly but don't assert on it
Auth: mid-extraction expiry	Catastrophic	Absent	Absent	Absent	No test intentionally triggers expiry
Auth: auth banner detection	High	Absent	Absent	Absent	Only tested if it happens to occur during E2E run
Chat extraction (DOM JS)	Catastrophic	Absent	Absent	Shallow	E2E asserts "has messages", no JS payload unit test
Thread extraction (DOM JS)	Catastrophic	Absent	Absent	Shallow	Same — E2E checks for bold + quoted text
Transcript extraction (iframe)	High	Absent	Absent	Shallow	E2E only
Meeting chat → transcript	High	Shallow	Absent	Shallow	Unit tests cover formatting, not extraction logic
Channel listing	Medium	Shallow	Absent	Shallow	Unit tests cover markdown formatting only
Scroll + deduplicate loop	High	Absent	Absent	Shallow	E2E test with scrolling exists but weak assertions
Search: API interception	Catastrophic	Absent	Absent	Shallow	No unit test for the interception itself
Search: API response parsing	Catastrophic	Solid	Absent	Shallow	10 good unit tests for `_parse_search_api_results`
Search: date filter application	High	Absent	Absent	Shallow	`_days_to_date_filter` tested, but UI application isn't
Search: person filter	High	Absent	Absent	Shallow	E2E test exists
Search: empty query + --from	High	Absent	Absent	Shallow	Bug was fixed but no unit test prevents regression
Search: pagination/scroll	Medium	Absent	Absent	Absent	No test verifies "Show more" clicking works
Expand search results	Medium	Absent	Absent	Shallow	E2E test checks for "context" key
Markdown: messages	High	Solid	Absent	Absent	11 tests covering formatting behaviors
Markdown: transcript	Medium	Solid	Absent	Absent	4 tests
Markdown: search results	Medium	Solid	Absent	Absent	5 tests including expanded context
Markdown: channel threads	Medium	Solid	Absent	Absent	3 tests
Markdown: recent convos	Medium	Solid	Absent	Absent	4 tests
JSON output	High	Shallow	Absent	Shallow	Tests `json.dumps`, not meaningful
Cache: key generation	High	Solid	Absent	Absent	5 tests with normalization edge cases
Cache: read/write	High	Solid	Absent	Shallow	3 unit + 3 E2E tests
Cache: --refresh / --no-cache	Medium	Absent	Absent	Shallow	E2E only
Cookie parsing	Low	Solid	Absent	Absent	4 tests
URL construction (deep links)	High	Solid	Absent	Absent	`build_teams_url` + `_build_thread_url`
Multi-URL extraction	Medium	Absent	Absent	Shallow	2 E2E tests
Launcher interstitial	High	Absent	Absent	Implicit	Happens during navigation, not explicitly tested
PWA cache clearing	High	Absent	Absent	Absent	CDP call, no test at any level
`_format_output` routing	Medium	Shallow	Absent	Absent	Tests dispatcher, not behavior
--days cutoff in extraction	High	Absent	Absent	Absent	Date filtering in `extract_chat`/`extract_thread` untested
Sidebar filter buttons (CSS overflow workaround)	Medium	Absent	Absent	Shallow	E2E tests one filter
`list_recent` thread ID → URL	Medium	Solid	Absent	Shallow	`_chat_url_from_thread_id` unit tested

Phase 5 — Structural Problems

1. All critical paths are untestable in CI

The entire extraction pipeline — everything that touches a browser — runs only locally. CI validates formatting and URL parsing. If someone breaks CHAT_EXTRACT_JS, CHANNEL_COLLECT_JS, the scroll loops, or the auth flow, CI passes green.

This isn't fixable without a mock Teams server or recorded browser sessions (Playwright's route or HAR replay). The system's dependency on a live Teams session with real data makes traditional E2E testing in CI impractical. This is an architectural constraint, not a laziness problem.

What would help: Playwright HAR recording of successful extractions, replayed in CI. This would cover navigation, DOM extraction, and API interception without a live session. It's non-trivial but it's the only way to get CI coverage of the critical paths.

2. JavaScript payloads have zero unit testing

Six JS payloads (CHAT_EXTRACT_JS, CHANNEL_COLLECT_JS, TRANSCRIPT_COLLECT_JS, SEARCH_EXTRACT_JS, RECENT_CHATS_JS, CHANNEL_THREADS_JS) are the core extraction logic. They're inline strings in Python. There are no tests that verify their behavior against sample DOM structures.

These payloads have already caused production bugs (09b21f8 — DM extraction walking up to wrong container). They're the most fragile part of the system because Microsoft can change the Teams DOM at any time.

What would help: jsdom or a lightweight DOM test harness that feeds sample HTML fragments to these JS functions and verifies the output structure. This would at least catch regressions in the JS logic itself (not selector changes, which require live DOM).

3. No contract testing of the Substrate API

_parse_search_api_results has good unit tests with hand-crafted JSON. But if Substrate changes its response schema (renames EntitySets, changes Source structure), nothing catches it until a live search fails. A recorded API response snapshot stored as a fixture would serve as a contract test.

4. No test fixtures for realistic/adversarial data

All unit test data uses clean fixtures:

{"author": "Alice", "timestamp": "2026-01-01", "content": "Hello"}

Real Teams messages have:

Empty author (system messages, bots)
HTML entities in content (already stripped by innerText, but the test doesn't verify this assumption)
Unicode/emoji in author names and content
Extremely long messages (paste of a log file)
Messages with only attachments (no text content)
Nested > quotes (markdown-in-markdown)
Newlines within author names (Teams sometimes renders "Name\nPronouns")

The test_json_preserves_unicode test is the only nod toward non-ASCII data.

5. Duplicate test method name

TestRecentConversations.test_recent_filter_channels appears twice (test_e2e.py:459 and :468). The first one is silently overwritten by Python. This should be a lint error.

Phase 6 — The Honest Smell Report

If 30% of the tests were deleted at random, would the remaining suite still catch most real regressions?

Yes. The suite is not bloated — it's lopsided. The 98 CI tests are concentrated in formatting and pure functions. Deleting 30% would remove some boundary tests for _days_to_date_filter or some of the _to_json tests, neither of which would meaningfully reduce regression detection. The formatting tests are solid but they're protecting the least-likely-to-break part of the system.

Were the last bugs in tested or untested code?

All three real bugs (DM extraction, person filter, empty query) were in completely untested code. The tests that existed at the time provided zero signal about these failures. The test suite was green while the tool was broken.

Ratio of behavior tests to implementation-detail tests?

Roughly 85/15. Most tests do assert on behavioral outputs (given this input, does the output contain these strings?). The TestFormatOutput class is the main offender — it's testing internal dispatch, not user-facing behavior. The TestToJson class is testing json.dumps.

Could a new engineer learn what the system does from the tests?

Partially. The test file names and class names are descriptive. The unit tests show what data goes in and what markdown comes out, which teaches the formatting contract. But the tests provide zero insight into how extraction works, what the auth flow looks like, or what the search interaction involves — because none of that is unit-tested.

The Headline

The suite has two layers:

Formatting and pure functions — Well-tested. 98 tests in CI. Would catch regressions in markdown output, URL parsing, API response parsing, cache key generation, and date filter mapping.
Everything that makes the tool work — Untested in CI. Auth, browser navigation, DOM extraction, scrolling, API interception, filter application, session recovery. 32 E2E tests exist but only run on one developer's machine.

The suite's honest batting average: every real production bug so far was in the untested layer. The tested layer has never been the source of a bug, because it's straightforward formatting code.

That doesn't mean the formatting tests are worthless — they'd catch a refactor regression. But the test suite's coverage is inverse to the system's risk profile. The most complex, most fragile, most DOM-dependent code has no automated safety net in CI.

chrisdavidmiles/test-suite-audit.md

Select an option

No results found