Skip to content

Instantly share code, notes, and snippets.

@chrisdavidmiles
Created April 17, 2026 03:57
Show Gist options
  • Select an option

  • Save chrisdavidmiles/3bd3a50f431ea89732aa0f0ec51f707b to your computer and use it in GitHub Desktop.

Select an option

Save chrisdavidmiles/3bd3a50f431ea89732aa0f0ec51f707b to your computer and use it in GitHub Desktop.
Verbatim Test Suite Quality Audit

Verbatim Test Suite Quality Audit

Phase 1 — System Map

What it is: A CLI tool (teams) that extracts Microsoft Teams conversations as markdown/JSON. Single-file Python script (~2000 lines) using Playwright to drive a headless Chromium browser against the Teams web app.

Entry points: One CLI command (teams) with mutually exclusive modes:

  1. teams <url> — Extract a specific conversation
  2. teams --find <query> — Global search across Teams
  3. teams --recent — List sidebar conversations
  4. teams --login — Open browser for manual login

Core domain:

  • URL detection — Classify a Teams URL into one of 5 content types (chat, thread, channel, meeting_chat, transcript)
  • Authentication — Login flow (TOTP, account picker, password), session persistence, mid-extraction session expiry detection and recovery
  • Extraction — Navigate to Teams URL, execute JavaScript in the DOM, scroll virtualized lists, deduplicate, collect messages
  • Search — Drive Teams search UI, intercept Substrate API responses, parse structured results, apply filters
  • Formatting — Convert extracted data to markdown or JSON
  • Caching — URL-keyed JSON cache with refresh/no-cache flags
  • Orchestration — Multi-URL extraction, expand search results, meeting chat → transcript chaining

External dependencies: Microsoft Teams web app (the entire DOM structure), Substrate search API (substrate.office.com), SharePoint iframes (cross-origin transcripts), Microsoft login (login.microsoftonline.com), TOTP (pyotp).

Critical paths (things that, if broken, make the tool useless):

  1. Auth flow — Can't do anything without a valid session
  2. Chat extraction JS payload — Wrong selectors = no messages or corrupt output
  3. Thread extraction JS payload — Same, different DOM
  4. Search API interception and parsing — No URLs in search results without this
  5. URL type detection — Wrong type = wrong extraction function = crash or garbage
  6. Scroll + deduplicate loop — Miss messages or get duplicates in long conversations

CI: GitHub Actions runs unit tests only (no E2E — requires real Teams session). Tests Python 3.12 and 3.13. No coverage thresholds. No retry config. No flaky test markers.

Recent bugs (from git log):

  • 09b21f8 Fix DM extraction — DOM walk was wrong, found wrong author/timestamp. No unit test existed for the JS payload; this was caught manually.
  • 75ce940 Fix person filter — intermediate API responses were polluting results. No unit test covered the filter-then-keep-last-response logic; caught manually.
  • 07e3868 Fix empty-query search — not args.find is True for "". No unit test; caught during E2E testing.
  • b93c4f7 Fix Chrome extension submodule — irrelevant to current code.

All three real bugs were in code with zero test coverage. The tests that existed at the time would not have caught any of them.


Phase 2 — Behavior Inventory

Domain: URL Detection

Behavior Criticality
Correctly classifies DM chat URLs High
Correctly classifies channel thread URLs (tacv2) High
Correctly classifies meeting recap/transcript URLs High
Correctly classifies meeting chat URLs (meeting_ prefix) High
Correctly classifies channel landing page URLs High
Correctly classifies direct message links (thread.v2 + /l/message/) High
Handles URL-encoded characters in path segments Medium
Falls through gracefully for unrecognized Teams URLs Medium

Domain: Authentication

Behavior Criticality
Detects login page and fills email/password Catastrophic
Handles "Pick an account" page High
Enters TOTP code with 35s cooldown High
Detects login errors (bad password, rate limit) High
Detects "Stay signed in?" and clicks through Medium
Detects mid-extraction session expiry (redirect to login) Catastrophic
Detects auth banner (stale MSAL tokens) High
Re-login + resume after session expiry High
Fails gracefully when env vars are missing Medium
Validates login actually completed (not stuck on login page) High

Domain: Extraction — Chat

Behavior Criticality
Extracts messages with author, timestamp, content from DM DOM Catastrophic
Scrolls virtualized list to load historical messages High
Deduplicates messages by ID across scroll iterations High
Sorts messages by ID (chronological order) High
Stops scrolling after N no-change iterations Medium
Respects --days cutoff for chat messages High
Handles empty chat pane (no messages) Medium
Handles missing author (anonymous/system messages) Medium
Quick mode: extracts visible messages only Medium

Domain: Extraction — Channel Thread

Behavior Criticality
Extracts thread messages from channel-replies-viewport Catastrophic
Handles subject line on parent message Medium
Scrolls DOWN to load replies (opposite of chat) High
Accumulates messages across scroll iterations via ID map High
Respects --days cutoff with channel timestamp format High

Domain: Extraction — Transcript

Behavior Criticality
Finds correct SharePoint iframe (xplatplugins over streamembed) High
Extracts speaker, timestamp, text from transcript cells High
Scrolls virtualized transcript list High
Clicks Transcript tab if not already selected Medium

Domain: Extraction — Meeting Chat

Behavior Criticality
Extracts chat messages first, then checks for Recap tab High
Clicks Recap tab and extracts transcript High
Handles meetings with no recap available Medium
Returns combined dict with messages + transcript High

Domain: Extraction — Channel Listing

Behavior Criticality
Lists thread subjects from channel page Medium
Builds thread URLs from channel URL + message IDs High
Scrolls to load more threads Medium

Domain: Search

Behavior Criticality
Enters search text and clicks Messages tab High
Intercepts Substrate API responses Catastrophic
Parses API response structure (EntitySets → ResultSets → Results → Source) Catastrophic
Builds Teams deep link URLs from thread_id + message_id High
Deduplicates results by InternetMessageId High
Handles missing fields in API response gracefully Medium
Filters bot MRIs from DisplayTo channel names Medium
Falls back to HitHighlightedSummary when Preview is empty Medium
Applies date filter (preset labels and custom date picker) High
Applies person filter (type name, click suggestion) High
Keeps only last API response after filters are applied High
Scrolls + clicks "Show more" for pagination Medium
Falls back to DOM scraping when API interception fails Medium
Empty query with --from uses person name as search text High
--expand: navigates to result URLs and extracts context Medium

Domain: Formatting

Behavior Criticality
Markdown: author header, quoted content, separators High
Markdown: consecutive messages from same author collapse header Medium
Markdown: date change inserts date separator Low
Markdown: subject rendered as ### heading Low
Markdown: empty content messages skipped Medium
Markdown: grep filters by content/author/subject (case-insensitive) High
Markdown: sort newest reverses message order Medium
Markdown: multiline content each line gets > prefix Medium
Markdown: channel thread timestamp format handled Low
Markdown: transcript speaker continuation (same speaker, no repeat header) Medium
Markdown: search results with/without URL, with/without expanded context Medium
Markdown: channel threads list with subjects and URLs Medium
Markdown: recent conversations with type labels and URLs Medium
JSON: messages serialized as JSON array High
JSON: meeting_chat has messages + transcript keys Medium
JSON: unicode preserved Low
JSON: multi-URL output as list of {url, type, data} Medium

Domain: Caching

Behavior Criticality
Cache key is stable for same URL (normalized) High
Trailing slash and fragment are stripped for key Medium
Different URLs get different keys High
Round-trip write then read returns same data High
Missing cache file returns None Medium
Corrupt cache file returns None (not crash) Medium
--refresh re-extracts and overwrites cache Medium
--no-cache skips both read and write Medium
--quick mode disables cache reads Medium

Domain: Cookie Parsing

Behavior Criticality
Parses Netscape cookie format Low
Strips leading dots from domains Low
Filters to Microsoft domains only Low
Skips comments and blank lines Low

Domain: CLI Orchestration

Behavior Criticality
Multiple URLs reuse one browser session Medium
Launcher interstitial ("Use the web app instead") handled High
PWA cache cleared to detect stale auth High
Output to file with -o Low
Debug mode opens visible browser Low

Phase 3 — Test Audit: What's Load-Bearing, What's Theater

Summary: 98 unit tests, 32 E2E tests

130 total tests. Of those, 32 E2E tests are skipped in CI (they require a real Teams session and urls.env). That means CI runs 98 tests. Let me examine what those 98 actually test.

test_url_detection.py (10 tests) — Mostly load-bearing

These are solid. Each test provides a realistic URL and asserts the correct type string. They cover the 5 return values and edge cases (URL encoding, query params, fragment paths).

Gap: No test for URLs that don't match any pattern — e.g., https://outlook.office365.com/something or a completely malformed URL. The fallback to "chat" is untested. Minor.

test_cookies.py (4 tests) — Load-bearing but low-criticality

Cookie parsing is a fallback auth method. These tests are solid for what they cover. The code they test is straightforward string splitting.

Verdict: Real tests, but they protect a non-critical code path. Cookie-based auth is secondary to the persistent profile approach.

test_search.py (14 tests) — Load-bearing

All tests for _days_to_date_filter. Good boundary testing (1, 2, 3, 5, 7, 8, 14, 30, 31, 32, 60, 90 days). Each asserts the correct preset label or custom date.

Gap: Only tests the pure mapping function. The actual application of the filter (clicking the dropdown, selecting the option, clicking Apply) is untested at the unit level — and it can't be, since it requires a browser. The E2E tests cover this path when urls.env is present.

test_cache.py (8 tests) — Load-bearing

Good: round-trip, missing file, corrupt file, URL normalization. Uses tempfile.TemporaryDirectory to avoid side effects. The CACHE_DIR monkey-patching is a bit fragile but works.

Minor anti-pattern: test_round_trip (line 37-57) mutates module-level state (teams.CACHE_DIR). A test crash before the finally block would leave global state dirty. In practice this is fine since pytest isolates processes.

test_markdown.py (58 tests) — Mixed: mostly load-bearing, some shallow

Load-bearing tests:

  • TestMessagesToMarkdown (11 tests): Good behavioral tests. Checks author headers, separators, date changes, grep filtering, sorting, multiline quoting, subject headings, empty content skipping.
  • TestTranscriptToMarkdown (4 tests): Speaker continuation, speaker change, grep filter.
  • TestSearchToMarkdown (5 tests): URL rendering, expanded context with <details>.
  • TestParseSearchApiResults (10 tests): This is the highest-value unit test class. Covers the Substrate API parsing: field extraction, deduplication, missing fields, bot MRI filtering, fallback fields, multi-response merging, missing message ID.
  • TestBuildTeamsUrl (4 tests): URL construction for chat vs. channel vs. meeting threads.

Shallow tests:

  1. TestFormatOutput — mostly a routing test, not a behavior test (test_markdown.py:112-189):

    def test_chat_markdown(self):
        data = [{"author": "Alice", "timestamp": "", "content": "Hello"}]
        output = teams._format_output("chat", data, use_json=False)
        assert "**Alice**" in output
        assert "> Hello" in output

    This test is load-bearing only for the routing logic ("does _format_output('chat', ...) call messages_to_markdown?"). The actual formatting is already tested in TestMessagesToMarkdown. Seven tests that mostly verify the same thing: "does the dispatcher dispatch?" These aren't bad, but they're duplicative.

  2. TestToJson tests are shallow (test_markdown.py:612-658):

    def test_messages_to_json(self):
        output = teams._to_json(messages)
        parsed = json.loads(output)
        assert len(parsed) == 2
        assert parsed[0]["author"] == "Alice"

    _to_json is literally json.dumps(data, indent=2, ensure_ascii=False). Five tests for json.dumps. The test_json_preserves_unicode test is the only one providing non-trivial value; the rest are testing the standard library.

test_e2e.py (32 tests) — Load-bearing but unavailable in CI

These are real E2E tests that call the CLI via subprocess against live Teams. They're the only tests that exercise:

  • Browser lifecycle (launch, navigate, close)
  • Authentication
  • DOM extraction JavaScript payloads
  • Scroll loops
  • Cache behavior end-to-end
  • Search with real API interception
  • Multi-URL extraction
  • Meeting chat → transcript chaining
  • Sidebar listing with filters

Critical problem: All 32 are skipped in CI. They only run on the developer's machine when urls.env and ~/.teams-cli/profile exist. This means CI tests zero percent of the critical paths. The 98 tests that CI runs test formatting, URL parsing, cache key generation, and cookie parsing — all of which are supporting functions, not the core value.

Anti-pattern — Duplicate test name (test_e2e.py:459-476):

def test_recent_filter_channels(self):  # line 459
    ...
def test_recent_filter_channels(self):  # line 468
    ...

Two methods with the same name in TestRecentConversations. Python silently overwrites the first with the second. The first test (line 459) never runs. It's dead code that inflates the count.

Weak assertions in E2E (test_e2e.py:37-41):

def _assert_has_messages(stdout, label="output"):
    assert stdout.strip(), f"{label} is empty"
    assert "**" in stdout, f"No author names found in {label}"
    assert ">" in stdout, f"No message content found in {label}"

This helper is used by most extraction tests. It checks that something bold and something quoted exists in the output. It doesn't verify message count, author accuracy, content completeness, or structural correctness. A regression that corrupts all timestamps, drops half the messages, or garbles author names would pass these assertions. That said, for E2E tests against live data whose content changes, this level of assertion is defensible — you can't assert on specific content.


Phase 4 — Coverage Matrix

Behavior Crit. Unit Integration E2E Notes
URL detection (all 5 types) High Solid Absent Absent 10 good unit tests
Auth: login flow Catastrophic Absent Absent Shallow E2E tests exercise it implicitly but don't assert on it
Auth: mid-extraction expiry Catastrophic Absent Absent Absent No test intentionally triggers expiry
Auth: auth banner detection High Absent Absent Absent Only tested if it happens to occur during E2E run
Chat extraction (DOM JS) Catastrophic Absent Absent Shallow E2E asserts "has messages", no JS payload unit test
Thread extraction (DOM JS) Catastrophic Absent Absent Shallow Same — E2E checks for bold + quoted text
Transcript extraction (iframe) High Absent Absent Shallow E2E only
Meeting chat → transcript High Shallow Absent Shallow Unit tests cover formatting, not extraction logic
Channel listing Medium Shallow Absent Shallow Unit tests cover markdown formatting only
Scroll + deduplicate loop High Absent Absent Shallow E2E test with scrolling exists but weak assertions
Search: API interception Catastrophic Absent Absent Shallow No unit test for the interception itself
Search: API response parsing Catastrophic Solid Absent Shallow 10 good unit tests for _parse_search_api_results
Search: date filter application High Absent Absent Shallow _days_to_date_filter tested, but UI application isn't
Search: person filter High Absent Absent Shallow E2E test exists
Search: empty query + --from High Absent Absent Shallow Bug was fixed but no unit test prevents regression
Search: pagination/scroll Medium Absent Absent Absent No test verifies "Show more" clicking works
Expand search results Medium Absent Absent Shallow E2E test checks for "context" key
Markdown: messages High Solid Absent Absent 11 tests covering formatting behaviors
Markdown: transcript Medium Solid Absent Absent 4 tests
Markdown: search results Medium Solid Absent Absent 5 tests including expanded context
Markdown: channel threads Medium Solid Absent Absent 3 tests
Markdown: recent convos Medium Solid Absent Absent 4 tests
JSON output High Shallow Absent Shallow Tests json.dumps, not meaningful
Cache: key generation High Solid Absent Absent 5 tests with normalization edge cases
Cache: read/write High Solid Absent Shallow 3 unit + 3 E2E tests
Cache: --refresh / --no-cache Medium Absent Absent Shallow E2E only
Cookie parsing Low Solid Absent Absent 4 tests
URL construction (deep links) High Solid Absent Absent build_teams_url + _build_thread_url
Multi-URL extraction Medium Absent Absent Shallow 2 E2E tests
Launcher interstitial High Absent Absent Implicit Happens during navigation, not explicitly tested
PWA cache clearing High Absent Absent Absent CDP call, no test at any level
_format_output routing Medium Shallow Absent Absent Tests dispatcher, not behavior
--days cutoff in extraction High Absent Absent Absent Date filtering in extract_chat/extract_thread untested
Sidebar filter buttons (CSS overflow workaround) Medium Absent Absent Shallow E2E tests one filter
list_recent thread ID → URL Medium Solid Absent Shallow _chat_url_from_thread_id unit tested

Phase 5 — Structural Problems

1. All critical paths are untestable in CI

The entire extraction pipeline — everything that touches a browser — runs only locally. CI validates formatting and URL parsing. If someone breaks CHAT_EXTRACT_JS, CHANNEL_COLLECT_JS, the scroll loops, or the auth flow, CI passes green.

This isn't fixable without a mock Teams server or recorded browser sessions (Playwright's route or HAR replay). The system's dependency on a live Teams session with real data makes traditional E2E testing in CI impractical. This is an architectural constraint, not a laziness problem.

What would help: Playwright HAR recording of successful extractions, replayed in CI. This would cover navigation, DOM extraction, and API interception without a live session. It's non-trivial but it's the only way to get CI coverage of the critical paths.

2. JavaScript payloads have zero unit testing

Six JS payloads (CHAT_EXTRACT_JS, CHANNEL_COLLECT_JS, TRANSCRIPT_COLLECT_JS, SEARCH_EXTRACT_JS, RECENT_CHATS_JS, CHANNEL_THREADS_JS) are the core extraction logic. They're inline strings in Python. There are no tests that verify their behavior against sample DOM structures.

These payloads have already caused production bugs (09b21f8 — DM extraction walking up to wrong container). They're the most fragile part of the system because Microsoft can change the Teams DOM at any time.

What would help: jsdom or a lightweight DOM test harness that feeds sample HTML fragments to these JS functions and verifies the output structure. This would at least catch regressions in the JS logic itself (not selector changes, which require live DOM).

3. No contract testing of the Substrate API

_parse_search_api_results has good unit tests with hand-crafted JSON. But if Substrate changes its response schema (renames EntitySets, changes Source structure), nothing catches it until a live search fails. A recorded API response snapshot stored as a fixture would serve as a contract test.

4. No test fixtures for realistic/adversarial data

All unit test data uses clean fixtures:

{"author": "Alice", "timestamp": "2026-01-01", "content": "Hello"}

Real Teams messages have:

  • Empty author (system messages, bots)
  • HTML entities in content (already stripped by innerText, but the test doesn't verify this assumption)
  • Unicode/emoji in author names and content
  • Extremely long messages (paste of a log file)
  • Messages with only attachments (no text content)
  • Nested > quotes (markdown-in-markdown)
  • Newlines within author names (Teams sometimes renders "Name\nPronouns")

The test_json_preserves_unicode test is the only nod toward non-ASCII data.

5. Duplicate test method name

TestRecentConversations.test_recent_filter_channels appears twice (test_e2e.py:459 and :468). The first one is silently overwritten by Python. This should be a lint error.


Phase 6 — The Honest Smell Report

If 30% of the tests were deleted at random, would the remaining suite still catch most real regressions?

Yes. The suite is not bloated — it's lopsided. The 98 CI tests are concentrated in formatting and pure functions. Deleting 30% would remove some boundary tests for _days_to_date_filter or some of the _to_json tests, neither of which would meaningfully reduce regression detection. The formatting tests are solid but they're protecting the least-likely-to-break part of the system.

Were the last bugs in tested or untested code?

All three real bugs (DM extraction, person filter, empty query) were in completely untested code. The tests that existed at the time provided zero signal about these failures. The test suite was green while the tool was broken.

Ratio of behavior tests to implementation-detail tests?

Roughly 85/15. Most tests do assert on behavioral outputs (given this input, does the output contain these strings?). The TestFormatOutput class is the main offender — it's testing internal dispatch, not user-facing behavior. The TestToJson class is testing json.dumps.

Could a new engineer learn what the system does from the tests?

Partially. The test file names and class names are descriptive. The unit tests show what data goes in and what markdown comes out, which teaches the formatting contract. But the tests provide zero insight into how extraction works, what the auth flow looks like, or what the search interaction involves — because none of that is unit-tested.


The Headline

The suite has two layers:

  1. Formatting and pure functions — Well-tested. 98 tests in CI. Would catch regressions in markdown output, URL parsing, API response parsing, cache key generation, and date filter mapping.

  2. Everything that makes the tool work — Untested in CI. Auth, browser navigation, DOM extraction, scrolling, API interception, filter application, session recovery. 32 E2E tests exist but only run on one developer's machine.

The suite's honest batting average: every real production bug so far was in the untested layer. The tested layer has never been the source of a bug, because it's straightforward formatting code.

That doesn't mean the formatting tests are worthless — they'd catch a refactor regression. But the test suite's coverage is inverse to the system's risk profile. The most complex, most fragile, most DOM-dependent code has no automated safety net in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment