What it is: A CLI tool (teams) that extracts Microsoft Teams conversations as markdown/JSON. Single-file Python script (~2000 lines) using Playwright to drive a headless Chromium browser against the Teams web app.
Entry points: One CLI command (teams) with mutually exclusive modes:
teams <url>— Extract a specific conversationteams --find <query>— Global search across Teamsteams --recent— List sidebar conversationsteams --login— Open browser for manual login
Core domain:
- URL detection — Classify a Teams URL into one of 5 content types (chat, thread, channel, meeting_chat, transcript)
- Authentication — Login flow (TOTP, account picker, password), session persistence, mid-extraction session expiry detection and recovery
- Extraction — Navigate to Teams URL, execute JavaScript in the DOM, scroll virtualized lists, deduplicate, collect messages
- Search — Drive Teams search UI, intercept Substrate API responses, parse structured results, apply filters
- Formatting — Convert extracted data to markdown or JSON
- Caching — URL-keyed JSON cache with refresh/no-cache flags
- Orchestration — Multi-URL extraction, expand search results, meeting chat → transcript chaining
External dependencies: Microsoft Teams web app (the entire DOM structure), Substrate search API (substrate.office.com), SharePoint iframes (cross-origin transcripts), Microsoft login (login.microsoftonline.com), TOTP (pyotp).
Critical paths (things that, if broken, make the tool useless):
- Auth flow — Can't do anything without a valid session
- Chat extraction JS payload — Wrong selectors = no messages or corrupt output
- Thread extraction JS payload — Same, different DOM
- Search API interception and parsing — No URLs in search results without this
- URL type detection — Wrong type = wrong extraction function = crash or garbage
- Scroll + deduplicate loop — Miss messages or get duplicates in long conversations
CI: GitHub Actions runs unit tests only (no E2E — requires real Teams session). Tests Python 3.12 and 3.13. No coverage thresholds. No retry config. No flaky test markers.
Recent bugs (from git log):
09b21f8Fix DM extraction — DOM walk was wrong, found wrong author/timestamp. No unit test existed for the JS payload; this was caught manually.75ce940Fix person filter — intermediate API responses were polluting results. No unit test covered the filter-then-keep-last-response logic; caught manually.07e3868Fix empty-query search —not args.findis True for"". No unit test; caught during E2E testing.b93c4f7Fix Chrome extension submodule — irrelevant to current code.
All three real bugs were in code with zero test coverage. The tests that existed at the time would not have caught any of them.
| Behavior | Criticality |
|---|---|
| Correctly classifies DM chat URLs | High |
| Correctly classifies channel thread URLs (tacv2) | High |
| Correctly classifies meeting recap/transcript URLs | High |
| Correctly classifies meeting chat URLs (meeting_ prefix) | High |
| Correctly classifies channel landing page URLs | High |
| Correctly classifies direct message links (thread.v2 + /l/message/) | High |
| Handles URL-encoded characters in path segments | Medium |
| Falls through gracefully for unrecognized Teams URLs | Medium |
| Behavior | Criticality |
|---|---|
| Detects login page and fills email/password | Catastrophic |
| Handles "Pick an account" page | High |
| Enters TOTP code with 35s cooldown | High |
| Detects login errors (bad password, rate limit) | High |
| Detects "Stay signed in?" and clicks through | Medium |
| Detects mid-extraction session expiry (redirect to login) | Catastrophic |
| Detects auth banner (stale MSAL tokens) | High |
| Re-login + resume after session expiry | High |
| Fails gracefully when env vars are missing | Medium |
| Validates login actually completed (not stuck on login page) | High |
| Behavior | Criticality |
|---|---|
| Extracts messages with author, timestamp, content from DM DOM | Catastrophic |
| Scrolls virtualized list to load historical messages | High |
| Deduplicates messages by ID across scroll iterations | High |
| Sorts messages by ID (chronological order) | High |
| Stops scrolling after N no-change iterations | Medium |
Respects --days cutoff for chat messages |
High |
| Handles empty chat pane (no messages) | Medium |
| Handles missing author (anonymous/system messages) | Medium |
| Quick mode: extracts visible messages only | Medium |
| Behavior | Criticality |
|---|---|
| Extracts thread messages from channel-replies-viewport | Catastrophic |
| Handles subject line on parent message | Medium |
| Scrolls DOWN to load replies (opposite of chat) | High |
| Accumulates messages across scroll iterations via ID map | High |
Respects --days cutoff with channel timestamp format |
High |
| Behavior | Criticality |
|---|---|
| Finds correct SharePoint iframe (xplatplugins over streamembed) | High |
| Extracts speaker, timestamp, text from transcript cells | High |
| Scrolls virtualized transcript list | High |
| Clicks Transcript tab if not already selected | Medium |
| Behavior | Criticality |
|---|---|
| Extracts chat messages first, then checks for Recap tab | High |
| Clicks Recap tab and extracts transcript | High |
| Handles meetings with no recap available | Medium |
| Returns combined dict with messages + transcript | High |
| Behavior | Criticality |
|---|---|
| Lists thread subjects from channel page | Medium |
| Builds thread URLs from channel URL + message IDs | High |
| Scrolls to load more threads | Medium |
| Behavior | Criticality |
|---|---|
| Enters search text and clicks Messages tab | High |
| Intercepts Substrate API responses | Catastrophic |
| Parses API response structure (EntitySets → ResultSets → Results → Source) | Catastrophic |
| Builds Teams deep link URLs from thread_id + message_id | High |
| Deduplicates results by InternetMessageId | High |
| Handles missing fields in API response gracefully | Medium |
| Filters bot MRIs from DisplayTo channel names | Medium |
| Falls back to HitHighlightedSummary when Preview is empty | Medium |
| Applies date filter (preset labels and custom date picker) | High |
| Applies person filter (type name, click suggestion) | High |
| Keeps only last API response after filters are applied | High |
| Scrolls + clicks "Show more" for pagination | Medium |
| Falls back to DOM scraping when API interception fails | Medium |
Empty query with --from uses person name as search text |
High |
--expand: navigates to result URLs and extracts context |
Medium |
| Behavior | Criticality |
|---|---|
| Markdown: author header, quoted content, separators | High |
| Markdown: consecutive messages from same author collapse header | Medium |
| Markdown: date change inserts date separator | Low |
| Markdown: subject rendered as ### heading | Low |
| Markdown: empty content messages skipped | Medium |
| Markdown: grep filters by content/author/subject (case-insensitive) | High |
| Markdown: sort newest reverses message order | Medium |
Markdown: multiline content each line gets > prefix |
Medium |
| Markdown: channel thread timestamp format handled | Low |
| Markdown: transcript speaker continuation (same speaker, no repeat header) | Medium |
| Markdown: search results with/without URL, with/without expanded context | Medium |
| Markdown: channel threads list with subjects and URLs | Medium |
| Markdown: recent conversations with type labels and URLs | Medium |
| JSON: messages serialized as JSON array | High |
| JSON: meeting_chat has messages + transcript keys | Medium |
| JSON: unicode preserved | Low |
| JSON: multi-URL output as list of {url, type, data} | Medium |
| Behavior | Criticality |
|---|---|
| Cache key is stable for same URL (normalized) | High |
| Trailing slash and fragment are stripped for key | Medium |
| Different URLs get different keys | High |
| Round-trip write then read returns same data | High |
| Missing cache file returns None | Medium |
| Corrupt cache file returns None (not crash) | Medium |
--refresh re-extracts and overwrites cache |
Medium |
--no-cache skips both read and write |
Medium |
--quick mode disables cache reads |
Medium |
| Behavior | Criticality |
|---|---|
| Parses Netscape cookie format | Low |
| Strips leading dots from domains | Low |
| Filters to Microsoft domains only | Low |
| Skips comments and blank lines | Low |
| Behavior | Criticality |
|---|---|
| Multiple URLs reuse one browser session | Medium |
| Launcher interstitial ("Use the web app instead") handled | High |
| PWA cache cleared to detect stale auth | High |
Output to file with -o |
Low |
| Debug mode opens visible browser | Low |
130 total tests. Of those, 32 E2E tests are skipped in CI (they require a real Teams session and urls.env). That means CI runs 98 tests. Let me examine what those 98 actually test.
These are solid. Each test provides a realistic URL and asserts the correct type string. They cover the 5 return values and edge cases (URL encoding, query params, fragment paths).
Gap: No test for URLs that don't match any pattern — e.g., https://outlook.office365.com/something or a completely malformed URL. The fallback to "chat" is untested. Minor.
Cookie parsing is a fallback auth method. These tests are solid for what they cover. The code they test is straightforward string splitting.
Verdict: Real tests, but they protect a non-critical code path. Cookie-based auth is secondary to the persistent profile approach.
All tests for _days_to_date_filter. Good boundary testing (1, 2, 3, 5, 7, 8, 14, 30, 31, 32, 60, 90 days). Each asserts the correct preset label or custom date.
Gap: Only tests the pure mapping function. The actual application of the filter (clicking the dropdown, selecting the option, clicking Apply) is untested at the unit level — and it can't be, since it requires a browser. The E2E tests cover this path when urls.env is present.
Good: round-trip, missing file, corrupt file, URL normalization. Uses tempfile.TemporaryDirectory to avoid side effects. The CACHE_DIR monkey-patching is a bit fragile but works.
Minor anti-pattern: test_round_trip (line 37-57) mutates module-level state (teams.CACHE_DIR). A test crash before the finally block would leave global state dirty. In practice this is fine since pytest isolates processes.
Load-bearing tests:
TestMessagesToMarkdown(11 tests): Good behavioral tests. Checks author headers, separators, date changes, grep filtering, sorting, multiline quoting, subject headings, empty content skipping.TestTranscriptToMarkdown(4 tests): Speaker continuation, speaker change, grep filter.TestSearchToMarkdown(5 tests): URL rendering, expanded context with<details>.TestParseSearchApiResults(10 tests): This is the highest-value unit test class. Covers the Substrate API parsing: field extraction, deduplication, missing fields, bot MRI filtering, fallback fields, multi-response merging, missing message ID.TestBuildTeamsUrl(4 tests): URL construction for chat vs. channel vs. meeting threads.
Shallow tests:
-
TestFormatOutput— mostly a routing test, not a behavior test (test_markdown.py:112-189):def test_chat_markdown(self): data = [{"author": "Alice", "timestamp": "", "content": "Hello"}] output = teams._format_output("chat", data, use_json=False) assert "**Alice**" in output assert "> Hello" in output
This test is load-bearing only for the routing logic ("does
_format_output('chat', ...)callmessages_to_markdown?"). The actual formatting is already tested inTestMessagesToMarkdown. Seven tests that mostly verify the same thing: "does the dispatcher dispatch?" These aren't bad, but they're duplicative. -
TestToJsontests are shallow (test_markdown.py:612-658):def test_messages_to_json(self): output = teams._to_json(messages) parsed = json.loads(output) assert len(parsed) == 2 assert parsed[0]["author"] == "Alice"
_to_jsonis literallyjson.dumps(data, indent=2, ensure_ascii=False). Five tests forjson.dumps. Thetest_json_preserves_unicodetest is the only one providing non-trivial value; the rest are testing the standard library.
These are real E2E tests that call the CLI via subprocess against live Teams. They're the only tests that exercise:
- Browser lifecycle (launch, navigate, close)
- Authentication
- DOM extraction JavaScript payloads
- Scroll loops
- Cache behavior end-to-end
- Search with real API interception
- Multi-URL extraction
- Meeting chat → transcript chaining
- Sidebar listing with filters
Critical problem: All 32 are skipped in CI. They only run on the developer's machine when urls.env and ~/.teams-cli/profile exist. This means CI tests zero percent of the critical paths. The 98 tests that CI runs test formatting, URL parsing, cache key generation, and cookie parsing — all of which are supporting functions, not the core value.
Anti-pattern — Duplicate test name (test_e2e.py:459-476):
def test_recent_filter_channels(self): # line 459
...
def test_recent_filter_channels(self): # line 468
...Two methods with the same name in TestRecentConversations. Python silently overwrites the first with the second. The first test (line 459) never runs. It's dead code that inflates the count.
Weak assertions in E2E (test_e2e.py:37-41):
def _assert_has_messages(stdout, label="output"):
assert stdout.strip(), f"{label} is empty"
assert "**" in stdout, f"No author names found in {label}"
assert ">" in stdout, f"No message content found in {label}"This helper is used by most extraction tests. It checks that something bold and something quoted exists in the output. It doesn't verify message count, author accuracy, content completeness, or structural correctness. A regression that corrupts all timestamps, drops half the messages, or garbles author names would pass these assertions. That said, for E2E tests against live data whose content changes, this level of assertion is defensible — you can't assert on specific content.
| Behavior | Crit. | Unit | Integration | E2E | Notes |
|---|---|---|---|---|---|
| URL detection (all 5 types) | High | Solid | Absent | Absent | 10 good unit tests |
| Auth: login flow | Catastrophic | Absent | Absent | Shallow | E2E tests exercise it implicitly but don't assert on it |
| Auth: mid-extraction expiry | Catastrophic | Absent | Absent | Absent | No test intentionally triggers expiry |
| Auth: auth banner detection | High | Absent | Absent | Absent | Only tested if it happens to occur during E2E run |
| Chat extraction (DOM JS) | Catastrophic | Absent | Absent | Shallow | E2E asserts "has messages", no JS payload unit test |
| Thread extraction (DOM JS) | Catastrophic | Absent | Absent | Shallow | Same — E2E checks for bold + quoted text |
| Transcript extraction (iframe) | High | Absent | Absent | Shallow | E2E only |
| Meeting chat → transcript | High | Shallow | Absent | Shallow | Unit tests cover formatting, not extraction logic |
| Channel listing | Medium | Shallow | Absent | Shallow | Unit tests cover markdown formatting only |
| Scroll + deduplicate loop | High | Absent | Absent | Shallow | E2E test with scrolling exists but weak assertions |
| Search: API interception | Catastrophic | Absent | Absent | Shallow | No unit test for the interception itself |
| Search: API response parsing | Catastrophic | Solid | Absent | Shallow | 10 good unit tests for _parse_search_api_results |
| Search: date filter application | High | Absent | Absent | Shallow | _days_to_date_filter tested, but UI application isn't |
| Search: person filter | High | Absent | Absent | Shallow | E2E test exists |
| Search: empty query + --from | High | Absent | Absent | Shallow | Bug was fixed but no unit test prevents regression |
| Search: pagination/scroll | Medium | Absent | Absent | Absent | No test verifies "Show more" clicking works |
| Expand search results | Medium | Absent | Absent | Shallow | E2E test checks for "context" key |
| Markdown: messages | High | Solid | Absent | Absent | 11 tests covering formatting behaviors |
| Markdown: transcript | Medium | Solid | Absent | Absent | 4 tests |
| Markdown: search results | Medium | Solid | Absent | Absent | 5 tests including expanded context |
| Markdown: channel threads | Medium | Solid | Absent | Absent | 3 tests |
| Markdown: recent convos | Medium | Solid | Absent | Absent | 4 tests |
| JSON output | High | Shallow | Absent | Shallow | Tests json.dumps, not meaningful |
| Cache: key generation | High | Solid | Absent | Absent | 5 tests with normalization edge cases |
| Cache: read/write | High | Solid | Absent | Shallow | 3 unit + 3 E2E tests |
| Cache: --refresh / --no-cache | Medium | Absent | Absent | Shallow | E2E only |
| Cookie parsing | Low | Solid | Absent | Absent | 4 tests |
| URL construction (deep links) | High | Solid | Absent | Absent | build_teams_url + _build_thread_url |
| Multi-URL extraction | Medium | Absent | Absent | Shallow | 2 E2E tests |
| Launcher interstitial | High | Absent | Absent | Implicit | Happens during navigation, not explicitly tested |
| PWA cache clearing | High | Absent | Absent | Absent | CDP call, no test at any level |
_format_output routing |
Medium | Shallow | Absent | Absent | Tests dispatcher, not behavior |
| --days cutoff in extraction | High | Absent | Absent | Absent | Date filtering in extract_chat/extract_thread untested |
| Sidebar filter buttons (CSS overflow workaround) | Medium | Absent | Absent | Shallow | E2E tests one filter |
list_recent thread ID → URL |
Medium | Solid | Absent | Shallow | _chat_url_from_thread_id unit tested |
The entire extraction pipeline — everything that touches a browser — runs only locally. CI validates formatting and URL parsing. If someone breaks CHAT_EXTRACT_JS, CHANNEL_COLLECT_JS, the scroll loops, or the auth flow, CI passes green.
This isn't fixable without a mock Teams server or recorded browser sessions (Playwright's route or HAR replay). The system's dependency on a live Teams session with real data makes traditional E2E testing in CI impractical. This is an architectural constraint, not a laziness problem.
What would help: Playwright HAR recording of successful extractions, replayed in CI. This would cover navigation, DOM extraction, and API interception without a live session. It's non-trivial but it's the only way to get CI coverage of the critical paths.
Six JS payloads (CHAT_EXTRACT_JS, CHANNEL_COLLECT_JS, TRANSCRIPT_COLLECT_JS, SEARCH_EXTRACT_JS, RECENT_CHATS_JS, CHANNEL_THREADS_JS) are the core extraction logic. They're inline strings in Python. There are no tests that verify their behavior against sample DOM structures.
These payloads have already caused production bugs (09b21f8 — DM extraction walking up to wrong container). They're the most fragile part of the system because Microsoft can change the Teams DOM at any time.
What would help: jsdom or a lightweight DOM test harness that feeds sample HTML fragments to these JS functions and verifies the output structure. This would at least catch regressions in the JS logic itself (not selector changes, which require live DOM).
_parse_search_api_results has good unit tests with hand-crafted JSON. But if Substrate changes its response schema (renames EntitySets, changes Source structure), nothing catches it until a live search fails. A recorded API response snapshot stored as a fixture would serve as a contract test.
All unit test data uses clean fixtures:
{"author": "Alice", "timestamp": "2026-01-01", "content": "Hello"}Real Teams messages have:
- Empty author (system messages, bots)
- HTML entities in content (already stripped by
innerText, but the test doesn't verify this assumption) - Unicode/emoji in author names and content
- Extremely long messages (paste of a log file)
- Messages with only attachments (no text content)
- Nested
>quotes (markdown-in-markdown) - Newlines within author names (Teams sometimes renders "Name\nPronouns")
The test_json_preserves_unicode test is the only nod toward non-ASCII data.
TestRecentConversations.test_recent_filter_channels appears twice (test_e2e.py:459 and :468). The first one is silently overwritten by Python. This should be a lint error.
If 30% of the tests were deleted at random, would the remaining suite still catch most real regressions?
Yes. The suite is not bloated — it's lopsided. The 98 CI tests are concentrated in formatting and pure functions. Deleting 30% would remove some boundary tests for _days_to_date_filter or some of the _to_json tests, neither of which would meaningfully reduce regression detection. The formatting tests are solid but they're protecting the least-likely-to-break part of the system.
Were the last bugs in tested or untested code?
All three real bugs (DM extraction, person filter, empty query) were in completely untested code. The tests that existed at the time provided zero signal about these failures. The test suite was green while the tool was broken.
Ratio of behavior tests to implementation-detail tests?
Roughly 85/15. Most tests do assert on behavioral outputs (given this input, does the output contain these strings?). The TestFormatOutput class is the main offender — it's testing internal dispatch, not user-facing behavior. The TestToJson class is testing json.dumps.
Could a new engineer learn what the system does from the tests?
Partially. The test file names and class names are descriptive. The unit tests show what data goes in and what markdown comes out, which teaches the formatting contract. But the tests provide zero insight into how extraction works, what the auth flow looks like, or what the search interaction involves — because none of that is unit-tested.
The suite has two layers:
-
Formatting and pure functions — Well-tested. 98 tests in CI. Would catch regressions in markdown output, URL parsing, API response parsing, cache key generation, and date filter mapping.
-
Everything that makes the tool work — Untested in CI. Auth, browser navigation, DOM extraction, scrolling, API interception, filter application, session recovery. 32 E2E tests exist but only run on one developer's machine.
The suite's honest batting average: every real production bug so far was in the untested layer. The tested layer has never been the source of a bug, because it's straightforward formatting code.
That doesn't mean the formatting tests are worthless — they'd catch a refactor regression. But the test suite's coverage is inverse to the system's risk profile. The most complex, most fragile, most DOM-dependent code has no automated safety net in CI.