Goal: Deliver a blunt, evidence‑backed production‑readiness assessment of the Extractor project and a minimal patch set (unified diffs) with tests and doc updates. Keep changes surgical. Ship safety first.
Reviewer persona & tone
- Principal SRE/DevEx + AppSec; fluent with Python/uv, Typer/FastAPI, Vite/React, ArangoDB, CI.
- Be terse, specific, and fail‑closed. Unverified claims must be marked 🔴 (blocking) or 🟡 (needs proof). No hand‑waving.
Project context (declare at top of your report)
- Project: Extractor — Self‑Correcting Agentic Document Processing System (multi‑stage pipeline + tabbed UX)
- Repo root: /home/graham/workspace/experiments/extractor
- Date: <YYYY‑MM‑DD>
- Doc anchors (this repo):
- Happy Path & CI:
README.md(Quick Start, CI quick start),AGENTS.md(agent workflow/gates) - Pipeline overview:
docs/PIPELINE_RUNBOOK.md,docs/SMOKES_GUIDE.md - UX prototype:
prototypes/tabbed/html/README.md,prototypes/tabbed/docs/workflow.md - Make targets:
Makefile(CI, smokes, pipeline runs, bundles)
- Happy Path & CI:
Inputs you have
- This review bundle (code + docs) assembled via our bundler. Treat docs as claims requiring evidence in code and tests. Mark missing/out‑of‑date artifacts as P0 doc‑debt with explicit fixes.
How to build the bundle (reference for you; include paths in the report)
- Pipeline focus (recommended first pass):
mkdir -p scripts/artifacts && \ python3 scripts/tools/copy_selected_files.py \ --root src/extractor/pipeline \ --output scripts/artifacts/extractor_pipeline_bundle.txt - UX (tabbed prototype) focus:
mkdir -p scripts/artifacts && \ python3 scripts/tools/copy_selected_files.py \ --root prototypes/tabbed \ --output scripts/artifacts/tabbed_bundle.txt - Optional combined bundle:
cat scripts/artifacts/extractor_pipeline_bundle.txt \ scripts/artifacts/tabbed_bundle.txt \ > scripts/artifacts/extractor_review_bundle.txt
Research requirements (competitive landscape)
- Develop the landscape using YOUR tools (don’t rely on our links). Deliver 6–10 entries with dated citations. Lenses:
- PDF/Document extraction pipelines (Marker/Surya, pdfplumber, Unstructured, DocAI/Document AI)
- Table extraction/repair (Camelot lattice/stream, Tabula, DeepDeSRT)
- Agentic pipelines for documents (self‑correcting/multi‑stage; quality gates)
- Graph/knowledge construction from docs (ArangoDB, FAISS, GraphRAG‑style patterns)
- Include a short research log (5–10 bullet queries + links) that we can reproduce.
Focus areas & acceptance criteria (explicit pass/fail)
- Shared (apply to pipeline and UX)
- Subprocess safety: Prefer argv; avoid
shell=Truein mutating paths (document any necessary allow‑list). - Secrets: Never embed secrets in URLs; env/header only; grep/scanner passes.
- Artifacts: Deterministic outputs under
scripts/artifacts/ordata/results/...with timestamps. - Observability: Clear logs; failures show actionable messages; resource sampling present when enabled.
- Subprocess safety: Prefer argv; avoid
- Pipeline specifics (Stages 01–14)
- Offline policy: Stages 01–09 run offline (no DB writes); DB I/O confined to 10–12. Flag any violations.
- Import hygiene: No import‑time side effects; CLIs wired through
build_cli()factories; lazy step imports viasrc/extractor/pipeline/steps/__init__.pyremain intact. - JSON strictness: LLM stages must either use provider JSON mode or guarded parsing (
clean_json_string), fail‑closed on invalid JSON; include minimal tests. - Concurrency/limits: Respect semaphores, timeouts, and cache in
litellm_call; no stampedes; bounded memory. - Table integrity: Stage 05 cells sanitized minimally; Stage 07 merging honors Stage 05 column order; no invented rows/cols.
- Reflow prompt: Stage 07’s strict schema rules enforced; figures/tables propagated with references and captions.
- Flattening & ordering: Stage 10 preserves reading order, emits deterministic
_keyand doc/section IDs. - Graph edges: Stage 11 edge schema validation passes; weights in [0,1]; self‑edges only where specified.
- Degradation: If Arango/FAISS absent, stages 10–12 degrade with clear messages and skip safely.
- UX specifics (Tabbed prototype)
- Health gate: No Vite overlay; no
console.error/pageerror;/mainrenders with[data-testid="page-label"]present. - Toolbar/Canvas: Top toolbar does not occlude canvas; pointer draw works (
N→ drag) and chip highlight appears on selection. - Thumbnails: Left‑rail and bottom filmstrip modes render; selector presence validated by smokes.
- Artifacts: Health/log screenshot pairs saved under
scripts/artifacts/and linked in the report.
- Health gate: No Vite overlay; no
Required commands to tie to readiness (run locally; attach key artifact paths)
- Dev CI gate (servers + suite):
BASE_URL=http://127.0.0.1:8080 \ BROWSERLESS_DISCOVERY_URL=http://127.0.0.1:9222/json/version \ make ci
- Pipeline (offline path):
make smokes-pipeline-offline make run-all-offline
- Happy path and full runs (as available in your environment):
make steps-happy make quick-pipeline # Full requires API keys + Arango make pipeline-full
Output format (strict)
- Executive verdict (1–2 paragraphs)
- Readiness: 🔴/🟡/✅
- Top 5 “will break in prod” with file:line or CLI anchors
- Competitive landscape matrix (≥6 rows, dated citations)
- Project assessment by focus area (finding → evidence → minimal patch/diff → tests/smokes → doc diffs)
- Per‑file code review frame (Critical/Medium/Hygiene/Strengths)
- Patches grouped into ≤3 minimal PRs (unified diffs only; no refactors)
- Test plan (deterministic smokes: pipeline + UX)
- Research log & citations
- Readiness gate & score (0–100; rubric below)
- Submission checklist (all must be true):
- Offline policy holds (01–09 no DB writes)
- Mutating paths avoid
shell=True(except documented allow‑list) - No secrets in URLs; scanner passes
- JSON strictness enforced in LLM stages; invalid JSON fails‑closed
- Table merge rules honored; no invented rows/cols
- UX health gate passes (overlay/console errors/pageerror = fail)
- Patches include smokes/doc diffs; CLIs preserved
- Readiness targets produce artifacts matching docs
Readiness scoring rubric (0–100)
- Safety 30, Offline Policy 15, JSON Strictness 10, Concurrency/Perf 10, Artifacts/Repro 10, Graph/DB Degradation 10, UX Health 10, Docs/DX 5.
Constraints & non‑negotiables
- Keep changes minimal and surgical; no new runtime deps unless a P0 requires it.
- Preserve existing CLIs (
build_cli()entry points, Typer apps) unless a P0 forces change; document migrations. - If evidence is missing, mark 🟡 and state the proof required.
Crucial code to include INLINE in the report (full functions where noted)
- Steps package loader:
src/extractor/pipeline/steps/__init__.py(lazy alias loader__getattr__— FULL) - Stage 01:
01_annotation_processor.pybuild_cli()(FULL)extract_annotations_data(...)core loop (sufficient excerpt)SYSTEM_PROMPTand relevant feature extractors (_gridline_features,_detect_numbering) (FULL)
- Stage 02:
02_marker_extractor.py(CLI wiring + extraction harness excerpt) - Stage 04:
04_section_builder.pyanalyze_section_numbering,derive_section_depth,build_sections_from_blocks(FULL)
- Stage 05:
05_table_extractor.pygenerate_pandas_metrics,sanitize_cell, strategy map and I/O wiring (FULL)
- Stage 06:
06_figure_extractor.pydescribe_image_with_llm,extract_and_describe_figure(FULL)
- Stage 07:
07_reflow_section.pyPROMPT_STRICT_REQUIREMENTS,build_reflow_prompt,build_section_context_text, LLM call wrapper (FULL)
- Stage 07½:
07_requirements_miner.py_mine_from_paragraph,_mine_from_table,_summarize(FULL)
- Stage 08:
08_lean4_theorem_prover.pyidentify_requirements_in_sectionand the Lean CLI integration block (excerpt OK)
- Stage 09:
09_section_summarizer.pysummarize_section(FULL)
- Stage 10:
10_arangodb_exporter.py_derive_doc_set_and_revision,_fast_embedding,_collect_section_contexts, flattening emission (excerpt OK)
- Stage 11:
11_arango_create_graph.py_validate_edges, sample relationship builders (FULL)
- Stage 12:
12_insert_annotations.pyensure_graphand insertion loop (excerpt OK)
- Utilities (selected):
utils/litellm_call.py—litellm_call(...)(FULL) and shutdown helperutils/json_utils.py—clean_json_string(...)(FULL)utils/diagnostics.py— resource sampler start/stop + event helpers (excerpt OK)utils/unified_conversion.py— entrybuild_unified_document_from_reflow(excerpt OK)
UX acceptance anchors to verify (must attach artifacts)
- From repo root with live servers:
npm run ux:check # launches local Chrome BROWSERLESS_WS=ws://127.0.0.1:9222/devtools/browser \ npm run ux:check:cdp # CDP attach
- Required evidence in the report:
- Screenshot path:
scripts/artifacts/<route>_*.png - Log path:
scripts/artifacts/ux_check_*.log - Summary line items:
toolbarClear=true,pointerDrawOk=true, selector checks
- Screenshot path:
Submission & TTL
- Deliver a SINGLE secret GitHub Gist containing the report (analysis, code excerpts, diffs, test plan, research log).
- Keep it accessible for ≥15 minutes after sending. We will mirror and delete per policy.
- Provide unified diffs ONLY (we apply them); include exact artifact paths you generated.
\n## Attachments (local paths)
- scripts/artifacts/extractor_pipeline_bundle.txt
- scripts/artifacts/tabbed_bundle.txt
- scripts/artifacts/extractor_review_bundle.txt