Leo Review: Epic #77 — CSV Import Endpoint

Changes requested · requirements · 2026-03-19

Well-structured epic with clear scope boundaries, validated user need (6/6 interviews), explicit IN/OUT lists, and sensible phasing. Five blockers around data model readiness, a risky dependency choice, and unaddressed PII policy need resolution before child tasks begin.

Blockers

[Data Model Impact] Epic depends on #74 (Work, Edition, UserCopy) which isn't implemented — current codebase only has placeholder collections/items tables. Any import code written now will need destructive rework when #74 lands. Fix: Hard-gate this epic: no child task starts until #74 is merged and schema changes #15–#19 plus WANT_TO_READ status are present in @bib/db. ⚠️ Needs Leo's input
[Data Model Impact] Five schema changes (#15–#19) are listed as dependencies but have no consolidated migration plan. Applying piecemeal risks partial states (e.g. import_metadata exists but SentimentTag.source doesn't). Fix: Define a single atomic Drizzle migration covering all five changes plus WANT_TO_READ. For #16 (star_rating integer→decimal), use a two-step approach: add star_rating_decimal column, backfill, drop original in a follow-up migration. Verify SQLite REAL affinity works with Drizzle's type mapping. ⚠️ Needs Leo's input
[Third-Party Dependencies] fuzzyset.js is proposed for fuzzy matching but is effectively unmaintained since 2020, has a bus factor of 1, no TypeScript types, and sits in the critical matching path (determines which books land in a user's library). Fix: Evaluate fuse.js (actively maintained, typed) or implement Jaro-Winkler/trigram similarity directly — the thresholds are already defined, so a small pure-TypeScript implementation is realistic. Either way, wrap behind a BookMatcher interface in @bib/core so the algorithm is swappable. ⚠️ Needs Leo's input
[Security] Review text ingested from CSV is user-authored narrative that may contain sensitive content. No policy on whether it's classified as PII, stored encrypted, or excluded from AI enrichment pipelines. Fix: Classify review_text as PII. Confirm it is excluded from any Claude API enrichment calls (or require explicit user consent). Document retention and access policy before implementation. ⚠️ Needs Leo's input
[Security, Data Model, Cost Model] import_metadata jsonb stores unmapped CSV fields with no sanitisation, schema, size boundary, or expiry. An attacker can embed arbitrarily large payloads; at scale, unbounded jsonb across 450M rows creates significant storage cost with no feature value after initial reconciliation. Fix: Define a Zod schema in @bib/shared per source format. Enforce max key count and per-value byte limit. Add a 90-day TTL (purge after successful re-import or expiry). Strip to primitives only — no nested objects.

Concerns

[Architectural Fit, Separation of Concerns, Portability] Background enrichment queue is referenced throughout but no queue infrastructure exists in the current architecture. SQS is named directly, coupling requirements to AWS. Fix: Define an EnrichmentQueue interface in @bib/core (enqueue/ack/nack). Name it "enrichment queue" in requirements, with SQS as current implementation choice. Resolve the queue pattern as part of #75/#98 dependency before Task 7. ⚠️ Needs Leo's input
[Data Model Impact, Architectural Fit] SentimentTag is in the Intelligence layer (shared), but user_imported source value blurs the boundary — one user's StoryGraph moods could pollute shared sentiment data if isolation isn't enforced. Fix: Either add a non-null owner_id FK to SentimentTag for user_imported rows and exclude from shared queries, or move imported moods to UserCopyTag with tag_type = 'imported_mood' (cleaner layer separation). ⚠️ Needs Leo's input
[Cost Model] Free tier stores ALL imported books (e.g. 2,000) with zero revenue. At 1.5M users × 300 books, import_metadata alone is ~1.35TB. Neon bills on storage. Fix: Consider capping stored rows for free tier, or stripping import_metadata after a grace period. At minimum, benchmark storage cost per free user against conversion rate assumptions. ⚠️ Needs Leo's input
[Scope & YAGNI] BookBuddy parser is included because of one user's test dataset, not validated segment demand. Zero interviewees use BookBuddy. Fix: Defer BookBuddy to a follow-on epic. Use Kerrie's data converted to Biblii template format for testing. Revisit if users request it.
[Scope & YAGNI] Re-import with per-field change detection (Task 8) is a full sub-feature built speculatively. No evidence users will iterate on CSV imports after initial migration. Fix: Descope to dedup-only for v1 (skip existing books, append new ones). Add per-field diffing when users request it.
[Separation of Concerns, Architectural Fit] No package mapping for 6–8 child tasks. Without explicit assignment, individual tasks may independently decide where to place code, fragmenting logic across packages. Fix: Add a "Package mapping" section: parsers and matching in @bib/core, Zod schemas in @bib/shared, HTTP endpoints in @bib/api, repository implementations in @bib/db.
[Third-Party Dependencies] No CSV parser library is specified. Node has no built-in CSV parser, and naive splitting breaks on quoted fields with embedded commas (common in book data). Fix: Decide on csv-parse or papaparse before implementation. Wrap behind a CsvReader interface. Add this decision to the spike.
[Security] 10MB file limit is transport-level only. No validation of row count, column count, or per-field length — a file under 10MB can contain pathological input causing memory exhaustion. Fix: Enforce max row count (e.g. 10,000), max columns per row, and max per-field byte length in the parse layer.
[Scalability] 30-second Lambda timeout for 2,000-book preview is tight with no headroom for cold starts or slow Neon connections. No recovery path for partial failures during confirm phase. Fix: Benchmark at 500 and 2,000 rows. Add an import_jobs tracking row so confirm can be retried safely, skipping already-committed batches.
[Data Model, Scalability] No composite indices specified for 450M-row UserCopy table. Free-tier 50-book query and import dedup both need (owner_id, reading_status). Fix: Define CREATE INDEX on user_copies(owner_id, reading_status, collection_status) in the migration. Add EXPLAIN ANALYZE baseline tests.
[Data Model] WANT_TO_READ vs TBR distinction is mentioned but not defined. Import status mapping depends on this being resolved. Fix: Define the full reading_status enum before any parser maps CSV values to it.

Suggestions

Fuzzy matching accuracy needs a golden-file test fixture (30–50 real title/author pairs with expected outcomes) to catch regressions → add to Task 5 acceptance criteria (Testability)
ISBN checksum validation before catalogue lookup prevents lookup amplification → add ISBN-10/13 validator to @bib/shared (Security)
Half-star rating may only be needed for StoryGraph — scope to format-specific mapping rather than platform-wide change (Scope & YAGNI)
Encoding fallback can use built-in TextDecoder with 'iso-8859-1' label — note in spike to prevent unnecessary iconv-lite dependency (Third-Party Dependencies)
Lambda GB-second cost per import should be benchmarked at target scale to validate margin (Cost Model)
pg_trgm migration path is Postgres-specific — note that in-memory matching remains the portable baseline (Portability)
Extract format detection as a standalone pure function separate from parser classes for independent testing (Testability)
Centralise free-tier enforcement in a single query helper rather than scattering limits across endpoints (Scalability)

Passed

No lenses returned a clean pass — all 10 found at least suggestions. This is expected for a feature of this scope at requirements stage.

ranguard/leo-review-epic-77.md

Select an option

No results found