bonau/apple-podcast-ripper-mvp-plan-v3-extended.md

Created May 20, 2026 08:23

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/bonau/c92c5bb0339981eb6ff3048bb56b877d.js"></script>
Save bonau/c92c5bb0339981eb6ff3048bb56b877d to your computer and use it in GitHub Desktop.

Raw

apple-podcast-ripper-mvp-plan-v3-extended.md

System Analysis and Architecture Design Document (v3 Implementation Evolution)

Date: 2026-05-20
Depth: Complete coverage of L1 to L5
Background: Following the deployment and hardening of the local Kubernetes (k3s) cluster, this document provides a comprehensive record of the special technical considerations, security defenses, and architectural portability flexibilities for future migration to the GCP Vertex AI GPU platform. It serves as an architectural guide for the transition from MVP to production-grade.

1. Executive Summary

This document is based on a dual-track deployment implementation utilizing local Kubernetes and optional GCP Serverless. The core evolution lies in using a highly self-hosted and securely hardened K8s cluster to replace the original design that was heavily coupled with third-party SaaS. During the implementation, optimizations were specifically engineered to address WSL2 rootless container network limitations, K8s NetworkPolicy concurrency sync races, deep IP SSRF protection, and Redis memory avalanche defense. The Worker adopts a completely decoupled design, ensuring that it can be directly ported to GPU platforms such as GCP Vertex AI for high-performance transcription without modifying any core business code.

2. Special Technical Considerations

During the practical delivery of the local K8s Stack (k3s on Podman WSL2), several highly challenging platform constraints and network boundary issues were resolved:

2.1 WSL2 Rootless Podman Network Boundary Restrictions

Problem: In the WSL2 rootless Podman environment, host 127.0.0.1 port forwarding cannot be directly recognized by the containerd inside the Kubernetes cluster across different processes. This prevented loopback connectivity during local private Registry and Ingress testing.
Implementation Consideration:
- The registry registration host was changed to registry.localhost:5000, and internal cluster resolution is handled through containerd registries.yaml mirror mappings.
- API smoke tests, Ingress verification, and local CLI tools dynamically obtain the internal container IP of the k3s-server (10.89.x.x), and perform HTTP calls directly using the Host: header, bypassing the WSL2 loopback limitation.

2.2 kube-router NetworkPolicy Concurrency Sync Race

Problem: When a new Pod starts up and immediately initiates external or cross-service network connections, there is a synchronization latency of several hundred milliseconds before the kube-router NetworkPolicy controller writes the rules to iptables. This caused initContainers (such as Alembic migrations) or database initialization Jobs to encounter Connection Refused immediately upon startup.
Implementation Consideration:
- Inside the Alembic initContainer and the postgres initialization scripts, a pg_isready retry loop (up to 30 retries, with a 2-second interval) was explicitly encapsulated to provide a soft tolerance cushion during the kube-router network policy synchronization window.
- The same exponential backoff retry logic was implemented in the MinIO bucket pre-provisioning Job.

2.3 Strict Security Hardening (Pod Security Standards Restricted)

Problem: To comply with the strict Restricted Pod Security Standards of the K8s cluster, all workloads must run perfectly under non-root privileges and with read-only root filesystems.
Implementation Consideration:
- runAsNonRoot: true is set, with the API using UID 1000 and the Frontend using UID 101 (nginx).
- readOnlyRootFilesystem: true is enabled, mounting necessary runtime temporary writes (such as /tmp, /var/cache/nginx) as emptyDir memory disks.
- Since faster-whisper requires downloading approximately 2Gi of model weights, sharing the model cache with the tmpfs memory disk would easily trigger OOM. Therefore, the Worker's /home/app/.cache/huggingface is specifically mounted as a disk-backed emptyDir, and limits.ephemeral-storage is set to 3Gi to prevent the Pod from being evicted by the Kubelet Eviction Manager.

2.4 Deep SSRF Defense (DNS Security Guard)

Problem: Relying solely on regular expressions to validate Apple Podcast URLs cannot prevent malicious 302 redirects (HTTP Redirect) and internal port scanning attacks.
Implementation Consideration:
- Implemented DNS query interception in net_guard.py. Before the Worker downloads audio files and before the API queries the iTunes API, the target URL's domain name is resolved. If the target IP falls within private networks (RFC1918), CGNAT, loopback, or reserved addresses, it is immediately blocked.
- The downloader explicitly configures follow_redirects=False. For every redirect (Location), it re-extracts the URL and re-runs the DNS SSRF check, strictly prohibiting direct trust of httpx's automatic tracking.

2.5 Rate Limiting and Memory Prevention Design (Rate Limit with LocalBackstop)

Problem: If Redis fails or is subjected to a DDoS attack, the rate-limiting components on the API side could consume excessive memory, leading to an OOM crash of the API Pod itself.
Implementation Consideration:
- Adopted Redis Lua scripts to implement atomic sliding-window rate limiting.
- Introduced LocalBackstop, a double-ended queue with a maximum memory limit (maxlen), as a fallback in-memory cache when the Redis connection is lost. This ensures that the Pod will not suffer an OOM crash due to rate-limiting log or CSP report accumulation when Redis is offline.

3. CPU Transcription and Flexible Design for Future Migration to GCP Vertex AI

Although CPU transcription is adopted in the MVP phase to simplify the local architecture and ensure zero fixed costs, the architecture achieves complete decoupling, preserving excellent flexibility for migration to a GPU platform:

3.1 Worker Architecture with Fully Separated Responsibilities

The existing podcast-worker codebase features highly granular separation of duties:

Task Consumption: main.py performs a blocking listen on the Redis queue to retrieve tasks.
Business Flow: job.py controls the entire Happy Path of downloading, transcoding, transcribing, uploading, and updating the database.
Transcription Engine: transcribe.py encapsulates model loading and transcription for faster-whisper.
Storage Layer: object_storage.py serves as a storage abstraction Facade, simultaneously supporting the MinIO S3 API and GCP native Cloud Storage (GCS).

3.2 Technical Path for Migration to GCP Vertex AI

When traffic increases and transcription acceleration (e.g., using GPU) is required, the following path can be adopted without changing the core transcription business logic (transcribe.py):

                     [ Task Submission API ]
                               ↓ (Triggered by GCP Pub/Sub or Cloud SQL)
             ┌─────────────────┴─────────────────┐
             ▼                                   ▼
    [ GCP Cloud Function ]          [ Vertex AI Custom Job ] (GPU)
    - Lightweight logic, audio download - Launches dedicated GPU container
    - Fast response & preprocessing     - Executes Python transcribe.py
             └─────────────────┬─────────────────┘
                               ▼
                   [ Google Cloud Storage ] (GCS)

Seamless Container Migration: Since podcast-worker itself is a fully packaged Docker Image, we can push this image to GCP Artifact Registry to serve as the runtime environment for a Vertex AI Custom Container.
Vertex AI Custom Job / Pipelines Triggering:
- Upon task submission, the API can directly invoke the GCP SDK to launch a Vertex AI Custom Job (specifying a GPU instance like NVIDIA T4/L4).
- Once the Job starts, it executes the same podcast_worker.job.process_one as the local stack, but by setting environment variables, faster-whisper can be configured to use device="cuda" and compute_type="float16". This reduces the transcription time of a long program from 20 minutes on CPU to under 2 minutes on GPU.
Consistent Cloud Storage & Database: Since the codebase natively supports STORAGE_BACKEND=gcs and CLOUDSQL_CONNECTION_NAME, the Vertex AI GPU job can write directly back to Cloud SQL (PostgreSQL) and GCS upon completion. This remains completely transparent to the frontend and the API.

4. Single-Cloud Decision and Self-Hosted Strategy

The project has decided to adhere strictly to a single-cloud model (K8s and GCP ecosystem) in the MVP and Phase 2 stages, avoiding a hybrid multi-cloud SaaS model (AWS + Supabase + Vercel + Modal). The technical reasons and architectural decisions are as follows:

Credential and Key Management Overhead: A multi-cloud approach introduces AWS IAM, Supabase Auth/API keys, Modal tokens, and Vercel integrations. In the absence of a unified, automated CLI (e.g., dedicated CLI software to automatically create and synchronize all accounts and secrets with a single command), the cognitive load of configuring Secrets locally and in the cloud is extremely high, and significantly expands the attack surface for credential leaks.
Cross-Cloud Latency and Stability: If the API is on Vercel (Lambda), the DB is on Supabase (Postgres), the Worker is on Modal (GPU), and transcripts are stored in AWS S3, a single transcription request involves multiple cross-cloud network round trips. This not only significantly slows down response times but also increases the risk of single points of failure (SPOF) across multiple third-party platforms.
Perfect Equivalence of the GCP Native Ecosystem: The implementation has proven that the local K8s Stack (PG + Redis + MinIO) maps perfectly 1:1 to the GCP ecosystem (Cloud SQL + Pub/Sub + GCS), ensuring extremely low migration and management costs.

5. Summary

This v3 architecture implementation turns the local K8s cluster into an exceptionally robust security fortress and establishes a frictionless upgrade path to GCP native serverless and Vertex AI GPU. This decision provides the optimal solution for the long-term evolution of the project—fitting a lightweight local MVP while allowing smooth expansion to massive cloud-based computing power.

Raw

apple-podcast-ripper-mvp-plan.md

System Analysis Report: Podcast Ripper Online (MVP Planning)

Analysis Date: 2026-03-10
Analysis Depth: Complete (L1 to L5)
Analysis Purpose: Evaluate the feasibility of transforming the existing Podcast Ripper PoC (local Notebook) into an online service, and plan the MVP architecture and implementation path.

I am the claude-4.6-sonnet-medium-thinking model, with data last updated in early 2025.
System Time: 2026-03-10 00:43:45 CST

Executive Summary

This system aims to transform the existing Podcast Ripper PoC (local Notebook) into an installation-free Web service, where users only need to input an Apple Podcast URL to asynchronously obtain Traditional Chinese transcripts. The core challenges lie in the long-running nature of Whisper transcription and the GPU cold start in Serverless environments, both of which directly impact the user experience. The most important architectural decision is selecting a platform that supports long-running Serverless GPU tasks (such as Modal.com) and decoupling the frontend and backend using a task queue. It is recommended to clarify first whether the target audience of the service requires authentication, as this decision will affect the entire access control and cost model design.

L1 | Business Background and Goals

Core Problem

Users currently need to install a local Python environment and manually run a Notebook to obtain Podcast transcripts. This imposes a high barrier to entry for non-technical users.

This system aims to solve: allowing anyone to obtain Podcast transcripts through a browser without installing any software.

Business Value

Direct Benefit: Lowers the barrier for users from "requiring a technical background to use" to "accessible to anyone," expanding the potential user base by more than 10 times.
Indirect Benefit: Serves as a starting point for a personal brand or a SaaS product; validates the market's willingness to pay for such services; accumulates real usage data to guide subsequent feature optimizations.

Key Stakeholders

Role	What They Care About	Influence	Remarks
General Users (Non-technical)	Ease of operation, reasonable waiting time, quality of results	High	Primary beneficiaries
Developers (Maintainers)	Low operations & maintenance costs, scalable architecture, easy debugging	High	The developer themselves
Podcast Copyright Holders	How their content is used	Low (no direct contact currently)	Potential source of legal risk

Success Definition

Response time from user URL input to "successful task submission" < 2 seconds
End-to-end completion time (including cold start) for a single episode < 15 minutes (for a typical show of 30-60 minutes in length)
Fixed monthly system cost at zero traffic < USD $5 (Serverless pay-as-you-go goal)
Accuracy of transcription results reaches PoC standard (Traditional Chinese output, no significant decline in accuracy)

L2 | Business Logic

Core Business Rules

Necessity of Asynchrony: Transcribing a 60-minute Podcast episode is estimated to take 5-15 minutes. An HTTP request cannot wait, so a "Submit Task → Get TaskID → Poll or Notify" model must be adopted.

Task State Machine (Mandatory):

[PENDING] → [PROCESSING] → [COMPLETED]
                         ↘ [FAILED]
[PENDING] → [CANCELLED] (Active cancellation by user, can be deferred in MVP)

Single Episode Limit: The MVP phase will only process one episode per request to avoid complex resource estimation.
Retention Period of Results: Transcripts are automatically deleted after 24 hours in S3 (implemented via S3 Object Lifecycle rules) to avoid long-term storage fees.
Handling Duplicate Submissions: If a COMPLETED task already exists for the same Episode GUID within 24 hours, the cached result is returned directly (preventing duplicate billing).
Failure Compensation: No automatic retries for failed tasks (MVP), but the failure reason must be clearly displayed to the user.

Main Business Process

User Inputs URL
      ↓
[Validate URL Format] → Fail → Return 400 Error Immediately
      ↓ Success
[Query RSS Feed via iTunes API] → Fail → Return Error Immediately (URL does not exist or is not a Podcast)
      ↓
[Parse RSS, Retrieve Episode Info for the Specified Episode Number]
      ↓
[Check Cache: Does a COMPLETED result already exist for the same GUID?]
      ├─ Yes → Return download link directly
      └─ No ↓
[Create Task Record (PENDING), Return TaskID]
      ↓
[Push Task to Message Queue]
      ↓ (Asynchronous)
[Worker Receives Task]
      ↓
[Update State → PROCESSING]
      ↓
[Download Audio File → Run Whisper → Simplified-to-Traditional Chinese Conversion → Upload to S3]
      ├─ Success → Update State to COMPLETED + Record Download URL
      └─ Fail → Update State to FAILED + Record Error Message

Data Entities (Conceptual Level)

TranscriptionTask: The lifecycle unit of a transcription request, containing TaskID, state, Episode GUID, download URL, error message, creation time, and completion time.
Episode: Single episode information parsed from the RSS Feed, containing title, release date, audio URL, GUID, and podcast channel.
PodcastChannel: Channel information, containing Apple Podcast ID, channel title, and RSS Feed URL.

⚠️ Logical Risk Points

GUID Cache Boundary Issue: If the same program GUID is republished (audio content updated but GUID remains unchanged), the cache mechanism will return the old transcript. The MVP accepts this limitation for now, and a force_refresh parameter can be added later.
Polling Strategy Undefined: How many seconds between each poll from the frontend? What is the maximum polling limit? How is it handled when the limit is exceeded? The original report did not specify.
Discussion on Missing User Authentication: Does a public service need protections other than Rate Limiting? Malicious users can bypass simple IP restrictions (via VPN).

L3 | System Architecture

Architecture Pattern

Selected: Event-driven Serverless Architecture
Reasoning:

Traffic is highly irregular (personal tool); pay-as-you-go Serverless is more cost-effective than a constantly running server.
Transcription tasks naturally fit the asynchronous Worker model.
Zero Ops overhead, aligned with the core requirements of the developer (maintainer).

Main Trade-offs:

	Gains	Losses
Serverless	Low fixed cost, auto-scaling, Zero Ops	Cold start latency, complex debugging, vendor lock-in risk
GPU Platform (Modal)	Long-running execution, GPU acceleration	Additional learning curve, extra third-party dependency

System Components

┌─────────────────────────────────────────────────────────────┐
│                        User's Browser                       │
└────────────────────────┬────────────────────────────────────┘
                         │ HTTPS
┌────────────────────────▼────────────────────────────────────┐
│             Frontend (Next.js on Vercel)                    │
│  - Form Input / TaskID Display / Polling Status / Down Links │
└────────────────────────┬────────────────────────────────────┘
                         │ REST API
┌────────────────────────▼────────────────────────────────────┐
│             API Layer (Next.js API Routes / FastAPI)         │
│  POST /tasks              Create Task                       │
│  GET  /tasks/{id}         Query Task Status                 │
└───────────┬─────────────────────────┬───────────────────────┘
            │ Write Task State        │ Push Message
┌───────────▼───────────┐  ┌──────────▼──────────────────────┐
│   Database (Supabase  │  │   Message Queue (SQS / Supabase │
│   Postgres)           │  │   Realtime / Modal Queue)       │
│  - TranscriptionTask  │  └──────────┬──────────────────────┘
│  - Task State Query   │             │ Consume Task
└───────────▲───────────┘  ┌──────────▼──────────────────────┐
            │ Update State │   Worker (Modal.com GPU)        │
            └───────────────┤  1. Download Audio              │
                            │  2. Whisper Transcription      │
                            │  3. opencc conversion          │
                            │  4. Upload to S3                │
                            └──────────┬──────────────────────┘
                                       │
                            ┌──────────▼──────────────────────┐
                            │   Storage (AWS S3)              │
                            │  - Temp Audio (Processing)      │
                            │  - Final Transcript (24h TTL)   │
                            └─────────────────────────────────┘

Integration Boundaries

External System	Integration Method	Data Flow	Risk Level	Remarks
iTunes Lookup API	REST (GET)	Outward, Read-only	Medium	Apple has no public SLA, subject to change without warning
RSS Feed (various Podcast hosts)	HTTP GET	Outward, Read-only	Medium	Formats vary slightly; parsing tolerance needs enhancement
Modal.com	Python SDK / Webhook	Bidirectional	High	Core dependency, significant vendor lock-in
AWS S3	AWS SDK	Outward, Read-write	Low	Mature and stable
Supabase	REST / SDK	Bidirectional	Low	Can be replaced with any Postgres instance

Non-Functional Requirements

Metric	Target	Remarks
API Response Time (Create Task)	< 2 seconds	Excludes transcription duration
End-to-End Transcription Time (1-hour show)	< 15 minutes	Includes GPU cold start
Monthly Fixed Cost (Zero Traffic)	< USD $5	Serverless target
Availability Target	99.5% (soft)	Personal tool, not financial grade
Transcript File Retention	24 hours	S3 Lifecycle automatic deletion

L4 | Technical Implementation

Tech Stack Selection

Category	Choice	Alternatives	Reason for Selection
Frontend Framework	Next.js	Vue + Nuxt	Native integration with Vercel; API Routes can also serve as the BFF layer.
Frontend Deployment	Vercel	S3 + CloudFront	Zero-config CI/CD, natively supports Next.js.
GPU Worker Platform	Modal.com	Replicate, RunPod	Supports long-running tasks, native Python, fast cold start (relatively), highly flexible containerization.
Message Queue	SQS or Modal Queue	RabbitMQ, Supabase Realtime	Mature integration between SQS and Lambda/Modal; if using Modal entirely, its built-in Queue can be used directly.
Database	Supabase (Postgres)	Firebase, DynamoDB	Postgres offers flexible structured queries, Supabase provides an out-of-the-box REST API; avoids the schema complexity of DynamoDB.
Object Storage	AWS S3	Supabase Storage, Cloudflare R2	Mature and stable, Lifecycle Policy natively supports TTL.
Transcription Engine	OpenAI Whisper (large-v3 / turbo)	Google STT, AssemblyAI	PoC compatible, offline control, transparent cost.
Chinese Conversion	opencc-python-reimplemented	Manual post-processing	Already validated in the PoC.

API Design (Core Endpoints)

POST /api/tasks
  Body: { "podcast_url": "https://podcasts.apple.com/...", "episode_index": 0 }
  Response 202: { "task_id": "uuid", "status": "pending" }
  Response 400: { "error": "INVALID_URL", "message": "..." }
  Response 429: { "error": "RATE_LIMITED", "retry_after": 3600 }

GET /api/tasks/{task_id}
  Response 200: {
    "task_id": "uuid",
    "status": "pending|processing|completed|failed",
    "created_at": "ISO8601",
    "completed_at": "ISO8601 | null",
    "download_url": "https://s3.../result.tar.gz | null",
    "expires_at": "ISO8601 | null",
    "error": "DOWNLOAD_FAILED | TRANSCRIPTION_FAILED | ... | null"
  }
  Response 404: { "error": "TASK_NOT_FOUND" }

Performance Critical Path

Predicted Bottleneck 1: GPU Cold Start
- Modal.com cold starts may take 30-90 seconds (to download Container Image).
- Mitigation Strategy: Set keep_warm=1 to keep one warm instance (adds a small fixed cost), or explicitly inform users on the frontend: "First launch may take up to 2 minutes."
Predicted Bottleneck 2: Long Audio Download
- A 60-minute MP3 is approximately 50-100MB; download time depends on the Podcast host's bandwidth.
- Mitigation Strategy: Stream downloads directly to Whisper (no need to write to disk first), or parallelize download and Whisper input pipeline.
Predicted Bottleneck 3: Polling Frequency
- Blindly polling once per second from the frontend imposes unnecessary pressure on the API.
- Mitigation Strategy: Initial poll at 5-second intervals, then transition to 30 seconds, then 60 seconds (exponential backoff); stop polling after a maximum of 30 minutes and prompt the user to check back manually.

Security Considerations

Threat	Measure
Malicious bulk submission of tasks consuming GPU budget	IP-based Rate Limiting (3 times per IP per hour), implementable via Vercel Edge Middleware or Upstash Redis.
SSRF Attack (submitting internal network URLs)	Validate that the URL must start with whitelisted domains like `podcasts.apple.com`.
S3 Signed URL Leakage	Use Pre-signed URLs (valid for 1 hour) instead of exposing permanent public links.
Unauthorized API Endpoint Access	The MVP can adopt API Key (Header) authentication or remain directly public (paired with Rate Limiting). If user accounts are required, adopt Supabase Auth (Google OAuth).
CORS	Restrict API access exclusively to Vercel deployment domains.

Observability

Logs: Record structured logs (JSON) for every task status transition (PENDING → PROCESSING → COMPLETED/FAILED), including task_id, duration, and error_code.
Metrics (Supabase queries or Modal Dashboard):
- Daily task success rate
- Average transcription duration
- Distribution of failure reasons
Alerts: Email alerts for Modal.com when costs exceed defined thresholds (configure Billing Alert).

L5 | Project Realities and Constraints

Existing Constraints

Technical Debt: The existing PoC is a Jupyter Notebook; it must be refactored into modular Python functions before deployment to Modal. This step represents the largest upfront cost for the MVP.
Non-negotiable Decisions: Whisper remains the transcription engine (matching the PoC) to ensure offline control and predictable costs; commercial APIs are not considered.
Single Maintainer Constraint: Architectural complexity must remain within the range of a single maintainer to avoid introducing too many third-party services.

Implementation Risk Matrix

Risk	Probability	Impact	Mitigation Strategy
Modal.com price increase or service outage	Medium	High	Package worker logic into a Docker Image to allow easy migration to RunPod / Replicate.
iTunes Lookup API format change	Medium	High	Integrate integration tests, fail fast, and notify the maintainer upon anomalies.
Diverse audio URL formats (non-standard RSS)	High	Medium	Strengthen RSS parsing tolerance and explicitly document known incompatible formats.
GPU cold start exceeds user's expectation	High	Medium	Explicit progress indications on the frontend; evaluate keep_warm costs.
Copyright complaints	Low	High	Retain a disclaimer; do not store original audio files longer than the processing duration; only provide transcripts.
Cost explosion (malicious or viral spread)	Low	High	Rate Limiting + Billing Alert + emergency service kill-switch.

Phased Recommendations

Phase 1 (MVP, estimated 2-4 weeks)
- Scope: Single-page frontend + Modal Webhook Worker + Supabase task state + S3 storage.
- Goal: End-to-end Happy Path available, no authentication, IP-based Rate Limiting.
- Exclusions: User accounts, task history, push notifications, audio chunking.
Phase 2 (Stabilization, based on feedback)
- Integrate user authentication (Google OAuth via Supabase Auth).
- Task history (personalized).
- Email notifications (upon task completion).
- Support choosing specific episodes (currently only supports the latest episode).
Phase 3 (Feature Expansion)
- Audio chunking for parallel transcription (processing ultra-long programs).
- Summarization feature (integrating LLMs).
- Support for more Podcast platforms (Spotify, SoundCloud).

Key Findings and Recommendations

🔴 Must Address (High Risk)

Incompletely Defined Task State Machine → Before starting implementation, define all states and transition conditions, particularly the classification of FAILED errors (download failure vs. transcription failure), as this directly impacts the error messages and retry logic shown to users.
GPU Cold Start Experience Gap → Design explicit progress feedback on the frontend (e.g., progress bars, estimated wait time indicators, "first launch is slower" notice), otherwise users will assume the system is broken and abandon it.
Mandatory Pre-Launch Billing Alerts → Configure cost threshold alerts on Modal.com and AWS to prevent unexpected bills caused by a single malicious user.

🟡 Recommended Improvements (Medium Risk)

GUID Collision in Cache Mechanism → Explicitly document this limitation as a "known trade-off," and explain it in the documentation. A force_refresh parameter can be introduced later to resolve this.
Clarify Polling Strategy Specifications → Define frontend polling frequency, maximum wait time, and timeout handling before development to avoid post-implementation UX issues.
RSS Parsing Fault Tolerance → The PoC's RSS parsing targets specific formats. Different Podcast hosts vary in their RSS formats (enclosure tag placement, audio format, etc.). Enhance tolerance and build a test suite.

🟢 Long-term Optimization (Low Risk)

Modal.com Vendor Lock-in → Package the Worker logic as a Docker Image + standardized entrypoint function from the beginning to reduce future migration costs.
Observability Enhancements → Add structured logs and a task success rate dashboard as soon as the MVP launches to drive subsequent optimizations with data.
OpenCC Version Pinning → Ensure the opencc-python-reimplemented version is pinned in requirements.txt to avoid transcription rule shifts from version upgrades.

Assumptions and To-Be-Confirmed (TBC) Items

#	Assumption / TBC Item	Impact if Assumption is Incorrect	Verification Method
1	Modal.com's GPU Worker can complete transcription of a 60-minute show in under 15 minutes.	Needs chunking/parallelization, significantly increasing architectural complexity.	Run a 60-minute test audio on Modal to measure actual duration.
2	Apple's iTunes Lookup API can be stably used without authentication.	Requires direct RSS parsing or other methods to obtain Feed URLs.	Test API call rate limits and observe Rate Limit behavior.
3	The service is for public use, requiring no user accounts (MVP phase).	If accounts are required, Supabase Auth must be added, increasing development time by 2-3 weeks.	Confirm the MVP target user scope with stakeholders.
4	Supabase's free tier can support the MVP traffic.	May require a paid tier, increasing fixed costs.	Confirm Supabase free tier database connection limits and API call caps.
5	S3 Pre-signed URL's 1-hour validity is sufficient for users to download.	Requires extending validity or adopting alternative distribution methods.	Adjust based on user testing feedback.

Differences from the Original Report

The original report (generated by gemini-3-flash-preview) covered the core architectural ideas but left room for enhancements in the following dimensions:

Dimension	Original Report	This Revised Version
Document Structure	Lacks executive summary and assumption verification table.	Fully complies with standard templates.
Task State Machine	Only mentioned "polling," did not define states or transitions.	Full state machine diagram + error flow.
L3 Architecture Diagram	Single line text: `User -> SQS -> Lambda -> S3`.	Detailed ASCII component diagram + integration boundary risk table.
L4 Tech Selection	Bulleted list of reasons without a comparison table.	Structured selection table + API specifications + security threat mapping table.
L5 Project Realities	Completely missing.	Risk matrix + three-phase delivery plan.
Priority of Recommendations	Uncategorized.	Categorized into three priority levels: 🔴/🟡/🟢.
Cache Logic	Only mentioned "retaining for 24 hours."	Included GUID cache hit logic + boundary issue explanation.

This report is generated based on existing PoC information and a system analyst's perspective. It is recommended to perform actual performance tests on Modal.com with the developer before finalizing the architecture.