Skip to content

Instantly share code, notes, and snippets.

@sunfmin
Last active April 2, 2026 04:31
Show Gist options
  • Select an option

  • Save sunfmin/9b376c1fd203a77402995181e6fdbb9b to your computer and use it in GitHub Desktop.

Select an option

Save sunfmin/9b376c1fd203a77402995181e6fdbb9b to your computer and use it in GitHub Desktop.

Percev — Product Specification

Version: 1.0 Draft Date: 2026-03-25 Author: Felix Sun Status: Draft


Problem Statement

Knowledge workers spend 15-25 hours per week in meetings, presentations, and video calls, yet retain only a fraction of what was discussed. Existing solutions (Otter.ai, Granola, Limitless) require sending audio to the cloud, raising privacy concerns for organizations handling sensitive data. There is no native macOS app that records screen + audio, transcribes in real-time, and generates AI summaries — all locally on the user's machine.

Who experiences this: Engineers, PMs, designers, executives — anyone who attends meetings, watches technical presentations, or reviews recorded content and needs to recall specific details later.

Cost of not solving: Lost context, duplicated discussions, missed action items, and the growing unease of sending private meeting audio to third-party cloud services.


Goals

  1. Privacy-first capture: All transcription and AI processing runs locally on Apple Silicon — no audio or video ever leaves the machine unless the user explicitly exports
  2. Zero-friction recording: Start capturing a meeting or video in under 2 seconds from the menu bar — no app switching, no configuration
  3. Real-time comprehension: Live subtitles with <2s latency so users can follow along with foreign-language content or in noisy environments
  4. Total recall: Full-text search across all recordings so users can find "that thing someone said about the API redesign" in seconds
  5. Actionable output: AI-generated summaries with key moments, decisions, and action items — not just a wall of transcript text

Non-Goals

  1. Cloud transcription service — We are not building a cloud API or SaaS. All processing is local. Cloud sync (e.g., iCloud) may come later but is not v1.
  2. Real-time collaboration — No shared transcripts, live co-editing, or team features in v1. This is a single-user tool first.
  3. Mobile app (iOS) — macOS only. iPhone/iPad lacks the GPU horsepower for local whisper inference at acceptable latency.
  4. Speaker diarization — Identifying who said what is desirable but technically hard with local-only processing. Deferred to v2.
  5. Video editing — We capture video for playback context, not for editing. No trimming, filters, or effects. (The Editor tab composes clips from segments but is not a full video editor.)

User Stories

Knowledge Worker (Primary Persona)

  • As a knowledge worker, I want to start recording my screen and audio with one click from the menu bar so that I can capture meetings without disrupting my workflow
  • As a knowledge worker, I want to see real-time subtitles inside the app window so that I can follow conversations in noisy environments or in languages I'm less fluent in
  • As a knowledge worker, I want to browse my past recordings with synced video and transcript so that I can review what was discussed in yesterday's standup
  • As a knowledge worker, I want AI-generated summaries of my recordings so that I can get the key points without rewatching an entire hour-long meeting
  • As a knowledge worker, I want to search across all my transcripts by keyword so that I can find the exact moment someone discussed a specific topic
  • As a knowledge worker, I want to export a transcript as SRT/TXT/Markdown so that I can share meeting notes with my team
  • As a knowledge worker, I want to select which window to record so that I only capture the Zoom call, not my entire desktop

Privacy-Conscious User

  • As a privacy-conscious user, I want all transcription to happen locally on my Mac so that my meeting audio never leaves my machine
  • As a privacy-conscious user, I want a clear indicator when recording is active so that I always know when Percev is capturing
  • As a privacy-conscious user, I want to delete recordings permanently so that sensitive content can be fully removed

Power User

  • As a power user, I want to use Claude Code in the embedded terminal to ask questions, summarize, or run custom workflows on my transcripts — without leaving the app
  • As a power user, I want to select multiple segments across recordings, compose them in the Editor with insights cards, and export a self-contained video clip to share with colleagues
  • As a power user, I want to choose between whisper model sizes (small/medium/large) so that I can trade off between speed and accuracy based on my hardware

Requirements

Must-Have (P0)

P0-0: First Launch Setup

  • On first launch, a setup wizard guides the user through:
    1. Recording consent disclaimer (one-time, see Q5)
    2. Whisper model download — default: Small (~466MB). User can choose a different size. Download runs with a progress bar showing percentage, speed, and estimated time remaining
    3. Screen Recording permission — prompt to grant macOS Screen Recording access
    4. Microphone permission — optional, prompt with explanation
  • If user skips or cancels download, the app reminds them from the menu bar with a badge
  • Acceptance criteria:
    • Setup wizard appears on first launch only
    • Whisper model download shows progress bar with %, speed (MB/s), and ETA
    • User can change whisper model size during setup
    • Downloads are resumable if interrupted (don't restart from zero)
    • App transitions to menu bar after setup completes
    • If downloads are skipped, menu bar shows a badge prompting the user to complete setup

P0-1: Menu Bar Presence

  • Percev lives in the macOS menu bar as a lightweight status icon and also appears in the Dock
  • Click menu bar icon to reveal dropdown: Start/Stop recording (with ⌘⇧R shortcut indicator), Open Percev, Join Discord, Quit
  • Click Dock icon to bring the main app window to front
  • Recording indicator: menu bar icon changes to a red dot with elapsed recording time (e.g., ● 12:34) when actively recording — rendered as a non-template NSImage so the red color is visible
  • Acceptance criteria:
    • App launches with both a menu bar icon and a Dock icon
    • Menu dropdown shows Start/Stop recording with keyboard shortcut, Open Percev, Join Discord, and Quit
    • Global keyboard shortcut to start/stop recording (default: ⌘⇧R, configurable)
    • Menu bar icon shows red dot + elapsed time during active recording

P0-2: Window Picker

  • Window picker is only shown when screen recording is enabled; if the user starts a recording with video off, no picker is shown and recording begins immediately
  • The picker can also be triggered mid-session from the menu bar dropdown or the main window by enabling video on an active recording
  • Smart detection: auto-suggests meeting apps (Zoom, Teams, Lark, Google Meet, Slack Huddle) if they're running
  • Shows window thumbnails with app name and title for easy identification
  • Async thumbnail loading: window list appears immediately with placeholder icons; thumbnails load in the background and replace placeholders as they become available — picker is never blocked waiting for thumbnails
  • Acceptance criteria:
    • When video is off, recording starts immediately with no picker shown
    • Enabling video from menu bar or main window while idle triggers the window picker before starting
    • Enabling video during an active recording triggers the picker to select a window and begins screen capture without interrupting audio/transcript
    • Picker opens instantly — window list (app name + title) is shown before any thumbnails are fetched
    • Each window card shows a grey placeholder graphic until its thumbnail is ready
    • Thumbnails load asynchronously and fade in as they resolve, without shifting layout
    • Picker shows all visible windows with thumbnails
    • Meeting app windows are surfaced at the top
    • User can switch target window during an active recording
    • "Full screen" option captures the entire display

P0-3: Screen + Audio Recording

  • Captures the targeted window as video (H.265/HEVC, hardware-encoded via VideoToolbox) using ScreenCaptureKit
  • Video is optional (default: off) — toolbar has a video on/off toggle button. Audio + transcript are always captured. When video is off, NO video.mp4 is created (file existence = hasVideo).
  • Video toggle + permission flow: When user toggles video ON, check screen recording permission. If not granted, prompt for it via the macOS system dialog. If permission denied, show a message and keep video off.
  • Window Picker — shown when video is enabled and recording starts (see P0-2). If video is off, recording starts immediately with no picker.
  • Camera recording is a separate opt-in — when enabled, the front camera is recorded to a separate video file (camera.mp4) in the recording directory, independent of screen recording
  • Simultaneously captures system audio (what the user hears) via ScreenCaptureKit
  • Optionally captures microphone input (user's voice)
  • No virtual audio devices needed — ScreenCaptureKit captures system audio natively regardless of output device
  • No metadata.json — all recording metadata is derived from the file system:
    • Date: parsed from directory name (yyyy-MM-dd-HHmmss-title)
    • Duration: read from audio.wav via AVAsset
    • hasVideo: video.mp4 file exists and is >10KB
    • hasCamera: camera.mp4 file exists and is >10KB
    • hasMicrophone: mic.wav file exists and is >10KB
    • Window title: stored in .title text file (one line), falls back to directory name
  • Acceptance criteria:
    • HEVC video recording with VFR (only encode when screen changes) with <5% CPU overhead on M1 or newer
    • Toolbar has video on/off toggle button (persisted to settings)
    • Toggling video ON checks screen recording permission, prompts if needed
    • When video is OFF, NO video.mp4 file is created
    • When video is ON, window picker is shown before recording starts
    • Camera recording is opt-in and saves to a separate camera.mp4 file in the same recording directory
    • Camera and screen recording can be enabled/disabled independently
    • System audio captured regardless of output device (speakers, headphones, Bluetooth, AirPods)
    • Microphone capture is opt-in with clear permission handling
    • Recording continues even if the target window is minimized or occluded
    • Only ONE ScreenCaptureKit audio stream at a time (enforce single-instance)
    • Clicking "Start Recording" auto-selects the active recording in the sidebar
    • Recording directory appears in library immediately (no metadata.json dependency)
    • All metadata derived from file system: date from dir name, duration from audio, hasVideo from file existence
    • If app is force-quit during recording, the recording appears in library on next launch

P0-4: Key Frame Extraction

  • During or after recording, automatically detect and save key frames when significant visual changes occur (slide transitions, screen switches, new content)
  • Detection: compare consecutive frame histograms — large pixel difference = new content. No AI needed.
  • Key frames saved as JPEGs in a keyframes/ subdirectory with timestamp filenames
  • These frames serve two purposes:
    1. Visual timeline in the playback view (scrub through key moments)
    2. Claude Code can read them (multimodal) — enabling "Identify Key Moments" quick-action button
  • Acceptance criteria:
    • Key frames extracted automatically on recording completion (or during recording)
    • Frame detection uses histogram comparison — lightweight, no ML required
    • Key frames saved as keyframes/HH-MM-SS.jpg in the recording directory
    • Duplicate/near-identical frames are deduplicated (threshold configurable)
    • Key frames appear as a visual timeline strip in the playback view
    • Clicking a key frame thumbnail seeks the video to that timestamp

P0-5: Real-Time Transcription

  • Two-tier architecture using whisper.cpp C API with Metal GPU acceleration:
    • Tier 1 (Partial): Every 2 seconds, transcribe the latest 2s audio chunk. Each chunk becomes its own append-only partial segment — previously rendered partial text is never modified, only new chunks are appended. Displayed as a single flowing block (consecutive partial chunks joined with spaces, single timestamp above the first partial).
    • Tier 2 (Final): Every 20 seconds, re-transcribe ALL accumulated audio (up to 30s) for complete, punctuated sentences. Cut at sentence boundaries, carry remainder to next pass. All partial segments are removed and replaced by the final.
  • Language: always auto-detected from audio — no settings, no locale guessing. Whisper auto-detects via language = nil (do NOT set detect_language = true — it produces 0 segments). No initial_prompt — it causes hallucination/repetition.
  • Chinese text normalization — whisper outputs mixed Simplified/Traditional Chinese. Default: convert to Simplified via macOS CFStringTransform("Hans-Hant") for full Unicode coverage. User can change in toolbar menu (Off / Simplified / Traditional). Normalization is applied at display time (in TranscriptTextView), so both live and playback transcripts are normalized regardless of how they were stored.
  • Acceptance criteria:
    • Partial text appears within 2 seconds of speech
    • Each 2s partial chunk is a separate append-only segment — previously rendered text is never modified
    • Consecutive partial chunks display as one flowing block with spaces between chunks
    • Final sentences appear within 20-25 seconds with proper punctuation
    • Final segments are 20s+ of audio, cut at sentence boundaries, remainder carried over
    • Incomplete sentences carry over to the next transcription window (no cut-off mid-sentence)
    • Silence is handled gracefully (no phantom text, no timing drift)
    • No initial_prompt on any tier (prevents hallucination and repetition)
    • Total audio context never exceeds 30s (whisper hard limit)
    • Language auto-detected from audio — Chinese audio produces Chinese text, not English
    • Chinese text normalization defaults to Simplified, uses CFStringTransform for full coverage
    • All transcript text is selectable and copyable — user selections on finalized text are preserved during live partial updates

P0-6: Real-Time Subtitle Panel

  • Subtitle panel embedded inside the app window (not a floating overlay), rendered using STTextView (TextKit 2 editor component) in read-only mode for true cross-segment text selection
  • All segments rendered as a single NSAttributedString document — users can drag-select across any combination of finalized and partial text
  • Selection preservation: Uses NSTextStorage common-prefix diff for incremental updates — only appends/replaces changed characters at the end. User selections on earlier text are never disturbed during live updates.
  • Shows all finalized sentences for the current recording session + current partial text
  • New sentences appear at the bottom; panel auto-scrolls to follow the latest text
  • Layout: During active recording, header shows compact single line (red dot + title). Transcript panel fills the entire detail area. Editorial typography: timestamps above text blocks (70% font size, monospaced digits), generous proportional spacing (40% line height, 80% block spacing). Text soft-wraps with no horizontal scrollbar.
  • Font customization: Toolbar "Aa" menu with "Choose Font..." opens macOS native NSFontPanel. Custom font name and point size are persisted to settings and applied to ALL text (timestamps, finals, partials). "Reset to System Font" reverts to default. Font changes trigger full attribute refresh across the entire document.
  • Chinese normalization applied at display time in TranscriptTextView, configurable in toolbar menu (Off / Simplified / Traditional).
  • Acceptance criteria:
    • All sentences for the session are retained and scrollable
    • Panel auto-scrolls to the latest sentence as new text arrives
    • Scrolling back pauses auto-scroll; reaching the bottom resumes it
    • Text is readable against both light and dark backgrounds
    • Partial text shows as flowing block (consecutive chunks joined, single timestamp)
    • User can toggle panel visibility with a keyboard shortcut
    • User can choose any system font and size via macOS font panel
    • Font changes apply to ALL text — timestamps, finals, and partials
    • All transcript text is selectable and copyable across all segments
    • User text selection is preserved during live 2-second partial updates
    • During active recording, header is compact (red dot + title, one line)
    • During active recording, transcript panel fills available space (no video area)
    • Chinese normalization (Off/Simplified/Traditional) configurable from toolbar menu

P0-7: Recording Library

  • All data stored as plain files in the user-configurable home directory (default: ~/Percev/)
  • No database — the file system is the source of truth. Library view scans the root directory and reads metadata.json from each recording subdirectory
  • Home directory structure:
    ~/Percev/                              # user-configurable in settings
    ├── CLAUDE.md                          # auto-generated, explains data format for Claude Code
    ├── 2026-03-25-143025-standup/         # {date}-{HHmmss}-{title} for uniqueness
    │   ├── transcript.jsonl               # timestamped transcript lines
    │   ├── audio.wav                      # system audio (16kHz mono WAV)
    │   ├── mic.wav                        # microphone audio (16kHz mono WAV, optional)
    │   ├── video.mp4                      # screen recording (optional)
    │   ├── camera.mp4                     # camera recording (optional)
    │   ├── metadata.json                  # duration, window title, date, thumbnail path, etc.
    │   ├── thumbnail.jpg                  # first non-blank frame
    │   └── keyframes/                     # auto-extracted key frames
    │       ├── 00-02-15.jpg
    │       ├── 00-08-42.jpg
    │       └── ...
    ├── 2026-03-25-160530-design-review/
    │   ├── ...
    └── .percev/                           # hidden folder for app internals
        ├── settings.json                  # app preferences
        └── models/
            ├── whisper-small.bin          # whisper model
            └── ...
    <!-- Updated: 2026-03-30 — Directory format includes HHmmss for uniqueness. Mic audio is separate file. Removed embedding model and search index from .percev/ (deferred to P2). -->
    
  • Library view shows recordings sorted by date, with title, duration, and thumbnail
  • Acceptance criteria:
    • Active recording appears in library immediately when recording starts (with "Recording in Progress" indicator)
    • Recordings appear in library immediately after stopping
    • Thumbnail generated from first non-blank frame
    • Auto-title from first meaningful transcript text or window title
    • Delete recording permanently removes the entire recording directory
    • Storage usage displayed in settings
    • Home directory is configurable in settings (moving existing data is handled automatically)
    • Library rebuilds correctly from file system (no hidden database state)

P0-8: Playback with Synced Transcript

  • Video in separate floating window — managed by VideoWindowManager singleton. Video player is NOT embedded in the main pane; it opens as a floating NSWindow beside the main window. This gives the transcript maximum vertical space.
  • VideoWindowManager behavior:
    • show(): reuses a single persistent NSWindow, swaps the AVPlayer when switching recordings. Window never closes/reopens on recording switch — just content swap.
    • hide(): closes the window when switching to an audio-only recording. Does NOT change isEnabled preference.
    • toggle(): user action — saves frame + toggles isEnabled preference.
    • X button: saves frame + sets isEnabled = false.
    • isEnabled is a global preference that persists across recording switches. If user closes the video window, it stays closed for subsequent recordings until they reopen it.
    • Window frame (position + size) saved via NSWindow.saveFrame(usingName:) on user actions only. Restored on next open. Smart positioning beside main window only on first-ever open.
    • contentAspectRatio locked to video's natural ratio (async-loaded from track metadata).
    • Video window shows inline AVPlayer controls. Bidirectional sync with transcript controls via KVO on player.timeControlStatus.
  • Audio-only recordings: no video window, no video toggle button. hasVideo determined from metadata.json (not file existence — video.mp4 is always created).
  • Clicking a transcript timestamp jumps playback to that position (custom seekTimeKey attribute + NSClickGestureRecognizer, not .link to avoid blue styling).
  • Current spoken line is highlighted during playback (accent background color).
  • Playback controls: compact bar with skip ±10s, play/pause, speed menu (pill-shaped 1x button), video toggle icon.
  • Delete confirmation: trash button shows alert dialog before deleting.
  • Acceptance criteria:
    • Transcript scrolls automatically to follow playback position
    • Click any timestamp to seek playback to that moment
    • Playback speed: 0.5x, 1x, 1.5x, 2x (compact menu button)
    • Keyboard shortcuts for play/pause (Space), skip ±10s (←/→)
    • Video opens in a separate floating window, positioned beside main window
    • Video window persists across recording switches (global preference)
    • Video window hidden for audio-only recordings, reopens for next video recording
    • Video window frame (size + position) remembered across sessions
    • Video window aspect ratio locked to prevent black bars
    • Video and transcript controls bidirectionally synced
    • Delete recording requires confirmation dialog

P0-9: Embedded Terminal with Claude Code

  • Percev does NOT include built-in AI features (no summaries, no chat, no LLM)
  • Instead, an embedded terminal panel lives side-by-side with the transcript and video player
  • Users bring their own Claude subscription and run Claude Code directly inside Percev
  • Each recording is stored as a well-structured directory directly under the Percev home directory:
    ~/Percev/2026-03-25-143025-standup/
    ├── transcript.jsonl      # timestamped transcript lines
    ├── audio.wav             # system audio (16kHz mono WAV)
    ├── mic.wav               # microphone audio (optional, 16kHz mono WAV)
    ├── video.mp4             # screen recording (optional)
    ├── camera.mp4            # camera recording (optional)
    └── metadata.json         # duration, window title, date, file paths, etc.
    
  • Auto-launch Claude Code on recording selection — when a recording is selected, Percev auto-starts a new terminal session in the recording's directory and launches claude automatically. If the user switches to a different recording, the existing Claude session is killed and a new one starts in the new directory. Users type directly in the SwiftTerm terminal (no separate input field).
  • Implementation: SwiftTerm (open-source terminal emulator) with LocalProcessTerminalView
  • Timestamp linking: parse Claude Code output for timestamps (e.g., [00:12:34]) and make them clickable to jump the video player to that moment
  • Layout: vertical split on left, terminal on right — the left side shows video player on top and synced transcript on bottom (VSplitView). The right side is the embedded Claude Code terminal (resizable HSplitView). This layout gives equal prominence to content review (left) and AI interaction (right).
  • Quick-Action Buttons — toolbar buttons above the terminal that send pre-built prompts to Claude Code with one click:
    • 📝 Summarize — generate a structured summary (key topics, decisions, action items)
    • Action Items — extract action items with assignees and deadlines
    • Ask — opens a text input for a custom question, sends it to Claude Code with the transcript as context
    • 📧 Follow-up Email — draft a follow-up email from the meeting
    • 🌐 Translate — translate the transcript to a selected language
    • 🖼️ Key Moments — send key frame images + transcript to Claude Code to identify and describe the most important visual moments (slide content, diagrams, code shown on screen)
    • 🔄 Compare — select another recording and compare what changed between meetings
    • Users can also type freely in the terminal for any custom workflow
  • Button implementation: each button constructs a claude CLI command with the appropriate prompt and pipes it to the embedded terminal (e.g., claude "Summarize the meeting transcript in transcript.jsonl. Include key topics, decisions, and action items.")
  • Users can customize or add their own quick-action buttons in settings (custom prompt templates)
  • Acceptance criteria:
    • Quick-action buttons are visible in a toolbar above the embedded terminal
    • Each button sends a pre-built prompt to Claude Code — no typing required
    • "Ask" button opens a text input field for custom questions
    • Users can add/edit/reorder custom quick-action buttons in settings
    • Buttons are disabled when Claude Code is not installed or no recording is selected
    • Selecting a recording auto-starts a terminal session and launches Claude Code in the recording directory
    • Switching recordings kills the existing Claude session and starts a new one in the new directory
    • Timestamps in terminal output (e.g., [00:12:34]) are clickable and seek the video player
    • Split view is resizable; terminal can be toggled visible/hidden (keyboard shortcut)
    • Recording directory is human-readable and Claude Code-friendly
    • JSONL transcript format includes timestamps, text, and language per line
    • A CLAUDE.md file is auto-generated in the recordings root directory explaining the data format
    • Recordings directory path is configurable in settings

P0-10: Whisper Model Selection

  • Settings: choose whisper model size
    • Small (~466MB, fastest, good for real-time)
    • Medium (~1.5GB, balanced, better accuracy)
    • Large-v3 (~3.1GB, best accuracy, slower)
  • In-app model download with progress
  • Acceptance criteria:
    • Model download shows progress and estimated time
    • User warned if selected model may cause >2s partial latency on their hardware
    • Model switch takes effect on next recording (not mid-recording)

Deferred Features

See spec-p2.md for deferred features including:

  • P2: Editor, Semantic Search, Speaker Diarization, iCloud Sync

Success Metrics

No in-app telemetry — consistent with our privacy-first positioning. Metrics are gathered from external signals and community feedback only.

Sales & Distribution (LemonSqueezy)

Metric Target Stretch Measurement
Downloads (first month) 5,000 15,000 LemonSqueezy analytics
Paid conversions (first month) 250 750 LemonSqueezy sales data
Refund rate <5% <2% LemonSqueezy
Revenue (first 3 months) $5,000 $15,000 LemonSqueezy

Community & Feedback

Metric Target Stretch Measurement
GitHub stars (if open-source) 1,000 5,000 GitHub
Community members (Discord) 500 2,000 Discord
Bug reports resolved >80% within 1 week >90% GitHub issues
Social media mentions 50/month 200/month Manual tracking

Technical Architecture

Core Components

┌──────────────────────────────────────────────────┐
│                  Percev.app                        │
│                                                    │
│  ┌──────────────┐  ┌───────────────────────────┐  │
│  │  Menu Bar UI  │  │   Recording Engine         │  │
│  │  (SwiftUI)    │  │                           │  │
│  │               │  │  ScreenCaptureKit         │  │
│  │  • Start/Stop │  │  ├─ Video (H.264)         │  │
│  │  • Window Pick│  │  ├─ System Audio (PCM)    │  │
│  │  • Status     │  │  └─ Mic Audio (PCM)       │  │
│  └──────────────┘  └───────────┬───────────────┘  │
│                                │                    │
│  ┌──────────────┐  ┌──────────▼────────────────┐  │
│  │  Subtitle     │  │  Transcription Engine      │  │
│  │  Panel        │  │                           │  │
│  │  (SwiftUI,    │  │  whisper.cpp C API        │  │
│  │   in-window)  │◄─│  ├─ 2s partials (live)    │  │
│  │               │  │  └─ 20s finals (sentences)│  │
│  └──────────────┘  └───────────┬───────────────┘  │
│                                │                    │
│  ┌──────────────┐  ┌──────────▼────────────────┐  │
│  │  Library &    │  │  Storage (Plain Files)      │  │
│  │  Playback     │  │  ~/Percev/ (configurable)  │  │
│  │  (SwiftUI)    │◄─│                           │  │
│  │               │  │  <recording-name>/         │  │
│  │  • Search     │  │  ├─ transcript.jsonl       │  │
│  │  (USearch +   │  │  ├─ audio.wav              │  │
│  │   MiniLM)     │  │  ├─ video.mp4              │  │
│  │               │  │  └─ metadata.json          │  │
│  │               │  │                           │  │
│  │               │  │  .percev/search.usearch    │  │
│  └──────────────┘  └───────────┬───────────────┘  │
│                                │                    │
│                    ┌──────────▼────────────────┐  │
│                    │  Embedded Terminal (P1)      │  │
│                    │  (SwiftTerm / PTY)         │  │
│                    │                           │  │
│                    │  Claude Code runs in-app   │  │
│                    │  ├─ Auto-cd to recording   │  │
│                    │  ├─ Reads JSONL transcripts │  │
│                    │  ├─ Clickable timestamps   │  │
│                    │  └─ Any custom AI workflow  │  │
│                    └───────────────────────────┘  │
└──────────────────────────────────────────────────┘

Key Technical Constraints

  1. Single ScreenCaptureKit stream — macOS allows only one audio capture stream at a time. Multiple Percev instances or competing apps will conflict.
  2. Whisper 30s context limit — The transcription engine must ensure final interval (20s) + sentence carryover (up to 10s) never exceeds 30s.
  3. Metal GPU required — whisper.cpp inference runs on Metal. Intel Macs are not supported.
  4. Memory pressure — whisper medium model uses ~2.6GB VRAM. Combined with video encoding, total memory overhead should be profiled across M1 (8GB) through M4 (up to 128GB) configurations.
  5. Screen Recording permission — macOS requires explicit user consent. App must handle the permission flow gracefully with clear instructions.

Storage Estimates

Content Size per hour Notes
Video (H.265/HEVC, VFR, 1080p) ~100MB VFR + constrained VBR, 200-300 Kbps avg
Audio (WAV, 16kHz mono) ~115MB Or ~15MB as AAC
Transcript (JSONL) ~200KB Text is tiny
Metadata (JSON) ~1KB Tiny
Total per hour (video on) ~115MB ~115MB with HEVC VFR, ~15MB video-off

At 2 hours of recording per day: ~7GB/month (video on) or ~1GB/month (video off).

HEVC Encoding Settings

Property Value Notes
Codec H.265/HEVC via VideoToolbox Hardware-accelerated on all Apple Silicon
Profile Main, Auto Level Sufficient for 8-bit screen content
Average bitrate 200–300 Kbps Constrained VBR, sufficient for mostly-static screens
Peak bitrate 1.0–1.5 Mbps Burst for transitions/scrolling
Keyframe interval 5 seconds Time-based, not frame-count (since VFR)
B-frames Disabled Reduces latency, negligible benefit for screen content
Frame rate VFR (variable) Only encode when screen content changes; ScreenCaptureKit delivers frames on-change natively. Typical avg ~2-5 fps for meetings, up to 15fps during scrolling/video
Max frame rate 15 fps cap Prevents excessive encoding during fast motion
Real-time mode On Prioritizes low latency over compression
Pixel format Full-range YUV (NV12) Better text fidelity than video-range
Container .mov (QuickTime) Best macOS-native support for HEVC

Open Questions

# Question Owner Blocking?
1 App Store or direct download? Resolved: Ship v1.0 as direct download (Developer ID notarized) to avoid App Store sandbox conflicts with ScreenCaptureKit system audio capture. Revisit App Store submission for v1.1+ once entitlement requirements are confirmed. ScreenCaptureKit handles Screen Recording permission automatically via OS-level consent dialog — no special code needed. Engineering No
2 Freemium model? Resolved: App has no built-in AI features — AI is delegated to Claude Code (user's own subscription). Monetization via LemonSqueezy (international, friendly to China-based developers). Pricing model TBD but no AI cost to us. Product No
3 Should video be optional? Resolved: Yes, video is optional (default: on). Users can toggle off in settings to save storage — audio + transcript still captured. Product No
4 MLX model performance on M1 (8GB)? Resolved: No local LLM needed — AI delegated to Claude Code. Only whisper.cpp runs locally. Engineering No
5 Legal: recording consent notices? Resolved: Show a one-time consent/disclaimer on first app launch explaining that the user is responsible for informing meeting participants where required by law. No per-recording prompts. The existing menu bar recording indicator (P0-1) serves as a visual reminder. Legal No
6 Support Intel Macs? Resolved: No. Apple Silicon only. Metal GPU required for whisper.cpp and HEVC hardware encoding. Product No
7 Optimal video codec? Resolved: Use H.265/HEVC via VideoToolbox hardware encoder. ~40-50% smaller than H.264 at equivalent text quality, near-zero CPU (dedicated media engine on all Apple Silicon). ~540 MB/hour at 1080p 15fps. Constrained VBR at 1.0-1.5 Mbps average, 3x peak. See Technical Architecture for full settings. Engineering No

Competitive Landscape

Product Model Strengths Weaknesses vs. Percev
Otter.ai Cloud SaaS, $17/mo Polished UX, speaker ID, integrations All audio sent to cloud, subscription cost, meetings-only
Granola Cloud, $10/mo Clean meeting notes, AI summaries Meetings-only, cloud-dependent, no video
Limitless (ex-Rewind) Local + Cloud Full screen recording, search Requires pendant ($99) for best experience, cloud AI
Loom Cloud, freemium Async video sharing, screen recording No transcription focus, cloud storage, no real-time subtitles
macOS Dictation Local Built-in, free No recording, no search, no summaries, single-language

Percev's differentiation:

  1. Fully local — no cloud, no subscription for core features, no privacy concerns
  2. Any content — not just meetings: YouTube, lectures, podcasts, presentations
  3. Real-time subtitles — 2s latency live transcription overlay
  4. Native macOS — SwiftUI, Metal GPU, ScreenCaptureKit — feels like an Apple app
  5. Open AI layer — no built-in AI lock-in. Recordings are plain files (JSONL + WAV + MP4) that Claude Code or any tool can read. Users bring their own AI and can build any workflow on top of their data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment