Skip to content

data-ingestion

The podcast pipeline at ~/data-ingestion/. Ingests RSS feeds → triages by relevance → transcribes on Modal A10G → extracts claims via Parallel API → produces *.raw.json + digest markdown that the kb consumes.

Sister system to garmin-warehouse — separate repo (git init'd 2026-05-04 at commit 46ce4fd, 1,779 files, 9.2MB, no secrets) but they share data via the kb.

Quick orientation

  • Location: ~/data-ingestion/
  • Run command for full sync: ./scripts/podcast-sync.sh (also wired to launchd Sun 6am PT)
  • Output: insights/<show>/<ep>.raw.json (and .raw.meta.json prompt-version sidecar), plus digests/ markdown that swap-podcast and kb consume
  • Cost: ~$2-3 per show backfill + ~$0/week ongoing (Modal serverless + Parallel API)

Pipeline stages

   RSS feed                                         │
       │                                            │
       ▼                                            │
   triage.py: rate ep relevance via LLM (1-5)       │  per-show YAML
       │  ▼                                         │  in podcasts/
       │  reject if < cutoff                        │  drives behavior
       ▼                                            │
   modal_transcribe.py: Whisper-large on Modal A10G │
       │  ~3 min/ep, parallel-friendly              │
       ▼                                            │
   research.py: Parallel API → claims + studies     │
       │  resolves PMID/DOI for cited studies       │
       ▼                                            │
   <ep>.raw.json + <ep>.raw.meta.json sidecar       │
       │                                            │
       ▼                                            │
   swap.py compiles per-ep digest markdown          │
       │                                            │
       ▼                                            │
   kb/load.py picks up new raw.json files           │
       │  via rglob, registered by show YAML        │
       ▼                                            │
   DuckDB kb has claims, studies, embeddings        │

Per-show config

Each show has a YAML at podcasts/<show>.yaml. Adding a new show:

  1. Drop YAML in podcasts/
  2. kb/load.py auto-picks up the show on next run (uses an insights-prefix → show registry built from all YAMLs)

Live shows (as of 2026-05-04): - swap.yaml — Some Work, All Play (main + bonus, 174 + 118 eps) - real-science-of-sport.yaml — Tucker / Finch (97 eps) - running-effect.yaml — The Running Effect (58) - letsrun.yaml — LetsRun.com's Track Talk (58) - morning-shakeout.yaml — The Morning Shakeout (23, all ≥4-relevance) - coffee-club.yaml — Coffee Club / OAC (11) - strength-running.yaml — The Strength Running Podcast (5, all ≥5-relevance)

Smaller counts reflect higher relevance cutoffs in YAML, not missing data. Bumping cutoff (e.g. ≥3) backfills more eps if needed.

Critical files

Path What it does
batch.py Run the pipeline across all shows; the main entry
research.py Parallel API extraction (claims + study resolution)
modal_transcribe.py Modal A10G Whisper transcription
triage.py Pre-research LLM-based relevance scoring (filters Q&A/off-topic episodes before we pay to research them)
letsrun_threads.py + dedup_forum_scrape.py Forum-specific scraping path (Letsrun isn't a podcast feed)
notify.py Telegram + Resend notifications when sync finishes
find_misses.py + retry_misses.py Recovery for episodes that failed mid-pipeline
hot_briefing.py Cross-show synthesis: "what did multiple shows talk about this week"
research_radar.py Newer-paper auto-discovery via PubMed cited-by
otq_leaderboard.py Specific OTQ-2028-relevant content tracker
monitor_setup.py Configures the Parallel API monitors (long-running batch jobs)
scripts/podcast-sync.sh Sun 6am PT cron entry point
scripts/build_followups_summary.py Builds findings/_followups_summary.md from PubMed cited-by graph
IDEMPOTENCY.md Doc of the atomic-write + _raw_json_is_complete() guards across pipeline stages

Forum mode (Letsrun)

Letsrun threads use a different extraction prompt than podcasts: extract_claims(mode="forum"). The default podcast prompt drops Q&A-style and quoted content, which is exactly the shape of forum threads — so without forum mode, Letsrun extraction silently produces near-empty raw.json files. See feedback memory.

Cost guards

New podcast YAML → 3-episode cap on first run. Known YAML → 25-episode cap. Orchestrator timeouts sit just above Modal container cap. See feedback memory.

Idempotency

The pipeline is crash-safe: re-running mid-failure produces the same final state. Mechanism:

  • Atomic writes (tmpos.replace) at every layer
  • _raw_json_is_complete() guard: an in-progress raw.json never overwrites a complete one
  • Prompt-version sidecar: <ep>.raw.meta.json records which prompt version produced the raw.json. kb/stale_prompts.py finds claims needing re-extraction after prompt edits.

21 dedicated idempotency tests cover this.

Triage (pre-research) vs Triage (kb application queue)

Two different things both called "triage":

  1. ~/data-ingestion/triage.py — pre-research relevance scoring. Filters episodes BEFORE they cost money to research. "Is this episode worth the $0.05?"
  2. ~/garmin-warehouse/kb/triage.py — kb application-queue TUI. Walks claims surfaced by watches.yaml and lets Casey apply/dismiss them per finding. "Should this claim show up in bloodwork_baseline.md?"

The Worker UI's inline triage (commit 2 of UI plan) hits #2, not #1.