data-ingestion¶

The podcast pipeline at ~/data-ingestion/. Ingests RSS feeds → triages by relevance → transcribes on Modal A10G → extracts claims via Parallel API → produces *.raw.json + digest markdown that the kb consumes.

Sister system to garmin-warehouse — separate repo (git init'd 2026-05-04 at commit 46ce4fd, 1,779 files, 9.2MB, no secrets) but they share data via the kb.

Quick orientation¶

Location: ~/data-ingestion/
Run command for full sync: ./scripts/podcast-sync.sh (also wired to launchd Sun 6am PT)
Output: insights/<show>/<ep>.raw.json (and .raw.meta.json prompt-version sidecar), plus digests/ markdown that swap-podcast and kb consume
Cost: ~$2-3 per show backfill + ~$0/week ongoing (Modal serverless + Parallel API)

Pipeline stages¶

   RSS feed                                         │
       │                                            │
       ▼                                            │
   triage.py: rate ep relevance via LLM (1-5)       │  per-show YAML
       │  ▼                                         │  in podcasts/
       │  reject if < cutoff                        │  drives behavior
       ▼                                            │
   modal_transcribe.py: Whisper-large on Modal A10G │
       │  ~3 min/ep, parallel-friendly              │
       ▼                                            │
   research.py: Parallel API → claims + studies     │
       │  resolves PMID/DOI for cited studies       │
       ▼                                            │
   <ep>.raw.json + <ep>.raw.meta.json sidecar       │
       │                                            │
       ▼                                            │
   swap.py compiles per-ep digest markdown          │
       │                                            │
       ▼                                            │
   kb/load.py picks up new raw.json files           │
       │  via rglob, registered by show YAML        │
       ▼                                            │
   DuckDB kb has claims, studies, embeddings        │

Per-show config¶

Each show has a YAML at podcasts/<show>.yaml. Adding a new show:

Drop YAML in podcasts/
kb/load.py auto-picks up the show on next run (uses an insights-prefix → show registry built from all YAMLs)

Live shows (as of 2026-05-04): - swap.yaml — Some Work, All Play (main + bonus, 174 + 118 eps) - real-science-of-sport.yaml — Tucker / Finch (97 eps) - running-effect.yaml — The Running Effect (58) - letsrun.yaml — LetsRun.com's Track Talk (58) - morning-shakeout.yaml — The Morning Shakeout (23, all ≥4-relevance) - coffee-club.yaml — Coffee Club / OAC (11) - strength-running.yaml — The Strength Running Podcast (5, all ≥5-relevance)

Smaller counts reflect higher relevance cutoffs in YAML, not missing data. Bumping cutoff (e.g. ≥3) backfills more eps if needed.

Critical files¶

Path	What it does
`batch.py`	Run the pipeline across all shows; the main entry
`research.py`	Parallel API extraction (claims + study resolution)
`modal_transcribe.py`	Modal A10G Whisper transcription
`triage.py`	Pre-research LLM-based relevance scoring (filters Q&A/off-topic episodes before we pay to research them)
`letsrun_threads.py` + `dedup_forum_scrape.py`	Forum-specific scraping path (Letsrun isn't a podcast feed)
`notify.py`	Telegram + Resend notifications when sync finishes
`find_misses.py` + `retry_misses.py`	Recovery for episodes that failed mid-pipeline
`hot_briefing.py`	Cross-show synthesis: "what did multiple shows talk about this week"
`research_radar.py`	Newer-paper auto-discovery via PubMed cited-by
`otq_leaderboard.py`	Specific OTQ-2028-relevant content tracker
`monitor_setup.py`	Configures the Parallel API monitors (long-running batch jobs)
`scripts/podcast-sync.sh`	Sun 6am PT cron entry point
`scripts/build_followups_summary.py`	Builds `findings/_followups_summary.md` from PubMed cited-by graph
`IDEMPOTENCY.md`	Doc of the atomic-write + `_raw_json_is_complete()` guards across pipeline stages

Forum mode (Letsrun)¶

Letsrun threads use a different extraction prompt than podcasts: extract_claims(mode="forum"). The default podcast prompt drops Q&A-style and quoted content, which is exactly the shape of forum threads — so without forum mode, Letsrun extraction silently produces near-empty raw.json files. See feedback memory.

Cost guards¶

New podcast YAML → 3-episode cap on first run. Known YAML → 25-episode cap. Orchestrator timeouts sit just above Modal container cap. See feedback memory.

Idempotency¶

The pipeline is crash-safe: re-running mid-failure produces the same final state. Mechanism:

Atomic writes (tmp → os.replace) at every layer
_raw_json_is_complete() guard: an in-progress raw.json never overwrites a complete one
Prompt-version sidecar: <ep>.raw.meta.json records which prompt version produced the raw.json. kb/stale_prompts.py finds claims needing re-extraction after prompt edits.

21 dedicated idempotency tests cover this.

Triage (pre-research) vs Triage (kb application queue)¶

Two different things both called "triage":

~/data-ingestion/triage.py — pre-research relevance scoring. Filters episodes BEFORE they cost money to research. "Is this episode worth the $0.05?"
~/garmin-warehouse/kb/triage.py — kb application-queue TUI. Walks claims surfaced by watches.yaml and lets Casey apply/dismiss them per finding. "Should this claim show up in bloodwork_baseline.md?"

The Worker UI's inline triage (commit 2 of UI plan) hits #2, not #1.

systems/garmin-warehouse.md — sister system that consumes data-ingestion output
runbooks/data-ingestion-failures.md — recovery when the Sun 6am cron breaks
reference/cron-schedules.md