data-ingestion¶
The podcast pipeline at ~/data-ingestion/. Ingests RSS feeds → triages
by relevance → transcribes on Modal A10G → extracts claims via Parallel
API → produces *.raw.json + digest markdown that the kb consumes.
Sister system to garmin-warehouse — separate repo (git init'd
2026-05-04 at commit 46ce4fd, 1,779 files, 9.2MB, no secrets) but they
share data via the kb.
Quick orientation¶
- Location:
~/data-ingestion/ - Run command for full sync:
./scripts/podcast-sync.sh(also wired to launchd Sun 6am PT) - Output:
insights/<show>/<ep>.raw.json(and.raw.meta.jsonprompt-version sidecar), plusdigests/markdown that swap-podcast and kb consume - Cost: ~$2-3 per show backfill + ~$0/week ongoing (Modal serverless + Parallel API)
Pipeline stages¶
RSS feed │
│ │
▼ │
triage.py: rate ep relevance via LLM (1-5) │ per-show YAML
│ ▼ │ in podcasts/
│ reject if < cutoff │ drives behavior
▼ │
modal_transcribe.py: Whisper-large on Modal A10G │
│ ~3 min/ep, parallel-friendly │
▼ │
research.py: Parallel API → claims + studies │
│ resolves PMID/DOI for cited studies │
▼ │
<ep>.raw.json + <ep>.raw.meta.json sidecar │
│ │
▼ │
swap.py compiles per-ep digest markdown │
│ │
▼ │
kb/load.py picks up new raw.json files │
│ via rglob, registered by show YAML │
▼ │
DuckDB kb has claims, studies, embeddings │
Per-show config¶
Each show has a YAML at podcasts/<show>.yaml. Adding a new show:
- Drop YAML in
podcasts/ kb/load.pyauto-picks up the show on next run (uses an insights-prefix → show registry built from all YAMLs)
Live shows (as of 2026-05-04):
- swap.yaml — Some Work, All Play (main + bonus, 174 + 118 eps)
- real-science-of-sport.yaml — Tucker / Finch (97 eps)
- running-effect.yaml — The Running Effect (58)
- letsrun.yaml — LetsRun.com's Track Talk (58)
- morning-shakeout.yaml — The Morning Shakeout (23, all ≥4-relevance)
- coffee-club.yaml — Coffee Club / OAC (11)
- strength-running.yaml — The Strength Running Podcast (5, all ≥5-relevance)
Smaller counts reflect higher relevance cutoffs in YAML, not missing data. Bumping cutoff (e.g. ≥3) backfills more eps if needed.
Critical files¶
| Path | What it does |
|---|---|
batch.py |
Run the pipeline across all shows; the main entry |
research.py |
Parallel API extraction (claims + study resolution) |
modal_transcribe.py |
Modal A10G Whisper transcription |
triage.py |
Pre-research LLM-based relevance scoring (filters Q&A/off-topic episodes before we pay to research them) |
letsrun_threads.py + dedup_forum_scrape.py |
Forum-specific scraping path (Letsrun isn't a podcast feed) |
notify.py |
Telegram + Resend notifications when sync finishes |
find_misses.py + retry_misses.py |
Recovery for episodes that failed mid-pipeline |
hot_briefing.py |
Cross-show synthesis: "what did multiple shows talk about this week" |
research_radar.py |
Newer-paper auto-discovery via PubMed cited-by |
otq_leaderboard.py |
Specific OTQ-2028-relevant content tracker |
monitor_setup.py |
Configures the Parallel API monitors (long-running batch jobs) |
scripts/podcast-sync.sh |
Sun 6am PT cron entry point |
scripts/build_followups_summary.py |
Builds findings/_followups_summary.md from PubMed cited-by graph |
IDEMPOTENCY.md |
Doc of the atomic-write + _raw_json_is_complete() guards across pipeline stages |
Forum mode (Letsrun)¶
Letsrun threads use a different extraction prompt than podcasts:
extract_claims(mode="forum"). The default podcast prompt drops
Q&A-style and quoted content, which is exactly the shape of forum
threads — so without forum mode, Letsrun extraction silently produces
near-empty raw.json files. See feedback memory.
Cost guards¶
New podcast YAML → 3-episode cap on first run. Known YAML → 25-episode cap. Orchestrator timeouts sit just above Modal container cap. See feedback memory.
Idempotency¶
The pipeline is crash-safe: re-running mid-failure produces the same final state. Mechanism:
- Atomic writes (
tmp→os.replace) at every layer _raw_json_is_complete()guard: an in-progress raw.json never overwrites a complete one- Prompt-version sidecar:
<ep>.raw.meta.jsonrecords which prompt version produced the raw.json.kb/stale_prompts.pyfinds claims needing re-extraction after prompt edits.
21 dedicated idempotency tests cover this.
Triage (pre-research) vs Triage (kb application queue)¶
Two different things both called "triage":
~/data-ingestion/triage.py— pre-research relevance scoring. Filters episodes BEFORE they cost money to research. "Is this episode worth the $0.05?"~/garmin-warehouse/kb/triage.py— kb application-queue TUI. Walks claims surfaced bywatches.yamland lets Casey apply/dismiss them per finding. "Should this claim show up inbloodwork_baseline.md?"
The Worker UI's inline triage (commit 2 of UI plan) hits #2, not #1.
Related pages¶
systems/garmin-warehouse.md— sister system that consumes data-ingestion outputrunbooks/data-ingestion-failures.md— recovery when the Sun 6am cron breaksreference/cron-schedules.md