Runbook: kb rebuild¶
When the kb (~/garmin-warehouse/kb/kb.duckdb) is corrupted, behind,
or you want to verify a known-good state. The deterministic rebuild
path is the recovery, not "rerun load.py and pray."
When you need this¶
kb/kb.duckdbis corrupted (duckdb.IOExceptionor similar)- Schema migrations got applied wrong
- Embeddings table is incomplete (semantic search returns zero results)
- After a major prompt change in
data-ingestion/research.pythat invalidates many claims - Corpus diff shows unexpected drift you can't explain
The reproducible rebuild¶
This script:
- Computes input sha256s for every
*.raw.json+ every YAML config - Writes a
rebuild_manifest.jsonrecording inputs + git SHA of the warehouse repo - Drops
kb.duckdb(after confirming you want to) - Runs migrations from scratch via
kb/migrate.py up - Runs
kb/load.py(loads claims/studies/episodes from raw.json files registered in YAMLs) - Runs
kb/embed.py(Voyage embeddings, content-hash cached so if hashes match historical embeddings get reused — much faster than embedding from scratch) - Writes verification report
Verification¶
Checks:
- All FK constraints satisfied (no orphan claims/studies)
- Embedding count matches claim count
- Migration version matches latest migration file
- HNSW index is built
- No claims with NULL claim_prompt_version (means _raw_meta.json
sidecars are present + readable)
If verify fails, the manifest tells you exactly what state was loaded.
Faster: incremental rather than full rebuild¶
If only one show changed:
# Rebuilds only what's stale (uses content_hash cache for embeddings):
uv run python ~/garmin-warehouse/kb/sync.py
If only prompts changed and you want to re-extract some claims:
# Find stale claims:
uv run python ~/garmin-warehouse/kb/stale_prompts.py
# Re-research the affected episodes (in data-ingestion):
cd ~/data-ingestion
uv run python research.py --episode <ep-id> --force
# Then re-sync the kb:
cd ~/garmin-warehouse
uv run python kb/sync.py
Restore from R2¶
If local kb is hosed and rebuild also fails (e.g. raw.json files corrupted):
# Most recent backup (from yesterday's 7am sync):
rclone copy r2:garmin-warehouse-data/kb/latest/kb.duckdb ~/garmin-warehouse/kb/
# Or a specific date:
rclone copy r2:garmin-warehouse-data/kb/2026-05-04/kb.duckdb ~/garmin-warehouse/kb/
# Verify it loads:
uv run python -c "
import duckdb
con = duckdb.connect('/Users/caseymanos/garmin-warehouse/kb/kb.duckdb')
print(con.execute('SELECT COUNT(*) FROM claims').fetchone())
"
R2 keeps both kb/<DATE>/kb.duckdb (per-day snapshot, retained
indefinitely until lifecycle rule added) and kb/latest/kb.duckdb
(rolling head). See reference/r2-layout.md.
Schema history¶
# What migrations are applied:
uv run python ~/garmin-warehouse/kb/migrate.py status
# What migrations exist on disk:
ls ~/garmin-warehouse/kb/migrations/
If status shows fewer applied than exist on disk, run
kb/migrate.py up (idempotent — applies only what's missing).
When NOT to rebuild¶
- A few claims look weird → just dismiss them via triage TUI; rebuild is overkill
- Embeddings stale on a few new episodes →
kb/sync.pyis fine, no full rebuild needed - Adding a new podcast → drop YAML in
~/data-ingestion/podcasts/, run~/data-ingestion/batch.pyfor that show, thenkb/sync.pypicks it up automatically (kb/load.py uses rglob on raw.json files and a YAML-derived show registry)