Skip to content

Runbook: kb rebuild

When the kb (~/garmin-warehouse/kb/kb.duckdb) is corrupted, behind, or you want to verify a known-good state. The deterministic rebuild path is the recovery, not "rerun load.py and pray."

When you need this

  • kb/kb.duckdb is corrupted (duckdb.IOException or similar)
  • Schema migrations got applied wrong
  • Embeddings table is incomplete (semantic search returns zero results)
  • After a major prompt change in data-ingestion/research.py that invalidates many claims
  • Corpus diff shows unexpected drift you can't explain

The reproducible rebuild

~/garmin-warehouse/scripts/rebuild_kb.sh

This script:

  1. Computes input sha256s for every *.raw.json + every YAML config
  2. Writes a rebuild_manifest.json recording inputs + git SHA of the warehouse repo
  3. Drops kb.duckdb (after confirming you want to)
  4. Runs migrations from scratch via kb/migrate.py up
  5. Runs kb/load.py (loads claims/studies/episodes from raw.json files registered in YAMLs)
  6. Runs kb/embed.py (Voyage embeddings, content-hash cached so if hashes match historical embeddings get reused — much faster than embedding from scratch)
  7. Writes verification report

Verification

~/garmin-warehouse/scripts/verify_kb.sh

Checks: - All FK constraints satisfied (no orphan claims/studies) - Embedding count matches claim count - Migration version matches latest migration file - HNSW index is built - No claims with NULL claim_prompt_version (means _raw_meta.json sidecars are present + readable)

If verify fails, the manifest tells you exactly what state was loaded.

Faster: incremental rather than full rebuild

If only one show changed:

# Rebuilds only what's stale (uses content_hash cache for embeddings):
uv run python ~/garmin-warehouse/kb/sync.py

If only prompts changed and you want to re-extract some claims:

# Find stale claims:
uv run python ~/garmin-warehouse/kb/stale_prompts.py

# Re-research the affected episodes (in data-ingestion):
cd ~/data-ingestion
uv run python research.py --episode <ep-id> --force

# Then re-sync the kb:
cd ~/garmin-warehouse
uv run python kb/sync.py

Restore from R2

If local kb is hosed and rebuild also fails (e.g. raw.json files corrupted):

# Most recent backup (from yesterday's 7am sync):
rclone copy r2:garmin-warehouse-data/kb/latest/kb.duckdb ~/garmin-warehouse/kb/

# Or a specific date:
rclone copy r2:garmin-warehouse-data/kb/2026-05-04/kb.duckdb ~/garmin-warehouse/kb/

# Verify it loads:
uv run python -c "
import duckdb
con = duckdb.connect('/Users/caseymanos/garmin-warehouse/kb/kb.duckdb')
print(con.execute('SELECT COUNT(*) FROM claims').fetchone())
"

R2 keeps both kb/<DATE>/kb.duckdb (per-day snapshot, retained indefinitely until lifecycle rule added) and kb/latest/kb.duckdb (rolling head). See reference/r2-layout.md.

Schema history

# What migrations are applied:
uv run python ~/garmin-warehouse/kb/migrate.py status

# What migrations exist on disk:
ls ~/garmin-warehouse/kb/migrations/

If status shows fewer applied than exist on disk, run kb/migrate.py up (idempotent — applies only what's missing).

When NOT to rebuild

  • A few claims look weird → just dismiss them via triage TUI; rebuild is overkill
  • Embeddings stale on a few new episodes → kb/sync.py is fine, no full rebuild needed
  • Adding a new podcast → drop YAML in ~/data-ingestion/podcasts/, run ~/data-ingestion/batch.py for that show, then kb/sync.py picks it up automatically (kb/load.py uses rglob on raw.json files and a YAML-derived show registry)