ADR 005: kb migrations are source of truth, not schema.sql¶

Status: accepted Date: 2026-04-30 (formalized in this ADR 2026-05-04) Supersedes: —

Context¶

Early in the kb's life, schema lived as a single kb/schema.sql file with all CREATE TABLE statements. kb/load.py would DROP TABLE ... CASCADE; CREATE TABLE ... from schema.sql at the start of every load. This worked when:

Casey was the only user of the kb
Rebuilds were cheap (small corpus)
No data state needed preserving across loads

But three things broke that model:

Triage state matters. As the corpus grew and Casey curated applied.yaml / dismissed.yaml, dropping + recreating tables meant kb/load.py would briefly produce a kb with no rows, then reload from raw.json. This was harmless because the YAMLs survived independently — but during the brief window the UI would show empty results.
Embedding cost. Re-embedding 5,555 claims after every load is wasteful even with content-hash caching. We want the load to update the structured tables without disturbing embeddings_claim rows whose content hasn't changed.
Schema evolution. Adding a column (e.g. claim_prompt_version in migration 002) with the schema.sql + DROP+CREATE pattern means any existing per-claim metadata (like ad-hoc notes) gets lost. Migrations preserve data; DROP+CREATE doesn't.

Decision¶

Schema is in kb/migrations/*.sql (managed by kb/migrate.py).

kb/schema.sql stays as a historical/seed-only reference. It's the schema as of v1, no longer authoritative.
kb/migrate.py reads kb/migrations/00N_*.sql files in order, applies the ones that aren't already in the _migrations metadata table, records each application.
kb/load.py uses DELETE (not DROP+CREATE) to clear rows for re-load. Schema is unchanged across loads.
All schema changes go through a numbered migration. v4 is current (as of 2026-05-04).

Migration history¶

#	Description	Why
001	Initial schema	Bootstrap
002	Add `claim_prompt_version` column to `claims`	Track which prompt version produced each claim, enable `kb/stale_prompts.py`
003	Add `content_hash` to embedding tables; backfill	Skip-if-unchanged embedding cache (Voyage cost)
004	Episode-id normalization (strip leading zeros)	Ep-id `swap-ep-007` → `swap-ep-7` for cross-show consistency. 60 legacy rows preserved due to FK constraints.

Consequences¶

Good: - Data state preserved across loads — UI never shows mid-load empty state - Embedding cache becomes possible (migration 003 enabled it) - Schema evolution doesn't lose data - Migration history is git-tracked + reproducible - New collaborators (or future Casey) read migrations in order to understand schema evolution

Tradeoffs: - One more concept to learn ("schema lives in migrations dir") - schema.sql is now misleading if you don't know it's historical — hence this ADR + the reference/kb-schema.md warning - Every new schema change requires writing a migration file (small friction, but the right kind)

How to add a new migration¶

# 1. Write the migration file:
$EDITOR kb/migrations/005_add_my_feature.sql

# Example:
# ALTER TABLE claims ADD COLUMN reviewed BOOLEAN DEFAULT FALSE;
# CREATE INDEX claims_reviewed_idx ON claims (reviewed) WHERE reviewed;

# 2. Apply (idempotent):
uv run python kb/migrate.py up

# 3. Check status:
uv run python kb/migrate.py status

# 4. Commit:
git add kb/migrations/005_add_my_feature.sql
git commit -m "kb: migration 005 — add reviewed flag"

Future-proofing¶

If migrations become sufficiently complex (data-mutating, multi-step, multi-table), graduate to a real migration framework (alembic-style). For now, raw SQL files + a metadata table are sufficient.

What this implies for code¶

Don't ever DROP TABLE outside a migration. Period.
load.py DELETE is the right pattern for "clear and re-import."
New analysis code should SELECT the columns it needs, not assume schema layout by reading schema.sql.
When verify_kb.sh checks integrity, it compares against the applied migrations, not against schema.sql.

Tooling reminders¶

# Where am I?
uv run python kb/migrate.py status

# Apply pending:
uv run python kb/migrate.py up

# Inspect what's in the kb:
uv tool run harlequin --adapter duckdb ~/garmin-warehouse/kb/kb.duckdb

References¶

Code: ~/garmin-warehouse/kb/migrate.py
Code: ~/garmin-warehouse/kb/migrations/
Doc: reference/kb-schema.md
Memory: warehouse_hardening_2026_05.md