Skip to content

ADR 005: kb migrations are source of truth, not schema.sql

Status: accepted Date: 2026-04-30 (formalized in this ADR 2026-05-04) Supersedes: —

Context

Early in the kb's life, schema lived as a single kb/schema.sql file with all CREATE TABLE statements. kb/load.py would DROP TABLE ... CASCADE; CREATE TABLE ... from schema.sql at the start of every load. This worked when:

  • Casey was the only user of the kb
  • Rebuilds were cheap (small corpus)
  • No data state needed preserving across loads

But three things broke that model:

  1. Triage state matters. As the corpus grew and Casey curated applied.yaml / dismissed.yaml, dropping + recreating tables meant kb/load.py would briefly produce a kb with no rows, then reload from raw.json. This was harmless because the YAMLs survived independently — but during the brief window the UI would show empty results.
  2. Embedding cost. Re-embedding 5,555 claims after every load is wasteful even with content-hash caching. We want the load to update the structured tables without disturbing embeddings_claim rows whose content hasn't changed.
  3. Schema evolution. Adding a column (e.g. claim_prompt_version in migration 002) with the schema.sql + DROP+CREATE pattern means any existing per-claim metadata (like ad-hoc notes) gets lost. Migrations preserve data; DROP+CREATE doesn't.

Decision

Schema is in kb/migrations/*.sql (managed by kb/migrate.py).

  • kb/schema.sql stays as a historical/seed-only reference. It's the schema as of v1, no longer authoritative.
  • kb/migrate.py reads kb/migrations/00N_*.sql files in order, applies the ones that aren't already in the _migrations metadata table, records each application.
  • kb/load.py uses DELETE (not DROP+CREATE) to clear rows for re-load. Schema is unchanged across loads.
  • All schema changes go through a numbered migration. v4 is current (as of 2026-05-04).

Migration history

# Description Why
001 Initial schema Bootstrap
002 Add claim_prompt_version column to claims Track which prompt version produced each claim, enable kb/stale_prompts.py
003 Add content_hash to embedding tables; backfill Skip-if-unchanged embedding cache (Voyage cost)
004 Episode-id normalization (strip leading zeros) Ep-id swap-ep-007swap-ep-7 for cross-show consistency. 60 legacy rows preserved due to FK constraints.

Consequences

Good: - Data state preserved across loads — UI never shows mid-load empty state - Embedding cache becomes possible (migration 003 enabled it) - Schema evolution doesn't lose data - Migration history is git-tracked + reproducible - New collaborators (or future Casey) read migrations in order to understand schema evolution

Tradeoffs: - One more concept to learn ("schema lives in migrations dir") - schema.sql is now misleading if you don't know it's historical — hence this ADR + the reference/kb-schema.md warning - Every new schema change requires writing a migration file (small friction, but the right kind)

How to add a new migration

# 1. Write the migration file:
$EDITOR kb/migrations/005_add_my_feature.sql

# Example:
# ALTER TABLE claims ADD COLUMN reviewed BOOLEAN DEFAULT FALSE;
# CREATE INDEX claims_reviewed_idx ON claims (reviewed) WHERE reviewed;

# 2. Apply (idempotent):
uv run python kb/migrate.py up

# 3. Check status:
uv run python kb/migrate.py status

# 4. Commit:
git add kb/migrations/005_add_my_feature.sql
git commit -m "kb: migration 005 — add reviewed flag"

Future-proofing

If migrations become sufficiently complex (data-mutating, multi-step, multi-table), graduate to a real migration framework (alembic-style). For now, raw SQL files + a metadata table are sufficient.

What this implies for code

  • Don't ever DROP TABLE outside a migration. Period.
  • load.py DELETE is the right pattern for "clear and re-import."
  • New analysis code should SELECT the columns it needs, not assume schema layout by reading schema.sql.
  • When verify_kb.sh checks integrity, it compares against the applied migrations, not against schema.sql.

Tooling reminders

# Where am I?
uv run python kb/migrate.py status

# Apply pending:
uv run python kb/migrate.py up

# Inspect what's in the kb:
uv tool run harlequin --adapter duckdb ~/garmin-warehouse/kb/kb.duckdb

References