Skip to content

kb Schema

The kb is at ~/garmin-warehouse/kb/kb.duckdb (DuckDB). 9 tables. Schema is managed by kb/migrate.py via numbered migrations in kb/migrations/*.sql, currently at v4 in production.

Critical: schema.sql is NOT authoritative

kb/schema.sql exists but is the historical/seed-only definition. The actual schema is whatever kb/migrations/*.sql has applied. To check current state:

uv run python kb/migrate.py status

kb/load.py uses DELETE rather than DROP+CREATE so loads don't need migration tags.

Tables

Core content

Table Rows (2026-05-04) Purpose
episodes 544 One row per podcast episode. Has feed column for multi-show disambiguation.
claims 5,555 Extracted claims, linked to episode. Has claim_prompt_version for stale-prompt detection.
studies 472 Resolved citations (PMID/DOI). Many-to-many with claims via claim_studies.
claim_studies Junction table.
findings 19 Curated synthesis topics (mirrors findings/*.md files).
finding_topics Topic chips per finding.
xref Cross-references. dst_kind distinguishes finding↔finding from finding→study.

Embeddings

Table Rows Purpose
embeddings_claim 5,555 Voyage 3-large 1024d vectors per claim, indexed via HNSW
embeddings_finding 19 Same for findings

Embedding cache keyed by content_hash (migration 003). If a claim's text didn't change, re-embedding is a no-op.

Migration history

(Each migration in kb/migrations/ is numbered and applied in order. Current: v4.)

  • 001 — initial schema
  • 002 — added claim_prompt_version column to claims
  • 003 — added content_hash-keyed embedding cache
  • 004 — episode-id normalization (strip leading zeros). Legacy 60 rows preserved because DuckDB FK constraints prevent in-place rename.

To add a new migration: write kb/migrations/00N_description.sql, run uv run python kb/migrate.py up. The migrate.py records applied versions in a metadata table.

Key columns to know

episodes

  • id — like swap-ep-174, letsrun-thread-12345 (multi-source-aware)
  • feed — the source show identifier. Use SELECT DISTINCT feed FROM episodes to populate filter dropdowns
  • digest_path — file path for "open in editor" buttons
  • pub_date — for "show me X from the last month" queries

claims

  • episode_id — FK to episodes.id
  • kind — claim taxonomy
  • host_stance — endorses / skeptical / debating
  • otq_relevance — 0–5 score
  • claim_prompt_version — for finding stale claims after prompt edits

findings

  • id — matches the markdown filename stem
  • title — first H1 of the markdown body
  • study_links — count cached for graph node sizing

xref

  • src — finding ID
  • dst — finding ID or study PMID
  • dst_kindfinding | study

Triage state (NOT in DuckDB)

The triage state is in YAML files at ~/garmin-warehouse/kb/, not in the kb itself:

  • watches.yaml — declarative: "for finding X, watch for claims matching pattern Y"
  • applied.yaml — append-only: "this claim got applied to this finding on this date"
  • dismissed.yaml — append-only: "this claim is not relevant to this finding"

kb/review.py builds the application queue by joining watches.yaml against claims + embeddings_claim and excluding what's in applied.yaml/dismissed.yaml.

This pattern means a kb rebuild doesn't lose triage state — YAML files are the durable journal. The DuckDB is rebuildable; the YAMLs are not.

Useful queries

-- Claims per show:
SELECT feed, COUNT(*) FROM claims c JOIN episodes e ON c.episode_id = e.id
GROUP BY feed ORDER BY 2 DESC;

-- Stale-prompt claims (claims extracted with old prompts):
SELECT episode_id, COUNT(*)
FROM claims
WHERE claim_prompt_version < (SELECT MAX(claim_prompt_version) FROM claims)
GROUP BY episode_id;

-- Findings with most study links:
SELECT id, title, study_links FROM findings ORDER BY study_links DESC LIMIT 10;

-- Cross-references for a finding:
SELECT dst, dst_kind FROM xref WHERE src = 'bloodwork_baseline';

For ad-hoc exploration use Harlequin (already installed via uv tool install):

uv tool run harlequin --adapter duckdb ~/garmin-warehouse/kb/kb.duckdb