kb Schema¶
The kb is at ~/garmin-warehouse/kb/kb.duckdb (DuckDB). 9 tables. Schema
is managed by kb/migrate.py via numbered migrations in
kb/migrations/*.sql, currently at v4 in production.
Critical: schema.sql is NOT authoritative¶
kb/schema.sql exists but is the historical/seed-only definition.
The actual schema is whatever kb/migrations/*.sql has applied. To
check current state:
kb/load.py uses DELETE rather than DROP+CREATE so loads don't
need migration tags.
Tables¶
Core content¶
| Table | Rows (2026-05-04) | Purpose |
|---|---|---|
episodes |
544 | One row per podcast episode. Has feed column for multi-show disambiguation. |
claims |
5,555 | Extracted claims, linked to episode. Has claim_prompt_version for stale-prompt detection. |
studies |
472 | Resolved citations (PMID/DOI). Many-to-many with claims via claim_studies. |
claim_studies |
— | Junction table. |
findings |
19 | Curated synthesis topics (mirrors findings/*.md files). |
finding_topics |
— | Topic chips per finding. |
xref |
— | Cross-references. dst_kind distinguishes finding↔finding from finding→study. |
Embeddings¶
| Table | Rows | Purpose |
|---|---|---|
embeddings_claim |
5,555 | Voyage 3-large 1024d vectors per claim, indexed via HNSW |
embeddings_finding |
19 | Same for findings |
Embedding cache keyed by content_hash (migration 003). If a claim's
text didn't change, re-embedding is a no-op.
Migration history¶
(Each migration in kb/migrations/ is numbered and applied in order.
Current: v4.)
- 001 — initial schema
- 002 — added
claim_prompt_versioncolumn toclaims - 003 — added
content_hash-keyed embedding cache - 004 — episode-id normalization (strip leading zeros). Legacy 60 rows preserved because DuckDB FK constraints prevent in-place rename.
To add a new migration: write kb/migrations/00N_description.sql,
run uv run python kb/migrate.py up. The migrate.py records applied
versions in a metadata table.
Key columns to know¶
episodes¶
id— likeswap-ep-174,letsrun-thread-12345(multi-source-aware)feed— the source show identifier. UseSELECT DISTINCT feed FROM episodesto populate filter dropdownsdigest_path— file path for "open in editor" buttonspub_date— for "show me X from the last month" queries
claims¶
episode_id— FK toepisodes.idkind— claim taxonomyhost_stance— endorses / skeptical / debatingotq_relevance— 0–5 scoreclaim_prompt_version— for finding stale claims after prompt edits
findings¶
id— matches the markdown filename stemtitle— first H1 of the markdown bodystudy_links— count cached for graph node sizing
xref¶
src— finding IDdst— finding ID or study PMIDdst_kind—finding|study
Triage state (NOT in DuckDB)¶
The triage state is in YAML files at ~/garmin-warehouse/kb/, not in
the kb itself:
watches.yaml— declarative: "for finding X, watch for claims matching pattern Y"applied.yaml— append-only: "this claim got applied to this finding on this date"dismissed.yaml— append-only: "this claim is not relevant to this finding"
kb/review.py builds the application queue by joining watches.yaml
against claims + embeddings_claim and excluding what's in
applied.yaml/dismissed.yaml.
This pattern means a kb rebuild doesn't lose triage state — YAML files are the durable journal. The DuckDB is rebuildable; the YAMLs are not.
Useful queries¶
-- Claims per show:
SELECT feed, COUNT(*) FROM claims c JOIN episodes e ON c.episode_id = e.id
GROUP BY feed ORDER BY 2 DESC;
-- Stale-prompt claims (claims extracted with old prompts):
SELECT episode_id, COUNT(*)
FROM claims
WHERE claim_prompt_version < (SELECT MAX(claim_prompt_version) FROM claims)
GROUP BY episode_id;
-- Findings with most study links:
SELECT id, title, study_links FROM findings ORDER BY study_links DESC LIMIT 10;
-- Cross-references for a finding:
SELECT dst, dst_kind FROM xref WHERE src = 'bloodwork_baseline';
For ad-hoc exploration use Harlequin (already installed via
uv tool install):
Related¶
systems/garmin-warehouse.mdrunbooks/kb-rebuild.md- Memory:
warehouse_hardening_2026_05.md - Source:
~/garmin-warehouse/ARCHITECTURE.md