SOMA docs
Data

Index strategy

Which indexes exist, what they accelerate, and why each was worth the write cost.

Every read path in SOMA has a named index backing it. Unplanned sequential scans are a bug.

The catalog

IndexShapeAccelerates
entities_embedding_hnswHNSW on vector(1024) cosineSemantic recall via memory_recall
entities_search_ginGIN on tsvectorLexical fallback in search_entities
entities_user_type_idxB-tree on (user_id, type)Per-type listings (/app/projects, /app/entities?type=book)
entities_user_status_idxB-tree on (user_id, status)Active filter (hide archived)
edges_from_idxB-tree on (from_id, type)Outgoing-edge walk
edges_to_idxB-tree on (to_id, type)Incoming-edge walk (backlinks)
events_user_type_time_idxB-tree on (user_id, type, occurred_at DESC)Timeline views, filter-by-kind
events_entities_ginGIN on entity_ids uuid[]"Events involving X" — array containment
facts_embedding_hnswHNSW on vector(1024) cosineFact recall + dedup-on-write
facts_user_active_idxPartial B-tree on user_id WHERE superseded_by IS NULLCurrent facts only (skip the history)
sources_kind_external_id_uniqUNIQUE on (kind, external_id)Idempotent ingestion — repeated imports don't double-count

HNSW tuning

Both entities_embedding_hnsw and facts_embedding_hnsw use the defaults — m=16, ef_construction=64. Voyage-3-large is 1024-dimensional; default ef_search gives good recall for our volumes (MVP has thousands, not millions). Revisit when we cross 100k rows.

Why partial indexes

facts_user_active_idx is partial (WHERE superseded_by IS NULL). Facts are append-on-update — old rows stay for audit but never participate in recall. A full index would triple in size over time with zero extra use. The WHERE clause keeps it tight.