Persistence patterns for in-process vector indexes

2026-05-29 · persistence · sqlite · operations

Building a vector index that lives in memory is a weekend project. Building one that survives a crash, a deploy, a schema change, and a partial write without losing data or returning stale results is most of the actual engineering. This is the part the literature usually skips.

This post is about the persistence patterns embeddable indexes tend to land on, why they look the way they do, and where the rough edges are.

The two files problem

Almost every embeddable vector store ends up with two pieces of state:

The vectors themselves, in whatever format the index needs. For HNSW, that’s the graph plus the raw vectors. For flat, it’s just the raw vectors. USearch writes both to a single .usearch file.
The metadata — the text that produced the vector, the source URL, timestamps, the document ID, whatever the application actually wants to display. This is typically rows in SQLite, or sometimes a sidecar JSONL.

memista takes the explicit two-file approach: a SQLite database for metadata (one table per database_id, named chunks_<database_id>) and one <database_id>.usearch file per logical partition. The vector index references SQLite rows by their integer chunk_id primary key.

This is a clean design but introduces a coordination problem: a single “insert one chunk” operation is two writes to two different files. If the process crashes between them, the files disagree.

The four crash windows

Walk through insert_chunk in memista to see the windows:

1. INSERT INTO chunks_<id>  (gets a chunk_id back from SQLite)
2. index.add(chunk_id, embedding)
3. index.save("<id>.usearch")

There are four points a crash can hit:

Before step 1. Nothing happened. Safe.
Between 1 and 2. SQLite has a row; the index does not. A future search will not return this chunk, but the row is reachable by SELECT *. Disagreement is mild — the chunk is “orphaned.”
Between 2 and 3. The in-memory index has the entry; the on-disk index does not. On restart, the index loads without that entry, and the chunk is again orphaned in SQLite. Effectively the same as the previous window.
During step 3. USearch’s save is not atomic across the whole file in the general case; a partial write can corrupt the index. memista does not currently mitigate this.

For most embedded workloads, the first three windows are acceptable — periodic reconciliation cleans up orphans, and re-indexing them is cheap. The fourth is the one to take seriously.

Patterns that help

Atomic rename. Write the new index to <id>.usearch.tmp, fsync, then rename() over the live file. POSIX rename is atomic on the same filesystem; either the old or the new file is visible, never both, never neither. USearch’s save does not do this by default, so a small wrapper is appropriate. memista could adopt this in the helper without changing the API.

Batch and checkpoint. Don’t save the index after every insert. Buffer inserts in a session, save once at the end. The window where SQLite and the index disagree is wider, but it’s bounded and you can reconcile. memista currently saves after every batched call to /v1/insert, which is a sensible default for a service but pessimistic for high-throughput embeddable use.

WAL on SQLite. memista opens its SQLite pool with JournalMode::Wal, which is the right default. Writes are append-only to a separate .wal file, readers don’t block writers, and the recovery story is well understood. The trade-off is that backups need to include the WAL.

Reconcile on startup. On boot, do a cheap consistency check: count rows in SQLite, count entries in the index. If they disagree, walk the delta and either re-add the missing vectors (if you kept the embeddings) or mark the orphan rows for rebuild. This is the pragmatic answer to “what if a crash happens mid-batch” — accept it, detect it, fix it.

memista does not currently do this. You’d add it in the binary’s startup path, or in your library wrapper.

Where the embedding lives is the question

A subtle point: SQLite stores the text and metadata, but it does not store the embedding. The embedding is in the USearch index file. If you lose the USearch file, you can re-embed from the text — but only if the embedding model is stable and reachable.

This has practical consequences:

Versioning. When you change embedding models, every existing index is stale. memista has no built-in model-version field; you put it in the metadata JSON or you use a different database_id per model.
Cold-start cost. Re-embedding from text takes minutes or hours, depending on corpus size and model. The index file is therefore meaningful state, not just a cache.
Backup strategy. Both files must be backed up together, and at a point where they agree. The simplest pattern is to checkpoint, save, copy.

A useful invariant: the SQLite database is the source of truth for the content; the USearch file is a derived artifact you can rebuild from it given an embedding function. If you don’t have a way to re-embed on demand, you have lost that property and your backup strategy must account for it.

Schema changes

The hardest persistence question is what to do when the schema changes — new embedding dimension, new metric, new metadata columns.

For metadata, SQLite handles this gracefully. Add a column with ALTER TABLE, default it to NULL or a sentinel, and existing rows remain valid.

For the index, you cannot. USearch fixes dimensions, metric, and quantization at index creation. Changing any of them requires building a fresh index from scratch — re-embedding every row (if the dimension changes) or re-adding every vector (if only the metric changes).

memista is honest about this in its README: “Index rebuilds on dimension changes.” The current crate also hardcodes dimensions: 2, which means the first thing any real user does is fork that constant and accept that they will rebuild every index they have.

There is no general solution. The pattern that works:

Tag every index with the (model_id, dimension, metric) triple at the time of creation, in a sidecar JSON.
On startup, compare the live config against the tag. If they disagree, refuse to start with a clear error rather than load an incompatible index.
Provide a one-shot rebuild command that drops the old index and re-runs inserts from SQLite using the new embedding function.

This is the kind of operational scaffolding that turns an experimental library into something you can run in anger. memista is at the “experimental” end of that spectrum today; the patterns above are the shape of the work between here and “boring.”

The good news is that none of this is mysterious. Two files, one rename trick, one reconciliation pass, one explicit rebuild path. The discipline is admitting up front that an embeddable index has persistence concerns just as serious as a database — you just get to write them yourself.

A small checklist

If you’re shipping an embeddable index in production, work through this list before the first user touches it:

Are your two state files atomic? Wrap the index save in write-temp + fsync + rename. SQLite is already careful; the vector file usually isn’t.
Do you have a reconciliation pass on startup? It should walk the metadata store, confirm each row has a vector, and either re-add or quarantine the orphans. Run it every boot; it’s cheap when nothing is wrong.
Is the embedding model tagged? Put (model_id, model_version, dimension, metric) in a sidecar JSON next to each index. Refuse to load if it disagrees with the live config.
Do you have a rebuild command? It should drop the index, walk the SQLite table, re-embed each row, and re-insert. This is the path you need when models change, when a corruption is detected, or when you want to change index parameters.
Are backups consistent? Either checkpoint first or back up both files atomically (filesystem snapshot, LITESTREAM, whatever your stack offers). A backup that captures SQLite mid-flight without the matching index is a backup that lies.
Do you log enough to debug? Index size after every save, time spent in each phase of insert and search, count of reconciled orphans on startup. None of this is expensive; all of it pays off the first time something goes sideways.

memista will probably grow some of this scaffolding as it leaves experimental status. Until then, treat the crate as a building block and own the operational layer in your application. That is the cost of going embeddable. It is also, paradoxically, the appeal — you control exactly what runs, and exactly when it runs, and exactly what happens when it doesn’t.

← all posts