When you don't need a vector DB

2026-05-12 · architecture · rag · embeddable

Every few months a new vector database announces itself with a benchmark that peaks at a billion vectors and a sub-millisecond p99. The benchmarks are genuine. The implication — that you need that machine — usually isn’t.

Most retrieval workloads fit on one box. A surprising number fit in one process. This post is about how to tell, and about what removing the database actually changes.

What “embeddable” buys you

The first thing to notice about a vector database is that it is a database. That means: a network hop, a serialization format, a process to supervise, a config file, a backup strategy, a migration story, a security boundary, and a Helm chart someone has to keep current. None of those exist in your business; they exist to make the database operable from outside.

An embeddable index — a crate you link into your binary, or a library running as part of the same process — removes all of that. The vector you computed in RAM goes into an index that lives in RAM, on a thread you already had. The metadata goes into a SQLite file that sits next to your other state. There is no wire format because there is no wire.

This matters most in three shapes of workload:

Desktop and CLI tools. A user runs your binary. There is no infrastructure team. There is a ~/.config/yourapp/ directory. Anything that requires a separate daemon is a non-starter; anything that requires Docker is hostile.
Edge and on-device. A model and an index ship to a device with finite storage and no reliable network. The fewer moving parts, the fewer ways the device can be wrong.
Agents and ephemeral workers. A short-lived process needs retrieval over a corpus it just built or fetched. Spinning up a database for thirty seconds of work is comedy.

memista is built for those shapes. So is usearch directly, so is hnsw_rs, so is instant-distance. The category exists.

Knowing your workload fits

The question is not really “does my data fit on one box” — it is “does my data fit in the working set of one process at the latency I need.”

A few rough heuristics, all of which you should measure rather than trust:

Vector count under ~10M. A modern HNSW index with f32 vectors of dimension 384 is roughly vectors × dim × 4 bytes for the raw data, plus some graph overhead. Ten million 384-dim vectors is about 15 GiB just for the vectors. That fits on a developer laptop’s SSD; it doesn’t fit in RAM. HNSW touches a small fraction of nodes per query, so a memory-mapped layout works well below this scale. Above it, you start designing.
QPS under a few hundred. A single Rust process running HNSW with SIMD will happily serve hundreds of queries per second on commodity hardware. If your traffic is bursty and your p99 budget is tens of milliseconds, you have headroom.
One writer, many readers. Embeddable indexes are easiest when writes are batched and infrequent — a nightly re-index, an on-demand rebuild when a corpus changes. If you need concurrent writers across machines, you have outgrown the shape.
Metadata that fits in SQLite. “Fits” here means tens of millions of rows with simple lookups by primary key. SQLite is not the bottleneck; the index usually is.

If those four roughly hold, you do not need a vector database. You need a library and a disk.

What you give up

Honesty section. Embeddable means:

No horizontal scale. When the working set stops fitting in one process, you migrate. That migration is real work — your queries are now RPCs, your index is now sharded, your metadata is now in a different store. Plan the exit before you commit.
No managed backups. SQLite is cp, which is fine; the USearch index file is cp, which is also fine; but the coordination between them is yours. memista writes the index file after every insert batch, so a crash mid-batch leaves the SQLite row without a corresponding vector. In practice you rebuild affected entries; in theory you wrap the two writes in something more careful.
No client SDKs in seventeen languages. Your retrieval lives in the language your binary is written in. If you need polyglot access, you expose HTTP — which memista does, on 127.0.0.1:8083 — and now you have a small service, which is most of the way to having a database.
No tuning UI. When recall drops because your data drifted, you notice by reading the rustdoc, not by clicking through a dashboard.

The trade is: complexity at the boundary of your process for complexity inside it. For small and medium workloads, inside is cheaper.

Where memista fits on this map

memista is the explicit “library plus optional binary” shape. The crate exports AppState, create_app, and the request/response types; you can embed those in your own Actix app, or you can cargo install memista and run the server it ships with. The persistence is a SQLite file plus one <database_id>.usearch file per logical partition. That’s the unit of backup, the unit of move, the unit of delete.

A few honest notes, because the project is explicit about them in its own README: the current crate hardcodes embedding dimensions to two, which is a demo default; you will be editing IndexOptions::dimensions before real use. It has not been tested past about a hundred thousand vectors. It does not authenticate. These are not surprises if you read the source; they are also not things you’d accept from a database. From a library, they’re configuration you own.

When to migrate up

The signal is not size; it is contention. You’ll know it’s time when:

You add a second writer and discover that “one writer” was load-bearing.
Your nightly re-index can no longer fit in the maintenance window.
You want to A/B test two index configurations on live traffic and realize you’d need two processes to do it.
The p99 starts oscillating in a way that correlates with disk pressure on the box.

At that point the database earns its complexity. Until then, it doesn’t. Most of the projects that announce themselves with “we use a vector DB” are paying that cost for the same reason they are paying a Kubernetes cost: they did it the same way someone else did, before they measured whether they had to.

Measure first. If a single process and a 400 MiB index file can serve your traffic, ship that. You can always pull the cluster out of the closet later.

The operational dividend

There’s one more thing the database removes: the operational surface area that comes with it. Embeddable retrieval means you don’t need a separate monitoring story for the vector store. You don’t need a separate on-call rotation for it. You don’t need a separate access-control model. You don’t need to learn another query language. You don’t need to keep its client library version pinned in lockstep with your application.

For a small team — say, fewer than five engineers — these costs are not theoretical. Each new operational surface eats time that does not ship product. Removing the vector database is removing a paging target, a quarterly upgrade cycle, and a “who knows that system” bus factor.

The flip side: when something is broken with the retrieval inside your binary, the debugging surface is also smaller. Logs come from one process. State lives in two files you can ls. You can attach a profiler and see exactly where time is going. There is no protocol to sniff, no remote shell to open, no foreign log format to parse.

For shapes of work where the team is small, the data is bounded, and the recall budget is reasonable, this dividend is the actual reason to go embeddable. Speed is nice. Cost is nice. But the deeper win is that you have fewer systems, and the systems you have are ones you wrote.

← all posts