HNSW vs flat: choosing for embeddable workloads

2026-05-21 · ann · hnsw · indexing

There are exactly two vector indexes anyone running on a single machine needs to think about first: a flat (brute-force) index and HNSW. Everything else — IVF, ScaNN, DiskANN, product quantization — is a tuning knob on one of those two ideas. For embeddable workloads, this is the choice that matters, and the answer is rarely a clean “use HNSW.”

What flat actually is

A flat index is a list of vectors. To query, you compute the distance from your query vector to every vector in the list, sort by distance, and return the top-k. That is the entire algorithm. There is no graph, no clustering, no learned structure, no parameter to tune beyond the distance metric.

This is sometimes called brute-force or exact search. It has three properties that matter:

Recall is 100% by construction. You looked at every vector. The top-k is the actual top-k.
Latency is linear in corpus size. Twice the vectors, twice the work.
Memory is whatever the vectors take. No overhead.

In modern code a flat search is not as slow as it sounds. With SIMD distance kernels — simsimd on x86 with AVX-512 or on Apple Silicon with NEON — a 384-dim cosine over a hundred thousand vectors takes a few milliseconds on a laptop. The constant factor is small; the slope is the problem.

What HNSW actually is

HNSW (Hierarchical Navigable Small World) is a multi-layer proximity graph. Each vector is a node. Edges connect nearby vectors. Layers thin out as you go up — the top layer is sparse, the bottom is dense. To search, you start at an entry point in the top layer, greedily walk toward your query in each layer, descend, and repeat.

This is approximate. You might miss the true top-k. Recall depends on three parameters:

connectivity (often called M): how many neighbours each node keeps. Higher = better recall, more memory, slower inserts.
expansion_add (efConstruction): how hard the index works during insertion to find good neighbours. Higher = better long-term recall, slower builds.
expansion_search (efSearch): how hard each query works. Higher = better recall, slower queries.

The three together define a recall/latency/memory triangle. USearch’s defaults — which memista accepts by passing 0 for all three to the IndexOptions constructor — pick reasonable values that bias toward generality rather than any particular workload. If you have a specific recall target, you tune them; if you don’t, you accept what USearch picks.

The honest comparison

	flat	HNSW
Recall	100%	~95–99% (tunable)
Query latency	`O(N · d)`	`O(log N · d)` amortized
Memory	`N · d · 4` bytes (f32)	flat + graph overhead (~`M · N · 8` bytes typical)
Insert latency	`O(d)` (append)	`O(M · log N · d)` (graph stitching)
Update / delete	easy	possible but involved
Persistence	trivial (an array)	a graph (USearch handles this)

For a few thousand vectors, the flat index wins on every axis that matters. You don’t have memory overhead, you don’t have approximation error, you don’t have a build phase, and the query is fast enough.

For a few hundred thousand vectors, HNSW starts to dominate on latency while still costing a manageable amount of memory.

For a few million, HNSW is the only choice unless you are happy with hundreds of milliseconds per query.

The line is data-dependent

There is no universal threshold where flat stops winning. The crossover depends on:

Dimension. A 1536-dim OpenAI embedding has a much worse flat constant than a 384-dim sentence-transformer embedding. The crossover moves left.
SIMD support. A flat search on a machine with AVX-512 is 4× the speed of one on stock SSE2. The crossover moves right.
Recall budget. If you’ll accept 95% recall, HNSW gets cheaper. If you need 99.9%, the HNSW parameters need to climb and it becomes less attractive.
Query batch size. Batched flat queries can share SIMD-amortized loads; HNSW queries don’t batch as cleanly. If you query in batches of a hundred, flat stays competitive longer.

The right move on an embeddable system is to write your retrieval against an interface that admits both, measure with your actual data, and pick.

What memista picks

memista picks HNSW, because it uses USearch and USearch is an HNSW implementation. The metric is MetricKind::IP (inner product); the quantization is ScalarKind::F32; the connectivity and expansion parameters are USearch defaults. SIMD acceleration is on via the simsimd feature.

That choice is reasonable for the project’s stated audience — embed in a Rust binary, ship retrieval as part of an app — but it has implications you should think about:

You’re stuck with approximation. memista does not currently expose a flat-search fallback. If you need a recall guarantee, you’ll be validating against a separate brute-force pass during development.
Tuning is upstream. If you want to change efSearch per query, you fork the load_or_create_index helper and pass non-zero values to IndexOptions. The wrapper does not yet expose this.
Inner product is not cosine. If your embeddings are unit-normalized, IP gives you cosine. If they’re not, you’ll want to normalize on the way in. memista does not normalize for you.

A small heuristic

For a project starting today, I’d reach for flat when:

The corpus is under ~10k vectors.
The vectors are high-dimensional (1024+) and queries are bursty.
You need an exact-recall guarantee — for example, a deduplication pass where missing a near-duplicate is a real problem.

I’d reach for HNSW (and therefore something like memista or USearch directly) when:

The corpus is over ~50k vectors and growing.
Query latency matters more than perfect recall.
The data is mostly insert-once or insert-rarely; bulk re-indexes are cheap to schedule.

Between 10k and 50k, the answer is “measure.” On embeddable workloads, the constants matter more than the asymptotics, and the only constants that count are the ones on your hardware with your vectors.

The shape of the question is not “which index is best” but “which index is best for the shape of work I’m doing this quarter.” That shape changes; the right code admits both.

A practical pattern: keep both around

The cheapest way to stay honest about which index you need is to keep both implementations available behind a trait, and select at startup:

trait VectorIndex {
    fn add(&mut self, key: u64, v: &[f32]) -> Result<()>;
    fn search(&self, q: &[f32], k: usize) -> Vec<(u64, f32)>;
    fn len(&self) -> usize;
}

struct FlatIndex { /* Vec<(u64, Vec<f32>)> */ }
struct HnswIndex { /* USearch handle */ }

The cost of carrying both is a few hundred lines and one extra trait boundary. The payoff is that when you migrate, you don’t migrate — you flip a config. It also gives you a built-in recall validator: search the same query through both indexes during development, compare top-k. HNSW recall should track flat closely; if it doesn’t, something is wrong with your parameters or your embeddings.

memista does not do this today. The crate-level interface is “USearch with batteries”; if you need flat, you wrap or fork. For new code, it is worth considering whether the trait boundary lives in your application rather than in the library.

What to measure

If you take only one thing from this post: don’t pick on theory. Build the smallest possible benchmark with your actual vectors:

Take 10k vectors from your real distribution.
Run brute-force top-10 to establish ground truth.
Run HNSW top-10 with default parameters; measure recall@10 and p95 latency.
Vary efSearch (or expansion_search) until recall is acceptable; re-measure latency.
Repeat at 100k and 1M scales (extrapolating where needed).

The answer that comes out of this is hardware-specific, dataset-specific, embedding-model-specific, and query-distribution-specific. None of those are visible from outside. Trust your numbers, not someone else’s slide.

For most embeddable workloads — under a million vectors, single process, modest recall budget — the answer ends up being “HNSW with USearch defaults is fine.” Which is the bet memista is making. Whether that bet pays off for your workload is what the measurement tells you.

← all posts