AI & ML

You're Not Choosing a Vector Database — You're Choosing a Recall Budget

Most vector-DB comparisons argue about brands. The decision that actually bites you in production is quieter: how much search accuracy you're silently trading for speed, and what happens to it the moment you add a filter. Here's a mental model, a worked cost example, and the failure modes the docs skip.

Dhileep KumarJun 13, 20267 min read

You're Not Choosing a Vector Database — You're Choosing a Recall Budget

Search 'choosing a vector database' and you'll get the same article five times: a table of Pinecone vs Qdrant vs Weaviate vs Chroma vs pgvector, a paragraph each, and a shrug that ends with 'it depends on your use case. ' It does depend. But the axis those articles compare on — brand, hosting model, logo — is almost never the axis that hurts you six months in.

Here's the reframe I'd offer after watching these systems behave under real traffic: you are not choosing a database, you are choosing a recall budget. Every vector search is an approximation. The knob you're really turning is how wrong you're willing to be, how fast, and how gracefully that wrongness degrades when reality — filters, scale, updates — shows up. Once you see the decision that way, the tool matters a lot less than the marketing claims, and the two or three settings nobody talks about matter a lot more.

Exact search is a lie you can't afford

Similarity search sounds exact: find the nearest vectors to my query. And you can do it exactly — compare the query against every stored vector and sort. That's a brute-force k-NN scan, and it returns perfect results. The problem is cost: it's linear in your row count. At fifty thousand vectors nobody notices. At fifty million, every query reads fifty million arrays of floats, and your p99 latency turns into a slideshow.

So every production system reaches for ANN — approximate nearest neighbor. HNSW, IVF, DiskANN, ScaNN: different data structures, same bargain. They build an index that lets a query touch a few thousand candidates instead of fifty million, and in exchange they occasionally miss a true nearest neighbor. The fraction of true neighbors you actually get back is your recall. Recall of 0.95 means one in twenty of the 'correct' results silently didn't make the list.

The number that decides whether your RAG app feels smart or subtly broken is not written on any pricing page. It's the recall you accepted, at the speed you needed, under the filters you forgot to test.

A worked example: the same app at three scales

Say you're building an internal knowledge assistant. Each document chunk becomes a 1,536-dimension embedding. Let's walk the same app through three sizes and watch where the decision actually flips. (These are illustrative orders of magnitude to reason with, not measured benchmarks — your mileage will vary with hardware and config. )

50,000 vectors (~300 MB of floats). The whole index fits in RAM with room to spare. Brute force is often under 50 ms; any index is overkill. If you're already on Postgres, pgvector isn't a compromise here — it's the correct answer, full stop.
2,000,000 vectors (~12 GB). Now brute force hurts and you need a real index. pgvector with HNSW still handles this comfortably on a beefy instance, but you start caring about index build time, memory headroom, and the tuning knobs below. This is the band where teams needlessly panic-migrate.
80,000,000 vectors + heavy concurrent traffic. Here the dedicated stores (Qdrant, Milvus, a managed Pinecone tier) start earning their bill: sharding, quantization, memory-mapped indexes, and search that stays fast under concurrency without you hand-tuning Postgres autovacuum at 2 a. m.

Notice what changed across the three rows. Not the brand — the operational pressure. The migration signal isn't 'we hit a million vectors,' it's 'index build time, memory, or tail latency became someone's actual job. ' Until one of those three is a real line item, the fancier system is buying you a problem you don't have.

The pgvector setup nobody shows you past line one

Every tutorial shows CREATE EXTENSION vector and a distance operator, then stops. That's the demo, not the deployment. The part that determines whether pgvector is fast enough is the index and its two dials. Here's the shape that actually matters:

sql

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
  id         bigint PRIMARY KEY,
  tenant_id  bigint NOT NULL,
  created_at timestamptz NOT NULL,
  embedding  vector(1536)
);

-- The index is where recall vs speed is decided.
-- m = graph connectivity, ef_construction = build-time effort.
CREATE INDEX ON chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Query-time knob: how hard each search tries. Higher = better
-- recall, slower query. Tune this per workload, not once globally.
SET hnsw.ef_search = 100;

SELECT id
FROM chunks
WHERE tenant_id = 42
  AND created_at > now() - interval '90 days'
ORDER BY embedding <=> :query_vec
LIMIT 10;

Three things in that snippet are load-bearing and absent from most explainers. First, ef_search is a runtime setting you can raise for a slow, high-stakes query and lower for a cheap one — recall is not a fixed property of the database, it's a per-query decision. Second, ef_construction and m are baked in at build time; getting them wrong means dropping and rebuilding the index, which on a large table is a long, memory-hungry operation you want to schedule, not discover. Third — and this is the one that actually breaks apps — look at that WHERE clause.

The failure mode the benchmarks hide: filtered recall collapse

Benchmark charts measure recall on unfiltered search: pure nearest-neighbor over the whole set. Real queries are never that. They're 'nearest neighbors where tenant_id = 42 and created_at is recent' — similarity plus a filter. And filtering interacts badly with ANN indexes in a way the marketing never mentions.

An HNSW graph is built over all your vectors. When you add a selective filter, the index walks its graph finding near neighbors, then throws away the ones that fail the filter. If your filter is highly selective — one tenant out of ten thousand — the graph walk can exhaust its candidate budget before it finds enough matching rows, and recall for that tenant quietly craters even though your global benchmark said 0.98. You don't get an error. You get worse answers for exactly the queries a filter implies you care about most.

This is where tools genuinely differ, and where you should spend your evaluation time. Qdrant builds filterable payload indexes and can integrate the filter into the graph traversal. pgvector historically over-fetched and filtered after the fact, which is fine for loose filters and painful for tight ones. So the real test isn't 'which DB has higher recall' — it's 'run YOUR filters, at YOUR selectivity, and measure recall on the filtered result set. ' If you skip one test in this whole exercise, don't skip that one.

A decision table you can actually act on

Strip the branding and the choice collapses to a handful of if/then rules driven by scale, filter selectivity, and who's on call:

Already on Postgres, under ~2M vectors, loose filters → pgvector. One extension, one system to back up, SQL filters next to your search. Migrating later is a bounded, well-trodden task.
Tight, high-selectivity filters (per-tenant, per-user) at scale → test a dedicated store with real filter-aware indexing (Qdrant, Milvus) against your actual selectivity before committing. This is the strongest reason to leave Postgres.
Tens of millions of vectors, high concurrency, strict p99 → dedicated, likely with quantization and sharding. The ops burden of tuning Postgres to match usually isn't worth it here.
Small team, no ops appetite, willing to pay → managed (Pinecone, or hosted Qdrant/Weaviate). You're buying away index tuning and 2 a. m. pages, which is a legitimate thing to buy.
You need keyword + vector (hybrid) search → test it explicitly; hybrid quality and fusion (e. g. reciprocal rank fusion) vary far more between tools than plain vector search does.

The gotchas that show up after launch

A few non-obvious ones worth pre-empting. Dimension bloat: OpenAI's large embedding model is 3,072 dimensions, and index memory scales with it — halving dimensions (many models now support shortening via Matryoshka representation) can halve your RAM bill with a small recall hit, and that trade is often a bargain. Update churn: HNSW handles inserts fine but deletes leave tombstones; a high-delete workload needs periodic rebuilds or your index quietly bloats. And the distance-metric trap: if you embedded with a model tuned for cosine similarity but index with L2 (Euclidean) ops, your results are subtly wrong in a way that looks like 'the model is just mediocre. ' Match the operator to the model.

The honest summary: the vectors are the same floats wherever they live, and the index math is broadly shared across tools. What differs is how each one degrades — under filters, under scale, under churn — and how much of that degradation lands on you versus a managed service. Pick the smallest system that survives your real filters at your real scale, instrument recall so you'd notice if it dropped, and let a measured pain — not a comparison table — tell you when it's time to move. That's a recall budget, spent deliberately. Everything else is logos.

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter