Vector Databases in Production: What To Measure — The Agentic Web

Vector databases are easy to demo and hard to operate. This post gives you the production view: what to measure, how to trade recall vs. latency, which indexes to choose, and how to run the system without surprises.

What to measure (and why)

Latency percentiles: p50/p95/p99 for single‑query and batch. Users feel p95; your SLOs should, too.
Recall@k: fraction of “ground truth” neighbors found in top‑k. Raising recall usually raises latency/cost.
Throughput: sustained QPS under your recall target at p95 latency. Include warm/hot cache scenarios.
Cost per 1K queries: infra + egress + embedding compute amortized. Optimizations should lower this number.
Footprint: RAM/disk per million vectors including metadata and index overhead. This caps your scale.
Tail behavior: 99.9th latency spikes often come from filters or cold shards—spot them early.

Pick 2–3 targets to start, e.g., “p95 < 80 ms at recall@10 ≥ 0.9, $<0.001 per query.”

Recall vs. latency: the core trade‑off

Flat (exact) search: highest recall, worst latency for big collections.
HNSW: great recall/latency balance; RAM‑heavy; common in Qdrant/Milvus.
IVF/IVF‑PQ: fast with large data; tune nlist/nprobe; PQ compresses memory at some recall loss.

Guidance:

Small (<1M vectors): HNSW with moderate efConstruction/efSearch often wins.
Large (≫1M): IVF or IVF‑PQ with good training, then re‑rank the top 100–1,000 with a precise metric.
Hybrid: keyword (BM25) pre‑filter + vector re‑rank improves precision for text documents.

Filtering and metadata

Real apps filter by tenant, tag, time, or access. Filtering reshapes performance:

Pre‑filter: use inverted indexes or WHERE clauses to shrink candidate sets before ANN.
Post‑filter: apply after ANN; simpler but risks empty results and wasted work.
Compound filters: benchmark them; some engines fall back to slower paths.

Always carry minimal metadata (ids, tenant, type, timestamp) in payload. Keep large blobs outside the vector DB and join by id.

Multi‑tenancy

Hard isolation: separate collections or databases per tenant. Easiest to reason about; more overhead.
Soft isolation: shared collection with tenant_id filters. Efficient but requires careful SLOs and quotas.
Noisy neighbor controls: per‑tenant rate limits and shard‑affinity to avoid tail spikes.

Memory, storage, and cost

FP32 vectors: 4 bytes × dim (e.g., 1536 → ~6 KB/vector). FP16 halves this with minor accuracy cost.
HNSW index adds multiples of the base size (often 1–2×). PQ can cut to <1 KB/vector at some recall loss.
Store raw text out‑of‑band (object store, Postgres) and keep embeddings lean.
Snapshot and back up indexes; test restore times!

Ingestion strategy

Batch first: accumulate, deduplicate, normalize, embed, and bulk‑insert. It’s cheaper and builds better indexes.
Streaming updates: queue → micro‑batch (e.g., every 1–5 seconds) to amortize overhead.
Idempotency keys: avoid duplicate vectors on retries.
Re‑embedding: version your embedding model; reindex lazily by partition to avoid cliff events.

Observability

Emit per‑query metrics: latency, index probes, candidates scanned, hit count, and recall (when labeled).
Log filters used and result sizes. Empty results with strict filters are a design smell.
Sample queries for offline quality review.
Alert on: p95 > SLO, error rate, and shard imbalance.

Capacity planning

Start with a back‑of‑the‑envelope: vectors_per_doc × docs × bytes_per_vector × index_overhead.
Leave 30–50% headroom for growth and reindexing.
Plan for compaction windows (write throttling) and rolling index builds.

Benchmarking methodology

Fix a labeled dataset (queries → relevant ids).
Warm the cache and run a baseline (exact or high‑recall settings).
Sweep index params (efSearch, nprobe, PQ codebooks) and plot recall vs. p95.
Add realistic filters (tenant, tag, time) and repeat.
Pick the knee of the curve; validate under concurrency (N workers).

Tools to try: ANN‑Benchmarks configs, your engine’s built‑in bench, or a simple Locust/JMeter harness that records both latency and recall.

Launch checklist

SLOs defined (latency, recall) and dashboards live.
Index parameters and embedding model versioned in code.
Backups, restores, and failover tested.
Quotas per tenant and rate limits enforced.
On‑call runbook: what to do if recall tanks or latency spikes.

TL;DR

Pick a target recall and measure it alongside p95 latency. Choose HNSW for simplicity, IVF/IVF‑PQ for scale. Keep metadata small, filter smartly, and test restores. When your dashboards show stable recall and predictable p95 under load, you’re ready for production.