Vector databases are easy to demo and hard to operate. This post gives you the production view: what to measure, how to trade recall vs. latency, which indexes to choose, and how to run the system without surprises.
What to measure (and why)
- Latency percentiles: p50/p95/p99 for single‑query and batch. Users feel p95; your SLOs should, too.
- Recall@k: fraction of “ground truth” neighbors found in top‑k. Raising recall usually raises latency/cost.
- Throughput: sustained QPS under your recall target at p95 latency. Include warm/hot cache scenarios.
- Cost per 1K queries: infra + egress + embedding compute amortized. Optimizations should lower this number.
- Footprint: RAM/disk per million vectors including metadata and index overhead. This caps your scale.
- Tail behavior: 99.9th latency spikes often come from filters or cold shards—spot them early.
Pick 2–3 targets to start, e.g., “p95 < 80 ms at recall@10 ≥ 0.9, $<0.001 per query.”
Recall vs. latency: the core trade‑off
- Flat (exact) search: highest recall, worst latency for big collections.
- HNSW: great recall/latency balance; RAM‑heavy; common in Qdrant/Milvus.
- IVF/IVF‑PQ: fast with large data; tune nlist/nprobe; PQ compresses memory at some recall loss.
Guidance:
- Small (<1M vectors): HNSW with moderate efConstruction/efSearch often wins.
- Large (≫1M): IVF or IVF‑PQ with good training, then re‑rank the top 100–1,000 with a precise metric.
- Hybrid: keyword (BM25) pre‑filter + vector re‑rank improves precision for text documents.
Filtering and metadata
Real apps filter by tenant, tag, time, or access. Filtering reshapes performance:
- Pre‑filter: use inverted indexes or WHERE clauses to shrink candidate sets before ANN.
- Post‑filter: apply after ANN; simpler but risks empty results and wasted work.
- Compound filters: benchmark them; some engines fall back to slower paths.
Always carry minimal metadata (ids, tenant, type, timestamp) in payload. Keep large blobs outside the vector DB and join by id.
Multi‑tenancy
- Hard isolation: separate collections or databases per tenant. Easiest to reason about; more overhead.
- Soft isolation: shared collection with
tenant_idfilters. Efficient but requires careful SLOs and quotas. - Noisy neighbor controls: per‑tenant rate limits and shard‑affinity to avoid tail spikes.
Memory, storage, and cost
- FP32 vectors: 4 bytes × dim (e.g., 1536 → ~6 KB/vector). FP16 halves this with minor accuracy cost.
- HNSW index adds multiples of the base size (often 1–2×). PQ can cut to <1 KB/vector at some recall loss.
- Store raw text out‑of‑band (object store, Postgres) and keep embeddings lean.
- Snapshot and back up indexes; test restore times!
Ingestion strategy
- Batch first: accumulate, deduplicate, normalize, embed, and bulk‑insert. It’s cheaper and builds better indexes.
- Streaming updates: queue → micro‑batch (e.g., every 1–5 seconds) to amortize overhead.
- Idempotency keys: avoid duplicate vectors on retries.
- Re‑embedding: version your embedding model; reindex lazily by partition to avoid cliff events.
Observability
- Emit per‑query metrics: latency, index probes, candidates scanned, hit count, and recall (when labeled).
- Log filters used and result sizes. Empty results with strict filters are a design smell.
- Sample queries for offline quality review.
- Alert on: p95 > SLO, error rate, and shard imbalance.
Capacity planning
- Start with a back‑of‑the‑envelope: vectors_per_doc × docs × bytes_per_vector × index_overhead.
- Leave 30–50% headroom for growth and reindexing.
- Plan for compaction windows (write throttling) and rolling index builds.
Benchmarking methodology
- Fix a labeled dataset (queries → relevant ids).
- Warm the cache and run a baseline (exact or high‑recall settings).
- Sweep index params (efSearch, nprobe, PQ codebooks) and plot recall vs. p95.
- Add realistic filters (tenant, tag, time) and repeat.
- Pick the knee of the curve; validate under concurrency (N workers).
Tools to try: ANN‑Benchmarks configs, your engine’s built‑in bench, or a simple Locust/JMeter harness that records both latency and recall.
Launch checklist
- SLOs defined (latency, recall) and dashboards live.
- Index parameters and embedding model versioned in code.
- Backups, restores, and failover tested.
- Quotas per tenant and rate limits enforced.
- On‑call runbook: what to do if recall tanks or latency spikes.
TL;DR
Pick a target recall and measure it alongside p95 latency. Choose HNSW for simplicity, IVF/IVF‑PQ for scale. Keep metadata small, filter smartly, and test restores. When your dashboards show stable recall and predictable p95 under load, you’re ready for production.