Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial Metodologías

RAG with Postgres and pgvector in production: from PoC to SLO

RAG with Postgres and pgvector in production: from PoC to SLO

Key takeaways

  • pgvector + HNSW is the reasonable default in 2026 up to ~10 M vectors per node: the same Postgres you already run, no new stack and no second backup plan.
  • pgvectorscale (StreamingDiskANN plus statistical binary quantization) raises the ceiling to 50 M and keeps p95 under 50 ms with reranking, per Timescale’s own April 2026 benchmark (471 QPS @ 99 % recall).
  • Two-stage reranking is where the quality comes from: a wide cosine sweep over HNSW, then a fine reorder with a cross-encoder. Drop it and you feel it the moment you leave the demo.
  • DSPy + MLflow evals turn “feels better” into a number you can watch. Changing the prompt without measuring is flying blind.
  • SLOs are what keep a RAG alive in production: p95 < 700 ms end-to-end, recall@10 ≥ 0.85, and a hard cost-per-thousand-queries budget.

Why Postgres + pgvector is default in 2026

Three operational reasons outweigh any benchmark chart.

  • One database, one backup plan. You don’t design a second ops regime for a separate “vector DB”. PITR, logical replication, and pg_dump already cover the embeddings the same way they cover everything else.
  • Relational joins against the embeddings. Filtering by tenant, date, or category is an ordinary WHERE. In specialised vector DBs that usually turns into payload filtering with its own limits.
  • Maturity. ANN-Benchmarks 2025 put pgvector at the top with HNSW; pgvectorscale reached 471 QPS @ 50 M at 99 % recall in its own report, and its StreamingDiskANN narrows the gap with Milvus on large datasets.

When not? Past 100 M vectors, when you need sub-10 ms p99 over thousands of QPS, or when you already run another database for the product and prefer to isolate the AI subsystem. A dedicated Qdrant or Milvus wins there.

Schema and migrations

I always separate documents from chunks. The document is the business unit (permissions, deletion, metadata); the chunk is the retrieval unit. tenant_id rides in both tables on purpose: you will want it for filtering, for partitioning, and for Row-Level Security.

sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id          uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id   uuid NOT NULL,
    source      text NOT NULL,
    title       text,
    body        text NOT NULL,
    created_at  timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE chunks (
    id            uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id   uuid NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    tenant_id     uuid NOT NULL,
    chunk_index   int NOT NULL,
    text          text NOT NULL,
    embedding     vector(1024) NOT NULL,
    token_count   int,
    created_at    timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX chunks_tenant_idx ON chunks (tenant_id);

A repeated tenant_id in chunks is not useless redundancy: it saves a join on the hot query and lets you add a partial index or a per-tenant partition later.

On dimensions: vector(1024) matches BGE-M3 and Cohere v3. With text-embedding-3-large you can ask for 3072 or trim to 1024/256 via the dimensions parameter (Matryoshka embeddings survive the trim without collapsing). Fewer dimensions means a smaller index and lower latency, so trim as far as your recall tolerates.

Embedding generation and ingest

  • Model. text-embedding-3-large is still the managed quality ceiling; BGE-M3 at 1024 dim is the sensible local choice; Cohere Embed v3 when the case is genuinely multilingual. Fix the model and dimension before indexing anything: changing them forces a full reindex.
  • Normalise. For cosine, normalise vectors to unit norm at ingest; then you can use inner product (<#>), which is cheaper and gives the same order.
  • Chunking. 500-800 tokens with 50-100 overlap works for technical text. For conversations, splitting per turn keeps the semantics. When the domain allows it, contextual retrieval (prepending a sentence or two that places each chunk in its document) lifts recall measurably.
  • Idempotent ingest. Batch 64-128 chunks per call, and one job with ON CONFLICT (document_id, chunk_index) DO NOTHING so re-running is safe. Store the model and chunking parameters on each row; without that versioning, any later comparison is invalid.

HNSW indexing: parameters that matter

sql
CREATE INDEX chunks_embedding_hnsw
ON chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

The opclass has to match the query operator: vector_cosine_ops with <=>, vector_l2_ops with <->, vector_ip_ops with <#>. If they don’t match, Postgres skips the index and quietly does a seq scan.

  • m = 16: edges per node in the graph. 16 is a good default; 32 raises recall at the cost of size and build time.
  • ef_construction = 64: search depth at build. 128-256 builds a better graph; you pay for it in build time.
  • ef_search: a runtime knob. Start at 40 and raise it per session (SET LOCAL hnsw.ef_search = 80) until you hit your recall@10 target, no further. Every increase costs latency.

The HNSW index lives in memory. If it doesn’t fit in shared_buffers or the OS cache, latency spikes, so size it before you promise a p95. Building it is slow by default too, so give it memory and parallelism:

sql
SET maintenance_work_mem = '8GB';
SET max_parallel_maintenance_workers = 7;

-- in production, without blocking writes:
CREATE INDEX CONCURRENTLY chunks_embedding_hnsw
ON chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);

-- and watch progress:
SELECT phase, round(100.0 * blocks_done / nullif(blocks_total, 0), 1) AS pct
FROM pg_stat_progress_create_index;

Storage and quantization

As the corpus grows, the in-memory index is the first bottleneck. Three levers, least to most aggressive:

  • halfvec (half precision). Index with halfvec_cosine_ops and the index is half the size, with a barely perceptible recall loss. It’s the first thing I turn on: effectively free.
  • Binary quantization. Keep a bit version of the embedding and index it with bit_hamming_ops for very cheap candidate generation, then reorder by exact distance over the full vector. pgvectorscale automates this with its statistical binary quantization (SBQ).
  • Fewer dimensions. With Matryoshka embeddings you trim the dimension at the source (3072 → 1024 → 512) and the index shrinks proportionally. Measure recall at each step; there’s a point where it drops off a cliff.

The rule of thumb I follow: halfvec almost always, binary when the dataset doesn’t fit in RAM, dimension trimming when you control the embedding model.

Filtering at scale: the recall cliff

This is the most expensive and least obvious mistake. A query like this looks innocent:

sql
SELECT id, text FROM chunks
WHERE tenant_id = $1
ORDER BY embedding <=> $2
LIMIT 10;

If tenant_id = $1 matches few rows, HNSW can return its top-k by vector and have almost none pass the filter, so you end up with three results instead of ten, or none. The index doesn’t know about the filter. The options depend on how hard you filter:

  • Partial index or partition per tenant. If the common filter is the tenant, a CREATE INDEX ... WHERE tenant_id = ... or a declarative partition gives each tenant its own graph.
  • Raise ef_search. Usually enough for low-selectivity filters.
  • Iterative scan (pgvector 0.8). SET hnsw.iterative_scan = strict_order (exact order) or relaxed_order (better recall), bounded with hnsw.max_scan_tuples. The index keeps scanning until it gathers enough candidates that pass the filter.

Without measuring recall with the filter applied, this bug never shows up in a demo and always shows up in the first meeting with a customer who can’t find their own document.

Two-stage reranking

An embedding is a lossy compression: a bi-encoder pushes query and document into separate vectors and compares them by cosine, so it loses nuance. A cross-encoder reads query and document together and scores the pair, which is far more accurate and far more expensive. Hence two stages: HNSW brings 50 cheap candidates and the cross-encoder reorders and keeps the best 5.

  • Local cross-encoder: BAAI/bge-reranker-v2-m3 reorders 50 candidates in ~80 ms on CPU, ~20 ms on GPU.
  • Managed cross-encoder: Cohere Rerank v3, ~10 ms p95 plus per-call cost.
  • Middle ground: late-interaction models like ColBERT, between the bi-encoder and the cross-encoder in both cost and quality.

Reranking doubles per-query cost and almost always earns it: on my datasets, dropping it lowers relevance by 10-15 %.

Hybrid search: BM25 + vectors

Vector search only fails at what the vector can’t see: product codes, proper names, version numbers, exact terms. That’s where BM25 (lexical search) recovers what the embedding skips, and the two together hold up best in production.

In Postgres you have two paths: native full-text search (tsvector/tsquery), or ParadeDB (pg_search) if you want real BM25. Then you fuse the two lists with Reciprocal Rank Fusion (RRF), which sums 1/(k + rank) from each list and skips calibrating between incompatible score scales. I go into it in hybrid search: BM25 + vectors.

Evals with DSPy + MLflow

Changing the prompt without measuring is flying blind. Minimal pipeline:

  • DSPy (dspy.ai): define the (query, context) → answer signature and measure with retrieval metrics (context precision, context recall) and generation metrics (faithfulness, answer relevance).
  • MLflow: every run logged with the corpus hash, embedding model, HNSW parameters, metrics, and latency. That lets you compare two configurations without arguing from memory.
  • Frequency: re-run the eval set on every prompt, model, or pipeline change. It’s the bar for approving a change.

Anchor the eval set on 100-200 representative queries and review it quarterly. If you use an LLM as judge, calibrate it against a human-labelled subset: if the judge gives 4.5 where the person gives 3, it’s inflating. I cover it in LLM as judge: mature evaluation.

Operating it: pooling, monitoring, and reindex

What breaks in production isn’t the vector algebra, it’s the operations.

  • Pooling. Put PgBouncer in transaction mode in front. Each retrieval query is short and highly concurrent; without a pool you hit Postgres connection limits.
  • Recall drift. Recall isn’t stable while the corpus grows and changes. Measure recall@10 against the eval set on a cron and alert when it drops below the threshold, rather than waiting for a user to complain.
  • Reindex. After many deletes or embedding updates, the HNSW graph degrades. REINDEX INDEX CONCURRENTLY rebuilds it without blocking. Watch bloat with pg_stat_user_indexes and tune autovacuum on chunks if it churns a lot.

SLOs and load testing

Three SLOs keep the system alive:

  • End-to-end latency: p95 < 700 ms for a short single-turn, p95 < 1500 ms when rerank and a large LLM are in the mix.
  • Recall@10 on the eval set: ≥ 0.85 stable; below 0.80 raises an alert.
  • Cost per thousand queries: a hard, alert-monitored budget. A RAG with an expensive LLM scales OPEX until it’s unsustainable if nobody watches the bill.

I run load tests with k6 on a separate machine from Postgres: 50 concurrent users, 10 minutes sustained, measuring p50/p95/p99. I repeat it after every product release and every pgvector version bump, and I keep a recall regression test in CI so a parameter change can’t quietly degrade quality.

For wider patterns, see RAG in production: patterns and retrieval evaluation frameworks.

Reference repos: pgvector/pgvector[1], pgvectorscale[2], DSPy[3], MLflow[4]. Every company should keep its own eval set: the trouble with vendor benchmarks is they don’t see your domain.

Was this useful?
[Total: 0 · Average: 0]
  1. pgvector/pgvector
  2. pgvectorscale
  3. DSPy
  4. MLflow

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.