How to Build a RAG System with pgvector and LangChain: The Production Architecture
Why production RAG systems fail in retrieval, not prompting — and how to build scalable hybrid retrieval pipelines with pgvector, BM25, reranking, caching, and observability.
How to Build a RAG System with pgvector and LangChain: The Production Architecture
Most production AI failures are not model failures. They are retrieval failures.
If you want to understand why your RAG system is hallucinating, stop looking at your prompt. A perfect prompt with the wrong data yields a confident hallucination. An average prompt with the correct data yields a useful answer. The distance between a database returning a conceptually similar chunk and returning a factually useful chunk is the hardest engineering problem in modern AI.
We call it the Retrieval Gap.

LangChain is a fantastic prototyping wrapper. But underneath those abstractions, you are building a distributed database system. If you treat pgvector like a magic black box, it will melt under load.
Here is how retrieval systems actually break in the real world, the architectural progression required to keep them standing, and the operational scars we collected along the way.
The RAG Maturity Curve
Understanding where your architecture currently sits is the only way to anticipate the next bottleneck.
-
Stage 1: The Toy. Jupyter notebook + ChromaDB + 1 PDF. (~10% Retrieval Recall)
-
Stage 2: Persistent Infrastructure. Python scripts + PostgreSQL (
pgvector) + Docker. -
Stage 3: Operational Reality. FastAPI + PgBouncer + HNSW Indexing.
-
Stage 4: Hybrid Retrieval. Vector search + Keyword search (BM25) + Reranking APIs. (~90%+ Retrieval Recall)
-
Stage 5: The Semantic Microservice. Autonomous ingestion, Reciprocal Rank Fusion, Semantic Caching, and continuous Recall evaluation.

The Minimal Deployment Topology
Before diving into the failures, you need to understand where the pieces actually live. A Stage 4/5 production architecture completely separates the orchestration wrapper from the data persistence layer.
The Stack:
-
Orchestration:
FastAPI+LangGraph(Stateless async workers) -
Connection Pooler:
PgBouncer(Protects the DB from connection exhaustion) -
System of Record:
PostgreSQL+pgvector(Stores metadata, BM25 text, and HNSW embeddings) -
Semantic Cache:
Redis(Bypasses the DB for repeated queries) -
External Compute:
OpenAI(Embeddings/Generation) +Cohere(Cross-Encoder Reranking)
The PDF Lie and Ingestion Contamination
Most RAG failures happen before the embedding model ever sees the text.
Your PDF loader is lying to you. It does not “see” a cleanly formatted engineering manual. It sees a fragmented mess of text, hidden whitespace, and floating footers. If a document has “Confidential – Internal Use Only” on every page, a naive character splitter will attach that string to every single chunk.
Embedding models amplify repeated noise aggressively. If your ingestion layer is dirty, retrieval quality collapses long before the LLM becomes the bottleneck.
You have to clean the data. Overlap is your buffer against slicing equations in half, but regex filtering is mandatory before you even initiate chunking.
Python
import re
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Remove repeating headers BEFORE chunking to prevent semantic poisoning
cleaned_text = re.sub(r"Confidential - Internal Use Only", "", raw_text)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""]
)
Postmortem: The Night the Index Rebuild Killed Production
Everyone talks about the magic of vector search. Nobody talks about the database locks.
Six months into a deployment, we experienced a complete system outage. It was 3 AM, database CPU was pegged at 99%, and read queries were timing out.
The root cause wasn’t traffic. We had hit 5 million vectors. A nightly cron job triggered a mass re-ingestion of updated manuals, which forced the HNSW graph to recalculate its connections heavily. HNSW index builds are incredibly RAM and CPU intensive. The index rebuild starved the database of resources, locking out our read traffic.
The Unresolved Tension: HNSW vs. IVFFlat
This outage forced us to confront an ugly compromise regarding Approximate Nearest Neighbor (ANN) indexing. There is no perfect index.
| Index Type | The Strength | The Weakness | Best For |
| HNSW | Highest recall, lowest query latency. | Massive RAM footprint, slow index builds. | Read-heavy, static production systems. |
| IVFFlat | Fast build times, low RAM overhead. | Lower recall, requires table scans to train. | High-churn datasets, memory-constrained DBs. |
We knew IVFFlat reduced recall by a few percentage points, but our daily ingestion volume made HNSW operationally impossible during business hours. We traded a 3% precision drop for database stability. Real engineering is choosing which tradeoff you can survive.
The Latency Budget Breakdown
In pgvector, the <=> operator performs a Cosine Distance calculation. Without an index, this is a brute-force sequential scan.
Once you move to a Stage 4 architecture (Hybrid Search), you are stacking API calls. Here is what a realistic P95 latency budget looks like in production:
| Pipeline Stage | P95 Latency | Operational Note |
| Embedding (OpenAI) | 120ms | Network dependent. |
| pgvector Retrieval | 45ms | Assumes warm ANN index. |
| Keyword (BM25) | 18ms | Standard Postgres FTS. |
| Cross-Encoder Rerank | 380ms | The heaviest architectural tax. |
| LLM Generation | 2.4s | Streaming helps perceived latency. |

What We Tried That Failed
Engineering is a graveyard of good ideas. Here are three things we implemented to improve retrieval that actively made the system worse:
-
“Just increase the chunk overlap.” We pushed overlap to 50% hoping to preserve more context. It exploded our storage costs and degraded retrieval because the index was flooded with near-duplicate chunks competing for the top-k slots.
-
“Retrieve more chunks (Top-K = 25).” We figured the LLM or the reranker would sort it out. Instead, the reranker API choked on the token payload, latency spiked to over 2 seconds, and the “Lost in the Middle” phenomenon worsened.
-
“Rely strictly on semantic similarity.” Vector search fails badly at negations (“do not throttle”) and exact SKU identifiers (
ERR_CONN_RESET). Semantic math thinksERR_CONN_TIMEOUTis practically the same thing. This is why hybrid BM25 search is non-negotiable.
The Tenant Leakage Problem
In enterprise RAG, searching the “whole database” is a critical vulnerability.
A vector search returning another customer’s document is not a hallucination problem. It’s a security incident. Naive implementations pass a query to the DB and retrieve the closest math. You must implement strict metadata filtering before the vector distance is calculated, or rely on Postgres Row-Level Security (RLS).
Python
# The Metadata Filter is your security boundary
retriever = db.as_retriever(
search_kwargs={
"k": 5,
"filter": {"tenant_id": current_user.tenant_id} # Mandatory
}
)
How to Measure Retrieval Quality (Evaluation Science)
You cannot fix what you do not measure. If you are judging your RAG system by reading the LLM’s final chat output, you are flying blind. You must isolate retrieval failure from generation failure.
We measure Recall@10 (Did the correct, ground-truth chunk appear in the top 10 results?).
Evaluation Methodology Baseline
-
Corpus: 42 engineering and troubleshooting manuals.
-
Chunks: ~510,000 vectors.
-
Embedding Model: text-embedding-3-small (1536-dim).
-
Chunk Strategy: 1000 tokens / 200 token overlap.
-
Hardware: 32GB RAM / 8 vCPU (AWS RDS).
The Benchmark Results:
-
Pure Vector Search: Recall@10 = 74%.
-
Hybrid (Vector + BM25) + Reranker: Recall@10 = 96%.
If your Recall@10 is 74%, that means 1 out of 4 times, the LLM is physically incapable of answering the user’s question because the database didn’t hand it the right text. No amount of prompt engineering will fix that.
The Query Repetition Problem (Semantic Caching)
As your system scales, you will notice that users ask the same core questions 80% of the time. Running the embedding model, querying Postgres, hitting the reranking API, and generating the LLM token stream for “How do I reset my password?” is a massive waste of compute.
By putting a caching layer (like Redis) in front of your pipeline, you can store previous responses keyed to the vector embedding of the user’s question. If a new query comes in with a 0.99 cosine similarity to a cached query, you bypass the entire RAG pipeline and return the cached string. This turns a 3-second, heavy-compute transaction into a 15ms lookup.
Most teams think they are building AI systems. In practice, they are building retrieval infrastructure with a language model attached to it.
Models will change. Abstractions like LangChain will evolve. But the fundamental physics of database connections, chunk contamination, latency budgets, and index rebuilds remain exactly the same. The plumbing is what matters.