RAG Architecture in Production: How I Built a 12-Component System Achieving 96.8% Accuracy
A practitioner's deep-dive into hybrid search, temporal intelligence, pgvector, and the hard-won lessons of shipping enterprise RAG at scale.
What Is RAG Architecture and Why Does It Matter?
Retrieval Augmented Generation (RAG) is a system design pattern that grounds large-language-model responses in external, up-to-date knowledge instead of relying solely on the model's training data. Rather than fine-tuning a model every time your data changes, RAG retrieves the most relevant documents at query time, injects them into the prompt context, and lets the LLM synthesise a grounded answer.
Why does this matter? Because hallucinations are the number-one deployment blocker in enterprise AI. When I started building production RAG systems for clients, accuracy on domain-specific questions hovered around 60–65%. That is not a rounding error — it means roughly one in three answers is unreliable. No engineering leader will sign off on that in a customer-facing product.
The promise of RAG architecture is simple: give the model exactly the context it needs, nothing more, nothing less. The reality of RAG architecture is anything but simple. Getting from a proof-of-concept to a production system that serves real users at enterprise scale requires deliberate decisions at every layer — from chunking and embedding to retrieval, re-ranking, and response synthesis. Over the past year I have designed, built, and shipped a 12-component RAG system that now runs in production with a measured 96.8% accuracy on a held-out evaluation set. This article walks through every meaningful decision I made and why.
The Anatomy of a Production RAG System
Most tutorials show RAG as three boxes: embed → retrieve → generate. In production, those three boxes expand into at least a dozen interacting components. Here is the architecture I shipped, broken into its 12 core modules:
- Document Ingestion Pipeline — normalises PDFs, HTML, Markdown, and structured data into a common intermediate representation.
- Chunking Engine — applies context-aware, hierarchical chunking with configurable overlap and boundary detection.
- Embedding Service — generates dense vector representations via a dedicated microservice wrapping an embedding model (currently
text-embedding-3-largeat 3 072 dimensions, quantised to binary for storage efficiency). - Vector Store (pgvector) — stores embeddings in PostgreSQL with the
pgvectorextension, using HNSW indexing for approximate nearest-neighbour search. - Keyword Index — a parallel BM25 index built on the same chunks, enabling exact-match and phrase retrieval.
- Hybrid Retriever — merges results from the vector store and the keyword index using Reciprocal Rank Fusion (RRF).
- Re-Ranker — a cross-encoder model that re-scores the top-k candidates on semantic relevance to the query.
- Temporal Filter — applies time-aware weighting so recent documents outrank stale ones when temporal relevance is detected in the query.
- Context Assembler — packs selected chunks into the prompt window, respecting token budgets and inserting source citations.
- Response Synthesiser — the LLM call itself, with system instructions that enforce citation behaviour and confidence calibration.
- Citation Validator — post-processes the response to verify that every claim traces back to a retrieved chunk.
- Feedback Loop & Eval Harness — collects user feedback, runs nightly evaluation sweeps against labelled datasets, and flags accuracy regressions.
Each component has its own configuration surface, failure modes, and performance characteristics. The rest of this article dives into the ones that had the largest impact on accuracy: hybrid search, chunking strategies, temporal intelligence, and the evaluation harness that turned guesswork into measurement.
Hybrid Search: Combining Vector and Keyword Search
What is hybrid search in RAG? Hybrid search is the strategy of running both a semantic vector search and a lexical keyword search against the same corpus, then fusing the results into a single ranked list. This matters because neither approach alone is sufficient for production workloads.
Vector search (dense retrieval) excels at capturing meaning. If a user asks "How do I handle authentication failures?" and the relevant chunk uses the phrase "login error recovery flow," a good embedding model will match them. But vector search struggles with exact identifiers — model numbers, error codes, API route names. A query for ERR_CONN_REFUSED may return chunks about network connectivity in general rather than the specific error.
Keyword search (BM25) is the opposite. It is ruthlessly precise on exact terms but blind to synonyms and paraphrases. By combining both, you cover the full spectrum.
In my system, the Hybrid Retriever operates as follows:
- The user query is simultaneously dispatched to the pgvector HNSW index and the BM25 keyword index.
- Each index returns its top-30 results with scores.
- Results are merged using Reciprocal Rank Fusion (RRF) with a constant
k = 60, which dampens the impact of outlier scores from either source. - The merged top-20 are passed to the cross-encoder re-ranker for final scoring.
The weighting between vector and keyword scores is not static. I implemented a query classifier — a lightweight model that categorises incoming queries as conceptual, factual-exact, ormixed. Conceptual queries shift the RRF weight toward vector results; factual-exact queries shift toward BM25. This adaptive weighting alone improved top-5 recall by 8.3 percentage points on our evaluation set.
A practical implementation detail: pgvector's HNSW index in PostgreSQL 16 supports both vector_cosine_opsand vector_ip_ops. I use cosine distance for normalised embeddings, with an ef_constructionof 256 and m of 32. For BM25 I use a custom GIN index on a tsvector column, which keeps the entire retrieval stack inside Postgres — no Elasticsearch required. This dramatically simplifies deployment and operational overhead.
Chunking Strategies That Actually Work
How should you chunk documents for RAG? The answer is: it depends on your data, and you should measure rigorously. That said, here are the strategies I tested and what actually moved the needle in production.
Fixed-Size Chunking (Baseline)
The simplest approach: split text every N tokens with M tokens of overlap. I started here with 512-token chunks and 64-token overlap. It works for homogeneous prose but falls apart on structured documents like API references, legal contracts, and technical manuals where section boundaries carry semantic weight.
Recursive / Hierarchical Chunking
This is the strategy that stuck. The chunking engine first splits on document structure — headings, section breaks, code fences — then recursively splits oversized sections by paragraph and sentence boundaries. Each chunk retains a parent reference so the context assembler can pull in the surrounding section header when needed. This parent-child relationship was critical: it lets the LLM understand where a chunk lives in the document hierarchy, not just what it says.
Semantic Chunking
I experimented with embedding-based boundary detection, where you slide a window through the text and split wherever the cosine similarity between adjacent windows drops below a threshold. In theory, this produces chunks that are each internally coherent. In practice, it is slow (you must embed every window), fragile (the threshold is hard to tune globally), and only marginally better than hierarchical chunking on my evaluation set. I dropped it for production.
The Settings That Mattered
After extensive A/B testing against a labelled evaluation set of 420 question–answer pairs:
- Target chunk size: 384 tokens (not 512). Smaller chunks improved precision without hurting recall because the re-ranker compensates by scoring more candidates.
- Overlap: 48 tokens — enough to preserve sentence continuity at boundaries.
- Metadata enrichment: Every chunk carries its document title, section heading path, page number (for PDFs), and an auto-generated one-line summary used as a secondary embedding target.
- Deduplication: Near-duplicate chunks (cosine similarity > 0.97) are collapsed at ingestion time. This is surprisingly important when ingesting versioned documentation where 80% of paragraphs are unchanged between releases.
Achieving High Accuracy: From 60% to 96.8%
How do you measure RAG accuracy? First, you need a labelled evaluation dataset — a set of questions paired with gold-standard answers and the source chunks that should be retrieved. I built a 420-pair eval set, manually curated from real user queries and verified by domain experts.
I track three metrics:
- Retrieval Recall@5 — does the correct source chunk appear in the top 5 results?
- Answer Correctness — LLM-as-judge scoring the generated answer against the gold answer on a 1-5 scale, thresholded at ≥ 4.
- Faithfulness — does the answer only contain claims supported by retrieved context? Measured by a separate LLM judge.
Here is the accuracy progression as I iterated on the system:
| Iteration | Change | Retrieval Recall@5 | Answer Correctness |
|---|---|---|---|
| v0 — Baseline | Naive vector search, 512-token chunks | 61.2% | 58.4% |
| v1 | Switched to hierarchical chunking (384 tokens) | 72.5% | 66.1% |
| v2 | Added BM25 hybrid search with RRF | 81.0% | 74.8% |
| v3 | Introduced cross-encoder re-ranker | 88.3% | 82.6% |
| v4 | Query classification + adaptive RRF weighting | 91.6% | 87.9% |
| v5 | Metadata enrichment + parent-chunk context | 93.8% | 91.2% |
| v6 | Temporal filtering + chunk deduplication | 95.1% | 93.4% |
| v7 — Current | Prompt engineering + citation validation | 96.8% | 95.3% |
The single biggest jump came from adding hybrid search (v2). The single most surprising jump came from metadata enrichment and parent-chunk injection (v5) — I did not expect a 5.5-point lift from what felt like a bookkeeping change. But giving the LLM the section heading and document title alongside the chunk body dramatically reduced out-of-context hallucinations.
The last two points (v7) came from disciplined prompt engineering: explicitly instructing the model to say "I don't have enough information" rather than guessing, and the citation validator that catches unsupported claims before they reach the user.
Temporal Intelligence: When Your Data Has a Time Dimension
How do you handle time-sensitive data in RAG? Most RAG tutorials ignore temporality entirely. But in production, your knowledge base is not static. Policies change. Product specs get updated. Pricing shifts quarterly. If a user asks "What is the current refund policy?" and your system retrieves a chunk from 2023 alongside one from 2025, you have a problem.
The Temporal Filter component in my system works in three stages:
- Query temporal intent detection: A lightweight classifier determines whether the query has temporal intent. Queries like "current pricing", "latest release notes", or "what changed in Q4" trigger temporal mode. Queries like"explain how HNSW indexing works" do not.
- Timestamp-aware scoring: When temporal mode is active, each retrieved chunk receives a time-decay multiplier. I use an exponential decay function with a configurable half-life (currently 90 days). A chunk from 30 days ago gets a multiplier of ~0.79; a chunk from a year ago gets ~0.06. This multiplier is applied after the RRF fusion but before the re-ranker, so the re-ranker sees a time-adjusted candidate set.
- Version-chain resolution: Documents that are explicit revisions of earlier documents (detected via metadata or filename patterns like
policy-v3.md) trigger a version-chain lookup. Only the latest version in the chain is eligible for retrieval, and superseded versions are automatically deprioritised.
This temporal intelligence layer is what allows the system to handle evolving knowledge bases without requiring manual curation every time a document is updated. On our temporal evaluation subset (87 questions with time-sensitive answers), enabling this layer improved answer correctness from 78.2% to 94.6%.
Scaling RAG for Enterprise: Lessons from Production
What does it take to run RAG at enterprise scale? The jump from a demo to a production system serving hundreds of concurrent users surfaces problems that no tutorial warns you about. Here are the lessons that cost me the most time.
Latency Budget Management
Users expect sub-2-second time-to-first-token. My retrieval pipeline (embed query → vector search → BM25 search → RRF fusion → re-rank) takes roughly 280–400ms end-to-end. The embedding call alone is 50–80ms. To stay within budget I parallelise the vector and keyword searches, cache frequent query embeddings with a 15-minute TTL, and use a distilled cross-encoder (60M parameters instead of 400M) that re-ranks 20 candidates in under 40ms on a single GPU.
Index Maintenance
HNSW indexes in pgvector are not free. With 2.4 million chunks, the index consumes roughly 11 GB of memory. I run a nightly vacuum and reindex job during the maintenance window. For hot-path ingestion (documents that must be searchable within minutes), I use a two-tier architecture: new chunks are written to a small, flat IVFFlat index that is cheap to update, and promoted to the main HNSW index during the nightly job. Queries search both tiers and merge results.
Multi-Tenancy and Data Isolation
Enterprise clients require strict data isolation. Each tenant's chunks live in a partitioned table with row-level security enforced at the Postgres level. The pgvector HNSW index is built per partition, which means each tenant gets its own index sized to their corpus. This trades some operational complexity for hard data boundaries — a requirement for SOC 2 compliance.
Observability
Every retrieval request logs: the raw query, the classified query type, the top-10 chunk IDs with scores, the re-ranked order, the number of tokens assembled, and the latency breakdown per component. This telemetry feeds a Grafana dashboard that lets me spot accuracy regressions within hours, not weeks. When average re-ranker scores drop below a threshold, an alert fires — it usually means a recent ingestion introduced corrupted or duplicate chunks.
Common RAG Pitfalls and How to Avoid Them
What are the most common mistakes in RAG implementations? After building, reviewing, and debugging RAG systems for multiple organisations, here are the patterns I see fail most often:
1. Treating Chunking as an Afterthought
Teams spend weeks selecting embedding models and days on chunking. It should be the other way around. Chunking determines the atomic unit of retrieval — get it wrong and no amount of re-ranking will save you. A chunk that spans two unrelated topics poisons the context window. A chunk that is too small loses critical surrounding context. Invest in hierarchical, structure-aware chunking with parent references.
2. No Evaluation Harness
The second most common mistake is having no systematic way to measure accuracy. If you cannot answer "did last week's change make the system better or worse?" you are flying blind. Build a labelled eval set early — even 50 pairs is better than nothing — and run it on every configuration change.
3. Ignoring the Retrieval Step
Many teams blame the LLM when answers are wrong, but the root cause is almost always bad retrieval. If the correct chunk is not in the context window, the best model in the world cannot produce the right answer. Debug retrieval first. Log the retrieved chunks. Read them. Ask yourself: "Given only these chunks, could a human answer the question?" If not, the problem is retrieval, not generation.
4. Over-Stuffing the Context Window
More context is not always better. Including 20 marginally relevant chunks dilutes the signal and increases cost and latency. I found the sweet spot at 5–7 chunks for most queries, with the re-ranker ensuring those are the highest-quality candidates. The context assembler enforces a hard token budget and prioritises chunks with higher re-ranker scores.
5. No Feedback Loop
Production systems drift. User queries evolve. New documents introduce terminology the embedding model handles poorly. Without a feedback loop — thumbs up/down, explicit corrections, or automated eval sweeps — accuracy silently degrades. My system runs a nightly eval against the latest corpus snapshot and flags any metric that drops more than two percentage points.
RAG vs Fine-Tuning: When to Use Which
Should you use RAG or fine-tuning? This is the question I get asked most often. The answer is not either/or — they solve different problems, and in many production systems they complement each other.
Use RAG When:
- Your knowledge base changes frequently (weekly or more).
- You need source attribution and citations for every answer.
- Data access is governed by permissions — different users see different documents.
- You need to add new knowledge without retraining or waiting for a fine-tuning job.
- Accuracy on long-tail, domain-specific facts is critical (RAG retrieves the exact passage).
Use Fine-Tuning When:
- You need the model to adopt a specific tone, style, or output format consistently.
- The knowledge is stable and unlikely to change for months.
- Latency is extremely tight and you cannot afford the retrieval overhead.
- The task is more about how the model responds than what it knows.
Use Both When:
- You need domain-specific reasoning patterns and up-to-date factual knowledge.
- Fine-tune for output structure and tone; use RAG for factual grounding.
In the system I built, the base model is not fine-tuned. All domain knowledge comes through RAG retrieval, and the model's behaviour is shaped entirely through system prompts and the context assembler's formatting. This keeps the system modular: I can swap the LLM provider without retraining, and I can update the knowledge base without touching the model. For most enterprise use cases, this is the right default. Fine-tuning becomes worthwhile when you need deeply specialised reasoning that prompt engineering alone cannot achieve — for example, generating domain-specific code or classifying documents according to a proprietary taxonomy.
Cost Comparison
RAG has higher per-query cost (embedding + retrieval + larger prompt) but near-zero update cost. Fine-tuning has lower per-query cost but significant update cost every time the training data changes. For a knowledge base that updates weekly with 500+ documents, RAG is typically 3–5× cheaper over a six-month window once you factor in fine-tuning job costs and the engineering time to manage training pipelines.
Building RAG That Actually Works
Production RAG is not a single algorithm — it is a system of interacting components, each of which needs to be designed, measured, and iterated on independently. The 12-component architecture I have described here is the result of months of systematic experimentation, driven by a rigorous evaluation harness and informed by real production traffic.
If you take away three things from this article, let them be these:
- Hybrid search is non-negotiable. Vector-only retrieval leaves too much precision on the table. Adding BM25 and adaptive fusion is the single highest-leverage change you can make.
- Chunking is architecture, not plumbing. Invest in hierarchical, structure-aware chunking with parent references and metadata enrichment. This is where most of the accuracy gains hide.
- Measure everything. Without a labelled evaluation set and automated regression sweeps, you are guessing. The jump from 60% to 96.8% accuracy was not a single breakthrough — it was dozens of small, measured improvements compounding over time.
I am currently working on next-generation improvements including agentic retrieval (where the system autonomously decides how many retrieval passes to run and which indexes to query), and graph-augmented RAG that models entity relationships alongside vector similarity. RAG is also a critical building block in multi-agent AI systems, where retrieval agents provide grounded context to specialised reasoning agents — a pattern I explore in depth in my architecture guide. If you are building production RAG and want to compare notes, I am always happy to talk architecture.
Ready to discuss your AI project?
Book a free 30-minute discovery call to explore how AI can transform your business.
Book Discovery Call