Enterprise RAG: the production pattern where the LLM grounds every answer on your corpus, with citations.
The four-layer architecture regulated enterprises use to ship LLM applications that know your policies, your contracts, your knowledge base — and prove it by citing the chunks they grounded on. Sovereign-deployable end to end.
Enterprise RAG, defined.
Retrieval-Augmented Generation is the dominant production pattern for enterprise LLM use. At ingestion time, documents are chunked, embedded as vectors and stored in a vector database. At query time, the user's question is embedded, the nearest chunks are retrieved, and the LLM is prompted to answer using the retrieved context.
The architectural payoff is that the LLM's knowledge stays current (re-ingest to refresh), domain-specific (choose what to ingest) and traceable (the answer cites the chunks it grounded on). The engineering reality is that 80% of RAG quality lives in retrieval — chunking, embedding choice, hybrid search, re-ranking — not in the generation. Teams that skip straight to fine-tuning before exhausting retrieval improvements waste their first six weeks.
For the underlying terms — embedding, hybrid retrieval, re-ranking, chunking, vector database — see the RAG section of the enterprise AI glossary.
What makes RAG the production default
Grounded on your corpus, not the public internet
Answers cite the chunks they grounded on — your policies, your contracts, your knowledge base. The LLM's training corpus is irrelevant. Knowledge updates on your schedule, not the model maker's.
Auditable by design
Every retrieval is logged. Every answer carries citation provenance. A regulator can replay any answer back to the chunk-level evidence it grounded on and the model version that produced it.
Lower cost than fine-tuning
Re-ingest a document, the model now knows it. No retraining, no GPU hours, no MLOps overhead. The 80/20 rule on enterprise LLM use is that retrieval improvements beat fine-tuning at lower cost — until you're solving a format or voice problem that retrieval can't touch.
Sovereign-deployable end to end
Open-weights embedding model, open-weights LLM, open-weights re-ranker, open-source vector DB. The entire pipeline runs inside your perimeter on your hardware with network egress blocked at the namespace.
The four-layer enterprise RAG stack
Containerised, Kubernetes-native, sovereign-deployable end to end. The stack runs alongside your sovereign LLM serving layer in the same cluster.
Parsers per document type (PDF, DOCX, HTML, scanned), chunking strategy tuned per type — semantic splitting for prose, section-aware for structured documents, parent-document retrieval for legal contracts. Metadata extraction so retrieval can filter by jurisdiction, business unit, document type, effective date.
Open-weights embedding model running on customer GPUs (nomic-embed-text for English-primary, BGE-M3 for multilingual). Vector DB chosen by scale — pgvector under 10M chunks, Qdrant for 10-100M, Milvus beyond. Incremental ingestion so corpus updates don't trigger full re-embeds.
Hybrid retrieval combines dense vector search with BM25 keyword search via Reciprocal Rank Fusion — 15-25% accuracy lift over either alone. Cross-encoder re-ranker (bge-reranker-v2-m3) on the top-30 candidates, top-5 to top-8 passed to the LLM. Optional HyDE or query expansion for thin queries.
Open-weights LLM served by vLLM on customer GPUs. Prompt template enforces grounding instructions — refuse to answer if context is insufficient, cite the chunk identifiers, never make up facts. Post-processing injects citations into the response, PII guardrails inspect output before return.
Six failure modes — and the engineering cure for each
Every RAG implementation we've diagnosed has hit at least three of these. The fix is rarely a better LLM; it's a better retrieval pipeline.
Naive chunking
Fixed-size character splits (the LangChain default) break sentences mid-thought and ignore document structure. Cure: chunker per document type — semantic for prose, section-aware for structured docs, parent-doc retrieval for long contracts.
Dense retrieval only
Pure embedding-based search misses rare entities (invoice numbers, regulation IDs, drug names) the embedding model never saw. Cure: hybrid dense + BM25 with Reciprocal Rank Fusion is the production default for any non-trivial corpus.
No re-ranking
Top-K dense retrieval surfaces relevant-but-not-best chunks; the LLM grounds on weaker context. Cure: cross-encoder re-ranker on the top 30 candidates, top 5-8 passed to generation. Typical lift: 8-15 points on long-document corpora.
Context stuffing
Teams assume more retrieved chunks is better. Quality degrades past 5-8 chunks due to lost-in-the-middle — the model attends most to the start and end of the context window. Cure: tighter retrieval, better re-ranking, not more chunks.
No eval suite
Quality regressions ship to production undetected because no one is measuring on every change. Cure: RAGAS-style evals (context precision, context recall, answer faithfulness, answer relevance) running on every prompt or model change, deployment gated on regression thresholds.
No incremental ingestion
Every corpus update triggers a full re-embed of the entire knowledge base. Cure: content-hash indexing so only changed documents re-embed, with a background worker for the long tail of metadata-only updates.
Six RAG workloads we ship in production
These are first-pilot patterns — narrow enough to deploy in 6–9 weeks, valuable enough to justify the investment, and structurally similar enough that the second corpus deploys in two to three weeks.
Compliance + policy Q&A
Internal compliance Q&A grounded on the policy corpus, regulatory updates, internal procedure documents. Answers cite the policy clause they grounded on. Common first-pilot use case for BFSI sovereign RAG.
Branch-staff knowledge assistant
Frontline staff get instant answers to product, procedure and customer-process questions. Grounded on the internal knowledge base, never on the public internet. Deployable in multiple languages from the same corpus.
Contract + clause discovery
Legal and procurement teams query the contract corpus by clause type, counterparty, jurisdiction or risk indicator. Hybrid retrieval handles both semantic queries and exact-clause matches.
Customer-support deflection
Customer chat or voice deflection grounded on the support knowledge base. Tier-1 queries resolved without human handoff; escalation paths preserved for edge cases. Sovereign-deployable for regulated industries.
Clinical-guidance assist
Clinicians query the internal clinical guidelines, drug interaction database and procedure documents. Grounded answers with citations to the source guideline section. HIPAA-compliant sovereign deployment.
Internal research + insight discovery
Analysts query the research corpus, prior reports, and structured data assets through a single natural-language interface. Multi-document synthesis with per-claim citations.
From corpus inventory to production RAG in 6–9 weeks
This is the timeline for the first corpus on a sovereign platform. Subsequent corpora ride the same embedding and retrieval layers and drop to two to three weeks each.
Corpus inventory + chunking design
We map the customer's document types, volumes, refresh cadence and quality variance. Chunking strategy is designed per type, not a one-size-fits-all default. One week.
Stack deployment
Embedding model, vector DB, retrieval API, LLM serving via vLLM, Langfuse observability — all deployed into the customer's sovereign Kubernetes cluster. Network egress blocked at namespace. One to two weeks.
Corpus ingestion + eval build
Parsing, chunking, embedding the first corpus. SMEs build the eval set (typically 50-200 representative queries with expected behaviour). RAGAS framework deployed inside the perimeter. One to two weeks.
Retrieval tuning + hypercare
Iterate on hybrid retrieval weights, re-ranking, prompt template. Eval suite gates every change. Phased rollout — 5%, 20%, full — with delivery team embedded for 45-day hypercare. Two to three weeks.
Subsequent corpora
Once the platform is in place, additional corpora ride the same embedding, vector DB and retrieval layers. Typical second-corpus timeline drops to two to three weeks.
The RAG eval suite is the difference between a working demo and a production system.
Production RAG runs four metrics on every change. Context precision — what fraction of retrieved chunks are relevant. Context recall — what fraction of the relevant chunks were retrieved. Answer faithfulness — does the answer ground on the context or hallucinate. Answer relevance — does the answer address the user's question.
MindMap Digital's sovereign RAG deployments ship with RAGAS inside the perimeter, gating every prompt, model, retriever or chunking change on regression against the customer-built eval set. Teams without this discipline ship quality regressions to production and discover them in the customer complaint. Teams with this discipline catch them before merge.
Enterprise RAG across the portfolio
Sovereign AI pillar →
The architectural pattern that enterprise RAG ships inside for regulated buyers — data, weights and audit under your control.
Agentic AI pillar →
Agents call RAG as a tool. Retrieval becomes one of the actions in the agentic loop.
Generative AI service →
End-to-end LLM serving on customer infrastructure — the substrate that RAG runs on.
ChatNext →
Conversational AI platform with RAG-grounded answers, multilingual, sovereign-deployable for BFSI and healthcare.
Enterprise AI glossary →
Forty plain-language definitions including RAG, embedding, hybrid retrieval, re-ranking, chunking and semantic search.
117 accelerator library →
Pre-built RAG components — every one air-gap capable. Browse the catalogue.
Enterprise RAG — the questions buyers ask
What is enterprise RAG?
Retrieval-Augmented Generation is the dominant production pattern for enterprise LLM use. At ingestion time, documents are chunked, embedded as vectors and stored in a vector database. At query time, the user's question is embedded, the nearest chunks are retrieved, and the LLM is prompted to answer using the retrieved context. The architectural payoff: the LLM's knowledge stays current (re-ingest to refresh), domain-specific (choose what to ingest) and traceable (the answer cites the chunks it grounded on).
Why is RAG preferable to fine-tuning for most enterprise use cases?
Three reasons. (1) Freshness — RAG knowledge updates with a re-ingestion job; fine-tuning requires a new training run. (2) Auditability — every RAG answer cites the chunks it grounded on, so the regulator can replay the answer chain; fine-tuned knowledge is opaque inside the weights. (3) Cost — building and operating a RAG pipeline is cheaper than running periodic fine-tuning at quality parity, particularly for the common case of "the model needs to know our policies and documents" rather than "the model needs to adopt a specific output format or voice." Fine-tuning earns its keep on the format/voice problem; RAG wins on the knowledge problem.
What are the four layers of an enterprise RAG stack?
Layer 1 — ingestion and chunking: parsers for the customer's document types, chunking strategies tuned per type, metadata extraction. Layer 2 — embedding and storage: a sovereign-deployable embedding model (nomic-embed-text or BGE-M3), a vector database (pgvector under 10M chunks, Qdrant for 10–100M, Milvus beyond). Layer 3 — retrieval: hybrid dense plus BM25 search, a cross-encoder re-ranker, optional query expansion or HyDE. Layer 4 — generation: the LLM, the prompt template, the response post-processing including citation insertion and PII guardrails.
Where does enterprise RAG go wrong in production?
Six common failure modes. (1) Naive chunking — fixed-size character splits break sentences and drop semantic boundaries. (2) No hybrid retrieval — pure dense retrieval misses rare entities and exact-match queries. (3) No re-ranking — top-K retrieval surfaces relevant-but-not-best chunks, the LLM grounds on weak context. (4) Stuffed context — teams assume more chunks is better; quality typically degrades past 5–8 chunks due to the lost-in-the-middle effect. (5) No eval suite — quality regressions ship to production undetected. (6) No incremental ingestion — every corpus update is a full re-embed.
Can enterprise RAG run on sovereign on-premise infrastructure?
Yes — and for regulated industries this is the default deployment model. MindMap Digital's sovereign RAG stack runs entirely inside the customer's perimeter: embedding model on customer GPUs, vector database in customer Kubernetes, LLM serving via vLLM on customer GPUs, retrieval API as a customer-namespace service. Network egress is blocked at the namespace level. Every retrieval and every model call streams into the customer's SIEM. The compliance posture matches the customer's sovereign LLM serving layer because they share the same cluster.
How do you measure RAG quality?
Four metrics, scored against a customer-built eval set on every change. Context precision — what fraction of retrieved chunks are relevant. Context recall — what fraction of the relevant chunks in the corpus were retrieved. Answer faithfulness — does the answer actually ground on the retrieved context or hallucinate. Answer relevance — does the answer actually address the user's question. The open-source RAGAS framework computes all four; MindMap deploys it inside the perimeter alongside Langfuse for production observability. The gating rule: no merge if any metric regresses below the customer's acceptance threshold.
How long does it take to deploy enterprise RAG?
MindMap Digital's standard sovereign RAG deployment is 6–9 weeks from contract to production for a first corpus. Week one: corpus inventory and chunking-strategy design. Weeks two to three: stack deployment (embedding model, vector DB, retrieval API, eval harness). Weeks four to five: corpus ingestion, eval build, retrieval tuning. Weeks six to seven: end-to-end pilot with hypercare. Weeks eight to nine: phased rollout. Subsequent corpora on the same platform deploy in two to three weeks because the infrastructure, identity integration and eval framework are already in place.
What models does MindMap deploy in the enterprise RAG stack?
For embedding: nomic-embed-text for English-primary corpora, BGE-M3 for multilingual workloads — both open-weights and locally deployable. For re-ranking: bge-reranker-v2-m3, runs on CPU for low-traffic deployments, on a small GPU at production scale. For generation: Llama 3.3 70B or Qwen 2.5 72B for high-quality general workloads, Llama 3.3 8B or Mistral 7B for high-throughput specialised workloads, all served via vLLM. We deliberately choose open-weights at every layer — closed-source frontier APIs cannot be deployed on-prem and are therefore ruled out for sovereign-mandated buyers.
Score your RAG readiness. In 2 minutes.
Six questions on corpora, infrastructure, eval discipline and compliance — your tier, your gaps, and the engagement that fits.