NEWMindMap Digital has acquired Bluetide.co— deepening our data & agentic-AI stack.Read more →
Home · Enterprise RAG
Pillar · Enterprise RAG

Enterprise RAG: the production pattern where the LLM grounds every answer on your corpus, with citations.

The four-layer architecture regulated enterprises use to ship LLM applications that know your policies, your contracts, your knowledge base — and prove it by citing the chunks they grounded on. Sovereign-deployable end to end.

4
Stack layers
6–9 wk
Contract to production
15–25%
Hybrid retrieval lift
Sovereign
End-to-end deployable
Definition

Enterprise RAG, defined.

Retrieval-Augmented Generation is the dominant production pattern for enterprise LLM use. At ingestion time, documents are chunked, embedded as vectors and stored in a vector database. At query time, the user's question is embedded, the nearest chunks are retrieved, and the LLM is prompted to answer using the retrieved context.

The architectural payoff is that the LLM's knowledge stays current (re-ingest to refresh), domain-specific (choose what to ingest) and traceable (the answer cites the chunks it grounded on). The engineering reality is that 80% of RAG quality lives in retrieval — chunking, embedding choice, hybrid search, re-ranking — not in the generation. Teams that skip straight to fine-tuning before exhausting retrieval improvements waste their first six weeks.

For the underlying terms — embedding, hybrid retrieval, re-ranking, chunking, vector database — see the RAG section of the enterprise AI glossary.

Why RAG beats the alternatives

What makes RAG the production default

Grounded on your corpus, not the public internet

Answers cite the chunks they grounded on — your policies, your contracts, your knowledge base. The LLM's training corpus is irrelevant. Knowledge updates on your schedule, not the model maker's.

Auditable by design

Every retrieval is logged. Every answer carries citation provenance. A regulator can replay any answer back to the chunk-level evidence it grounded on and the model version that produced it.

Lower cost than fine-tuning

Re-ingest a document, the model now knows it. No retraining, no GPU hours, no MLOps overhead. The 80/20 rule on enterprise LLM use is that retrieval improvements beat fine-tuning at lower cost — until you're solving a format or voice problem that retrieval can't touch.

Sovereign-deployable end to end

Open-weights embedding model, open-weights LLM, open-weights re-ranker, open-source vector DB. The entire pipeline runs inside your perimeter on your hardware with network egress blocked at the namespace.

Reference architecture

The four-layer enterprise RAG stack

Containerised, Kubernetes-native, sovereign-deployable end to end. The stack runs alongside your sovereign LLM serving layer in the same cluster.

L1
Ingestion + Chunking
Parsers · chunking strategy per document type · metadata extraction
● ON YOUR INFRA

Parsers per document type (PDF, DOCX, HTML, scanned), chunking strategy tuned per type — semantic splitting for prose, section-aware for structured documents, parent-document retrieval for legal contracts. Metadata extraction so retrieval can filter by jurisdiction, business unit, document type, effective date.

L2
Embedding + Storage
BGE-M3 or nomic-embed-text · pgvector / Qdrant / Milvus
● ON YOUR INFRA

Open-weights embedding model running on customer GPUs (nomic-embed-text for English-primary, BGE-M3 for multilingual). Vector DB chosen by scale — pgvector under 10M chunks, Qdrant for 10-100M, Milvus beyond. Incremental ingestion so corpus updates don't trigger full re-embeds.

L3
Retrieval
Hybrid dense + BM25 · cross-encoder re-ranking · query expansion
● ON YOUR INFRA

Hybrid retrieval combines dense vector search with BM25 keyword search via Reciprocal Rank Fusion — 15-25% accuracy lift over either alone. Cross-encoder re-ranker (bge-reranker-v2-m3) on the top-30 candidates, top-5 to top-8 passed to the LLM. Optional HyDE or query expansion for thin queries.

L4
Generation + Citation
Llama 3.3 / Qwen 2.5 via vLLM · prompt template · citation injection · PII guardrails
● ON YOUR INFRA

Open-weights LLM served by vLLM on customer GPUs. Prompt template enforces grounding instructions — refuse to answer if context is insufficient, cite the chunk identifiers, never make up facts. Post-processing injects citations into the response, PII guardrails inspect output before return.

Where it goes wrong

Six failure modes — and the engineering cure for each

Every RAG implementation we've diagnosed has hit at least three of these. The fix is rarely a better LLM; it's a better retrieval pipeline.

Naive chunking

Fixed-size character splits (the LangChain default) break sentences mid-thought and ignore document structure. Cure: chunker per document type — semantic for prose, section-aware for structured docs, parent-doc retrieval for long contracts.

Dense retrieval only

Pure embedding-based search misses rare entities (invoice numbers, regulation IDs, drug names) the embedding model never saw. Cure: hybrid dense + BM25 with Reciprocal Rank Fusion is the production default for any non-trivial corpus.

No re-ranking

Top-K dense retrieval surfaces relevant-but-not-best chunks; the LLM grounds on weaker context. Cure: cross-encoder re-ranker on the top 30 candidates, top 5-8 passed to generation. Typical lift: 8-15 points on long-document corpora.

Context stuffing

Teams assume more retrieved chunks is better. Quality degrades past 5-8 chunks due to lost-in-the-middle — the model attends most to the start and end of the context window. Cure: tighter retrieval, better re-ranking, not more chunks.

No eval suite

Quality regressions ship to production undetected because no one is measuring on every change. Cure: RAGAS-style evals (context precision, context recall, answer faithfulness, answer relevance) running on every prompt or model change, deployment gated on regression thresholds.

No incremental ingestion

Every corpus update triggers a full re-embed of the entire knowledge base. Cure: content-hash indexing so only changed documents re-embed, with a background worker for the long tail of metadata-only updates.

Reference use cases

Six RAG workloads we ship in production

These are first-pilot patterns — narrow enough to deploy in 6–9 weeks, valuable enough to justify the investment, and structurally similar enough that the second corpus deploys in two to three weeks.

Compliance + policy Q&A

Internal compliance Q&A grounded on the policy corpus, regulatory updates, internal procedure documents. Answers cite the policy clause they grounded on. Common first-pilot use case for BFSI sovereign RAG.

Branch-staff knowledge assistant

Frontline staff get instant answers to product, procedure and customer-process questions. Grounded on the internal knowledge base, never on the public internet. Deployable in multiple languages from the same corpus.

Contract + clause discovery

Legal and procurement teams query the contract corpus by clause type, counterparty, jurisdiction or risk indicator. Hybrid retrieval handles both semantic queries and exact-clause matches.

Customer-support deflection

Customer chat or voice deflection grounded on the support knowledge base. Tier-1 queries resolved without human handoff; escalation paths preserved for edge cases. Sovereign-deployable for regulated industries.

Clinical-guidance assist

Clinicians query the internal clinical guidelines, drug interaction database and procedure documents. Grounded answers with citations to the source guideline section. HIPAA-compliant sovereign deployment.

Internal research + insight discovery

Analysts query the research corpus, prior reports, and structured data assets through a single natural-language interface. Multi-document synthesis with per-claim citations.

How we deploy it

From corpus inventory to production RAG in 6–9 weeks

This is the timeline for the first corpus on a sovereign platform. Subsequent corpora ride the same embedding and retrieval layers and drop to two to three weeks each.

01

Corpus inventory + chunking design

We map the customer's document types, volumes, refresh cadence and quality variance. Chunking strategy is designed per type, not a one-size-fits-all default. One week.

02

Stack deployment

Embedding model, vector DB, retrieval API, LLM serving via vLLM, Langfuse observability — all deployed into the customer's sovereign Kubernetes cluster. Network egress blocked at namespace. One to two weeks.

03

Corpus ingestion + eval build

Parsing, chunking, embedding the first corpus. SMEs build the eval set (typically 50-200 representative queries with expected behaviour). RAGAS framework deployed inside the perimeter. One to two weeks.

04

Retrieval tuning + hypercare

Iterate on hybrid retrieval weights, re-ranking, prompt template. Eval suite gates every change. Phased rollout — 5%, 20%, full — with delivery team embedded for 45-day hypercare. Two to three weeks.

05

Subsequent corpora

Once the platform is in place, additional corpora ride the same embedding, vector DB and retrieval layers. Typical second-corpus timeline drops to two to three weeks.

The discipline that separates demo from production

The RAG eval suite is the difference between a working demo and a production system.

Production RAG runs four metrics on every change. Context precision — what fraction of retrieved chunks are relevant. Context recall — what fraction of the relevant chunks were retrieved. Answer faithfulness — does the answer ground on the context or hallucinate. Answer relevance — does the answer address the user's question.

MindMap Digital's sovereign RAG deployments ship with RAGAS inside the perimeter, gating every prompt, model, retriever or chunking change on regression against the customer-built eval set. Teams without this discipline ship quality regressions to production and discover them in the customer complaint. Teams with this discipline catch them before merge.

See sovereign AI architecture →Get the playbook (PDF)
FAQ

Enterprise RAG — the questions buyers ask

What is enterprise RAG?

Retrieval-Augmented Generation is the dominant production pattern for enterprise LLM use. At ingestion time, documents are chunked, embedded as vectors and stored in a vector database. At query time, the user's question is embedded, the nearest chunks are retrieved, and the LLM is prompted to answer using the retrieved context. The architectural payoff: the LLM's knowledge stays current (re-ingest to refresh), domain-specific (choose what to ingest) and traceable (the answer cites the chunks it grounded on).

Why is RAG preferable to fine-tuning for most enterprise use cases?

Three reasons. (1) Freshness — RAG knowledge updates with a re-ingestion job; fine-tuning requires a new training run. (2) Auditability — every RAG answer cites the chunks it grounded on, so the regulator can replay the answer chain; fine-tuned knowledge is opaque inside the weights. (3) Cost — building and operating a RAG pipeline is cheaper than running periodic fine-tuning at quality parity, particularly for the common case of "the model needs to know our policies and documents" rather than "the model needs to adopt a specific output format or voice." Fine-tuning earns its keep on the format/voice problem; RAG wins on the knowledge problem.

What are the four layers of an enterprise RAG stack?

Layer 1 — ingestion and chunking: parsers for the customer's document types, chunking strategies tuned per type, metadata extraction. Layer 2 — embedding and storage: a sovereign-deployable embedding model (nomic-embed-text or BGE-M3), a vector database (pgvector under 10M chunks, Qdrant for 10–100M, Milvus beyond). Layer 3 — retrieval: hybrid dense plus BM25 search, a cross-encoder re-ranker, optional query expansion or HyDE. Layer 4 — generation: the LLM, the prompt template, the response post-processing including citation insertion and PII guardrails.

Where does enterprise RAG go wrong in production?

Six common failure modes. (1) Naive chunking — fixed-size character splits break sentences and drop semantic boundaries. (2) No hybrid retrieval — pure dense retrieval misses rare entities and exact-match queries. (3) No re-ranking — top-K retrieval surfaces relevant-but-not-best chunks, the LLM grounds on weak context. (4) Stuffed context — teams assume more chunks is better; quality typically degrades past 5–8 chunks due to the lost-in-the-middle effect. (5) No eval suite — quality regressions ship to production undetected. (6) No incremental ingestion — every corpus update is a full re-embed.

Can enterprise RAG run on sovereign on-premise infrastructure?

Yes — and for regulated industries this is the default deployment model. MindMap Digital's sovereign RAG stack runs entirely inside the customer's perimeter: embedding model on customer GPUs, vector database in customer Kubernetes, LLM serving via vLLM on customer GPUs, retrieval API as a customer-namespace service. Network egress is blocked at the namespace level. Every retrieval and every model call streams into the customer's SIEM. The compliance posture matches the customer's sovereign LLM serving layer because they share the same cluster.

How do you measure RAG quality?

Four metrics, scored against a customer-built eval set on every change. Context precision — what fraction of retrieved chunks are relevant. Context recall — what fraction of the relevant chunks in the corpus were retrieved. Answer faithfulness — does the answer actually ground on the retrieved context or hallucinate. Answer relevance — does the answer actually address the user's question. The open-source RAGAS framework computes all four; MindMap deploys it inside the perimeter alongside Langfuse for production observability. The gating rule: no merge if any metric regresses below the customer's acceptance threshold.

How long does it take to deploy enterprise RAG?

MindMap Digital's standard sovereign RAG deployment is 6–9 weeks from contract to production for a first corpus. Week one: corpus inventory and chunking-strategy design. Weeks two to three: stack deployment (embedding model, vector DB, retrieval API, eval harness). Weeks four to five: corpus ingestion, eval build, retrieval tuning. Weeks six to seven: end-to-end pilot with hypercare. Weeks eight to nine: phased rollout. Subsequent corpora on the same platform deploy in two to three weeks because the infrastructure, identity integration and eval framework are already in place.

What models does MindMap deploy in the enterprise RAG stack?

For embedding: nomic-embed-text for English-primary corpora, BGE-M3 for multilingual workloads — both open-weights and locally deployable. For re-ranking: bge-reranker-v2-m3, runs on CPU for low-traffic deployments, on a small GPU at production scale. For generation: Llama 3.3 70B or Qwen 2.5 72B for high-quality general workloads, Llama 3.3 8B or Mistral 7B for high-throughput specialised workloads, all served via vLLM. We deliberately choose open-weights at every layer — closed-source frontier APIs cannot be deployed on-prem and are therefore ruled out for sovereign-mandated buyers.

Score your RAG readiness. In 2 minutes.

Six questions on corpora, infrastructure, eval discipline and compliance — your tier, your gaps, and the engagement that fits.

Take the assessment →Talk to a RAG engineer →
Talk to the product team