How to Evaluate a RAG System: The Four Numbers That Actually Matter

RAG evaluation is the single highest-leverage engineering discipline in enterprise LLM work, and the one most teams skip. The four metrics that matter — context precision, context recall, answer faithfulness, answer relevance — give you a regression-proof harness for every change to your retrieval pipeline. Here's how to run them in practice.

MindMap Engineering

MindMap Digital

Across the RAG remediations we've been called into over the past 24 months, the same pattern appears in about three quarters of them: the system worked on day one, it doesn't work now, nobody knows when or why it stopped working. The reason is always the same — no eval harness was ever stood up, so quality drift was invisible until a customer or supervisor noticed. The most recent case we worked on was an insurer in the UK that had deployed a policy-document RAG nine months earlier. Their answer faithfulness had drifted from an estimated 94% (their best guess from launch acceptance testing) to somewhere we couldn't measure because there was no eval set. We built a 180-query labelled set in four working days, ran it, found faithfulness at 71%, fixed the prompt + retrieval issues over two weeks, brought it back to 93%, and embedded the eval suite in their CI so it would never regress invisibly again. That entire experience is preventable with two weeks of work at the start of any RAG programme — which is why RAG evaluation is the single highest-leverage engineering discipline in enterprise LLM work, and the one most teams skip. The fix is straightforward, well-supported by open-source tooling (RAGAS, DeepEval, the Langfuse evaluation harness), and almost free to implement compared to its operational value. The reason teams skip it is that the labelled eval set takes 3–5 days to build correctly, and the engineering team is being asked to ship features.

The four metrics that matter

Context precision: of the chunks retrieved for a query, what fraction are actually relevant. Measures whether your retrieval is picking the right chunks and filtering out the irrelevant ones. Context recall: of the relevant chunks that exist in the corpus, what fraction did the retrieval find. Measures whether your retrieval is comprehensive enough. Answer faithfulness: does the generated answer ground on the retrieved context or hallucinate facts not present in the context. Measures whether your generation is honest about what the retrieval supplied. Answer relevance: does the generated answer actually address the user's question. Measures whether your generation is on-topic. The four metrics decompose the RAG quality problem into orthogonal dimensions; each measures a different failure mode and each has a different fix.

How to build the eval set

Build it from real production traces, with the customer's subject-matter experts labelling. Pick 50–200 queries that represent the production distribution: include the easy queries (where the answer is in a single chunk), the hard queries (where the answer requires synthesis across multiple chunks), the impossible queries (where the corpus genuinely doesn't contain the answer and the system should refuse), and the adversarial queries (where the user is testing the system's limits). Per query, the SME labels: which chunks in the corpus contain the relevant information (ground truth for retrieval), what the correct answer would say (ground truth for generation), and what the answer must NOT say (negative ground truth for hallucination tests). Building this is 3–5 working days for two SMEs and an engineer — substantial but not prohibitive.

How to run the metrics

RAGAS is the open-source framework we deploy at customer sites. It runs the four metrics automatically using an LLM-as-judge approach: a strong model (we use the customer's own deployed model, not an external API, for sovereign deployments) is prompted with the question, the retrieved context, the generated answer, and the ground truth, and asked to score each metric. The scoring prompts are well-engineered in RAGAS and produce results that correlate strongly with SME judgement at sub-second cost per evaluation. The full eval suite for a 200-query set runs in 8–15 minutes depending on the model used as judge. We embed it in the CI pipeline so that every prompt change, retrieval-parameter change, or model upgrade triggers a re-run.

Gating deployment on regression thresholds

Eval scores alone aren't useful — gating is. The customer's SMEs and the engineering team agree on minimum acceptable thresholds per metric (e.g. context precision ≥ 0.85, context recall ≥ 0.80, faithfulness ≥ 0.95, relevance ≥ 0.90). Every PR that touches anything affecting RAG quality (chunking, embedding model, retrieval prompts, generation prompts, model version) runs the eval suite, and the PR cannot merge if any metric falls below threshold. This is the discipline that distinguishes a production-quality RAG system from a perpetual-incident one. Setup time: half a day. Operational benefit: indefinite.

What to do when a metric regresses

Each metric maps to a specific subsystem. Context precision regression → fix retrieval: tighten the re-ranker, raise the minimum-score threshold, narrow the chunking. Context recall regression → broaden retrieval: relax the score threshold, expand top-K, add query expansion / HyDE. Faithfulness regression → fix generation prompt: stronger grounding instructions, explicit refusal-when-uncertain language, citation-injection enforcement. Relevance regression → fix generation prompt: include the user's question more prominently in the prompt, restructure the answer template. The decomposition is what makes the eval system useful operationally — you know which subsystem to look at when a regression appears, rather than having to debug the whole pipeline.

How often to refresh the eval set

The corpus changes (new documents, retired documents, updated policies) and the user behaviour changes (new queries, evolving expectations). The eval set needs to track both. We refresh the eval set quarterly at most customers — pull the most recent 90 days of production traces, identify queries that aren't represented in the existing eval set, have the SMEs label a fresh batch. The refresh keeps the eval set's signal aligned with the production distribution rather than the distribution at the time the system was first deployed. See /enterprise-rag for the broader RAG architecture and the RAGAS framework specifically.

About the author

MindMap Engineering

MindMap Digital Engineering Practice

MindMap Engineering is the collective practice behind 117 production-deployed AI accelerators across BFSI, healthcare, government, retail and telecom. The pieces published here are written by the engineering leads who shipped the systems they describe — sovereign LLM platforms, RAG pipelines, agentic workflows, IDP systems — at customer sites across three continents. We don't write about architectures we haven't deployed.

Credentials + recognition

✓117 production-deployed AI accelerators
✓50+ enterprise customers across BFSI, healthcare, government
✓Deployments live across India, UK, EU, Gulf, North America, Africa
✓Sovereign deployment as the default architectural pattern
✓Langfuse + RAGAS + vLLM + Qdrant production experience

Areas of repeated lived expertise

Open-weights LLM serving (Llama, Qwen, Mistral, DeepSeek)Production RAG architectureAgentic AI runtime engineeringDocument intelligence (IDP) at 94%+ STPOn-premise + air-gapped deployments

More Insights

Keep reading

The 2026 Sovereign AI Architecture Report

Data-driven analysis of every meaningful sovereign AI stack in production today. Compares 6 open-weights model families, 4 vector databases, 3 inference servers and 5 reference architectures on cost-per-million-tokens, regulator-readiness, integration substrate and operational complexity. Survey-based, with the deployment numbers from 50+ regulated-industry engagements behind every recommendation.

Saurabh Goenka

22 min read

State of Agentic AI in Regulated Industries 2026

A production-pattern survey of agentic AI in BFSI, healthcare, public sector and pharma. What patterns actually ship (ReAct + tool-use, planner-executor, multi-agent orchestration), what fails in audit (silent loops, hidden tool calls, unbounded reasoning), and the four engineering controls separating prototypes from production. Based on the agent runtimes we've shipped at 17 regulated customers in the past 18 months.

MindMap Engineering

20 min read

EU AI Act Readiness Benchmark — 50 Enterprises

Anonymised readiness benchmark across 50 enterprises with EU exposure — banks, insurers, hospitals, manufacturers, public-sector bodies — measured against the 11 Articles 9–15 evidence requirements. Median readiness is 38%; only 14% would survive a supervisory audit today. Where the gaps cluster, why they're tractable in 90 days, and the five interventions that close the most ground.

Saurabh Goenka

18 min read

View all insights →

Ready to apply these ideas?

Talk to our engineering team. No sales pitch — just a technical conversation.

Start a conversation →