RAG on Your Own Servers: Architecture Patterns for Air-Gapped Enterprises
Building a RAG system inside a regulated bank or hospital is a different sport. The cloud tutorials don't translate, and the failure modes are subtle enough that smart teams ship broken systems and don't notice. Here are the patterns we have refined across more than 20 air-gapped deployments, covering vector databases, embedding models, chunking and evaluation.
Building a Retrieval-Augmented Generation system is conceptually simple: store documents in a vector database, embed queries, retrieve relevant chunks, pass to an LLM. In an air-gapped enterprise, every step requires a local alternative to the cloud services most RAG tutorials assume, and the failure modes are subtle enough that we've watched four sophisticated engineering teams ship RAG systems that retrieved the wrong document 30% of the time and didn't know it. The patterns below come from twenty-plus production deployments in regulated environments where the LLM cannot phone home, the documents are confidential, and the answer 'we'll figure out evaluation later' isn't acceptable to the audit committee. None of this is research. It's what survives contact with a 14,000-document policy corpus and 200 daily users.
Pattern 1: Embedding model selection is not a vibe check
Cloud RAG tutorials default to OpenAI's text-embedding-3-small or ada-002. In an air-gapped environment, you need a locally-deployable alternative, and 'pick the highest-ranked open model on MTEB' is wrong advice. MTEB rankings are dominated by English-centric benchmarks that don't predict performance on domain-specific corpora in the languages your customer actually uses. We've standardised on nomic-embed-text (768 dims, 8K context) for English-primary deployments where the corpus is general business prose, BGE-M3 for multilingual including Arabic and Swahili because its training data coverage is genuinely better than the alternatives in those languages, and multilingual-e5-large for our Indian-language deployments where the embedding quality on Hindi, Tamil, and Bengali noticeably exceeds the more generic models. Run via Ollama for simplicity at low traffic or sentence-transformers under FastAPI with a small GPU when throughput matters. Critically, never mix embedding models between indexing and query time — you will silently corrupt your index, and the failure will look like 'the LLM hallucinates'. We've seen this in two customer deployments where an upgrade to a 'better' embedding model was applied at query time without re-indexing the corpus, and retrieval quality collapsed for two weeks before anyone diagnosed the root cause.
Pattern 2: Vector database selection by scale
For air-gapped deployments, we've evaluated Milvus, Weaviate, Qdrant, and pgvector at production scale. Our recommendation: pgvector for deployments under 10 million chunks (the operational simplicity of having one less system to back up, monitor, and integrate with existing Postgres tooling is massive, and HNSW indexing in Postgres 16 and 17 is genuinely good); Qdrant for 10M to 100M chunks where you need payload filtering and fast snapshot/restore for compliance backup workflows; Milvus when you're beyond 100M chunks or need GPU-accelerated indexing for very high write throughput. Avoid Pinecone, Weaviate Cloud, Chroma Cloud, and any other hosted-only option. Your air-gap requirement makes them non-starters, and even for non-regulated workloads the lock-in risk is meaningful. Weaviate self-hosted is fine but the operational story is heavier than Qdrant for the same workload, and the multi-tenant abstractions are over-engineered for a single-customer deployment. The performance differences between these databases at typical enterprise scale are smaller than the operational differences; pick on operability, not benchmarks.
Pattern 3: Chunking strategy matters more than the model
The single largest delta in RAG quality across our deployments comes from chunking strategy, not embedding or LLM choice. We've moved from fixed-size chunking (the LangChain default of 1000 chars with 200 overlap, which is fine for tutorials and terrible for production) to semantic chunking. Split on sentence boundaries using a small spaCy model, merge short chunks until you hit a target token count, and use parent-document retrieval so the LLM sees the surrounding context that the chunk was indexed within. This single change improved retrieval precision by 23% on our financial-services policy QA benchmark and 31% on a clinical-coding benchmark for a US insurer. For structured documents (regulations, policies, technical manuals), a section-aware chunker that respects document headings outperforms semantic chunking again, typically a further 8-15% lift on questions whose answer lives in a specific section of a hierarchical document. Match the chunker to the document type; one chunking strategy across a heterogeneous corpus underperforms a small number of type-specific strategies dispatched by document classifier.
Pattern 4: Hybrid search is not optional
Pure dense retrieval fails on a class of queries it can't see coming: queries containing rare entities (drug names, regulation IDs, ticket numbers, customer codes) where the embedding model has never seen the token and produces a useless vector. The fix is hybrid search. Run BM25 and dense retrieval in parallel, fuse the results with Reciprocal Rank Fusion (k=60 is a good default), then re-rank the top-50 with a cross-encoder like bge-reranker-v2-m3. On our internal evals, hybrid + rerank lifted answer-correct rate from 71% to 89% on a corpus that included a lot of legal citations. Postgres makes this trivial — pgvector for dense, ts_vector for BM25, one query.
Pattern 5: Inference serving and GPU economics
For the LLM itself, vLLM gives the best throughput for batch workloads thanks to PagedAttention and continuous batching, typically 3-5x what you get from a naive transformers.generate loop. Ollama is simpler for lower-traffic internal tools and developer-experience scenarios. We run Llama 3.1 8B Instruct on a single A100 80GB for most enterprise workloads, achieving 60+ tokens/sec/request at batch size 16. The 70B models require 2x H100s and are worth it only for complex multi-step reasoning, agentic workflows, or when you need to fit a long-context evaluation in one pass. For most enterprise Q&A, 8B + good retrieval beats 70B + mediocre retrieval, every time.
Pattern 6: Evaluation, and the failures it catches
RAG systems degrade silently. The retriever returns plausible-looking but wrong chunks; the LLM confidently summarises them; the user gets a wrong answer with confident citations. The only defence is evaluation infrastructure, and the discipline to run it on every change. We use RAGAS for the basics (context precision, context recall, faithfulness, answer relevance), but the bigger win is a custom eval set of 200-500 questions written by SMEs from each customer, scored by a stronger LLM as judge (Claude 4.5 Sonnet via a local proxy with PII redaction for non-regulated portions of the eval, or for the strictest deployments, a self-hosted Llama 3.3 70B as judge). Run the eval on every model update, every chunking change, every prompt change. Each run produces a scorecard the SMEs can review; regressions block deployment. We've caught regressions that would have shipped silently in 11 of our last 14 deployments. Sometimes the regression is the model upgrade itself (a 'better' model that performs worse on the customer's specific corpus), sometimes it's a chunking experiment that helped on benchmark queries but hurt on the long tail. The eval suite is also where our biggest production lessons live. Three we paid for. First: we initially trusted PDF parsers and shipped systems where the retrieval was correct but the chunk text was garbled because PyPDF2 lost table structure. The fix is to switch to Docling, Unstructured.io's hi_res partition, or pdfplumber with explicit table extraction. Second: we under-invested in metadata filtering for the first year, putting everything in one big collection and relying on the retriever to figure it out; the fix is to filter by document type, jurisdiction, recency, and access permissions before vector search. Third: we let users ask multi-hop questions that no single retrieval can answer, and watched faithfulness scores collapse; the fix is query decomposition, using the LLM to break a complex question into 2-4 sub-queries, retrieve for each, then synthesise.
Pattern 7: Re-ranking and caching as the production-readiness layer
Two patterns separate prototypes from production. First, re-ranking: after hybrid retrieval surfaces the top 50 candidates, run them through a cross-encoder re-ranker before sending the top 5 to the LLM. We default to bge-reranker-v2-m3 (568M params, runs on CPU acceptably for low-traffic deployments, on a small GPU for production). The latency cost is 80-150ms; the accuracy lift on long documents is enormous because the re-ranker actually attends to query-document interaction in a way bi-encoders cannot. Skip this step and you've left 8-15 points of answer accuracy on the floor. The argument for skipping it is 'we want lower latency', but the user's tolerance is typically 3-5 seconds for a substantive answer, and 150ms is 5% of that. Second, a caching layer that actually understands the workload: most enterprise RAG queries cluster around a small number of frequently-asked topics, and a smart cache works at three levels. Exact-match on (query, retrieval policy) tuples; semantic where queries within 0.05 cosine distance reuse the prior answer if it's recent; and retrieved-context where the same document set has been retrieved for similar queries. Across our deployments, well-tuned caching reduces GPU load by 35-60% without measurably affecting answer quality.
What to do on Monday
If you're starting a RAG project, the order of operations matters. Week 1: build your eval set with SMEs — not after the fact, but before you've trained your intuition to defend whatever you've built. Week 2: stand up the simplest possible pipeline (pgvector, nomic-embed-text, Llama 3.1 8B via Ollama, basic chunking). Week 3: measure against your eval. Week 4-6: iterate on chunking, hybrid search, re-ranking, in that order. Don't fine-tune the LLM until you've exhausted retrieval improvements. Eighty percent of RAG quality lives in retrieval, not generation. The teams that go straight to fine-tuning the LLM as the first quality lever almost always discover, six weeks and a lot of GPU hours later, that they could have got the same lift from better chunking and a re-ranker. Order matters because the cheap interventions are also the most impactful.
MindMap Digital helps enterprises across Africa, the Middle East, and UK deploy AI, automation, and analytics at scale.
Keep reading
The Sovereign AI Inflection Point: Why Regulated Enterprises Are Moving On-Prem
Central banks, insurers and healthcare systems now insist their AI models run on their own infrastructure. The driver isn't fear of the cloud. It's a wave of new rules from SAMA, RBI, the ICO and the EU AI Act that makes on-prem the only legal answer. Here is what the sovereign AI stack looks like in 2026.
NASSCOM Tech Excellence 2026: How We Built the Healthcare AI Stack
Our NASSCOM Tech Excellence 2026 win recognised the Healthcare AI Stack we shipped over the last four years: Rx Compliance Stocker across 1,400 pharmacies, the Medical Records Parser that lifts FHIR data out of messy clinical text, and the Prior Auth Accelerator that turned a four-day chase into a four-minute review. Here is the engineering behind each one.
MindMap Acquires Bluetide.co: Deepening Our Data and Agentic AI Stack
We're excited to announce that MindMap Digital has acquired Bluetide.co, a specialist data engineering and agentic AI firm. Here's what this means for our clients and capabilities.
Ready to apply these ideas?
Talk to our engineering team. No sales pitch — just a technical conversation.
Start a conversation →