Cloud LLM vs On-Prem AI — TCO Calculator

Drag the slider to your monthly token volume. Pick a cloud baseline. Pick an on-prem reference architecture. See where you sit on the curve. The cost model is what we use in scoping calls with regulated-industry customers — built on 50+ production deployment cost runs in Europe, the Gulf and South Asia.

Your workload

Monthly tokens served

5.0B

100M1B10B50B

Cloud LLM baseline

Average across the three above

On-prem reference architecture

2× H100 80GB · pgvector · vLLM · Llama 3.3 8B / 70B (quantised) · single-tenant

Capex: €95k amortised over 36 months · Capacity: up to 12.0B/month

The economics

Cloud — monthly

€11k

€132k / year

€2.20 / million tokens · linear with volume

On-prem — monthly

€5039

€60k / year

€1.01 / million tokens · amortised

On-prem saves

€5961 /month

That's 2.2× the on-prem cost — €72k saved annually, or €215k over 3 years.

Hardware capex amortised straight-line over 36 months. Opex includes power (0.18 €/kWh), cooling, rack space (€450/U/month) and a 0.15 FTE SRE allocation at €120k loaded. Frontier API prices reflect publicly listed rates as of June 2026; high-volume enterprise tier may negotiate lower.

How to read the result

Under ~1.5B tokens/month: cloud is usually cheaper on pure cost. The lean single-rack architecture carries roughly €5k/month in amortised capex and opex before it serves a single token, and the operational simplicity of a cloud API outweighs the per-token premium at this volume.
~1.5B to 4B tokens/month: the cross-over zone for a lean single-rack deployment. Exactly where it lands depends on your model mix — around 1.6B against Claude-class pricing, nearer 4B against Gemini-class. The regulatory and resilience arguments tip the rest.
Above ~4B tokens/month: on-prem wins on every cloud baseline, and the gap widens with volume because cloud is linear while on-prem amortises. The multi-tenant enterprise architecture crosses over later (roughly 5–13B depending on baseline) but scales to 55B tokens/month.
When sovereignty is mandated: the cost calculus is not the decision driver — the regulatory posture is. For SAMA, central-bank and air-gap environments, sovereign architecture is the only viable path at any volume; this calculator just tells you what it costs.

Frequently asked questions

When is on-prem AI actually cheaper than cloud APIs?

For the lean single-rack architecture the cross-over sits between roughly 1.5B and 4B tokens per month, depending on which cloud baseline you price against — about 1.6B versus Claude-class pricing, nearer 4B versus Gemini-class. The gap widens fast above that: at the architecture's 12B tokens/month capacity ceiling, cloud APIs cost 3–8x more than the amortised on-prem stack. The economics shifted in 2024-25 as open-weights model quality improved and GPU costs continued to fall.

What assumptions go into the on-prem cost model?

Hardware capex amortised straight-line over 36 months. Operational expenses include power at 0.18 €/kWh, cooling, rack space at €450 per U per month, and a 0.15 FTE SRE allocation at €120k fully loaded. These reflect MindMap's median deployment cost across 50+ sovereign deployments in Europe, the Gulf and South Asia.

Why isn't cloud cheaper if it has the scale economics?

Cloud LLM vendors charge per-token at rates that recover frontier-model training cost across the customer base. Their scale economics show up in development capacity, not in marginal serving cost. Open-weights inference on commodity GPU has fundamentally different unit economics — the per-token cost is electricity and amortised hardware, which are both order-of-magnitude lower than vendor-listed rates at enterprise volumes.

Does this include the cost of an integration partner like MindMap?

No. The numbers cover serving infrastructure only. A typical MindMap engagement adds 6–12 weeks of engineering at our standard rates plus the platform-licensing component, which is a one-time + maintenance structure rather than per-token. We share the full engagement economics on a scoping call — it includes the SRE pattern, the eval-harness build, the audit-store, and the customer-side training programme.

What about cloud egress and other hidden costs?

We don't model them here. In practice they meaningfully push cloud TCO higher for high-volume deployments — egress to fetch RAG context, ingress for embeddings, observability bandwidth, audit-log storage. The headline numbers in this calculator already favour cloud at the low end relative to what most enterprises actually experience in production.

Want the full cost model for your situation?

We'll walk through a full TCO + risk model on a 20-minute scoping call. No deck. Just the numbers.

Book a 20-min walkthrough →