Free Tool · No signup required
Cloud LLM vs On-Prem AI — TCO Calculator
Drag the slider to your monthly token volume. Pick a cloud baseline. Pick an on-prem reference architecture. See where you sit on the curve. The cost model is what we use in scoping calls with regulated-industry customers — built on 50+ production deployment cost runs in Europe, the Gulf and South Asia.
Your workload
5.0B
100M1B10B50B
Average across the three above
2× H100 80GB · pgvector · vLLM · Llama 3.3 8B / 70B (quantised) · single-tenant
Capex: €95k amortised over 36 months · Capacity: up to 12.0B/month
The economics
Cloud — monthly
€11k
€132k / year
€2.20 / million tokens · linear with volume
On-prem — monthly
€5039
€60k / year
€1.01 / million tokens · amortised
On-prem saves
€5961 /month
That's 2.2× the on-prem cost — €72k saved annually, or €215k over 3 years.
Hardware capex amortised straight-line over 36 months. Opex includes power (0.18 €/kWh), cooling, rack space (€450/U/month) and a 0.15 FTE SRE allocation at €120k loaded. Frontier API prices reflect publicly listed rates as of June 2026; high-volume enterprise tier may negotiate lower.
How to read the result
- Under 200M tokens/month:cloud is usually still cheaper. The on-prem capex hasn't paid back; the operational simplicity of the cloud API outweighs the per-token cost.
- 200M to 1B tokens/month: the cross-over zone. On-prem starts to win on pure economics; the regulatory and resilience arguments tip the rest.
- Above 1B tokens/month: on-prem is materially cheaper, and the gap widens with volume because cloud is linear while on-prem amortises.
- Above 5B tokens/month: the cost calculus is no longer the decision driver — the regulatory posture is. Sovereign architecture is the only viable path for most regulated-industry customers above this volume.
Frequently asked questions
When is on-prem AI actually cheaper than cloud APIs?
On-prem becomes cheaper around 200M tokens per month for the lean single-rack architecture, and stays meaningfully cheaper above that point. The gap widens fast — at 5B tokens/month enterprises typically pay 5–8x more on cloud APIs than on equivalent on-prem capacity. The economics inverted in 2024-25 as cloud-API pricing flattened and GPU costs continued to fall.
What assumptions go into the on-prem cost model?
Hardware capex amortised straight-line over 36 months. Operational expenses include power at 0.18 €/kWh, cooling, rack space at €450 per U per month, and a 0.15 FTE SRE allocation at €120k fully loaded. These reflect MindMap's median deployment cost across 50+ sovereign deployments in Europe, the Gulf and South Asia.
Why isn't cloud cheaper if it has the scale economics?
Cloud LLM vendors charge per-token at rates that recover frontier-model training cost across the customer base. Their scale economics show up in development capacity, not in marginal serving cost. Open-weights inference on commodity GPU has fundamentally different unit economics — the per-token cost is electricity and amortised hardware, which are both order-of-magnitude lower than vendor-listed rates at enterprise volumes.
Does this include the cost of an integration partner like MindMap?
No. The numbers cover serving infrastructure only. A typical MindMap engagement adds 6–12 weeks of engineering at our standard rates plus the platform-licensing component, which is a one-time + maintenance structure rather than per-token. We share the full engagement economics on a scoping call — it includes the SRE pattern, the eval-harness build, the audit-store, and the customer-side training programme.
What about cloud egress and other hidden costs?
We don't model them here. In practice they meaningfully push cloud TCO higher for high-volume deployments — egress to fetch RAG context, ingress for embeddings, observability bandwidth, audit-log storage. The headline numbers in this calculator already favour cloud at the low end relative to what most enterprises actually experience in production.
Want the full cost model for your situation?
We'll walk through a full TCO + risk model on a 20-minute scoping call. No deck. Just the numbers.
Book a 20-min walkthrough →