What It Actually Costs to Run Llama 3.3 70B On-Prem

Cloud LLM pricing has dropped twice in 2026. On-prem economics dropped further. Below ~200M tokens per month the cloud is still cheaper; above 5B per month — where every regulated-industry customer we deploy sits — on-prem is genuinely an order of magnitude below the API. Here's the per-token model with the numbers behind it.

MindMap Engineering

MindMap Digital

In a CFO conversation in Frankfurt three weeks ago, the CFO of a large EU insurer asked me a question I now hear at least twice a quarter: "At what monthly token volume does it actually become cheaper to run our own infrastructure, given the cloud pricing dropped 30% in February?" Her team had built a model that suggested cloud was always cheaper; mine had built a model that suggested above 1B tokens/month on-prem won. We spent two hours reconciling the assumptions. The disagreement was almost entirely in what each model counted as "cost" — hers excluded a number of items mine included, and vice versa. By the time we converged on a shared framework, the answer was uncontroversial: at her projected 18-month volume of 3.2B tokens/month, on-prem at her existing data centre rates was roughly 3.5× cheaper than the cloud baseline she'd been pricing against. Cloud LLM pricing has dropped twice in 2026 — first in February when Anthropic and OpenAI cut frontier-tier per-token pricing by roughly 30%, then in May when Google followed. The on-prem economics moved in the same direction: open-weights model quality improved, GPU prices on the secondary market softened as enterprise data centres absorbed the H100/H200 capacity that hyperscalers had been over-ordering, and the inference servers (vLLM, SGLang, TensorRT-LLM) got materially more throughput-efficient. The crossover where on-prem becomes cheaper than the cloud API — the point our customers ask us to model on every engagement — has moved. The headline: below 200M tokens per month the cloud is cheaper for most reasonable scenarios; above 5B tokens per month, on-prem is genuinely an order of magnitude cheaper, and that's the volume bracket every regulated-industry customer we deploy sits in.

The model: what "on-prem cost per million tokens" actually includes

Most published cost comparisons for on-prem LLM serving are either underestimates (counting only GPU amortisation) or overestimates (counting full-burden data centre construction). The accurate model for a customer running inference inside an existing data centre includes: GPU hardware amortised over 36 months at a 12% blended cost of capital; power at the customer's actual data-centre rate ($0.08–$0.15 per kWh in Europe and the Gulf, higher in India, lower in some US locations); cooling at the customer's PUE (typically 1.3–1.6); rack space at the customer's internal cost ($200–$500/U/month); and 0.15 of an SRE FTE per cluster ($25–$45K/year fully loaded). The model deliberately excludes data centre construction (already paid), network egress (zero on-prem), licensing (free for open-weights), and integration cost (one-off, paid by the engagement). The numbers below assume European-tier electricity, a 1.4 PUE, $300/U/month rack space, and a $35K/year SRE allocation.

Three model tiers, three usage volumes

Model tiers: small (Llama 3.3 8B or Mistral 7B, single A100 80GB), medium (Llama 3.3 70B-Instruct-Quantized, 2× H100 80GB), large (Qwen 2.5 72B or DeepSeek V3 fp8, 4× H100 80GB). Volume tiers: low (50M tokens/month — a few teams using a chatbot internally), medium (1B tokens/month — a production application serving dozens of concurrent users), high (10B+ tokens/month — a customer-facing application or a multi-app platform). Cloud comparison: GPT-4.5-class pricing at $1.50/$6.00 per million tokens (blended ~$3.20), Claude 3.7 Sonnet at $3/$15 (blended ~$8), Gemini 2.5 Pro at $1.25/$5 (blended ~$2.80). Open-weights cloud hosted at $0.70–$1.80 per million tokens depending on model size — the most realistic cloud comparison for the regulated workloads where the customer has decided not to send data to OpenAI or Anthropic.

Low volume — 50M tokens/month — cloud wins

A single A100 80GB serving Llama 3.3 8B has roughly 250M tokens/month of throughput at the latency the customer is likely to accept (under 600ms time-to-first-token, sustained throughput of 60+ tokens/second). At 50M tokens/month utilisation is 20%, and the per-token cost from the model above is roughly $0.65/million tokens — the fixed costs dominate. Cloud-hosted open-weights at the same model class is around $0.70/million tokens at this volume. Cloud is competitive or cheaper, and the operational simplicity of not running your own GPU is usually decisive. The exception: regulated workloads where the cloud option is closed regardless.

Medium volume — 1B tokens/month — on-prem becomes competitive

Same single-A100 serving Llama 3.3 8B at 1B tokens/month is at 80% utilisation — the operational sweet spot. Per-token cost drops to roughly $0.16/million tokens. Cloud-hosted open-weights at this volume is around $0.55/million tokens after standard enterprise discounts. On-prem is roughly 3× cheaper. If the customer needs the larger 70B-class model (which is the case more often than the 8B-class for the long-tail complex queries our customers care about), the on-prem cost on 2× H100 quantised is roughly $0.30/million tokens at 1B tokens/month volume, versus cloud-hosted Llama 3.3 70B at around $0.95/million tokens. Still 3× cheaper. The crossover point — where on-prem beats cloud at this model class — is roughly 200–300M tokens/month.

High volume — 10B+ tokens/month — on-prem is an order of magnitude cheaper

At 10B tokens/month, the on-prem model approaches its fully-amortised floor. A 4× H100 cluster serving Qwen 2.5 72B at high utilisation produces per-token cost in the $0.08–$0.12/million range. Cloud-hosted equivalents at this volume are $0.40–$0.70/million tokens depending on contract terms. The frontier closed-source APIs at this volume are typically $2.50–$6.00/million tokens blended. On-prem is roughly 5× cheaper than hosted open-weights, and 25–50× cheaper than frontier closed-source. At enterprise volume — which for our regulated-industry customers is typically 5–50B tokens/month per major application — the cost difference funds the engineering team running the platform several times over.

Where on-prem economics quietly improved in 2026

Three under-reported shifts moved the math in the last 12 months. First, the secondary market for H100/H200 GPUs softened materially in Q1 2026 as hyperscalers overshot demand. Enterprise list prices held but the deal-spot price for clean fleet H100 80GB dropped roughly 18% from the early-2025 peak. Second, vLLM 0.7 and SGLang shipped continuous-batching and KV-cache optimisations that increased the practical throughput of H100-served 70B-class models by 30–45% at the latency budgets our customers care about. Third, the quality gap between best open-weights and best closed-source narrowed further. Qwen 2.5 72B and DeepSeek V3 are now within 2 percentage points of GPT-4.5 on most enterprise eval suites we run.

What this means for the CFO conversation

Two implications for the CFO conversation about LLM strategy. First, at any meaningful enterprise volume, on-prem is now the cheaper option and the compliance-easier option simultaneously. The argument in 2024 that "the cloud is so much cheaper that compliance can be solved separately" has structurally inverted. Second, the operational cost of running on-prem inference is no longer a serious objection at this scale. The SRE allocation needed to keep a 4× H100 cluster running well is 0.15 FTE — comfortably less than the procurement, security review, BAA negotiation, and vendor-management overhead of running the same workload on a hosted cloud API.

The model in two sentences

Below 200M tokens/month: cloud is cheaper. Above 1B tokens/month: on-prem is materially cheaper, and the gap widens as volume grows. Every regulated-industry customer we deploy sits well above the 1B threshold, which is why the architectural choice between sovereign and cloud is no longer a cost trade-off — it's a pure compliance and capability conversation in which on-prem happens to also be the cheaper option. The full cost model is downloadable from the Sovereign AI Playbook (the 12-minute free PDF); customers building a board paper on the architectural choice typically run it twice (their current cloud pattern at projected 18-month volume; their potential sovereign pattern at the same volume) to produce the cost comparison their CFO wants.

About the author

MindMap Engineering

MindMap Digital Engineering Practice

MindMap Engineering is the collective practice behind 117 production-deployed AI accelerators across BFSI, healthcare, government, retail and telecom. The pieces published here are written by the engineering leads who shipped the systems they describe — sovereign LLM platforms, RAG pipelines, agentic workflows, IDP systems — at customer sites across three continents. We don't write about architectures we haven't deployed.

Credentials + recognition

✓117 production-deployed AI accelerators
✓50+ enterprise customers across BFSI, healthcare, government
✓Deployments live across India, UK, EU, Gulf, North America, Africa
✓Sovereign deployment as the default architectural pattern
✓Langfuse + RAGAS + vLLM + Qdrant production experience

Areas of repeated lived expertise

Open-weights LLM serving (Llama, Qwen, Mistral, DeepSeek)Production RAG architectureAgentic AI runtime engineeringDocument intelligence (IDP) at 94%+ STPOn-premise + air-gapped deployments

More Insights

Keep reading

The 2026 Sovereign AI Architecture Report

Data-driven analysis of every meaningful sovereign AI stack in production today. Compares 6 open-weights model families, 4 vector databases, 3 inference servers and 5 reference architectures on cost-per-million-tokens, regulator-readiness, integration substrate and operational complexity. Survey-based, with the deployment numbers from 50+ regulated-industry engagements behind every recommendation.

Saurabh Goenka

22 min read

State of Agentic AI in Regulated Industries 2026

A production-pattern survey of agentic AI in BFSI, healthcare, public sector and pharma. What patterns actually ship (ReAct + tool-use, planner-executor, multi-agent orchestration), what fails in audit (silent loops, hidden tool calls, unbounded reasoning), and the four engineering controls separating prototypes from production. Based on the agent runtimes we've shipped at 17 regulated customers in the past 18 months.

MindMap Engineering

20 min read

EU AI Act Readiness Benchmark — 50 Enterprises

Anonymised readiness benchmark across 50 enterprises with EU exposure — banks, insurers, hospitals, manufacturers, public-sector bodies — measured against the 11 Articles 9–15 evidence requirements. Median readiness is 38%; only 14% would survive a supervisory audit today. Where the gaps cluster, why they're tractable in 90 days, and the five interventions that close the most ground.

Saurabh Goenka

18 min read

View all insights →

Ready to apply these ideas?

Talk to our engineering team. No sales pitch — just a technical conversation.

Start a conversation →