Guide

Self-hosted LLM Cost Breakdown — Does It Actually Beat a Managed API?

A worked example comparing self-hosted open-weight inference against managed LLM APIs, including hardware amortization, power, ops time and the all-important utilization break-even.

By Francesco Zinghinì · Updated 2026-06-16 · 503 words

"Self-hosting is cheaper" is half true. It's cheaper at high utilization and unbeatable for data residency — but a dedicated GPU that sits idle is the most expensive inference money can buy. This guide gives you the framework to decide with numbers.

The two cost models

A managed API is pure variable cost: you pay per token, scaling linearly with usage, with zero fixed cost and zero ops burden. Estimate it directly in the LLM API cost calculator.

Self-hosting is mostly fixed cost: hardware (or rented GPU hours) plus power plus a share of engineering time, largely independent of how many tokens you actually push through. The per-token cost is whatever that fixed cost divided by your real throughput works out to.

The components of a self-hosted bill

Compute. Either amortized hardware (purchase price ÷ useful life) or rented GPU hours. Rental is simpler to reason about and avoids capital risk.
Power. GPUs draw real wattage under load; multiply by your electricity rate and hours of operation.
Ops time. The cost everyone forgets. Setup, monitoring, scaling, quantization tuning and incident response are recurring engineering hours. Put a number on them.
Redundancy. Production usually needs more than one GPU for availability, which raises fixed cost and lowers per-GPU utilization.

The break-even is about utilization

Here's the mechanism that decides everything. Your fixed monthly cost is the same whether the GPU is busy or idle. So:

self_hosted_cost_per_token = monthly_fixed_cost / tokens_served_this_month

Double your throughput and the per-token cost halves. Halve it and the per-token cost doubles. That's why utilization, not hardware price, is the real variable.

A worked comparison

To decide, do this:

Estimate your monthly token volume (input + output).
Get the managed-API monthly cost for that volume from the calculator, trying a frontier and a fast model.
Estimate your self-hosted fixed monthly cost: GPU rental (or amortized hardware) + power + a realistic share of ops time.
Divide that fixed cost by your expected monthly throughput to get a per-token figure, and multiply back out for the same volume.
Now redo step 4 at half your expected utilization. If self-hosting only wins at near-full utilization, treat it as fragile.

Most teams find a crossover volume below which managed APIs are clearly cheaper (and far less work), and above which self-hosting pulls ahead — if utilization stays high.

Don't forget the non-cost reasons

Sometimes the decision isn't about money at all. If your data can't leave your infrastructure, self-hosted inference may be the only compliant option regardless of cost. Conversely, if you need the very top of frontier reasoning, a managed API like Claude may simply do the job better per task.

Bottom line

Run both numbers at your real volume and at half of it. Self-hosting wins on steady high-volume workloads and on data residency; managed APIs win on low/bursty volume, simplicity and access to top-tier models. Start in the cost calculator and bring a realistic utilization assumption — that single number decides the outcome.

Self-hosted LLM Cost Breakdown — Does It Actually Beat a Managed API?

The two cost models

The components of a self-hosted bill

The break-even is about utilization

A worked comparison

Don't forget the non-cost reasons

Bottom line

Related

Self-hosted LLM Inference Review — Open Weights on Your Own GPUs

How to Cut Your LLM API Costs Without Hurting Quality

Anthropic Claude API Review — Pricing, Caching and When It Pays Off