Self-hosted LLM Cost Breakdown — Does It Actually Beat a Managed API?
A worked example comparing self-hosted open-weight inference against managed LLM APIs, including hardware amortization, power, ops time and the all-important utilization break-even.
"Self-hosting is cheaper" is half true. It's cheaper at high utilization and unbeatable for data residency — but a dedicated GPU that sits idle is the most expensive inference money can buy. This guide gives you the framework to decide with numbers.
The two cost models
A managed API is pure variable cost: you pay per token, scaling linearly with usage, with zero fixed cost and zero ops burden. Estimate it directly in the LLM API cost calculator.
Self-hosting is mostly fixed cost: hardware (or rented GPU hours) plus power plus a share of engineering time, largely independent of how many tokens you actually push through. The per-token cost is whatever that fixed cost divided by your real throughput works out to.
The components of a self-hosted bill
- Compute. Either amortized hardware (purchase price ÷ useful life) or rented GPU hours. Rental is simpler to reason about and avoids capital risk.
- Power. GPUs draw real wattage under load; multiply by your electricity rate and hours of operation.
- Ops time. The cost everyone forgets. Setup, monitoring, scaling, quantization tuning and incident response are recurring engineering hours. Put a number on them.
- Redundancy. Production usually needs more than one GPU for availability, which raises fixed cost and lowers per-GPU utilization.
The break-even is about utilization
Here's the mechanism that decides everything. Your fixed monthly cost is the same whether the GPU is busy or idle. So:
self_hosted_cost_per_token = monthly_fixed_cost / tokens_served_this_month
Double your throughput and the per-token cost halves. Halve it and the per-token cost doubles. That's why utilization, not hardware price, is the real variable.
A worked comparison
To decide, do this:
- Estimate your monthly token volume (input + output).
- Get the managed-API monthly cost for that volume from the calculator, trying a frontier and a fast model.
- Estimate your self-hosted fixed monthly cost: GPU rental (or amortized hardware) + power + a realistic share of ops time.
- Divide that fixed cost by your expected monthly throughput to get a per-token figure, and multiply back out for the same volume.
- Now redo step 4 at half your expected utilization. If self-hosting only wins at near-full utilization, treat it as fragile.
Most teams find a crossover volume below which managed APIs are clearly cheaper (and far less work), and above which self-hosting pulls ahead — if utilization stays high.
Don't forget the non-cost reasons
Sometimes the decision isn't about money at all. If your data can't leave your infrastructure, self-hosted inference may be the only compliant option regardless of cost. Conversely, if you need the very top of frontier reasoning, a managed API like Claude may simply do the job better per task.
Bottom line
Run both numbers at your real volume and at half of it. Self-hosting wins on steady high-volume workloads and on data residency; managed APIs win on low/bursty volume, simplicity and access to top-tier models. Start in the cost calculator and bring a realistic utilization assumption — that single number decides the outcome.