Inference Savings Calculator

~3 minutes · result on the next page · no follow-up if you don't want one

About you

Where should we send a copy?

The result appears on the next page; we'll also email a copy you can forward. No drip sequence.

Name

Work email

Company (optional)

Step 1

Your inference spend today

A rough number is fine. The projection scales linearly with spend, so an order-of-magnitude estimate gives an order-of-magnitude answer.

Monthly inference spend (USD)

Sum across all providers: Anthropic, OpenAI, Bedrock, etc.

Number of providers in production

Count of distinct inference vendors you call.

Cost visibility today

Dynamic routing layer already in place

Step 2

Your workload mix

Roughly what fraction of your inference falls into each shape? Whole numbers; should sum to about 100.

Real-time, user-facing

Chat, in-product AI features.

Real-time, background

Notifications, lower-priority enrichment.

Batch / async-eligible

Evals, embeddings, bulk generation.

Mix sums to 100%

Step 3

Where the savings might be

Rough fractions. You can leave the defaults if unsure — the projection sensitivity will be on the headline number, not the per-lever ordering.

% of calls with cacheable prefix

Long system prompts, RAG context, few-shot examples.

% of calls over-provisioned

Could run on a smaller tier (Sonnet → Haiku, 4o → 4o-mini).

% open-weight-eligible

Workloads where Llama / Qwen / DeepSeek would meet your quality bar.

% workload stable enough for dedicated capacity

Predictable high-QPS that would benefit from provisioned throughput.

Step 4

What's already in place

Levers you've already pulled drop out of the projection automatically.

Prompt caching is enabled across cacheable workloads

Open-weight models (Llama / Qwen / DeepSeek) are in production

Already on dedicated / provisioned throughput somewhere

Step 5 · optional

Self-hosting

Only economically meaningful above ~$50k/mo on a stable workload. Leave at 0 if not on the table.

% of workload eligible for self-hosted vLLM

Optional

Anything else?

Constraints we should know about — compliance, latency SLA, specific provider lock-in, anything you'd want flagged in the result.

Result appears on the next page. We'll email a copy you can forward.