LLM Cost Optimizer — Compare Real Prompt Cost Across Models
Paste your actual prompt + system prompt. We tokenize it, project monthly cost across GPT-4o, Claude, Gemini, and Llama at your volume, and recommend the cheapest model that meets your latency and capability bar.
How this is calculated
How the optimizer works
We tokenize your actual system prompt and user prompt with a BPE-style estimator (within ±3% of true OpenAI/Anthropic billing), then project monthly cost across 10+ frontier models using your call volume. Each model gets a capability tier and a latency band; we recommend the cheapest model that meets both your capability bar and your latency budget.
Cache-aware pricing
Mark portions of your system prompt as cacheable (e.g. tool schemas, persistent context, few-shot examples). We project savings at provider-specific cache-read discounts: 90% off for Anthropic prompt caching, 50% off for OpenAI, 75% off for Gemini implicit caching.
Capability bar
We assign each model a tier based on standard benchmarks: flagship (Opus, GPT-5, Gemini 2.5 Pro) for hard reasoning; mid (Sonnet, GPT-4o, Flash) for general tasks; light (Haiku, GPT-5 mini) for classification and routing. Your selected bar filters the recommendation set.
Frequently asked
- GPT-4o, GPT-4o mini, GPT-4.1, Claude Opus 4.7, Sonnet 4.6, Haiku 4.5, Gemini 2.5 Pro/Flash, Llama 3.1 405B/70B, and Mistral Large.
- We use BPE tokenization for OpenAI/Anthropic estimation accurate to within 3% of real billing. Reasoning model thinking tokens are estimated separately.
- Yes. Pro accounts can mark portions of the system prompt as cacheable; we project cache-hit savings at 90% read discount.