LLM Cost Optimization in Production: Semantic Caching, Batching, and Smart Model Routing
Running GPT-4o for every agent call at enterprise scale is a budget problem. Smart model routing, semantic caching, and request batching can reduce inference costs by 60-80% without degrading output quality.
At November 2025 pricing, GPT-4o costs approximately $5/M input tokens while GPT-4o-mini is $0.15/M — a 33x cost difference for roughly equivalent quality on simple tasks. A model routing system that correctly classifies 60-70% of queries as simple (routing them to mini or Haiku) reduces inference costs by 50-65% with under 3% quality degradation on those tasks. Combined with semantic caching (targeting 30-40% cache hit rates on enterprise knowledge bases) and output length control, 70-80% total cost reduction is achievable.
The Enterprise LLM Cost Problem
An enterprise agentic platform processing 50,000 requests per day at an average of 2,000 input tokens and 500 output tokens per request, using GPT-4o throughout, costs approximately $600/day — $220,000 per year — on inference alone. For most enterprises, this is an unbudgeted cost that shows up as a shock when the pilot scales to production.
The instinct to control costs by restricting AI usage is wrong. The correct response is to optimize the inference stack so that you are paying GPT-4o prices only for GPT-4o-quality tasks. The cost landscape as of November 2025: GPT-4o at ~$5/M input tokens (OpenAI), GPT-4o-mini at ~$0.15/M (OpenAI), Claude 3.5 Haiku at ~$0.80/M (Anthropic), Llama 3.1 70B self-hosted at ~$0.10/M (compute cost, AWS). The gap between the top-tier and budget tier is 33-50x on a per-token basis.
The optimization levers are well-understood: model routing (right model for right complexity), semantic caching (avoid re-running similar queries), request batching (improve throughput, reduce overhead), prompt compression (remove redundant tokens before sending), and output length control (stop generating when the task is done). Applying all five levers in a production system requires engineering investment, but the ROI is clear.
LLM Cost Reference — November 2025
| Model | Input ($/M tokens) | Output ($/M tokens) | Best For | Relative Quality |
|---|---|---|---|---|
| GPT-4o | $5.00 | $15.00 | Complex reasoning, multi-step agentic tasks | Highest |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Long-context reasoning, coding, analysis | Highest |
| Claude 3.5 Haiku | $0.80 | $4.00 | Fast classification, summarisation, extraction | High |
| GPT-4o-mini | $0.15 | $0.60 | Simple classification, formatting, routing | Good |
| Llama 3.1 70B (self-hosted) | ~$0.10 | ~$0.10 | High-volume simple tasks, data extraction | Good |
| Llama 3.1 8B (self-hosted) | ~$0.01 | ~$0.01 | Bulk classification, structured extraction | Moderate |
The Five Cost Optimization Levers
Lever 1: Model Routing by Complexity
Route queries to the appropriate model tier based on estimated task complexity. Simple tasks — classification, keyword extraction, format conversion, FAQ-style lookups — route to GPT-4o-mini or Llama 3.1 70B. Complex tasks — multi-step reasoning, code generation, nuanced analysis, agent planning — route to GPT-4o or Claude 3.5 Sonnet. The router itself should be a lightweight classifier (fine-tuned BERT or GPT-4o-mini), not the expensive model. Typical outcome: 60-70% of requests classified as simple at one-tenth the cost.
Lever 2: Semantic Caching
Cache LLM responses indexed by semantic similarity of the input query, not exact string match. When a new query arrives, check if a semantically similar query has been answered before (cosine similarity > 0.95 against cached query embeddings). If a cache hit is found, return the cached response without an LLM call. Enterprise knowledge base queries (FAQ, policy lookup, documentation search) achieve 30-40% cache hit rates. Customer-facing queries typically achieve 15-25%. Use Redis with vector extensions or a dedicated semantic cache library (GPTCache).
Lever 3: Async Request Batching
Batch independent requests together and process them in parallel. Rather than making sequential LLM calls for 20 items to process, assemble them into a single batched request (OpenAI Batch API at 50% discount) or process concurrently with asyncio. The Batch API is specifically valuable for document processing pipelines where 50%+ cost savings come with 24-hour latency, which is acceptable for non-interactive workflows.
Lever 4: Prompt Compression
Long system prompts sent on every request are a significant hidden cost. A 2,000-token system prompt sent 50,000 times per day costs $500/day at GPT-4o pricing before a single word of user input is counted. Techniques: compress prompts using LLMLingua (up to 4x compression with <5% quality loss), cache the KV representation of static system prompts using OpenAI's prompt caching feature (50% discount on cached tokens), and audit prompts quarterly to remove accumulated instructions that no longer apply.
Lever 5: Output Length Control
LLMs have a tendency to over-generate — providing more explanation than the task requires. For structured extraction and classification tasks, explicit max_tokens constraints and JSON response format forcing (response_format: {type: json_object}) reduce average output length by 40-60% without quality loss. For agentic tasks, instruct the agent to stop when the task is complete rather than summarising every step taken.
Model Router with Complexity Classification and Semantic Cache
import asyncio
import hashlib
import json
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import numpy as np
from openai import AsyncOpenAI
from redis.asyncio import Redis
class ModelTier(str, Enum):
BUDGET = "gpt-4o-mini" # $0.15/M input
STANDARD = "gpt-4o" # $5.00/M input
REASONING = "o1-mini" # Reserved for explicit reasoning tasks
@dataclass
class RoutedRequest:
prompt: str
system: str
model: ModelTier
cache_hit: bool
estimated_cost_usd: float
COMPLEXITY_CLASSIFIER_PROMPT = """Classify the complexity of this task.
Simple: classification, extraction, lookup, format conversion, yes/no questions.
Complex: multi-step reasoning, code generation, detailed analysis, creative generation.
Respond with exactly one word: simple or complex."""
# Per-token pricing (November 2025, input tokens)
PRICING = {
ModelTier.BUDGET: 0.00000015, # $0.15/M
ModelTier.STANDARD: 0.000005, # $5.00/M
ModelTier.REASONING: 0.000003, # $3.00/M (o1-mini)
}
class IntelligentModelRouter:
"""
Routes LLM requests to appropriate model tier based on complexity,
with semantic caching to avoid redundant API calls.
"""
CACHE_SIMILARITY_THRESHOLD = 0.95
CACHE_TTL_SECONDS = 3600 # 1 hour for most enterprise queries
def __init__(self, redis_url: str = "redis://localhost:6379"):
self.openai = AsyncOpenAI()
self._redis_url = redis_url
self._redis: Optional[Redis] = None
self._cache_index: list[tuple[list[float], str]] = [] # (embedding, cache_key)
self._stats = {"total": 0, "cache_hits": 0, "budget_routes": 0, "standard_routes": 0}
async def connect(self):
from redis.asyncio import from_url
self._redis = await from_url(self._redis_url, decode_responses=True)
async def _embed(self, text: str) -> list[float]:
resp = await self.openai.embeddings.create(
input=text[:2000], # Truncate for cache key generation
model="text-embedding-3-small",
dimensions=768
)
return resp.data[0].embedding
async def _check_semantic_cache(
self, query_embedding: list[float]
) -> Optional[str]:
if not self._cache_index:
return None
query_vec = np.array(query_embedding)
best_score = 0.0
best_key = None
for cached_emb, cache_key in self._cache_index[-1000:]: # Check last 1000
cached_vec = np.array(cached_emb)
score = float(np.dot(query_vec, cached_vec) /
(np.linalg.norm(query_vec) * np.linalg.norm(cached_vec)))
if score > best_score:
best_score = score
best_key = cache_key
if best_score >= self.CACHE_SIMILARITY_THRESHOLD and best_key:
return best_key
return None
async def _store_cache(self, query_embedding: list[float],
response: str, ttl: int) -> None:
cache_key = "llm:" + hashlib.sha256(
str(query_embedding[:10]).encode()
).hexdigest()[:16]
await self._redis.setex(cache_key, ttl, response)
self._cache_index.append((query_embedding, cache_key))
async def _classify_complexity(self, prompt: str) -> ModelTier:
resp = await self.openai.chat.completions.create(
model="gpt-4o-mini", # Always use budget model for classification
messages=[
{"role": "system", "content": COMPLEXITY_CLASSIFIER_PROMPT},
{"role": "user", "content": prompt[:500]}, # Truncate for speed
],
max_tokens=5,
temperature=0,
)
classification = resp.choices[0].message.content.strip().lower()
return ModelTier.STANDARD if classification == "complex" else ModelTier.BUDGET
def _estimate_cost(self, prompt: str, model: ModelTier) -> float:
token_estimate = len(prompt.split()) * 1.3 # Rough token estimate
return token_estimate * PRICING[model]
async def complete(
self,
prompt: str,
system: str = "",
force_model: Optional[ModelTier] = None,
cache_ttl: int = CACHE_TTL_SECONDS,
) -> RoutedRequest:
self._stats["total"] += 1
full_prompt = f"{system}\n\n{prompt}" if system else prompt
# Check semantic cache first
query_embedding = await self._embed(full_prompt)
cache_key = await self._check_semantic_cache(query_embedding)
if cache_key:
cached = await self._redis.get(cache_key)
if cached:
self._stats["cache_hits"] += 1
return RoutedRequest(prompt=prompt, system=system,
model=ModelTier.BUDGET, cache_hit=True,
estimated_cost_usd=0.0)
# Route to appropriate model
model = force_model or await self._classify_complexity(prompt)
if model == ModelTier.BUDGET:
self._stats["budget_routes"] += 1
else:
self._stats["standard_routes"] += 1
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
resp = await self.openai.chat.completions.create(
model=model.value, messages=messages, temperature=0.2
)
response_text = resp.choices[0].message.content
await self._store_cache(query_embedding, response_text, cache_ttl)
return RoutedRequest(
prompt=prompt, system=system, model=model, cache_hit=False,
estimated_cost_usd=self._estimate_cost(full_prompt, model)
)
def print_stats(self):
n = self._stats["total"] or 1
print(f"Total: {n} | Cache hits: {self._stats['cache_hits']/n:.1%} | "
f"Budget routes: {self._stats['budget_routes']/n:.1%} | "
f"Standard routes: {self._stats['standard_routes']/n:.1%}")
async def main():
router = IntelligentModelRouter()
await router.connect()
result = await router.complete(
prompt="What is the refund policy for enterprise customers?",
system="You are a helpful assistant for Acme Corp."
)
print(f"Model: {result.model.value}, Cache hit: {result.cache_hit}, "
f"Est. cost: ${result.estimated_cost_usd:.6f}")
router.print_stats()
if __name__ == "__main__":
asyncio.run(main())Model router with semantic cache and complexity classifier. The complexity classifier itself always runs on GPT-4o-mini to avoid the paradox of using an expensive model to decide when to use an expensive model. In production, the cache_index should be backed by Redis with vector extensions (Redis Stack) rather than the in-memory list shown here for simplicity.
OpenAI's Batch API (available as of 2025) processes requests asynchronously within 24 hours at 50% cost reduction. For any workflow that processes documents, generates reports, or runs batch analyses — where real-time response is not required — the Batch API pays for itself immediately. A 10,000-document processing job that would cost $500 at standard API rates costs $250 via Batch API. Route all non-interactive workloads through the Batch API as a default.
Cost Optimization Quick Wins (Implement This Week)
- Audit your top-5 highest-volume agent prompts. Are any of them sending 2,000+ token system prompts on every call? Compress them with LLMLingua or restructure to use prompt caching.
- Identify all classification and routing calls in your agent stack. These are prime candidates for downgrade to GPT-4o-mini — 33x cheaper with comparable accuracy on binary and categorical classification.
- Implement response_format: json_object on all structured extraction calls. This alone typically reduces output token usage by 30-40% by preventing verbose explanatory text.
- Add max_tokens limits to all non-creative LLM calls. Most factual Q&A, classification, and extraction tasks should complete within 200-500 output tokens. Unconstrained generation is a silent cost leak.
- Set up cost monitoring with per-endpoint token tracking before implementing any other optimization. You cannot optimise what you do not measure — and most teams are surprised by where their token budget actually goes.
Inductivee's Cost Optimization Framework
When we audit a client's LLM spend, the two findings that appear almost universally are: unconstrained system prompt length and GPT-4o used for classification tasks. These two issues alone typically account for 50-60% of preventable cost. The fix for both is fast, non-destructive, and measurable within a week.
The more sophisticated optimization — semantic caching — requires up-front investment in infrastructure but pays out over months. For enterprise RAG systems where the same knowledge base is queried repeatedly by different users, cache hit rates of 35-40% are achievable. At GPT-4o prices, a 35% cache hit rate on a 50,000 request/day platform saves approximately $150/day — $55,000/year — after accounting for the cache infrastructure cost of roughly $500/month.
For teams just starting down the cost optimization path: measure first, optimise second. A one-week instrumentation sprint adding per-call token logging to every LLM call will reveal the top five cost drivers with enough precision to prioritise which levers to pull first. Building a model router before understanding your cost distribution is premature optimisation.
Frequently Asked Questions
How much can LLM inference costs be reduced in production?
What is model routing in LLM applications?
What is semantic caching for LLMs?
How much does GPT-4o cost at enterprise scale?
What is prompt caching and how does it reduce LLM costs?
Written By
Inductivee Team
AuthorAgentic AI Engineering Team
The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.
Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.
Engineer This With Inductivee
The engineering patterns in this article are what our team builds into production every day. Explore the related service to see how we deliver this capability at enterprise scale.
Agentic Custom Software Engineering
We engineer autonomous agentic systems that orchestrate enterprise workflows and unlock the hidden liquidity of your proprietary data.
ServiceAutonomous Agentic SaaS
Agentic SaaS development and autonomous platform engineering — we build SaaS products whose core loop is powered by LangGraph and CrewAI agents that execute workflows, not just manage them.
Related Articles
Context Window Management for Long-Running Agents: Engineering Patterns
Enterprise AI Governance: Building the Framework Before You Desperately Need It
The Enterprise AI Readiness Assessment: How to Know Before You Build
Ready to Build This Into Your Enterprise?
Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.
Start a Project