Skip to main content
Architecture

LLM Cost Optimization in Production: Semantic Caching, Batching, and Smart Model Routing

Running GPT-4o for every agent call at enterprise scale is a budget problem. Smart model routing, semantic caching, and request batching can reduce inference costs by 60-80% without degrading output quality.

Inductivee Team· AI EngineeringNovember 5, 2025(updated April 15, 2026)12 min read
TL;DR

At November 2025 pricing, GPT-4o costs approximately $5/M input tokens while GPT-4o-mini is $0.15/M — a 33x cost difference for roughly equivalent quality on simple tasks. A model routing system that correctly classifies 60-70% of queries as simple (routing them to mini or Haiku) reduces inference costs by 50-65% with under 3% quality degradation on those tasks. Combined with semantic caching (targeting 30-40% cache hit rates on enterprise knowledge bases) and output length control, 70-80% total cost reduction is achievable.

The Enterprise LLM Cost Problem

An enterprise agentic platform processing 50,000 requests per day at an average of 2,000 input tokens and 500 output tokens per request, using GPT-4o throughout, costs approximately $600/day — $220,000 per year — on inference alone. For most enterprises, this is an unbudgeted cost that shows up as a shock when the pilot scales to production.

The instinct to control costs by restricting AI usage is wrong. The correct response is to optimize the inference stack so that you are paying GPT-4o prices only for GPT-4o-quality tasks. The cost landscape as of November 2025: GPT-4o at ~$5/M input tokens (OpenAI), GPT-4o-mini at ~$0.15/M (OpenAI), Claude 3.5 Haiku at ~$0.80/M (Anthropic), Llama 3.1 70B self-hosted at ~$0.10/M (compute cost, AWS). The gap between the top-tier and budget tier is 33-50x on a per-token basis.

The optimization levers are well-understood: model routing (right model for right complexity), semantic caching (avoid re-running similar queries), request batching (improve throughput, reduce overhead), prompt compression (remove redundant tokens before sending), and output length control (stop generating when the task is done). Applying all five levers in a production system requires engineering investment, but the ROI is clear.

LLM Cost Reference — November 2025

ModelInput ($/M tokens)Output ($/M tokens)Best ForRelative Quality
GPT-4o$5.00$15.00Complex reasoning, multi-step agentic tasksHighest
Claude 3.5 Sonnet$3.00$15.00Long-context reasoning, coding, analysisHighest
Claude 3.5 Haiku$0.80$4.00Fast classification, summarisation, extractionHigh
GPT-4o-mini$0.15$0.60Simple classification, formatting, routingGood
Llama 3.1 70B (self-hosted)~$0.10~$0.10High-volume simple tasks, data extractionGood
Llama 3.1 8B (self-hosted)~$0.01~$0.01Bulk classification, structured extractionModerate

The Five Cost Optimization Levers

Lever 1: Model Routing by Complexity

Route queries to the appropriate model tier based on estimated task complexity. Simple tasks — classification, keyword extraction, format conversion, FAQ-style lookups — route to GPT-4o-mini or Llama 3.1 70B. Complex tasks — multi-step reasoning, code generation, nuanced analysis, agent planning — route to GPT-4o or Claude 3.5 Sonnet. The router itself should be a lightweight classifier (fine-tuned BERT or GPT-4o-mini), not the expensive model. Typical outcome: 60-70% of requests classified as simple at one-tenth the cost.

Lever 2: Semantic Caching

Cache LLM responses indexed by semantic similarity of the input query, not exact string match. When a new query arrives, check if a semantically similar query has been answered before (cosine similarity > 0.95 against cached query embeddings). If a cache hit is found, return the cached response without an LLM call. Enterprise knowledge base queries (FAQ, policy lookup, documentation search) achieve 30-40% cache hit rates. Customer-facing queries typically achieve 15-25%. Use Redis with vector extensions or a dedicated semantic cache library (GPTCache).

Lever 3: Async Request Batching

Batch independent requests together and process them in parallel. Rather than making sequential LLM calls for 20 items to process, assemble them into a single batched request (OpenAI Batch API at 50% discount) or process concurrently with asyncio. The Batch API is specifically valuable for document processing pipelines where 50%+ cost savings come with 24-hour latency, which is acceptable for non-interactive workflows.

Lever 4: Prompt Compression

Long system prompts sent on every request are a significant hidden cost. A 2,000-token system prompt sent 50,000 times per day costs $500/day at GPT-4o pricing before a single word of user input is counted. Techniques: compress prompts using LLMLingua (up to 4x compression with <5% quality loss), cache the KV representation of static system prompts using OpenAI's prompt caching feature (50% discount on cached tokens), and audit prompts quarterly to remove accumulated instructions that no longer apply.

Lever 5: Output Length Control

LLMs have a tendency to over-generate — providing more explanation than the task requires. For structured extraction and classification tasks, explicit max_tokens constraints and JSON response format forcing (response_format: {type: json_object}) reduce average output length by 40-60% without quality loss. For agentic tasks, instruct the agent to stop when the task is complete rather than summarising every step taken.

Model Router with Complexity Classification and Semantic Cache

python
import asyncio
import hashlib
import json
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import numpy as np
from openai import AsyncOpenAI
from redis.asyncio import Redis


class ModelTier(str, Enum):
    BUDGET = "gpt-4o-mini"          # $0.15/M input
    STANDARD = "gpt-4o"             # $5.00/M input
    REASONING = "o1-mini"           # Reserved for explicit reasoning tasks


@dataclass
class RoutedRequest:
    prompt: str
    system: str
    model: ModelTier
    cache_hit: bool
    estimated_cost_usd: float


COMPLEXITY_CLASSIFIER_PROMPT = """Classify the complexity of this task.
Simple: classification, extraction, lookup, format conversion, yes/no questions.
Complex: multi-step reasoning, code generation, detailed analysis, creative generation.
Respond with exactly one word: simple or complex."""

# Per-token pricing (November 2025, input tokens)
PRICING = {
    ModelTier.BUDGET: 0.00000015,    # $0.15/M
    ModelTier.STANDARD: 0.000005,    # $5.00/M
    ModelTier.REASONING: 0.000003,   # $3.00/M (o1-mini)
}


class IntelligentModelRouter:
    """
    Routes LLM requests to appropriate model tier based on complexity,
    with semantic caching to avoid redundant API calls.
    """

    CACHE_SIMILARITY_THRESHOLD = 0.95
    CACHE_TTL_SECONDS = 3600  # 1 hour for most enterprise queries

    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.openai = AsyncOpenAI()
        self._redis_url = redis_url
        self._redis: Optional[Redis] = None
        self._cache_index: list[tuple[list[float], str]] = []  # (embedding, cache_key)
        self._stats = {"total": 0, "cache_hits": 0, "budget_routes": 0, "standard_routes": 0}

    async def connect(self):
        from redis.asyncio import from_url
        self._redis = await from_url(self._redis_url, decode_responses=True)

    async def _embed(self, text: str) -> list[float]:
        resp = await self.openai.embeddings.create(
            input=text[:2000],  # Truncate for cache key generation
            model="text-embedding-3-small",
            dimensions=768
        )
        return resp.data[0].embedding

    async def _check_semantic_cache(
        self, query_embedding: list[float]
    ) -> Optional[str]:
        if not self._cache_index:
            return None
        query_vec = np.array(query_embedding)
        best_score = 0.0
        best_key = None
        for cached_emb, cache_key in self._cache_index[-1000:]:  # Check last 1000
            cached_vec = np.array(cached_emb)
            score = float(np.dot(query_vec, cached_vec) /
                          (np.linalg.norm(query_vec) * np.linalg.norm(cached_vec)))
            if score > best_score:
                best_score = score
                best_key = cache_key
        if best_score >= self.CACHE_SIMILARITY_THRESHOLD and best_key:
            return best_key
        return None

    async def _store_cache(self, query_embedding: list[float],
                           response: str, ttl: int) -> None:
        cache_key = "llm:" + hashlib.sha256(
            str(query_embedding[:10]).encode()
        ).hexdigest()[:16]
        await self._redis.setex(cache_key, ttl, response)
        self._cache_index.append((query_embedding, cache_key))

    async def _classify_complexity(self, prompt: str) -> ModelTier:
        resp = await self.openai.chat.completions.create(
            model="gpt-4o-mini",  # Always use budget model for classification
            messages=[
                {"role": "system", "content": COMPLEXITY_CLASSIFIER_PROMPT},
                {"role": "user", "content": prompt[:500]},  # Truncate for speed
            ],
            max_tokens=5,
            temperature=0,
        )
        classification = resp.choices[0].message.content.strip().lower()
        return ModelTier.STANDARD if classification == "complex" else ModelTier.BUDGET

    def _estimate_cost(self, prompt: str, model: ModelTier) -> float:
        token_estimate = len(prompt.split()) * 1.3  # Rough token estimate
        return token_estimate * PRICING[model]

    async def complete(
        self,
        prompt: str,
        system: str = "",
        force_model: Optional[ModelTier] = None,
        cache_ttl: int = CACHE_TTL_SECONDS,
    ) -> RoutedRequest:
        self._stats["total"] += 1
        full_prompt = f"{system}\n\n{prompt}" if system else prompt

        # Check semantic cache first
        query_embedding = await self._embed(full_prompt)
        cache_key = await self._check_semantic_cache(query_embedding)
        if cache_key:
            cached = await self._redis.get(cache_key)
            if cached:
                self._stats["cache_hits"] += 1
                return RoutedRequest(prompt=prompt, system=system,
                                     model=ModelTier.BUDGET, cache_hit=True,
                                     estimated_cost_usd=0.0)

        # Route to appropriate model
        model = force_model or await self._classify_complexity(prompt)
        if model == ModelTier.BUDGET:
            self._stats["budget_routes"] += 1
        else:
            self._stats["standard_routes"] += 1

        messages = []
        if system:
            messages.append({"role": "system", "content": system})
        messages.append({"role": "user", "content": prompt})

        resp = await self.openai.chat.completions.create(
            model=model.value, messages=messages, temperature=0.2
        )
        response_text = resp.choices[0].message.content
        await self._store_cache(query_embedding, response_text, cache_ttl)

        return RoutedRequest(
            prompt=prompt, system=system, model=model, cache_hit=False,
            estimated_cost_usd=self._estimate_cost(full_prompt, model)
        )

    def print_stats(self):
        n = self._stats["total"] or 1
        print(f"Total: {n} | Cache hits: {self._stats['cache_hits']/n:.1%} | "
              f"Budget routes: {self._stats['budget_routes']/n:.1%} | "
              f"Standard routes: {self._stats['standard_routes']/n:.1%}")


async def main():
    router = IntelligentModelRouter()
    await router.connect()
    result = await router.complete(
        prompt="What is the refund policy for enterprise customers?",
        system="You are a helpful assistant for Acme Corp."
    )
    print(f"Model: {result.model.value}, Cache hit: {result.cache_hit}, "
          f"Est. cost: ${result.estimated_cost_usd:.6f}")
    router.print_stats()


if __name__ == "__main__":
    asyncio.run(main())

Model router with semantic cache and complexity classifier. The complexity classifier itself always runs on GPT-4o-mini to avoid the paradox of using an expensive model to decide when to use an expensive model. In production, the cache_index should be backed by Redis with vector extensions (Redis Stack) rather than the in-memory list shown here for simplicity.

Tip

OpenAI's Batch API (available as of 2025) processes requests asynchronously within 24 hours at 50% cost reduction. For any workflow that processes documents, generates reports, or runs batch analyses — where real-time response is not required — the Batch API pays for itself immediately. A 10,000-document processing job that would cost $500 at standard API rates costs $250 via Batch API. Route all non-interactive workloads through the Batch API as a default.

Cost Optimization Quick Wins (Implement This Week)

  • Audit your top-5 highest-volume agent prompts. Are any of them sending 2,000+ token system prompts on every call? Compress them with LLMLingua or restructure to use prompt caching.
  • Identify all classification and routing calls in your agent stack. These are prime candidates for downgrade to GPT-4o-mini — 33x cheaper with comparable accuracy on binary and categorical classification.
  • Implement response_format: json_object on all structured extraction calls. This alone typically reduces output token usage by 30-40% by preventing verbose explanatory text.
  • Add max_tokens limits to all non-creative LLM calls. Most factual Q&A, classification, and extraction tasks should complete within 200-500 output tokens. Unconstrained generation is a silent cost leak.
  • Set up cost monitoring with per-endpoint token tracking before implementing any other optimization. You cannot optimise what you do not measure — and most teams are surprised by where their token budget actually goes.

Inductivee's Cost Optimization Framework

When we audit a client's LLM spend, the two findings that appear almost universally are: unconstrained system prompt length and GPT-4o used for classification tasks. These two issues alone typically account for 50-60% of preventable cost. The fix for both is fast, non-destructive, and measurable within a week.

The more sophisticated optimization — semantic caching — requires up-front investment in infrastructure but pays out over months. For enterprise RAG systems where the same knowledge base is queried repeatedly by different users, cache hit rates of 35-40% are achievable. At GPT-4o prices, a 35% cache hit rate on a 50,000 request/day platform saves approximately $150/day — $55,000/year — after accounting for the cache infrastructure cost of roughly $500/month.

For teams just starting down the cost optimization path: measure first, optimise second. A one-week instrumentation sprint adding per-call token logging to every LLM call will reveal the top five cost drivers with enough precision to prioritise which levers to pull first. Building a model router before understanding your cost distribution is premature optimisation.

Frequently Asked Questions

How much can LLM inference costs be reduced in production?

Combining model routing (budget models for simple tasks), semantic caching (30-40% cache hit rates), prompt compression, and output length control typically achieves 60-80% reduction in LLM inference costs versus a naive GPT-4o-for-everything approach. The exact reduction depends on workload mix — knowledge base query workloads with high cache hit rates achieve the upper range; diverse creative generation workloads achieve the lower range.

What is model routing in LLM applications?

Model routing classifies incoming requests by task complexity and directs them to the appropriate model tier — budget models (GPT-4o-mini, Llama 3.1 70B) for simple classification, extraction, and lookup tasks, and premium models (GPT-4o, Claude 3.5 Sonnet) for complex reasoning and generation. The router itself should be a lightweight classifier running on the budget tier. A correctly implemented router achieves 60-70% of requests on budget-tier models with under 3% quality degradation.

What is semantic caching for LLMs?

Semantic caching stores LLM responses indexed by the semantic content of the input query (via embedding similarity), not exact string match. When a new query arrives with cosine similarity above a threshold (typically 0.95) against a cached query, the cached response is returned without an LLM API call. Enterprise knowledge base systems achieve 30-40% cache hit rates, making semantic caching one of the highest-ROI cost reduction techniques.

How much does GPT-4o cost at enterprise scale?

At November 2025 pricing, GPT-4o costs approximately $5/M input tokens and $15/M output tokens. An enterprise platform processing 50,000 requests per day at 2,000 average input tokens and 500 output tokens would spend approximately $600/day ($220,000/year) at standard pricing before any optimization. Model routing and caching can reduce this to $60-120/day for most enterprise workload profiles.

What is prompt caching and how does it reduce LLM costs?

Prompt caching (available via OpenAI and Anthropic) stores the KV (key-value) representation of a prompt prefix at the API level, so repeat requests with the same system prompt prefix receive a 50% discount on cached input tokens. It is most effective for high-volume agents with long, static system prompts. A 2,000-token system prompt sent 50,000 times per day saves $250/day at GPT-4o pricing with prompt caching enabled.

Written By

Inductivee Team — AI Engineering at Inductivee

Inductivee Team

Author

Agentic AI Engineering Team

The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.

Agentic AI ArchitectureMulti-Agent OrchestrationLangChainLangGraphCrewAIMicrosoft AutoGen
LinkedIn profile

Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.

Ready to Build This Into Your Enterprise?

Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.

Start a Project