Architecture

vLLM in Production: Enterprise Inference Architecture for Self-Hosted LLMs

vLLM has become the default open-source serving engine for self-hosted LLMs. Here is how PagedAttention and continuous batching actually work, the deployment patterns we recommend for enterprise teams running vLLM in production, and where it fits — and does not fit — compared with TGI, TensorRT-LLM, and managed inference APIs.

Inductivee Team· AI EngineeringMay 21, 202614 min read

TL;DR

TL;DR — vLLM is an open-source LLM inference and serving engine, originally from the UC Berkeley Sky Computing Lab, built around two ideas: PagedAttention (a KV-cache memory manager modelled on virtual memory paging) and continuous batching (iteration-level scheduling that lets new requests join an in-flight batch instead of waiting for it to finish). Together they let a single GPU serve many concurrent requests at much higher utilisation than naive static batching, which is why vLLM has become the default open-source serving layer for self-hosted models. This guide covers what vLLM actually is, when it is the right choice versus TGI, TensorRT-LLM, or a managed inference API, and the production patterns — tensor parallelism, prefix caching, speculative decoding, quantisation, and observability — that we recommend for enterprise teams running it for real. The headline rule: use vLLM when you need self-hosted inference with high throughput on a standard CUDA stack; reach for TensorRT-LLM when you need maximum single-request latency on NVIDIA hardware and are willing to pay the engineering cost; stay on a managed API when your traffic is bursty enough that paying per token beats running the GPU 24/7.

What vLLM Actually Is

vLLM is an open-source library for fast LLM inference and serving, originally developed at the UC Berkeley Sky Computing Lab and now maintained by a broad community of contributors. It is released under the Apache 2.0 licence. The headline result from the original PagedAttention paper (Kwon et al., SOSP 2023, *Efficient Memory Management for Large Language Model Serving with PagedAttention*) is that careful KV-cache management plus iteration-level scheduling lets a single GPU serve substantially more concurrent requests than the naïve static-batching baselines that earlier inference servers shipped with. The exact speed-up depends on model size, sequence length, and hardware — the paper has the reference figures and the project's benchmark suite is the right source of truth for what to expect on your specific stack.

Underneath the hood, vLLM is a Python library with a C++/CUDA core that exposes two main surfaces. The first is an LLM Python class for offline batched inference — useful for evaluation runs, dataset generation, and one-off jobs. The second is an OpenAI-compatible HTTP server (vllm.entrypoints.openai.api_server) that speaks the same /v1/chat/completions and /v1/completions endpoints as the OpenAI API, which means existing client SDKs, LangChain, LangGraph, and most agent frameworks point at it with a one-line base-URL change. For most enterprise deployments, that compatibility is the practical reason vLLM wins over more bespoke serving stacks — your application code does not have to know it is talking to a self-hosted model.

vLLM supports a broad and growing list of model architectures — Llama, Mistral, Mixtral, Qwen, DeepSeek, Gemma, Phi, and many others — through Hugging Face Transformers-compatible model definitions. The supported-models list on the vLLM documentation is the canonical reference and is worth checking before committing to a model choice.

Why PagedAttention and Continuous Batching Matter

To understand what vLLM is doing differently, it helps to be concrete about where the bottleneck sits in LLM inference. Each token a model generates needs the attention mechanism to look back at every previous token, and the keys and values for those previous tokens are kept in GPU memory in the KV cache. A request that has generated 2,000 tokens is carrying ~2,000 tokens worth of KV cache around with it for every subsequent token it generates. KV-cache memory is, in practice, the constraint that determines how many concurrent requests a GPU can serve.

The pre-vLLM convention was to allocate a contiguous KV-cache buffer per request, sized to the request's maximum possible length. That is wasteful: most requests never reach maximum length, so a large fraction of allocated memory sits unused. Worse, the system has to either preallocate for worst-case (severely limiting concurrency) or dynamically reallocate (causing fragmentation). PagedAttention treats KV-cache memory like OS virtual memory — it is divided into fixed-size blocks, each request holds a list of block pointers, and blocks are allocated only when actually needed. The result is much higher KV-cache utilisation, which translates directly into more concurrent requests on the same GPU.

Continuous batching is the scheduling counterpart. Static batching waits for the slowest request in a batch to finish before starting the next batch, which means the GPU spends a lot of time finishing one long generation for a single request while shorter requests sit in the queue. Continuous batching schedules at the iteration level: as soon as one request in the batch finishes a step, a new request can take its slot. The GPU stays utilised; queue latency drops. Together with PagedAttention, this is what gives vLLM its throughput advantage.

How vLLM Compares to the Other Serious Choices

vLLM vs Hugging Face TGI (Text Generation Inference)

TGI is Hugging Face's production inference server, written in Rust with a Python model loader. It has its own continuous-batching scheduler, its own quantisation support, and a similar OpenAI-compatible API surface. The practical difference today is community velocity: vLLM has historically shipped new model architectures and new optimisation features (speculative decoding, chunked prefill, structured output, multi-LoRA) faster than TGI, and the vLLM community is larger. TGI's strengths are tight integration with the Hugging Face ecosystem and a slightly more conservative release cadence — which is sometimes what an enterprise stability team wants. Either is a defensible choice; in 2026, vLLM is the more common default for new builds.

vLLM vs NVIDIA TensorRT-LLM and Triton

TensorRT-LLM compiles the model graph to NVIDIA's optimised inference runtime, with kernel fusion, FP8 / INT8 quantisation tuned for NVIDIA hardware, and tight integration with the Triton Inference Server for production deployment. On NVIDIA GPUs, TensorRT-LLM typically reaches lower single-request latency than vLLM and is the right choice when you have a fixed model, fixed shape regime, and a strong latency SLA — interactive chat with strict p95 targets, voice-agent inner loops, real-time recommendation. The cost is engineering: every model needs to be compiled and re-compiled for new shapes or new hardware, the toolchain is more complex, and you are firmly inside NVIDIA's stack. vLLM is the better choice when you want a more dynamic workload, frequent model rotation, or a stack you can move off NVIDIA on short notice.

vLLM vs SGLang and LMDeploy

SGLang (from the LMSYS team) and LMDeploy (from the InternLM team) are two of the strongest newer entrants. SGLang in particular has invested heavily in structured generation and RadixAttention for prefix-sharing across requests, and benchmarks well against vLLM on workloads with significant shared-prefix structure (multi-turn chat, RAG with shared system prompts). LMDeploy is strong on quantisation and turbo-mind kernels. For an enterprise team standardising today, vLLM is the safer default because of community size and ecosystem support; for teams with workloads that strongly benefit from prefix sharing, SGLang is worth benchmarking head-to-head.

vLLM vs Ollama, llama.cpp, and Desktop Runtimes

Ollama and llama.cpp are excellent for local development, on-device inference, and CPU/Metal workloads. They are not the right tool for production server-side inference at scale — single-GPU throughput is lower, multi-GPU support is weaker, and the concurrency model is not built around server workloads. The right pattern is to develop against Ollama on a workstation and deploy against vLLM in production, using the same OpenAI-compatible API surface for both.

vLLM vs Managed Inference APIs (Together, Fireworks, Anyscale, Bedrock, Vertex)

The most important comparison is the one most teams skip. A managed inference API charges per token and absorbs all the operational complexity of GPU serving. Self-hosting on vLLM means paying for GPU instances around the clock, which is only economical if utilisation stays high. Bursty workloads — early-stage applications, internal tools with weekday-business-hours traffic, anything with daily peak-to-trough ratios above ~5× — usually come out cheaper on a managed API. Steady high-throughput workloads — production RAG over enterprise corpora, agent backends handling continuous traffic, batch document processing — usually come out cheaper self-hosted. The honest answer for most enterprises is a mix: managed APIs for spiky and exploratory workloads, vLLM for the high-throughput core.

Deployment Topology: Single GPU to Multi-Node

How you deploy vLLM depends on what fits on what hardware. For small and medium models — Llama 3 8B, Mistral 7B, Qwen 7B, anything in the ~7-13B parameter range at FP16 — a single modern GPU with 40-80 GB of memory is usually sufficient, and the deployment is a single container with the OpenAI-compatible server, fronted by your load balancer of choice. This is by far the most common shape and is what most teams should start with.

For larger models — 70B-class and above at FP16 — the model does not fit on a single GPU and you need tensor parallelism across multiple GPUs on a single node. vLLM supports tensor parallelism out of the box via the tensor_parallel_size argument. The constraint is fast inter-GPU interconnect: tensor parallelism communicates intermediate activations on every forward pass, so NVLink-class bandwidth is essentially required. Tensor-parallelising across PCIe-only GPUs works but is noticeably slower than the NVLink path; benchmark on your specific hardware before committing.

For models that exceed a single node — typically 400B-class and above at FP16 — you need pipeline parallelism across nodes in addition to tensor parallelism within a node. vLLM supports this through its distributed runtime, with Ray as the coordination layer. Cross-node deployments are operationally harder: you need reliable RDMA networking between nodes, careful failure-domain design (one node failing should not take down the cluster), and an autoscaler that understands the multi-node unit. Most enterprises should not run multi-node deployments until they have exhausted single-node options through quantisation, smaller model variants, or KV-cache offloading.

A practical pattern across all topologies is to run multiple replicas of the deployment behind a load balancer, scaled horizontally on request rate or GPU utilisation. Each replica is independently capable of serving the full model, which gives you both throughput scaling and zone-level fault tolerance. The Kubernetes-native pattern — one Deployment per model variant, each replica a vLLM pod with GPU resource requests, fronted by a Service and an Ingress — works well and is what most enterprise platform teams converge on.

Common Deployment Topologies

Topology	Model Size Range	GPU Requirement	Right For	Watch-Outs
Single GPU, single replica	≤13B at FP16, larger with quantisation	One A100 / H100 / L40S-class GPU	Most starting deployments; staging; low-traffic production	No fault tolerance; capacity ceiling is one GPU
Single GPU, multiple replicas	≤13B at FP16	N × A100 / H100 / L40S-class GPUs	Horizontal scaling; production with HA requirements	KV-cache is per-replica — no cross-replica prefix sharing
Tensor-parallel single node	30B-70B at FP16	2-8 GPUs on one node with NVLink	Larger models on existing GPU nodes	Requires NVLink; PCIe-only is much slower
Tensor + pipeline parallel multi-node	100B+ at FP16	Multiple GPU nodes with RDMA fabric	Frontier-scale open-source models	Operationally complex; consider quantisation first
Quantised single GPU	70B at INT8 / FP8 / AWQ / GPTQ	One A100 80GB or H100	Running larger models on smaller GPU footprint	Quality regression on some tasks — benchmark before shipping

A Production-Shaped vLLM Server Configuration

bash

# ─── PRODUCTION VLLM SERVE COMMAND ──────────────────────────────────
# Illustrative configuration for a Llama-3.1-70B deployment on a
# single H100 node with 8 GPUs and NVLink. Tune values against your
# own workload — the right numbers depend on hardware, model, and
# request mix.

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 8 \
  --dtype bfloat16 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-seqs 256 \
  --max-num-batched-tokens 8192 \
  --quantization fp8 \
  --served-model-name llama-3.1-70b-prod \
  --api-key "$VLLM_API_KEY" \
  --host 0.0.0.0 --port 8000

# Key flags and what they actually do:
#
#   --tensor-parallel-size 8
#     Shard the model across 8 GPUs on this node. Requires NVLink to
#     perform well. Communication happens on every forward pass.
#
#   --gpu-memory-utilization 0.92
#     Fraction of each GPU's memory vLLM is allowed to claim for
#     model weights + KV cache. The remainder is headroom for CUDA
#     activations, NCCL buffers, and graphics-related state. 0.85-0.92
#     is the safe band; pushing higher risks OOM under load spikes.
#
#   --enable-prefix-caching
#     Reuses KV-cache blocks across requests that share a prefix
#     (system prompts, few-shot examples, common RAG context). Huge
#     win for chat-style and RAG workloads; effectively free.
#
#   --enable-chunked-prefill
#     Splits long prefill steps into chunks that interleave with
#     decode steps from other requests. Smooths out latency spikes
#     when long-context requests arrive in a hot batch.
#
#   --max-num-seqs 256
#     Concurrency ceiling per replica. Higher = more throughput but
#     more KV-cache pressure and higher per-token latency. Tune on
#     your real traffic shape rather than guessing.
#
#   --quantization fp8
#     Activate FP8 quantisation on H100-class hardware (supported by
#     vLLM via Marlin / FBGEMM-style kernels for FP8). Cuts memory
#     and improves throughput; verify quality on your evals first.

What is not shown: TLS termination (do it at the load balancer or sidecar, not in vLLM), authentication beyond the static API key (use a gateway), per-tenant rate limiting (also at the gateway), and request logging (mirror the request path through your observability stack).

Quantisation: How Far You Can Push the Memory Footprint

Quantisation is the single highest-leverage way to fit a larger model on smaller hardware. vLLM supports several quantisation formats — FP8 (on H100-class and newer hardware), INT8, AWQ, GPTQ, and GGUF-via-conversion. Each trades off in a different way between quality, throughput, and supported hardware.

FP8 is the modern default on H100, H200, B100/B200, and equivalent hardware. It is roughly transparent on quality for most workloads, halves the memory footprint versus BF16, and is well-supported in vLLM's kernel path. If you are on H100-class hardware, FP8 is usually the first thing to try.

AWQ and GPTQ are weight-only quantisation formats that compress to INT4 or INT3 weights with FP16 activations. They are common for running larger models on smaller GPUs — a 70B AWQ-int4 model fits on a single A100 80GB. The quality cost is workload-dependent; on commonsense knowledge tasks the regression is usually small, on tasks that stress long-tail factual recall it can be larger. The discipline is to benchmark the quantised model on your own evals — never assume the published numbers transfer.

INT8 occupies an awkward middle on modern hardware: less aggressive than INT4 weight-only formats, less throughput-optimised than FP8 on H100-class GPUs. It remains useful on A100 and earlier hardware where FP8 is unavailable.

The rule we apply: try the unquantised model first if it fits, fall back to FP8 on H100-class hardware before reaching for weight-only quantisation, and reserve INT4 (AWQ / GPTQ) for the cases where the smaller footprint is genuinely necessary. Quantisation introduces a quality variable that is easy to forget when debugging downstream issues months later.

Performance Levers Worth Knowing

Prefix Caching

Prefix caching (--enable-prefix-caching) reuses KV-cache blocks across requests that share an identical token prefix. The benefit is dramatic for chat workloads (shared system prompt) and RAG workloads (shared retrieval context across follow-up questions in the same conversation). It is effectively free in compute and memory — vLLM hashes block contents and reuses identical blocks. Turn it on unless you have a specific reason not to.

Chunked Prefill

Without chunked prefill, a long-prompt request entering a batch holds the GPU for the entire duration of its prefill phase, starving other requests. Chunked prefill breaks the prefill into smaller chunks that can be interleaved with decode steps from other requests in the batch. The result is smoother latency under mixed-length workloads — particularly relevant for RAG pipelines where prompts can vary from a few hundred to many thousand tokens depending on retrieval results.

Speculative Decoding

Speculative decoding pairs a small fast draft model with a larger target model: the draft model proposes several tokens at once, the target model verifies them in a single forward pass, and the rejected tokens are regenerated. The result on workloads where the draft model agrees often with the target is a meaningful latency reduction. vLLM supports several speculative-decoding variants (draft-model, n-gram, MLP-based). The trade-off is added complexity and a quality verification step — measure on your workload before deploying.

Continuous Batching Knobs

--max-num-seqs controls the maximum number of concurrent sequences a replica will hold, and --max-num-batched-tokens controls the token-budget per scheduling step. Higher values increase throughput but at the cost of per-token latency and KV-cache pressure. There is no universal right answer — the right values are workload-specific and worth tuning empirically against representative traffic. A reasonable starting point is whatever fits comfortably in 80-90% of available KV-cache memory at peak load.

Structured Output and Guided Decoding

vLLM supports constrained / guided decoding via grammars (JSON schema, regex, EBNF) through outlines and xgrammar integrations. For workloads where you need reliable structured output — function calling, tool-use loops, downstream parsers — using guided decoding at the inference layer is more reliable than post-hoc parsing and retries. The cost is a small per-step overhead at decode time; usually well worth paying.

Warning

A common production failure mode is overcommitting GPU memory. vLLM preallocates KV-cache memory based on --gpu-memory-utilization, and if the value is set too aggressively (above ~0.95), the GPU has insufficient headroom for transient allocations during inference — NCCL buffers during tensor-parallel collective ops, CUDA workspace allocations, transient activation tensors. The first symptom is sporadic CUDA out-of-memory errors under load that do not reproduce in isolation. Stay in the 0.85-0.92 band unless you have measured at peak load and have headroom to spare.

Observability and Operability for vLLM in Production

Once vLLM is serving real traffic, the metrics you need are the same shape as any other inference platform but with model-specific specifics. vLLM exposes a Prometheus metrics endpoint that includes time-to-first-token (TTFT), inter-token latency, requests in queue, KV-cache utilisation, prompt and generation token counts, and preemption counts. Scrape it, dashboard it, alert on it.

The metrics most worth alerting on are: KV-cache utilisation sustained above ~95% (you are running close to the concurrency ceiling — either add replicas or drop max-num-seqs), preemption rate above zero (vLLM is having to evict in-flight requests to make room — same remediation), and TTFT p95 above the SLA you committed to (something has shifted upstream: longer prompts, more concurrency, hardware degradation).

For request-level visibility, the OpenAI-compatible endpoints emit standard token-count fields in the response, which slot neatly into LLM cost-tracking and evaluation pipelines. For request-level tracing across the agent stack, wrap vLLM client calls in OpenTelemetry spans the same way you would any other downstream service.

Model rollout discipline is its own topic. Self-hosted models do not have to track vendor release schedules, which is a feature — but it is also a feature you have to use deliberately. Stand up a canary deployment alongside the production deployment, shift a small fraction of traffic to it, compare metrics and eval scores against the production baseline for a few days, and only then promote. The model you are running is now part of your infrastructure; treat upgrades the way you would treat a database engine upgrade.

Where vLLM Tends To Be the Right Choice

High-Volume RAG and Search Backends

RAG pipelines that serve a large user base, run inside an enterprise corpus, and have predictable steady traffic are an excellent fit for self-hosted vLLM. The economics work out (GPU is utilised continuously), data residency is fully under your control, and prefix caching pays meaningfully when the system prompt and retrieved context patterns repeat across requests.

Agent Backends with High Concurrency

Multi-step agent workflows generate many LLM calls per task. The cost of each call adds up quickly on per-token APIs, and the steady traffic pattern fits self-hosting. vLLM's continuous batching is particularly valuable here because agent traffic is mostly decode-heavy short steps where iteration-level scheduling shows its full advantage.

Regulated Environments and Air-Gapped Deployments

Some industries (healthcare, defence, parts of financial services) cannot send data to a managed API regardless of the contractual data protections in place. Self-hosting on vLLM inside the controlled environment is sometimes the only option. The operational complexity is real but is the cost of doing business in those sectors.

Cost-Sensitive High-Token Workloads

Document processing pipelines, large-scale synthetic data generation, classification at scale, and batch summarisation are workloads where per-token API costs accumulate quickly and self-hosting on amortised GPU capacity tends to win on unit economics — provided you can keep the GPU busy. The break-even point depends on your exact workload, model choice, and cloud pricing; do the calculation rather than relying on rules of thumb.

Custom or Fine-Tuned Open-Source Models

If your moat involves a fine-tuned open-source model (LoRA on Llama, full fine-tune on Mistral, domain adaptation on Qwen), vLLM is where that fine-tune runs in production. Managed APIs that host third-party fine-tunes exist but tend to charge a meaningful premium; self-hosting on vLLM keeps the fine-tune fully under your control.

Seven Decisions Worth Making Deliberately Before Your First vLLM Deployment

These are the choices that compound over the lifetime of the deployment. Made consciously at the start, the deployment scales cleanly. Made by default, they tend to require unpicking later:

Model choice and quantisation tier. Pick the smallest model that meets your quality bar — measured on your own evals, not on public leaderboards — and quantise only if the unquantised version does not fit. Quantisation is recoverable; over-committing to too large a model is expensive.
GPU class and topology. H100 / H200-class hardware with NVLink is the path of least resistance for anything above 30B parameters. A100 / L40S-class works for smaller models and is cheaper. PCIe-only multi-GPU is a trap for tensor-parallel deployments — the interconnect is the bottleneck, not the compute.
Single replica vs horizontally scaled. Production deployments need at least two replicas behind a load balancer for fault tolerance. Single-replica deployments are for staging and internal tools where outages are acceptable.
Memory utilisation ceiling. --gpu-memory-utilization in the 0.85-0.92 band is the safe default. Higher values risk transient OOM under load; lower values waste capacity. Measure under peak load before committing.
Prefix caching and chunked prefill defaults. Turn both on for any chat-style or RAG-style workload. The cases where you would turn them off are narrow and specific (workloads with no prefix overlap; latency-critical decode-only loops); default-on is the right starting point.
Observability scope from day one. Wire the Prometheus metrics endpoint into your existing dashboards, alert on KV-cache utilisation and preemption rate, and instrument client-side traces through OpenTelemetry. Retrofitting observability after the first incident is much more expensive than building it in.
Model rollout process. Decide how new model versions get promoted (canary deployment, traffic shift, eval gate) before you have to do your first upgrade. The model is now part of your infrastructure — give it the same change-management discipline you give every other dependency.

How To Approach a vLLM Pilot

A sensible first vLLM deployment is a non-critical workload with predictable traffic — internal knowledge-base search, a developer-facing assistant, a batch document-processing pipeline — running a small or medium open-source model that comfortably fits on a single GPU. Stand it up alongside whatever managed API the workload currently uses, mirror a fraction of traffic to it, compare quality and latency for a couple of weeks, and use that data to decide whether the economics justify expanding the footprint.

The assessment criteria worth measuring before promoting to the production path: quality on your own evals (not generic benchmarks), p50 / p95 / p99 latency under representative load, throughput per GPU dollar, operational incident rate during the pilot window, and the security and compliance review of the data path. If any of these is unsatisfactory, the right answer is usually to keep the workload on the managed API for now — self-hosting only pays back when you actually run the GPU hot.

If you are weighing self-hosted inference for a real workload — particularly one where managed-API costs are climbing, data residency is becoming a constraint, or you are running a fine-tuned model in production — our team can help scope it. The capability gap between managed APIs and vLLM has narrowed dramatically; the open question is usually unit economics and operational maturity, and that is a calculation specific to your traffic shape rather than a generic answer.

Frequently Asked Questions

What is vLLM?

vLLM is an open-source LLM inference and serving engine originally developed at the UC Berkeley Sky Computing Lab, released under the Apache 2.0 licence. It is built around PagedAttention, a KV-cache memory manager modelled on virtual memory paging, and continuous batching, an iteration-level scheduler that lets new requests join an in-flight batch. Together they let a single GPU serve many more concurrent requests at much higher utilisation than naive static-batching baselines. vLLM exposes both an offline Python API and an OpenAI-compatible HTTP server, which means existing client SDKs and agent frameworks can point at a self-hosted vLLM endpoint with a single base-URL change.

How does vLLM compare to Hugging Face TGI?

Both are production inference servers with continuous batching, quantisation support, and OpenAI-compatible APIs. The practical difference today is community velocity: vLLM has shipped new model architectures and new optimisation features (speculative decoding, chunked prefill, multi-LoRA serving) faster than TGI, and the vLLM contributor community is larger. TGI's strengths are tight integration with the Hugging Face ecosystem and a more conservative release cadence. For new deployments in 2026, vLLM is the more common default; for teams already on TGI with workloads that are stable, there is no urgent reason to migrate.

When should I use vLLM instead of a managed API like Bedrock or Vertex?

Self-hosting on vLLM is most economical when your traffic is steady and high enough that the GPU stays busy — production RAG over enterprise corpora, agent backends with continuous traffic, batch document processing, fine-tuned models that managed APIs do not host. Managed APIs are typically cheaper for bursty workloads, exploratory projects, and anything with daily peak-to-trough ratios above roughly 5×. Data residency and regulatory constraints also push toward self-hosting regardless of unit economics. Most enterprises end up with a mix: managed APIs for spiky and exploratory workloads, vLLM for the high-throughput core.

Does vLLM support tensor parallelism and multi-GPU deployment?

Yes. vLLM supports tensor parallelism within a single node via the --tensor-parallel-size flag, and tensor + pipeline parallelism across nodes via Ray as the distributed coordinator. Tensor parallelism requires NVLink-class GPU interconnect to perform well — communication happens on every forward pass, and PCIe-only configurations are noticeably slower. Multi-node deployments add real operational complexity (RDMA networking, failure-domain design, multi-node autoscaling); most enterprises should exhaust single-node options through quantisation and smaller model variants before reaching for multi-node.

What quantisation formats does vLLM support?

vLLM supports several quantisation formats including FP8 (on H100-class and newer hardware), INT8, AWQ, GPTQ, and others through community-contributed kernels. The right format depends on your hardware and workload: FP8 is roughly transparent on quality and a strong default on H100-class hardware; AWQ and GPTQ (weight-only INT4) trade quality for a much smaller memory footprint and let you run larger models on smaller GPUs; INT8 is most useful on A100 and earlier hardware where FP8 is unavailable. Always benchmark the quantised model on your own evals before committing — quality regression is workload-specific and never assume published numbers transfer.

Should I enable prefix caching in vLLM?

For chat-style and RAG-style workloads, almost always yes. Prefix caching reuses KV-cache blocks across requests that share an identical token prefix — system prompts, few-shot examples, common RAG retrieval context, repeated agent scaffolding. The benefit is meaningful and the cost is effectively zero. The cases where you might turn it off are narrow: workloads with no prefix overlap across requests, or very strict memory budgets where every KV-cache block is needed for active sequences. Default-on is the right starting point.

Can I run vLLM inside a regulated or air-gapped environment?

Yes — vLLM is one of the main reasons running modern LLMs inside regulated or air-gapped environments is practical. The container ships with all model weights and dependencies pulled in at build time, runs on standard CUDA infrastructure without any external API calls at inference time, and exposes its API only on the network surface you choose. The operational complexity of running GPU infrastructure in air-gapped environments is real, but the inference layer itself is well-suited to that deployment model. Healthcare, defence, and parts of financial services are common adopters for exactly this reason.

Written By

Inductivee Team

Author

Agentic AI Engineering Team

The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.

Agentic AI ArchitectureMulti-Agent OrchestrationLangChainLangGraphCrewAIMicrosoft AutoGen

LinkedIn profile

Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.

Engineer This With Inductivee

The engineering patterns in this article are what our team builds into production every day. Explore the related service to see how we deliver this capability at enterprise scale.

Service

Ready to Build This Into Your Enterprise?

Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.

Start a Project