Your Load Balancer Has No Idea What an LLM Is (Google Fixed That)

Load balancers do one thing well: spread requests evenly across servers. Round-robin, least-connections, pick your flavor. For stateless web traffic, that works fine. For large language model inference, it causes real problems.

The Problem with Treating LLM Traffic Like HTTP Traffic

LLM serving is stateful in a way most infrastructure never anticipated. When a model processes a prompt, it generates a KV (key-value) cache that represents its “memory” of that conversation. Route the next request to a different GPU replica and that cache is gone. The replica recomputes it from scratch, burning latency and compute for no reason. Standard load balancers have no awareness of KV caches. They route blind.

GKE Inference Gateway is Google’s fix for this. It is a Kubernetes-native load balancer built specifically for LLM traffic patterns. Instead of routing by connection count or CPU, it routes based on real-time KV cache utilization across replicas. Requests with shared prompt prefixes reach the replica that already holds that context. The result: a 96% reduction in time-to-first-token for prefix-heavy workloads like coding assistants, and 40% higher throughput from the same GPU fleet. Google runs Vertex AI on it in production, so the numbers reflect real infrastructure, not a benchmark lab.

It also handles disaggregated serving, splitting the computationally intensive “prefill” phase from the token-by-token “decode” phase onto separate optimized hardware pools. Request prioritization lets you mark fraud detection calls as Critical and background summarization jobs as Sheddable, so high-value traffic always gets through under load.

Who Should Care

VPs of Engineering running LLM workloads on Kubernetes and watching GPU bills climb deserve a better answer than “add more replicas.” CAIOs who want to deploy larger models without tripling the hardware budget will find the throughput gains directly relevant. Platform teams that have already wired up vLLM or a similar inference server should ask whether the routing layer above it is doing any of the work it could be doing.

On the competitive side, AWS SageMaker offers inference components and speculative decoding, but its load balancing operates at the connection level rather than the LLM layer. Azure focuses LLM serving optimizations at the model server configuration layer, not the infrastructure routing layer. Neither vendor ships a managed Kubernetes-native inference gateway that understands KV cache state. That gap is architectural, not incidental.

The question worth sitting with is not whether this technology matters. It clearly does. The real question is whether your infrastructure team knows to ask for it, or whether they are still wiring up generic ingress controllers and wondering why P95 latency looks like a bar graph of bad decisions. If your AI roadmap involves serving models at any meaningful scale, the routing layer deserves more than an afterthought.

Want to go deeper?

GKE Inference Gateway GA announcement with benchmark numbers from Google Cloud.
How Vertex AI uses GKE Inference Gateway in production, with real latency improvement data.
The New Stack on GKE Inference Gateway at KubeCon, with coverage of disaggregated serving.