If you have tried to add GPU-accelerated AI inference to a product, you probably know the routine. Provision a cluster. Configure node pools. Install GPU drivers. Set up autoscaling. Reserve capacity. Pay for idle time. Repeat indefinitely as your requirements change.
Cloud Run GPUs, which went generally available in June 2025, skip most of that. Deploy a container. Add a GPU flag. Done. When no requests are running, the instance scales to zero. When traffic arrives, a new GPU instance is ready in under five seconds. Billing is per second of actual GPU usage.
Why This Is a Different Model
Every other major GPU inference option on a hyperscaler charges for reserved capacity. You tell the platform how many GPUs you need, it reserves them for you, and you pay for that reservation whether or not you are using it. For workloads with consistent, high-volume traffic, this is fine. For workloads where a user triggers inference on demand and then nothing happens for the next several minutes, you are paying for a lot of idle time.
Most AI features at the product level look like the second scenario. A user clicks a button. The model runs. The user reads the result. The GPU sits idle until the next user comes along. At scale this evens out, but at the feature level, especially for new or experimental features, reserved GPU capacity is a budget problem waiting to happen.
Scale-to-zero serverless matches the billing model to how inference actually works. Pay for the seconds the GPU is running inference. Pay nothing when it is not. For an NVIDIA L4 GPU, that is roughly $0.67 per hour of actual usage. An instance that is active 10% of the time costs a fraction of a reserved equivalent.
What Is Actually Running
The generally available option is the NVIDIA L4 Tensor Core GPU with 24 GB of video random-access memory (VRAM). This covers a wide range of models: most 7B to 13B parameter LLMs run comfortably, and with quantization, larger models are feasible. For workloads requiring more headroom, the NVIDIA RTX PRO 6000 Blackwell GPU with 96 GB VRAM is in preview as of April 2026.
Cloud Run integrates with NVIDIA NIM microservices, which are pre-optimized containers for running specific AI models. This means you can deploy a production-grade LLM inference endpoint without manual optimization work, point Cloud Run at a NIM container, configure the GPU, and get a streaming HTTP endpoint with token-by-token response delivery built in.
The GPU support also extends to Cloud Run Jobs, which handles batch workloads: embedding generation across large document sets, batch inference pipelines, model evaluation runs. Same scale-to-zero billing model, same per-second pricing.
The Part That Matters for Enterprise Products
Specialized serverless GPU platforms like Modal and Replicate offer similar economics, and they are worth knowing about. The gap shows up on the enterprise governance side. Cloud Run GPU workloads inherit all existing Cloud Run security controls automatically: VPC Service Controls, IAM, binary authorization, Cloud Logging on every request. For ISVs selling into regulated industries, that is not a nice-to-have, it is a requirement, and it is the part that most specialized platforms cannot match.
A few things worth thinking through: What percentage of your current GPU reservation is actually utilized in a typical week? Which AI features are off your product roadmap because the infrastructure economics do not work at the usage level you expect? And if deploying GPU inference looked the same as deploying any other Cloud Run service, how would that change your team’s willingness to experiment with new model-backed features?
Want to go deeper? Here are a few links worth your time
- Cloud Run GPUs GA announcement, Supported GPU types, region availability, pricing, and configuration requirements.
- Cloud Run GPU configuration docs, Technical reference for enabling GPUs, minimum resource requirements, and streaming support.
- NVIDIA on Cloud Run + NIM integration, How NVIDIA NIM microservices simplify optimized model deployment on Cloud Run GPU.
