For most of the history of AI infrastructure, the conversation was about training. Getting a model to a useful checkpoint faster, cheaper, at scale. Training is the glamorous part, the research, the capability improvements, the benchmark announcements.
Inference is where the money goes. Serving a model to real users continuously is the cost that actually compounds as AI products grow up. Google built Ironwood for that reality, and it shows in the design choices.
Built for Serving, Not Just Training
Ironwood is Google’s 7th-generation custom AI chip, announced in April 2025. Every previous TPU generation was primarily a training accelerator. Ironwood is the first one Google explicitly designed around inference: high volume, low latency, efficient at scale.
The headline numbers are large enough to be meaningless without context, so here is the one that actually matters for inference workloads: 192 GB of memory per chip, with 7.37 terabytes per second of memory bandwidth. That is six times the memory capacity of the previous generation. The reason this matters is simple, large models need to fit in memory to run efficiently. Models that previously required splitting across multiple chips, with all the coordination overhead that creates, now fit on one. Fewer chips per request means lower latency and lower cost per response.
Ironwood also introduces native support for FP8, which is the numerical precision format used for quantized inference. Running models at FP8 instead of higher precision reduces memory requirements and improves throughput without significantly degrading output quality. Previous TPU generations required software workarounds to get there. Ironwood handles it in hardware.
The Economics Are the Point
Ironwood delivers twice the performance per watt of its predecessor, Trillium. At the scale where inference costs actually matter, millions of requests per day, a 2x improvement in efficiency is a meaningful reduction in infrastructure spend. It is the kind of improvement that changes whether an AI-powered feature is profitable at a given price point.
There is also a supply angle worth mentioning. NVIDIA GPU procurement at volume has involved allocation queues stretching 6 to 18 months for recent generations. Ironwood is Google-designed and Google-owned. It is available through Vertex AI without navigating third-party allocation constraints. For teams planning infrastructure scaling, that certainty has real value.
Who This Actually Affects
Ironwood matters most to teams running high-volume AI inference in production, enough volume that cost per token is a line item someone is watching. At prototype scale, compute choice barely matters. At production scale, it determines product margin.
It is also relevant for anyone working with very large models or long-context processing. The memory headroom on a single Ironwood chip changes what is feasible without multi-chip distribution complexity. And the dedicated SparseCore accelerator handles embedding-based workloads, recommendation systems, search ranking, retrieval, more efficiently than general-purpose compute.
Ironwood powers Google’s own AI workloads, including Gemini serving. When you access it through Vertex AI, you are running on the same infrastructure class Google runs on internally, not a separate tier.
A few things worth thinking through: What proportion of your AI infrastructure spend today is inference versus training? If your cost per token dropped significantly, which features currently off your roadmap would become economically viable? And if GPU procurement delays have affected your scaling plans, is on-demand access to Google-owned inference hardware worth a closer look?
Want to go deeper? Here are a few links worth your time
- Ironwood: The Age of Inference (Google Blog), The full announcement with architecture details and the inference-first design rationale.
- Ironwood TPUs on Google Cloud, Cloud availability, pod configurations, and Vertex AI integration details.
- The Register: Google Ironwood TPUs, Third-party technical analysis and competitive positioning.
