<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Compute</title><link>https://cloud.google.com/blog/products/compute/</link><description>Compute</description><atom:link href="https://cloudblog.withgoogle.com/blog/products/compute/rss/" rel="self"></atom:link><language>en</language><lastBuildDate>Wed, 22 Apr 2026 12:00:16 +0000</lastBuildDate><image><url>https://cloud.google.com/blog/products/compute/static/blog/images/google.a51985becaa6.png</url><title>Compute</title><link>https://cloud.google.com/blog/products/compute/</link></image><item><title>What’s new with compute: Scaling core and agentic workloads</title><link>https://cloud.google.com/blog/products/compute/whats-new-in-compute-at-next26/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At Google Cloud Next, we’re announcing a range of compute capabilities to enable your core general purpose and AI workloads for the agentic world with higher performance and lower costs.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Why it matters:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; IT leaders and builders are faced with balancing compute investments and resources between agentic AI and the general purpose use cases, including the web servers, databases, and enterprise applications that drive everyday customer experiences. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;On one side, agents can place unpredictable demand on compute infrastructure, often scaling exponentially. A single user interaction can instantaneously kick off hundreds of concurrent, high-throughput, and low-latency tasks. On the other side, general-purpose workloads generate and hold the data required to fuel the agentic world. Relying on static and siloed infrastructure to run them can risk performance bottlenecks and spiraling costs, leaving your organization unable to respond to surges in demand. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Consider a global travel application where a simple vacation search instantly triggers a massive orchestration of agentic inventory checks, dynamic pricing models, and AI-driven personalized itineraries. Without a modern architecture, this sudden surge in demand can overwhelm the core booking database and bring business to a halt. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We address this with fluid compute, Google Cloud infrastructure that adapts to your general-purpose and agentic workflows, enabling both to win by flexing in performance, capacity, and scale, all in real time. This dynamic flexibility relies directly on the automated orchestration of Google Kubernetes Engine (GKE) and our new Agent Sandboxes to instantly provision secure, isolated execution environments at machine speed.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Let’s take a deeper look at the new compute capabilities announced at Next ‘26.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Run AI and general purpose workloads together&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Agentic planning and reinforcement learning depend on highly fluid compute to process unpredictable bursts of autonomous tasks. Relying on static infrastructure to isolate agent-generated code can create severe provisioning delays and heavily inflate your cloud budget. You can remove these bottlenecks by adopting an adaptive cloud foundation. Leveraging GKE Agent Sandboxes empowers your teams to securely launch thousands of execution environments. Pairing these scalable sandboxes with efficient Google Axion processors helps your organization optimize total cost of ownership while fueling artificial intelligence innovation. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Here’s what’s new in Google Cloud compute launches and announcements:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Google Axion N4A is GA:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Harness the agility of Google’s custom Arm-based Axion CPUs and achieve up to &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/axion-based-n4a-vms-now-in-preview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;2x better price-performance&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; than comparable current-generation x86-based VMs for cost-sensitive workloads such as Java applications, scale-out web servers, and SaaS built by startups, enterprises and partners. Learn more &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/general-purpose-machines#n4a_series"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;GKE Agent Sandbox, with Axion N4A for price performance, is GA.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; As the industry’s only native sandbox service among hyperscalers, GKE Agent Sandbox offers scalable and low-latency infrastructure designed for agents to safely execute untrusted code and tool calls without sacrificing performance. With Google Axion, you can build agents on leading infrastructure without compromising on cost or choice.&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;GKE Agent Sandbox running on Google Axion N4A instances provides up to 30% better price-performance than the next leading hyperscale cloud provider. Try GKE Agent Sandbox &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/concepts/machine-learning/agent-sandbox"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Google Axion C4A.metal, our first Axion bare metal instance, is in preview:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; C4A.metal instances power Android development, automotive simulation, CI/CD pipelines, security workloads, and custom hypervisors, without the performance overhead and complexity of nested virtualization. C4A.metal will be GA this summer; learn more &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/new-axion-c4a-metal-offers-bare-metal-performance-on-arm?e=48754805%E2%80%9D%20with%20%E2%80%9Chttps://docs.cloud.google.com/compute/docs/instances/bare-metal-instances#c4a-metal"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;C4 instances offer expanded support for Intel Xeon 6 (Granite Rapids) across all shapes: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Achieve high-performance for AI workloads like LLM inference and vector search by using Intel AMX with native FP16 support to increase throughput and reduce latency, offering 13% better price-performance versus comparable Intel Xeon 6-based VMs from another leading hyperscaler. C4 VMs are available with Intel Xeon 6 processors across all shapes. Learn more &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/general-purpose-machines#c4_series"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Flexible CUDs&lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt;expanded support is GA:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Shift spend across regions and VM families while optimizing for TCO, with flexible committed use discounts, now with support for a wider range of VM families and services, including memory-optimized (M1-M4) and HPC-optimized (H3, H4D) VM families, as well as Cloud Run. Learn more &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/committed-use-discounts-overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Here’s what customers are saying:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Unity: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Unity is redefining the economics of real-time AI with Unity Vector. By migrating its on-demand feature processor workloads to Google Axion N4A instances, Unity achieved a 20% improvement in cost efficiency without sacrificing latency. As Unity Vector scales to meet increasing demand, the move to N4A instances ensures that Unity continues to deliver industry-leading performance at a sustainable cost.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Deutsche Börse: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;A leading German market infrastructure provider&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Deutsche Börse migrated and modernized dozens of core financial applications onto Google Compute Engine VMs, including latest generation C4 and C4D instances, supporting latency-sensitive Oracle databases and post-trade processing at scale, and &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;boosting release speed, operational agility, and resilience&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;. This delivered the consistent performance they needed to process millions of financial transactions every day and they achieved &lt;/span&gt;&lt;a href="https://cloud.google.com/customers/deutsche-boerse?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;58% faster time to market and 33% lower TCO&lt;/span&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;WP Engine&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;WP Engine powers millions of digital experiences where every millisecond matters. By running GKE clusters on C4D and N4D instances, WP Engine has seen up to a 60% reduction in latency for mobile-optimized REST APIs and up to 51% faster processing for data-rich application requests.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;eDreams ODIGEO:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Operating a high-volume, AI-powered travel platform where every millisecond dictates the customer experience, eDreams ODIGEO migrated its foundational Java-based ecommerce modules on GKE to Axion virtual machines. This immediately eliminated weeks of manual code optimization, delivered a massive 75% improvement in P95 latency with zero code changes, and unlocked price-performance to scale their global services far more cost-effectively than their legacy x86 infrastructure could.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Chainguard:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Prioritizing absolute isolation for their foundational software build system, Chainguard deployed the new Axion C4A bare metal instances. This allowed them to establish a strong hypervisor security boundary for package builds, secure their development pipeline with architectural parity, and ensure robust protection, all without compromising build performance.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Run I/O and latency-sensitive workloads together&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Both AI and core workloads depend on the ability to store, read, and move data as a single, high-performance operation. Traditionally, these stages are slowed by network and storage limits tethered to vCPU counts, which can starve AI models of the data they need to function. You can remove these constraints by leveraging accelerated Hyperdisk performance for rapid data access and high-performance networking for consistent transit. By allowing your data pipelines to scale independently of compute, your AI training and I/O-sensitive workloads have the dedicated bandwidth they need to remain stable under peak demand.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;C4N is in preview:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Running high-volume network applications such as concurrent mobile app requests and real-time inventory updates can risk bottlenecks during peak traffic. Maximize your throughput with C4N, featuring Titanium adapters that offload complex packet processing to deliver a market-leading 95 million packets per second — a 40% performance advantage for high-traffic network applications compared to other leading hyperscalers. Designed to rapidly transfer large datasets, C4N provides nearly 400 Gbps of VM-to-VM bandwidth, a 4x improvement in bandwidth-per-vCPU, and achieves an 8x increase in egress network bandwidth through internet gateways compared to C4 VMs. C4N with Hyperdisk Extreme also provides the low-latency, high-speed data access that modern databases and enterprise AI applications need, with 25 GiB/s of block storage throughput and nearly 1M IOPS. Sign-up &lt;/span&gt;&lt;a href="https://forms.gle/tx1XV2yDrbMrcWgo8" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for C4N preview access.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;M4N is in preview: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Running memory-intensive databases can push organizations to overprovision compute cores (vCPU) to meet memory speeds, driving up software licensing fees. We introduced the new M4N series to solve this exact problem. Running Oracle workloads on M4N with Hyperdisk Extreme can reduce TCO by over 20%, enabling you to run Oracle more efficiently, with 26.57 GiB of RAM per vCPU for scale and on far fewer cores. Paired together, M4N with Hyperdisk Extreme delivers the highest per-core IOPS and throughput for high-memory instances across leading hyperscalers. Sign-up for the preview &lt;/span&gt;&lt;a href="https://docs.google.com/forms/d/e/1FAIpQLSeTBNw_Z5SkaeVlDMgbeFPnHS_wGsrTomEDO2cI6RIQlx93qA/viewform?usp=sharing&amp;amp;ouid=101252396062406318722" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Announcing Z4D: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Optimize I/O-intensive workloads and remove network-based storage bottlenecks with new Z4D instances. By securing up to 84 TiB of high-performance local SSD directly on the node, organizations can process massive datasets for SQL, NoSQL, and vector databases. Z4D provides up to 400 Gbps of VM-to-VM bandwidth, matching both C4N and M4N. Z4D virtual machines and bare metal instances will be in preview soon.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Here is what customers are saying:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Ericsson: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;5G Core workloads are inherently network-heavy, demanding high-throughput packet processing and deterministic latency that standard public cloud instances often struggle to maintain at scale. By leveraging the Google Cloud C4N, they’ve found the ideal choice for network performance to power Ericsson On-Demand. C4N’s architectural focus on network-optimized compute allows its 5G Core-as-a-Service to reach unprecedented throughput levels, like its recent 1 Tbps milestone, while maintaining the carrier-grade reliability its customers expect.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Teradata: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Teradata’s Autonomous Knowledge Cloud enables the world’s largest enterprises to activate enterprise intelligence and turn trusted data into measurable business outcomes. Customers rely on Teradata to run mission‑critical, highly I/O‑intensive analytics at scale where performance and efficiency directly determine value. C4N instances are well suited for these demanding workloads, delivering strong price‑performance and supporting more efficient, optimized deployments. With C4N, Teradata can help customers accelerate insights, scale with confidence, and drive greater impact from their data and AI investments. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Handle demanding storage requirements &lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Foundational workloads such as web servers, applications, and databases hold the data required to fuel the agentic world. Siloing this critical information on rigid hardware creates bottlenecks that can completely stall enterprise modernization. Imagine a global retail brand running a holiday promotion, but the inventory database times out and drops customer requests because the legacy hardware couldn’t process the sudden flood of agentic queries. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Organizations require the highest performing database hosts backed by high performance IOPS and throughput per vCPU to ensure non-blocking data delivery. Moving these applications to modern cloud infrastructure dramatically improves total cost of ownership and operational throughput. Through strategic cloud migrations, customers can eliminate the architectural walls that stall modernization and unlock their data for AI. Here is what is new in fluid compute for throughput- and capacity-sensitive workloads:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hyperdisk Balanced improvements. &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Hyperdisk Balanced enables fast and efficient block storage for general purpose workloads, including applications and relational databases. With Hyperdisk Balanced you can drive up to 2.4 GiB/s and 160K IOPS per volume, higher than general-purpose block storage offerings from other hyperscalers, all while achieving lower mean latency than alternatives. With Hyperdisk Balanced High Availability you can now achieve a 4x performance improvement for high availability databases like SQL Server or PostgreSQL by dynamically routing full disk performance to the active VM, removing the need to overprovision storage. Leverage zero-downtime encryption key rotation and consistency groups for instant snapshots, making it easier to stay more secure. With these capabilities, you can deliver lower TCO, higher performance, and workload resilience for your general-purpose workloads. Learn more &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/disks/hyperdisks"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hyperdisk ML performance improvements and Hyperdisk Exapools are GA:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; With 2 TiB/s of aggregate throughput (up from 1.2 TiB/s), Hyperdisk ML helps eliminate AI storage bottlenecks, offering more than 200x higher throughput per disk than competitive offerings, so your valuable accelerator clusters never sit idle. This allows you to maximize AI compute ROI while powering the next generation of intelligent agents. In addition, for large-scale training needs, Hyperdisk Exapools offer the highest aggregate block storage performance and capacity, per AI cluster, of any hyperscaler. Learn more about Hyperdisk ML and Exapools &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/disks/hd-types/hyperdisk-ml"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/disks/hyperdisk-exapools"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Announcing Z4M: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Access up to 168 TiB of local SSD coupled with up to 400 Gbps of network bandwidth, support for RDMA, and bare-metal shapes to run distributed parallel file systems and large-scale AI/ML workloads. Z4M will be integrated with Cluster Director with the option to be colocated with accelerators to provide fast and low-latency access to data. Z4M VMs and bare metal instances are expected to be in preview in Q3 2026.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Here is what customers are saying:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Shopify&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: During Black Friday weekend sales, &lt;/span&gt;&lt;a href="https://cloud.google.com/customers/shopify-compute?e=4875480&amp;amp;hl=en"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Shopify processed over $14.6B and tracked 136 million packages&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for 81M buyers using its Shop App built on Compute Engine’s Z-series backed storage — without compromising speed or reliability.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;HubX&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Operating a massive portfolio of AI-powered mobile applications where rapid model loading dictates the user experience, &lt;/span&gt;&lt;a href="https://cloud.google.com/customers/hubx?e=48754805&amp;amp;hl=en"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;HubX deployed Hyperdisk ML&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; on GKE to eliminate severe I/O bottlenecks. Leveraging this specialized storage layer allowed HubX to support hundreds of concurrent readers and accelerate pod initialization times by 30x during peak traffic surges, drastically reducing idle accelerator costs and helping ensure their complex inference workloads scaled as expected.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Fluid infrastructure for the agentic era&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Now, your foundational workloads and agents no longer need to compete for capacity or performance. With Google Cloud’s fluid compute, you get adaptive cloud infrastructure that prevents bottlenecks and enables both your foundational and AI workloads to collaborate and thrive. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Ready to get started?&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Head straight to the&lt;/span&gt;&lt;a href="https://console.cloud.google.com"&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Cloud console&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to spin up a VM for your next big project. Or start planning your migration by checking out &lt;/span&gt;&lt;a href="https://cloud.google.com/migration-center"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Migration Center's&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; AI-powered toolsets to perform cost estimates, create a business case, and evaluate your modernization options.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 22 Apr 2026 12:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/whats-new-in-compute-at-next26/</guid><category>Google Cloud Next</category><category>Compute</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/GCN26_102_BlogHeader_2436x1200_Opt_12_Dark.max-600x600.jpg" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>What’s new with compute: Scaling core and agentic workloads</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/GCN26_102_BlogHeader_2436x1200_Opt_12_Dark.max-600x600.jpg</image><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/whats-new-in-compute-at-next26/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Nirav Mehta</name><title>VP, Product Management, Compute Platforms</title><department></department><company></company></author></item><item><title>Cross-cloud infrastructure innovation for the agentic enterprise</title><link>https://cloud.google.com/blog/products/compute/cross-cloud-infrastructure-at-next26/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The era of agentic AI is accelerating from human- to machine-speed operations, while also creating profound stress on legacy technology infrastructure. This new reality pushes foundational systems to their limits: agents generate thousands of internal messages and complex queries, spawning more agents, all of which can rapidly overwhelm traditional networks and databases, and expose new security vulnerabilities.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Unlocking AI's full potential in the era of agents requires a secure, adaptive foundation. We call it &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;cross-cloud infrastructure for the agentic enterprise&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; – and at Google Cloud Next ‘26, we’re launching a powerful set of new innovations across four areas:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;What’s new:&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Fluid compute: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Google Compute Engine and Kubernetes services work together to enable cost-effective, high-speed AI agents and enterprise workloads with new compute and orchestration capabilities. &lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Secure cross-cloud connectivity: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Agent Gateway, Cloud Armor, and other tools deliver a secure, governed, and simplified networking foundation for AI agents, including observability of agentic traffic across clouds.&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Unified data layer: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Smart Storage, Knowledge Catalog, and other innovations transform passive data archives into dynamic reasoning engines, giving AI agents the context they need to execute.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Digital sovereignty: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Confidential External Key Management and new features in Google Distributed Cloud bring Google’s leading models and AI enablers wherever your data lives.&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Let’s take a closer look at all the news for each of these four areas.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Fluid compute&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Agentic workloads are dynamic and unpredictable, impacting both traditional enterprise applications and the AI agents themselves. Fluid compute is enabled by Google Compute Engine and Google Kubernetes services working together to dynamically adapt and shift weight in real-time, enabling cost-effective, high-speed AI agents and operational enterprise workloads for all customers. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;While our &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/ai-infrastructure-at-next26"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;AI Hypercomputer delivers raw power for large-scale AI model training&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, fluid compute addresses the needs of operational workloads and agents. As agents move toward reasoning and reinforcement learning, CPUs are reclaiming a central role, excelling at the "branchy" logic, complex control flows, and secure execution sandboxes (like those for agentic orchestration, RL, SLM inference, and RAG) that agent workflows demand. CPUs also provide the critical isolation needed for secure agent execution, complementing the parallel processing strength of GPUs and TPUs used in training.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are introducing new CPU families, GKE capabilities, and Hyperdisk block storage capabilities to run traditional workloads and AI agents securely at scale, including:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Google C4N Series&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: These VMs help ensure your enterprise workloads don't slow down under the demands of agentic AI by processing up to 95 million packets per second, up to 40% faster than other leading hyperscalers.&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;This eliminates I/O bottlenecks for demanding workloads like security appliances, streaming media, and open source databases, even when utilizing smaller instance sizes.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Google M4N Series with Hyperdisk Extreme&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: M4N removes data pipeline bottlenecks and eliminates overprovisioning to deliver industry leading per-core IOPS and throughput required to handle massive data I/O from agents, analytics, and mission-critical databases. M4N provides 26.57 GB of RAM per vCPU, allowing you to scale mission-critical workloads cost-effectively on fewer cores. For example, M4N with Hyperdisk Extreme reduces Oracle workload total cost of ownership by over 20% compared to leading hyperscale clouds.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;GKE Agent Sandbox: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;This solution secures agents with trusted gVisor isolation and handles demand spikes, launching up to 300 sandboxes per second, per cluster. Backed by the only managed sandbox technology available among leading hyperscale clouds, it achieves up to 30% better price-performance than competitors when running AI agents on GKE Agent Sandbox with Google Axion N4A. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“Wayfair's AI strategy is built on years of systematic infrastructure modernization on Google Cloud — migrating our core eCommerce engine and databases off legacy systems, decomposing monolithic services into cloud-native architecture, and unifying our data and analytics platform. That foundation is what makes everything else possible. Today, Gemini Enterprise Agent Platform is powering everything from catalog enrichment to generative shopping experiences that help customers create a home that's just right for them — and it's the same foundation preparing us for the agentic era, where AI doesn't just assist but actively drives discovery, personalization, and commerce across every customer touchpoint and across our business.”&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; - Fiona Tan, Chief Technology Officer, Wayfair&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Explore all our latest compute innovations in &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/whats-new-in-compute-at-next26"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;this blog&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Secure cross-cloud connectivity &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Agentic AI replaces predictable human requests with autonomous “reasoning loops,” in which agents call other agents that, in turn, call LLMs, triggering massive, sudden surges in compute and machine-to-machine traffic. This shift creates unique challenges for network predictability and security of non-human identities. Optimized for agentic AI, our Cross-Cloud Network moves data across diverse environments, connecting employees, customers, and agents with visibility and security. New in Cross-Cloud Network are:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Agent Gateway:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Governs and orchestrates your enterprise agentic traffic as the “air traffic controller” in &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/introducing-gemini-enterprise-agent-platform"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Gemini Enterprise Agent Platform&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. It natively understands agent protocols like MCP and A2A to inspect and govern every agent interaction. By integrating with Google and third-party identity and AI safety services, it enables deep inspection to verify access, block attacks, and protect sensitive data, maintaining compliance across your core business.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Cloud Network Insights&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Delivers broad visibility across your hybrid and multi-cloud infrastructure to drive faster troubleshooting and network resolutions. Continuously monitor your end-to-end agent, network and web performance across Google Cloud, AWS, Azure, data centers, internet applications, and agentic workloads. Using synthetic traffic analytics, Cloud Network Insights provides hop-by-hop network path visibility to help you pinpoint the source of degradations and is coupled with AI-powered insights from Gemini Cloud Assist to deliver more autonomous operations.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Enhanced Cloud Next Generation Firewall (NGFW) and Cloud Armor&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Provides machine-speed, AI-powered protection to combat the rapid explosion of AI-generated polymorphic malware and zero-day exploits. Cloud NGFW advanced malware sandbox delivers real-time inline prevention of AI-generated threats, while Cloud Armor managed rules provides automated protection against both known and unknown Common Vulnerabilities and Exposures (CVEs). Together with Model Armor, these services analyze the intent and content of AI agent communications.  &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Discover more about how we &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/networking/whats-new-in-cloud-networking-at-next26"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;optimized networking for AI&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; in and outside of the data center. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Unified data layer&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;AI agents are only as powerful as the data they can access and the context they’re given. More applications and platforms are using structured and unstructured data, but it can be difficult to catalog, find, and act on that data at scale, leading to less effective agent interactions. To close the gap, your agents need all of your data brought together into a cohesive, queryable knowledge engine, or unified data layer. This way, your agents can identify and access accurate sources. At Next ‘26, we’re enhancing the unified data layer with:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Smart Storage&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This solution transforms dark data into a powerful knowledge asset for AI agents and training by embedding new semantic intelligence directly into your data objects. With new Google Cloud Storage capabilities like automated annotation, entity extraction, and semantic search, your agents can instantly find and use the specific data they need — whether it's hidden in spreadsheets, PDFs, or other unstructured formats across your entire organization. This significantly speeds up the development and deployment of your AI solutions. Learn more about &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/storage-data-transfer/next26-storage-announcements"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;storage innovations to accelerate your AI workloads&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Knowledge Catalog&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Knowledge Catalog maps business meaning across your entire data estate, providing a grounded source of truth so agents can deliver the most accurate results. This foundation enables AI training and inferencing and doesn’t require you to migrate your data; your agents interact with it directly, wherever it lives, with full context and governance, making modernization easier.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Part of our &lt;/span&gt;&lt;a href="https://cloud.google.com/transform/shift-system-of-action-architecting-the-agentic-data-cloud-AI"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Agentic Data Cloud&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, Smart Storage and Knowledge Catalog can take your data from a passive archive into a dynamic reasoning engine.&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“AI is critical to making our customers’ smart home and security solutions more intelligent and convenient. By leveraging Google Cloud’s Smart Storage, we auto-annotate rich metadata delivered in BigQuery. We’ve scaled and accelerated our data discovery and curation efforts, speeding up our AI development process from months to weeks, continuously delivering innovations that build trust and enhance the overall home experience.”&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; - Brandon Bunker, VP of Product, AI, Vivint&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Digital sovereignty&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the agentic era, digital sovereignty is a fundamental requirement for public sector and enterprise customers looking to accelerate innovation — without sacrificing control. There’s no one-size-fits-all solution, which is why we’ve designed a comprehensive set of offerings to meet different sovereign AI needs anywhere: public cloud, on-premises, or hybrid. New capabilities in our sovereign AI portfolio include: &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Confidential External Key Management:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Organizations can use Confidential External Key Management to maintain complete possession, custody, and control of your encryption keys and the policies that govern them. Confidential External Key Management leverages &lt;/span&gt;&lt;a href="https://cloud.google.com/security/products/confidential-computing"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Confidential Compute&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to host the key management endpoint in a tamper-proof environment within Google Cloud. You are in control and determine where your keys are stored, who can access them, and under what circumstances. Even highly privileged Google administrators cannot access your keys without authorization, which you can revoke at any time. Your data, your control.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Gemini on Google Distributed Cloud: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;With Gemini on GDC, companies can securely deploy Gemini in sensitive environments, while meeting data sovereignty needs. Your choice of deployment models includes managed software on your connected hardware or a fully disconnected, air-gapped solution. You can now scale with Google's leading AI capabilities even in the most restricted, high-security environments — from powerful Gemini models to advanced coding, search, and other agentic capabilities.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In addition, Google Distributed Cloud supports an end-to-end AI stack, combining our latest-generation AI infrastructure with Gemini models to accelerate and enhance all your sovereign AI workloads. This stack includes:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;NVIDIA Blackwell GPUs:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; NVIDIA Blackwell (NVIDIA HGX B200) and NVIDIA Blackwell Ultra platforms (NVIDIA HGX B300) GPUs accelerate AI performance, leveraging fifth-gen NVIDIA NVLink to deliver data-center scale bandwidth directly to your environment.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;New VM families:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; New A4 family offerings provide the ability to handle the most demanding inference tasks, delivering a 2.25x increase in peak compute. Memory-Optimized M2 and M3&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;brings the&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;high memory-to-vCPU ratios needed for massive ERP and data analytics workloads on-premises.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Enhanced storage: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Eliminate storage bottlenecks with 6x storage capacity per zone and a 10x performance boost, giving you the ability to do AI reasoning on-premises. Now, your data infrastructure moves at the speed of AI reasoning.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Our customers demand high-performance, private AI inference without the risks of multi-tenancy. Google Distributed Cloud allows us to provide dedicated, low-latency environments that meet strict sensitive data requirements. With the ability to run Gemini on B200s and B300s, we can significantly increase inference speeds and provide the token throughput our clients need to scale."&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;- &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Dave Driggers, CEO &amp;amp; Co-founder, Cirrascale Cloud Services&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Transforming vision into reality &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When these product areas converge, your infrastructure evolves into a high-performing, secure, adaptive foundation for the agentic era. We're not just offering tools; we're providing the architectural blueprint to enable enterprises and the public sector to rapidly embrace the full power of AI and agents with confidence.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To learn more about key industry trends for AI Infrastructure, read our &lt;/span&gt;&lt;a href="https://cloud.google.com/resources/content/state-of-infrastructure-in-the-agentic-ai-era?utm_source=cgc-blog&amp;amp;utm_medium=blog&amp;amp;utm_campaign=FY26-Q1-GLOBAL-STO121-website-dl-State-AI-Infra-172614&amp;amp;utm_content=state-of-infra-agentic-ai-era-report&amp;amp;utm_term=state-of-infra-agentic-ai-era-report"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;State of Infrastructure in the Agentic AI Era report&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 22 Apr 2026 12:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/cross-cloud-infrastructure-at-next26/</guid><category>Networking</category><category>Storage &amp; Data Transfer</category><category>Infrastructure</category><category>Google Cloud Next</category><category>Compute</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/GCN26_102_BlogHeader_2436x1200_Opt_4_Light.max-600x600.jpg" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Cross-cloud infrastructure innovation for the agentic enterprise</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/GCN26_102_BlogHeader_2436x1200_Opt_4_Light.max-600x600.jpg</image><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/cross-cloud-infrastructure-at-next26/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Nirav Mehta</name><title>VP, Product Management, Compute Platforms</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Muninder Sambi</name><title>VP, Google Distributed Cloud</title><department></department><company></company></author></item><item><title>What’s next in Google AI infrastructure: Scaling for the agentic era</title><link>https://cloud.google.com/blog/products/compute/ai-infrastructure-at-next26/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;AI is evolving from answering questions to reasoning and taking action. Companies who want to lead in today’s agentic era require computing infrastructure designed and optimized for these new requirements. Today at Google Cloud Next, we are introducing new AI infrastructure capabilities that help you innovate faster, deliver compelling user and customer experiences, and optimize for cost and energy efficiency — all at massive scale. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The shift to agentic intelligence&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the agentic era, a single intent triggers a chain reaction. Unlike chat, a primary AI agent decomposes goals into specific tasks for a fleet of specialized agents that then collaborate, preserve state, and use reinforcement learning to deliver outcomes in real-time. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This process scales intelligence per interaction, but also creates complexity that yesterday’s architectures cannot support without spiraling costs or performance bottlenecks. To scale efficiently and effectively, you must move beyond manually integrating fragmented components and technologies. To deliver agentic experiences that are smart, fast, scalable, and cost-effective, you need a unified infrastructure stack that spans purpose-built hardware, open software, and flexible consumption models.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Google’s &lt;/span&gt;&lt;a href="https://cloud.google.com/solutions/ai-hypercomputer?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;AI Hypercomputer&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is AI-optimized infrastructure built for the agentic era, engineered to deliver on these new requirements. This is the same foundation that powers Google’s flagship Gemini models, consumer AI services, and enterprise AI offerings. Today, we are announcing a significant expansion of our AI infrastructure portfolio, including:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;TPU 8t and TPU 8i, &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;our eighth generation TPUs &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;A5X bare metal instances,&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; powered by NVIDIA Vera Rubin NVL72&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Axion N4A VMs,&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; powered by our custom Axion Arm-based CPUs&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Google Compute Engine 4th generation VMs,&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; powered by Intel and AMD x86-based CPUs&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Virgo Network, &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;our breakthrough data center fabric for AI workloads&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Google Cloud Managed Lustre,&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; a high-performance parallel file system&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Z4M VMs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; with high-capacity local SSD storage and RDMA for open parallel file systems&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Dedicated KV Cache&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; scalable storage subsystem&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Native PyTorch&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; support for TPUs&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;New Google Kubernetes Engine (GKE) capabilities&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for agent-native workload orchestration&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_ai_hypercomputer.max-1000x1000.png"
        
          alt="1 ai hypercomputer"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Taken together, these capabilities will help you accelerate the development of models and complex agentic workflows to accelerate innovation, and deliver useful, responsive services to customers, all while reducing costs and using energy responsibly at scale.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Let’s take a closer look.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Introducing our eighth-generation TPU systems, purpose-built for agentic AI&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today, we’re pleased to announce the &lt;/span&gt;&lt;a href="https://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/eighth-generation-tpu-agentic-era" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;eighth generation of our Tensor Processing Units (TPUs)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, which for the first time includes two distinct chips and specialized systems, engineered specifically for the agentic era. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;TPU 8t &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;is our training powerhouse, specifically designed for high-throughput AI workloads. It redefines the scale of AI development, delivering nearly 3x higher compute performance than previous generations to shrink training timelines for massive models. It packs 9,600 chips in a single superpod to provide 121 exaflops of compute and two petabytes of shared memory connected through high-speed inter-chip interconnects (ICI). This massive pool of compute, unified memory, and doubled ICI bandwidth helps ensure that even the most complex models achieve near-linear scaling and maximum system utilization. We can now turn months of training into weeks with the power of 1 million+ TPU chips in a single cluster, orchestrated by Pathways and JAX. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;TPU 8i&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; is our breakthrough reasoning system for inference and reinforcement learning (RL), engineered to deliver the ultra-low latency required by agentic workflows and Mixture of Experts (MoE) models. By tripling on-chip SRAM to 384 MB and increasing high-bandwidth memory (HBM) to 288 GB, it breaks the memory wall, hosting massive KV Caches entirely on silicon. Additionally, it doubles ICI bandwidth to 19.2 Tb/s, reduces the ICI network diameter by over 50%, and introduces a dedicated Collectives Acceleration Engine (CAE), which reduces on-chip latency by up to 5x to minimize lag during high-concurrency requests. With this design, TPU 8i delivers 80% better performance per dollar for inference than the prior generation, enabling fast, interactive user experiences, cost-effectively.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;TPU 8t and TPU 8i will be available to Cloud customers soon. To learn more, check out this &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/tpu-8t-and-tpu-8i-technical-deep-dive"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;deep dive on the architecture&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;A5X with NVIDIA Vera Rubin platform&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We know that one size doesn't fit all. Different customers have different workloads, different requirements, and different use cases. So, we also partner deeply with NVIDIA to deliver the latest GPU platforms as highly reliable and scalable services in Google Cloud. We will be among the first to deliver instances based on the next-generation Vera Rubin platform when it becomes available later this year. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are also co-engineering the open-source Falcon networking protocol with NVIDIA via the Open Compute Project, pushing the frontiers of reliable transport protocols. A5X will implement a variety of innovative concepts from Falcon.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Thinking Machine Labs, for example, uses our NVIDIA-based infrastructure to power Tinker, an open platform for reinforcement learning and fine-tuning of frontier models for specialized use cases, achieving over 2x faster training and serving with Google’s AI Hypercomputer.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Fueling agentic logic and reinforcement learning with Axion, Intel, and AMD&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;While GPUs and TPUs are great for training and serving AI models, they need to be complemented with high-performance CPU-based services to handle the complex logic, tool-calls, and feedback loops that surround the core AI model. Our &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/whats-new-in-compute-at-next26"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;new Axion-powered N4A CPU instances&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; deliver outstanding price-performance for these agent runtimes. In fact, GKE Agent Sandbox with Google Axion N4A offers up to 30% better price-performance than agent workloads on other hyperscalers. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;This efficiency extends across our entire portfolio, including our 4th generation Compute Engine VM families, powered by the latest x86 instances from Intel and AMD. These are specifically optimized for the broadest range of RL tasks, such as RL reward calculation, agent orchestration, and nested visualization, providing the optimal capabilities for every AI workload. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Virgo Network for data center scale-out fabric&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As part of AI Hypercomputer, the &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/networking/introducing-virgo-megascale-data-center-fabric"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Virgo Network&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is designed to meet the demanding requirements of modern large-scale AI workloads. Its collapsed fabric architecture with 4x the bandwidth of previous generations eliminates the "scaling tax" to deliver staggering peak computing power. This capacity helps the most ambitious AI workloads scale with near-linear efficiency.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With Virgo Network and TPU 8t, we can connect 134,000 TPUs into a single fabric in a single data center, and connect more than one million TPUs across multiple data center sites into a training cluster — essentially transforming globally distributed infrastructure into one seamless supercomputer. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are also making Virgo Network available for A5X (powered by NVIDIA Vera Rubin NVL72), supporting up to 80,000 GPUs in a single data center, and up to 960,000 GPUs across multiple sites. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Storage: Minimizing data bottlenecks&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;A massive compute cluster is only as effective as the storage system feeding it data. To ensure storage is not a bottleneck while making compute faster, we are delivering four key storage advancements that let you: &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Accelerate training and inference: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Google Cloud Managed Lustre now delivers 10 TB/s of bandwidth — a 10x improvement over last year and up to 20x faster than other hyperscalers. We’ve also increased its capacity to 80 petabytes. These advancements are powered by our new C4NX instances and Hyperdisk Exapools. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Minimize latency:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Managed Lustre can leverage new TPUDirect and RDMA to allow data to bypass the host, moving directly to the accelerators. By removing this processing overhead, your AI agents can respond with the near-instant speed users need. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Maintain peak utilization for training:&lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Rapid Buckets on Google Cloud Storage transforms object storage with sub-millisecond latency and 20 million operations per second. This helps ensure large-scale training checkpoints and recoveries happen near-instantly, allowing your accelerators to maintain 95% utilization or higher, accelerating training cycles, while also providing cost-effective utilization of valuable TPUs and GPUs.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Build custom solutions:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; For ISVs and organizations that want to build storage solutions, we are launching the&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Z4M instance, specifically engineered for customers who want to integrate trusted parallel file systems like Vast Data or Sycomp. Each Z4M instance scales to a massive 168 TiB of local SSD capacity and can be deployed in RDMA clusters of thousands of machines. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These new storage options provide a comprehensive storage portfolio, giving you the raw power of the AI Hypercomputer stack with optimal storage services for each use-case.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;GKE: Orchestration for agent-native workloads&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the agentic era, intelligence is only as effective as the speed at which it can be scaled. So, we’ve transformed GKE to serve as the premier orchestration engine for agent-native workloads.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Reducing latency across the stack&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;To support responsive agentic responses, we optimize every millisecond of the start-up and scale-out process. By streamlining how infrastructure responds to surges in demand, GKE ensures that your agents are ready the moment a user engages with the system. New in GKE are:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Accelerated node and pod startup:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; GKE nodes now start up to 4x faster, while pod startup times have been slashed by up to 80%.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Rapid model loading:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Leveraging the run:AI Model Streamer and Rapid Cache in Google Cloud Storage, models now load 5x faster, removing a traditional storage bottleneck.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Intelligent routing with AI-powered Inference Gateway&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Building on last year's introduction of GKE Inference Gateway, we are using "AI for AI" to solve the complexities of serving at scale. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Inference Gateway’s new predictive latency boost replaces heuristic guesswork with machine learning-driven, real-time capacity-aware routing. This intelligent orchestration cuts time-to-first-token (TTFT) latency by more than 70% without manual tuning. For businesses, this translates directly into more natural voice conversations and smooth, real-time interactions across a range of use cases. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Inference Gateway can be deployed alongside llm-d, a Kubernetes-native high-performance distributed LLM inference framework, which was &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/containers-kubernetes/llm-d-officially-a-cncf-sandbox-project"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;recently accepted&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; as a Cloud Native Computing Foundation (CNCF) Sandbox project. Google Cloud is proud to be a founding contributor to llm-d alongside Red Hat, IBM Research, CoreWeave, and NVIDIA, uniting around a clear, industry-defining vision: any model, any accelerator, any cloud. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_loveable_quote.max-1000x1000.png"
        
          alt="3 loveable quote"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Open software ecosystem for the full AI lifecycle &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Hardware reaches its full potential through co-designed software. AI Hypercomputer enables engineers to move faster by providing native, optimized support for the industry’s most popular frameworks, including JAX, PyTorch, and vLLM. This open software layer reduces friction between development and deployment, translating to faster time-to-market and better resource efficiency.&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are now in preview with select customers with native PyTorch support for TPU, which we call TorchTPU. With TorchTPU, you can run models on TPUs as they are, with full support for native PyTorch features like Eager Mode. When you combine this with our robust support of vLLM on TPU, our message is clear: we always focus on building for openness and customer choice.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Your foundation for agentic growth&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To innovate quickly and cost-effectively in the agentic era, you need a unified system that doesn’t compromise on performance or choice. That is exactly what AI Hypercomputer delivers. By co-designing every layer — from the silicon to the software — we remove the integration burden so your teams can focus on driving your business forward. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;AI Hypercomputer also serves as the powerful foundation for Google’s entire ecosystem of high-level services. This integrated stack powers everything from Gemini Enterprise to the &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/introducing-gemini-enterprise-agent-platform"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Gemini Enterprise Agent Platform&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, ensuring that all these infrastructure innovations translate directly into business value. By leveraging our fully managed services, such as our &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/docs/training/overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;serverless training service&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and our new Managed RL API, you can apply AI Hypercomputer’s massive performance gains to customize Gemini with your own business logic, delivering sophisticated, agent-based solutions. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We’re looking forward to seeing what you build next with this updated and expanded AI platform.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 22 Apr 2026 12:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/ai-infrastructure-at-next26/</guid><category>AI &amp; Machine Learning</category><category>Google Cloud Next</category><category>TPUs</category><category>Compute</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/GCN26_102_BlogHeader_2436x1200_Opt_18_Light.max-600x600.jpg" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>What’s next in Google AI infrastructure: Scaling for the agentic era</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/GCN26_102_BlogHeader_2436x1200_Opt_18_Light.max-600x600.jpg</image><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/ai-infrastructure-at-next26/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Amin Vahdat</name><title>SVP and Chief Technologist, AI and Infrastructure</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Mark Lohmeyer</name><title>VP and GM, AI and Computing Infrastructure</title><department></department><company></company></author></item><item><title>New innovations in Google Distributed Cloud</title><link>https://cloud.google.com/blog/topics/hybrid-cloud/google-distributed-cloud-at-next26/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Today at &lt;/span&gt;&lt;a href="https://www.googlecloudevents.com/next-vegas" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Cloud Next&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, we’re announcing new capabilities in &lt;/span&gt;&lt;a href="https://cloud.google.com/distributed-cloud"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Distributed Cloud&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (GDC) that bring Gemini and our advanced AI stack to wherever your data is, so you don’t need to compromise between AI innovation and sovereignty. This will serve as a catalyst for a sovereign neocloud architecture. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;GDC brings Google Cloud to wherever you need it — in your own data center or at the edge. It is offered as two distinct models to meet your specific security and hardware requirements: &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;GDC air-gapped&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, a fully disconnected deployment that runs on purpose-built, Google-supplied hardware designed for maximum security and compliance; and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;GDC connected&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, where you benefit from an integrated, Google-managed software lifecycle on your own hardware.&lt;/span&gt;&lt;/p&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Traditionally, enterprises and governments with strict data regulatory and sovereignty requirements, were locked out of the latest AI capabilities. Their only choice was to build their own systems, which is slow, complicated, and expensive. GDC ends that struggle. You get world-class AI innovation in your own premises without the toil.&lt;/span&gt;&lt;/p&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;GDC delivers a complete, on-premises AI solution: managed infrastructure optimized for AI workloads, a choice of Gemini or open models for flexibility and efficient Inference services that are cost effective. This foundation allows you to build and run secure AI agents and applications while maintaining total control over your data.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_Google_Distributed_Cloud.max-1000x1000.png"
        
          alt="1 Google Distributed Cloud"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Let’s take you through how the new innovations in GDC come together to support your sovereign AI workloads.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Managed AI infrastructure&lt;/strong&gt;&lt;/h3&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;To support sovereign AI needs on-premises, organizations require managed infrastructure that can handle the massive performance demands of compute, storage, and networking. Because on-premises AI workloads are dynamic and unpredictable, we are introducing new infrastructure innovations that deliver &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;peak performance&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; across a variety of requirements:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;NVIDIA Blackwell GPUs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Accelerate AI performance with &lt;/span&gt;&lt;a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;NVIDIA Blackwell&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (NVIDIA HGX B200) and &lt;/span&gt;&lt;a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;NVIDIA Blackwell Ultra platforms &lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;(NVIDIA HGX B300) GPUs, leveraging 5th-gen NVIDIA NVLink to deliver data-center scale bandwidth directly to your environment&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Google Cloud machine families&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; GDC &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;already supports the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/general-purpose-machines"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;N2 and N3&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; machine families&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; for general-purpose workloads, and now it supports the new &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;A4&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; machine family delivering a 2.25x increase in peak compute to handle demanding inference tasks. We’re also bringing the memory-optimized &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/memory-optimized-machines"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;M2 and M3&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; machine families&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; to GDC for workloads like ERP and data analytics that require higher memory-to-vCPU ratios.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Enhanced storage scale and performance&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: GDC now supports 6PB object storage per zone (as compared to 1PB earlier) — 6x the previous storage capacity. In addition, it now offers 30 IOPS/GB (as compared to 3 IOPS/GB earlier) per zone, a 10x performance boost, minimizing storage bottlenecks. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Foundational models in your data center&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With GDC, you can bring the power of Google’s flagship Gemini models directly into your environment, bridging the gap between world-class generative AI and strict data sovereignty by enabling native deployment within your own perimeter, now powered by the latest generation NVIDIA Blackwell GPUs.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today, we are excited to announce that the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;latest Gemini Flash models&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; are now available (in preview) on the NVIDIA Blackwell and Blackwell Ultra Platforms for GDC connected customers, joining our &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/topics/hybrid-cloud/gemini-is-now-available-anywhere?e=0"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;existing support&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for GDC air-gapped customers.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_Google_Distributed_Cloud.max-1000x1000.png"
        
          alt="2 Google Distributed Cloud"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p style="text-align: justify; padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"Deploying Gemini on Google Distributed Cloud has significantly improved our global manufacturing. Running frontier AI locally allows us to analyze IoT data for real-time predictive maintenance and quality control, avoiding cloud latency. We maintain strict data sovereignty over our IP while retaining cloud-like agility." - &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Junhee Lee, CEO&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;, Samsung SDS&lt;/span&gt;&lt;/p&gt;
&lt;h3 style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;AI Inferencing services: Introducing Google Distributed Cloud AI gateway&lt;/span&gt;&lt;/h3&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;To optimize performance and abstract infrastructure complexity, &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;we are introducing the AI gateway for sovereign environments&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. This intelligent middleware acts as the control plane for your models. This provides:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Dynamic request routing: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Automatically routes inference requests to the right AI model based on cost, latency, and accuracy, rather than on hard-coded logic. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Intelligent load balancing&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Routes requests for optimized inference efficiency, picking GPUs based on their utilization.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Quota management: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Prioritizes requests to ensure high-priority applications receive required throughput, and meet quota management goals.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong style="vertical-align: baseline;"&gt;Observability:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Built-in tracing and logging for every inference call, helping ensure auditability for compliance-heavy environments.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_Google_Distributed_Cloud.max-1000x1000.png"
        
          alt="3 Google Distributed Cloud"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3 style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Agentic AI applications and agents&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To truly operationalize AI at the edge, organizations need more than just foundational models. They need autonomous, secure agents built on an agentic AI architecture that can take action. We are thrilled to announce a &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;new sovereign agentic AI architecture for Google Distributed Cloud&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. Built with 3rd party providers on Kubernetes, this architecture helps to ensure that your agentic workflows execute entirely within your secure Customer Organization boundary. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/4_Google_Distributed_Cloud.max-1000x1000.png"
        
          alt="4 Google Distributed Cloud"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Using this agentic architecture, you can build and deploy powerful AI agents for agentic tasks like development, coding or data analysis all within your secure perimeter.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;AI anywhere with Google Distributed Cloud&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We believe GDC is the best platform to serve Google and other models on-prem, connected and air-gapped, enabling all customers to leverage AI and agentic solutions, without compromising on sovereignty. To learn more about these product offerings, visit our &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/distributed-cloud/docs"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;website&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;The innovations we discussed here deliver the flexibility and security required for the sovereign AI era. To see them in action, join our &lt;/span&gt;&lt;a href="https://cloud.withgoogle.com/next/25/session-library?filters=session-type-breakouts,interest-networking#all" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GDC breakout sessions&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; or the &lt;/span&gt;&lt;a href="https://www.googlecloudevents.com/next-vegas" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Showcase at Next ’26&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 22 Apr 2026 12:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/hybrid-cloud/google-distributed-cloud-at-next26/</guid><category>Compute</category><category>Google Cloud Next</category><category>Hybrid &amp; Multicloud</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/GCN26_102_BlogHeader_2436x1200_Opt_9_Light.max-600x600.jpg" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>New innovations in Google Distributed Cloud</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/GCN26_102_BlogHeader_2436x1200_Opt_9_Light.max-600x600.jpg</image><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/hybrid-cloud/google-distributed-cloud-at-next26/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Muninder Sambi</name><title>VP, Google Distributed Cloud</title><department></department><company></company></author></item><item><title>Inside the eighth-generation TPU: An architecture deep dive</title><link>https://cloud.google.com/blog/products/compute/tpu-8t-and-tpu-8i-technical-deep-dive/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At Google, our TPU design philosophy has always been centered on three pillars: scalability, reliability, and efficiency. As AI models evolve from dense large language models (LLMs) to massive Mixture-of-Experts (MoEs) and reasoning-heavy architectures, the hardware must do more than just add floating point operations per second (FLOPS); it must evolve to meet the specific operational intensities of the latest workloads.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The rise of agentic AI requires infrastructure that can handle long context windows and complex sequential logic. At the same time, world models have emerged as a necessary evolution from current next-sequence-of-data architectures, which means newer agents are simulating future scenarios, anticipating consequences, and learning through "imagination" rather than risky trial-and-error. The eighth-generation TPUs (TPU 8t and TPU 8i) are our answer to these challenges, ensuring that every workload, from the first token of training to the final step of a multi-turn reasoning chain, is running on the most efficient path possible. They are built to efficiently train and serve world models like Google DeepMind’s Genie 3, enabling millions of agents to practice and refine their reasoning in diverse simulated environments.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;TPU 8: Specialized by design&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Recognizing that the infrastructure requirements for pre-training, post-training, and real-time serving have diverged, our eighth-generation TPUs introduce two distinct systems: TPU 8t and TPU 8i. These new systems are key components of Google Cloud's AI Hypercomputer, an integrated supercomputing architecture that combines hardware, software, and networking to power the full AI lifecycle. While both systems share the core DNA of Google’s AI stack and support the full AI lifecycle, each is built to address distinct bottlenecks and optimize efficiency for critical stages of development. Additionally, by integrating Arm-based Axion CPU headers across our eighth-generation TPU system, we’ve removed the host bottleneck caused by data preparation latency. Axion provides the compute headroom to handle complex data preprocessing and orchestration, so that TPUs stay fed and don’t stall.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;TPU 8t: The pre-training powerhouse&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Optimized for massive-scale pre-training and embedding-heavy workloads, TPU 8t utilizes our proven 3D torus network topology at an even larger scale of 9,600 chips in a single superpod. TPU 8t is designed for maximum throughput across hundreds of superpods, ensuring that training runs stay on schedule.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Here are some key advancements of TPU 8t over prior-generation TPUs:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;The SparseCore advantage&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Central to TPU 8t is the SparseCore, a specialized accelerator designed to handle the irregular memory access patterns of embedding lookups. While the Matrix Multiply Unit (MXU) handles matrix math, the SparseCore offloads data-dependent all-gather operations, amongst other collectives, preventing the zero-op bottlenecks that often plague general-purpose chips.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;VPU/MXU overlap and balanced scaling&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: TPU 8t is designed to maximize provisioned FLOPs utilization. By implementing more balanced Vector Processing Unit (VPU) scaling, the architecture minimizes exposed vector operation time. This allows for better overlapping of quantization, softmax, and layernorms with the matrix multiplications in the MXU, helping the chip stay busy rather than waiting on sequential vector tasks.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong style="vertical-align: baseline;"&gt;Native FP4&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;TPU 8t introduces native 4-bit floating point (FP4) to overcome memory bandwidth bottlenecks, doubling MXU throughput while maintaining accuracy for large models even at lower-precision quantization. By reducing the bits per parameter, the platform minimizes energy-intensive data movement and allows larger model layers to fit within local hardware buffers for peak compute utilization.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/TPU_diagrams.max-1000x1000.jpg"
        
          alt="TPU diagrams"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="c3frb"&gt;Figure 1: TPU 8t ASIC block diagram&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Virgo Network topology and up to 4x data center network increase&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: To support the massive data requirements of TPU 8t, &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/networking/introducing-virgo-megascale-data-center-fabric"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;we introduced Virgo Network&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. This new networking architecture enables up to 4x increased data center network (DCN) bandwidth on TPU 8t training over DCN. Virgo Network is a scale-out fabric designed for the extreme requirements of modern AI workloads. Built on high-radix switches that reduce network layers by allowing more ports per switch, it employs a flat, two-layer non-blocking topology. Compared with traditional datacenter networks, this significantly reduces latency by minimizing network tiers. It features a multi-planar design with independent control domains to connect TPU 8t chips. The TPU 8t racks also connect with the Jupiter north-south fabric to access compute and storage services. Together, this streamlined architecture delivers the massive bisection bandwidth and deterministic low latency necessary for enabling the world's largest training clusters with high availability. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With 2x scale-up bandwidth on the inter-chip interconnect (ICI) and up to 4x raw scale-out DCN bandwidth compared to the previous generation, TPU 8t drastically reduces data bottlenecks. Then, to further accelerate the development of frontier models, we scale distributed training beyond a single cluster. With &lt;/span&gt;&lt;a href="https://docs.jax.dev/en/latest/index.html" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;JAX&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Pathways&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;we can now &lt;/strong&gt;&lt;a href="https://jax-ml.github.io/scaling-book/" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;scale&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt; to more than 1 million TPU chips in a single training cluster&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. Virgo Network can link over 134,000 TPU 8t chips with up to 47 petabits/sec of non-blocking bi-sectional bandwidth in a single fabric. This fabric delivers over 1.6 million ExaFlops with near-linear scaling performance.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_TPU_8t_rack_level_connectivity_to_Virgo_.max-1000x1000.png"
        
          alt="2 TPU 8t rack level connectivity to Virgo fabric"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="c3frb"&gt;Figure 2: TPU 8t rack level connectivity to Virgo fabric&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; font-style: italic; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Faster storage access: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;We are introducing &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;TPUDirect RDMA &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;and&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; TPU Direct Storage&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; in&lt;/span&gt;&lt;strong style="font-style: italic; vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;TPU 8t. TPU Direct RDMA enables direct data transfers between the TPU's memory (HBM) and the Network Interface Cards (NICs), bypassing the host CPU and DRAM. This reduces latency and host system bottlenecks, increasing the effective bandwidth for TPU-to-TPU communication. Similarly, TPUDirect Storage bypasses CPU host bottlenecks by enabling direct memory access between the TPU and high-speed managed storage like 10T Lustre, effectively doubling the bandwidth for massive data transfers. This architecture allows the silicon to ingest training data at line rate, ensuring that the MXUs stay fully saturated even when processing large multimodal datasets. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;By combining &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/storage-data-transfer/next26-storage-announcements"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Managed Lustre 10T&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and TPUDirect Storage to route hundred-petabyte datasets directly to the silicon, TPU 8t prevents training delays caused by data ingestion bottlenecks. This delivers 10x faster storage access compared to training on seventh-generation Ironwood TPUs.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_rq0yjyX.max-1000x1000.png"
        
          alt="3"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="c3frb"&gt;Figure 3: The top diagram shows the data transfer path without TPUDirect Storage. The bottom diagram shows TPU 8t data transfer with TPUDirect Storage between 2 TPU 8t chips and TPUDirect Storage with Managed 10T Lustre storage.&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;TPU 8i: The sampling and serving specialist&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Optimized for post-training and high-concurrency reasoning, we designed TPU 8i with our highest on-chip SRAM, a new Collectives Acceleration Engine (CAE), and a new serving-optimized network topology called Boardfly. &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong style="vertical-align: baseline;"&gt;Large on-chip SRAM:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; With 3x more on-chip SRAM over the previous generation, TPU 8i can host a larger KV Cache entirely on silicon, significantly reducing the idle time of the cores during long-context decoding. &lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/Figure_4-_TPU_8i_ASIC_block_diagram.jpg"
        
          alt="4 TPU 8i ASIC block diagram"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="c3frb"&gt;Figure 4: TPU 8i ASIC block diagram&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;The Collectives Acceleration Engine (CAE)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: To solve the sampling bottleneck, TPU 8i uses the CAE, which aggregates results across cores with near-zero latency, specifically accelerating the reduction and synchronization steps required during auto-regressive decoding and "chain-of-thought" processing. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;For each TPU 8i chip, there are two Tensor Cores (TC) on-core dies and one CAE on the chiplet die, replacing four SparseCores (SCs) on core dies in previous-generation Ironwood TPU. By integrating a specialized CAE, TPU 8i further reduces the on-chip latency of collectives by 5x. Lower latency per collective operation means less time spent waiting, directly contributing to higher throughput required to run millions of agents concurrently.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong style="vertical-align: baseline;"&gt;Boardfly ICI topology&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: While the 3D torus allows connecting thousands of chips to be used in cohesion, a large mesh does have more hops between chips and higher all-to-all latencies. For 8i, we changed how the chips connect together in fully connected boards that are then aggregated into groups. Utilizing a high-radix design, we connect up-to 1,152 of these chips together, reducing the network diameter and the number of hops a data packet must take to cross the system. By slashing the hops required for all-to-all communication (the heart of MoE and reasoning models), Boardfly achieves up to a 50% improvement in latency for communication-intensive workloads.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/5_I1mUzjb.max-1000x1000.png"
        
          alt="5"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="c3frb"&gt;Figure 5: TPU 8i hierarchical Boardfly topology building up from a building block of four fully connected chips into a fully connected group of eight boards, with 36 of such groups fully connected into a TPU 8i pod&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Boardfly consists of the following elements, and its topology is hierarchical by nature:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Building Block (BB):&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Each tray forms a four-chip ring using internal ICI links, providing 16 external connections for broader networking.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Group (G):&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Eight boards are fully connected via copper cabling to create a localized group, utilizing 11 of the available external links for intra-group communication.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Pod structure:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The final architecture scales to 36 groups (up to 1,024 active chips) linked through Optical Circuit Switches (OCS), ensuring a maximum latency of seven hops for any chip-to-chip communication.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Deep dive: The Boardfly vs. torus math&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Why move away from the torus for TPU&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;8i? It comes down to&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;network diameter.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In a 3D torus, nodes are arranged in a grid where each dimension wraps around like a ring. To reach the furthest possible chip in a 8 x 8 x 16 (1024-chip) configuration, a packet must traverse half the distance of each ring:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;3D torus = 8/2(X) + 8/2(Y) + 16/2(Z) = 16 hops&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;While the torus is highly efficient for the neighbor-to-neighbor communication typical of dense training, it creates a latency tax for all-to-all communication patterns. In the era of reasoning models and MoE, where any chip may need to talk to any other chip to route a token, this hop count matters.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Boardfly’s high-radix topology is inspired by &lt;/span&gt;&lt;a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/34926.pdf" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Dragonfly&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; topology principles. By increasing the number of direct optical long-haul links between groups of boards, we flatten the network. For that same 1024-chip pod, Boardfly reduces the network diameter from 16 hops down to just seven.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This 56% reduction in network diameter translates directly to lower tail latency, so that the TPU 8i CAE isn't left waiting for data to arrive from across the pod.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/6_Qu7H2lI.max-1000x1000.png"
        
          alt="6"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="c3frb"&gt;Figure 6: A visual representation of the maximum seven-hop ICI network diameter via optical circuit switch on TPU 8i pod&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;TPU 8t and TPU 8i at a glance&lt;/span&gt;&lt;/h3&gt;
&lt;div align="left"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;&lt;table&gt;&lt;colgroup&gt;&lt;col/&gt;&lt;col/&gt;&lt;col/&gt;&lt;/colgroup&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Feature&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;TPU 8t&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;TPU 8i&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Primary Workload&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Large-scale pre-training&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Sampling, serving, and reasoning&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Network Topology&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;3D torus&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Boardfly &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Specialized Chip Features&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;SparseCore (Embeddings) &amp;amp; LLM Decoder Engine&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;CAE (Collectives Acceleration Engine)&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;HBM Capacity&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;216 GB&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;288 GB&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;On-Chip SRAM (Vmem)&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;128 MB&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;384 MB&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Peak FP4 PFLOPs&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;12.6&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;10.1&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;HBM Bandwidth&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;6,528 GB/s&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;8,601 GB/s (~1.3x of TPU 8t)&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;CPU Header&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Arm Axion&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: middle; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Arm Axion&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Software enablement: A performance-first AI stack&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Hardware is only as powerful as the software that drives it. The eighth generation of TPUs are built on the same performance-first stack we pioneered with the seventh-generation Ironwood TPUs, designed to make custom kernel development accessible without sacrificing the abstraction of high-level frameworks. This stack includes:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Pallas and Mosaic&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: We provide first-class support for &lt;/span&gt;&lt;a href="https://docs.jax.dev/en/latest/pallas/tpu/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Pallas&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, our custom kernel language that lets you write hardware-aware kernels in Python. This enables you to squeeze every drop of performance out of the TPU 8i CAE and the TPU 8t SparseCore.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Native PyTorch experience: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;We're thrilled to share that &lt;/span&gt;&lt;a href="https://developers.googleblog.com/torchtpu-running-pytorch-natively-on-tpus-at-google-scale/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;native PyTorch support for TPUs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is now in preview. If you're currently building and serving models on PyTorch, we've made it easier than ever to start using TPUs. You can bring your existing models to our TPUs just as they are, complete with full support for the native features you rely on, such as Eager Mode.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Portability&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The same JAX, PyTorch, or Keras code that runs on Ironwood scales to this generation. Accelerated Linear Algebra (XLA) handles the complex translation of the Broadly topology and CAE synchronization behind the scenes, allowing you to focus on your model, not the interconnect.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Generation over generation: The performance leap&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our commitment to co-designing hardware and software continues to pay dividends. When compared to seventh-generation Ironwood TPU, the eighthgeneration TPUs deliver massive gains:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Training price-performance&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: TPU 8t delivers up to 2.7x performance-per-dollar improvement over Ironwood TPU for large-scale training.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Inference price-performance&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: TPU 8i delivers up to 80% performance-per-dollar improvement over Ironwood TPU, particularly at low-latency targets for large MoE models.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Energy efficiency&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Both chips deliver up to &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;2x better performance-per-watt&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, critical for scaling the next generation of AI sustainably.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Looking ahead&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To empower Google Cloud customers pioneering the next wave of innovation, we designed TPU 8t and TPU 8i as two distinct, specialized systems tailored to the multifaceted future demands of the AI lifecycle. TPU 8t and 8i are both purpose-built for the most demanding serving and training workloads, fully integrating with the AI Hypercomputer software stack: JAX, PyTorch, vLLM, XLA, and Pathways. This specialization and ground-up redesign, all in deep collaboration with Google Deepmind, delivers exceptional price-performance and power efficiency.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The modularity of our eighth-generation architecture provides a clear and unique roadmap for the future. Just as every major shift in the computing landscape has required infrastructure breakthroughs, so does the agentic era. Reasoning agents that plan, execute, and learn within continuous feedback loops cannot operate at peak efficiency on hardware that was originally optimized for traditional training or transactional inference; their operational intensity are fundamentally distinct. Our eighth-generation TPU infrastructure has evolved to meet these specific requirements head-on.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To learn more about the eighth-generation TPU family:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;a href="https://cloud.google.com/resources/tpu-interest?e=48754805"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Submit an interest form for eighth-generation TPUs&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;a href="https://discuss.google.dev/c/google-cloud/cloud-ai-infrastructure/ai-infrastructure-tpus/247" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Get involved in the community forums&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;a href="https://youtu.be/wOVtSeP4aAM" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Check out the eighth-generation TPU announcement video&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;a href="https://cloud.google.com/tpu"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Visit our TPU website&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt; &lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 22 Apr 2026 12:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/tpu-8t-and-tpu-8i-technical-deep-dive/</guid><category>AI &amp; Machine Learning</category><category>Google Cloud Next</category><category>TPUs</category><category>Compute</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/eighth-generation_TPU.max-600x600.jpg" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Inside the eighth-generation TPU: An architecture deep dive</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/eighth-generation_TPU.max-600x600.jpg</image><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/tpu-8t-and-tpu-8i-technical-deep-dive/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Diwakar Gupta</name><title>Distinguished Engineer, Google Cloud</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Sabastian Mugazambi</name><title>Group Product Manager, Google Cloud</title><department></department><company></company></author></item><item><title>A developer’s guide to architecting reliable GPU infrastructure at scale</title><link>https://cloud.google.com/blog/products/compute/a-guide-to-architecting-reliable-gpu-infrastructure/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="font-style: italic; vertical-align: baseline;"&gt;Editor’s note&lt;/strong&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;strong style="font-style: italic; vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;This blog post outlines Google Cloud’s GPU AI/ML infrastructure reliability strategy, and will be updated with links to new community articles as they appear.&lt;/span&gt;&lt;/p&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As we enter the era of multi-trillion parameter models, computational power has transitioned from a utility to a mission-critical strategic asset. To meet relentless training demand, organizations are no longer just building clusters — they are engineering massive, integrated compute ecosystems comprising hundreds of thousands of high-performance accelerators that are interconnected with an ultra-high-bandwidth networking backplane. At this unprecedented scale, raw performance thrives when it is built upon a foundation of systemic resilience.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In "always-on" mission-critical environments, the statistical probability of hardware variance becomes a primary constraint for reliability. When thousands of GPUs are operating at peak utilization for months at a time, a 0.01% performance fluctuation can trigger a systemic failure. The cost of training interruptions now measured in millions of dollars and weeks of lost progress, the industry's focus has shifted. The true frontier of training isn't just about the size of the cluster — it’s about the resilient system architecture that is able to power the next generation of AI workloads.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The core challenge for the industry goes beyond simple hardware fixes; it requires the creation of holistic software and infrastructure frameworks designed to withstand the inevitable disruptions of massive-scale computing. In an environment where AI/ML infrastructure represents a major capital expenditure on a company's balance sheet, partnering with a cloud provider that places a premium on infrastructure reliability is paramount.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Operational realities of AI at scale&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The construction of a supercomputer utilizing hundreds of thousands of advanced GPUs involves significant operational complexity. Maintaining optimal utilization over several months to train a single large language Model (LLM) subjects the hardware to high levels of sustained performance that exceed the design parameters of conventional data center equipment. The advent of rackscale GPU architectures, such as the NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72, has shifted the landscape. Considerations now extend beyond individual machines to encompass entire domains, impacting multiple interconnected trays with the potential to require coordinated management for AI/ML workloads to avoid disruptions.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The business implications of infrastructure instability&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For organizations at the forefront of AI innovation, infrastructure reliability poses a significant commercial risk with substantial economic consequences.&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;High cost of failure:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; A single failure in a massive training job requires restarting from the last checkpoint, wiping out days or even weeks of progress. When infrastructure spend is a big capex, every failure counts. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Delayed time-to-market:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; In the fast-moving AI space, being first matters. Every day spent debugging hardware failures is a day delaying releasing new models while competitors are getting ahead. Reliability issues can directly slow down model iteration cycles, delaying product launches and feature updates.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Operational complexities:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Manually managing a large GPU cluster is a resource-intensive task. Companies come to the cloud to reduce the cost of managing the infrastructure. Without systemic reliability investments, operations teams can get overwhelmed by a constant stream of alerts, forced to play "whack-a-mole" to identify, isolate, and replace faulty nodes thus affecting their time spent on planning for the future capacity and model demands. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Expensive workarounds to mitigate failure impact:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; To achieve a certain level of performance and &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity?e=48754805&amp;amp;_gl=1*9b6bxc*_ga*MjA0OTQyOTQyNi4xNzcyNzc2OTEw*_ga_WH2QY8WWF5*czE3NzI3NzY5MDkkbzEkZzEkdDE3NzI3NzczNzUkajU4JGwwJGgw"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Goodput&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, companies can end up needing to buy 10-20% more hardware than they actually need as a buffer.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Quantitative assessment: Key reliability metrics&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Beyond traditional uptime measurements, the primary metrics Google Cloud uses to measure AI infrastructure health and stability are MTBI and Goodput. &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Mean Time Between Interruption (MTBI):&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The average time a system runs before encountering an interruption. Includes instance terminations as well as every customer workload interruption that our systems can observe (example GPU XIDs).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Goodput:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The amount of useful computational work completed per unit time.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Google Cloud’s methodology: Engineering systemic resilience&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The objective has shifted from expecting total hardware perfection to engineering systems that demonstrate inherent resilience. We understand that trust in our infrastructure begins with reliability. Our approach is based on four principles:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Proactive prevention:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We’ve integrated hardware validation, real-time telemetry, and automated remediation throughout the infrastructure lifecycle. This &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;systemic approach to &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;shift from reactive troubleshooting to proactive management optimizes the reliability of mission-critical GPUs systems at scale.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Continuous monitoring and intelligent detection:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;We have transformed raw data into actionable insights by synthesizing multi-layered telemetry through automated analysis, to proactively identify and resolve anomalies. This data-driven approach shifts our infrastructure from reactive maintenance to an intelligent, self-healing system that helps ensure continuous workload stability.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Transparency and control:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;We empower users with full visibility and control over GPU infrastructure health. We provide a comprehensive suite of observability metrics and direct tools, allowing customers to correlate hardware status with their workload Goodput and report faults. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Minimizing disruptions:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Our control plane integrates smart scheduling with predictive health signals to enable improved workload migration via maintenance notifications. If unexpected issues arise, customers can enable automated remediations and fast recovery mechanisms to initiate rapid restoration of service. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We have covered an in-depth journey into these principles in our technical deep-dive post linked below. We are launching a comprehensive &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;technical deep dive series&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; to explore Google’s approach towards AI/ML infrastructure reliability for Google Cloud GPUs further. Check back here as we add links to learn about:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;a href="https://discuss.google.dev/t/proactive-prevention-inside-google-clouds-multi-layered-gpu-qualification-process/337742" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Proactive prevention: Inside Google Cloud's multi-layered GPU qualification process&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; font-style: italic; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Transparency and Control : Providing Operational Transparency and Management tools to Mitigate GPU Workload Impact (Coming Soon)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Continuous monitoring and intelligent detection: Using ML to predict and prevent GPU downtime (coming soon)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Minimizing disruptions: Smart scheduling and fast recovery systems for mission-critical GPU clusters (coming soon)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;</description><pubDate>Thu, 09 Apr 2026 22:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/a-guide-to-architecting-reliable-gpu-infrastructure/</guid><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>A developer’s guide to architecting reliable GPU infrastructure at scale</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/a-guide-to-architecting-reliable-gpu-infrastructure/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Abhijith Prabhudev</name><title>Product Manager, Google</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Abhay Ketkar</name><title>Senior Staff Software Engineer, Google</title><department></department><company></company></author></item><item><title>AI infrastructure efficiency: Ironwood TPUs deliver 3.7x carbon efficiency gains</title><link>https://cloud.google.com/blog/topics/systems/ironwood-tpus-deliver-37x-carbon-efficiency-gains/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;At Google, we are committed to being &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/topics/sustainability/tpus-improved-carbon-efficiency-of-ai-workloads-by-3x?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;transparent about the environmental impact of our AI infrastructure&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, publishing metrics on the lifetime emissions of our chips — from manufacturing to powering these chips in the data center. Today, &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;we are updating these metrics for our seventh-generation TPU, Ironwood, which demonstrates an approximately 3.7x improvement in Compute Carbon Intensity (CCI) compared to TPU v5p&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;the previous generation of performance-optimized TPUs&lt;/span&gt;.&lt;/span&gt;&lt;sup&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: super;"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In other words, despite the fact that AI is driving demand for additional compute resources, our ongoing work to optimize AI hardware is helping to improve the energy consumption and emissions of AI workloads.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Measuring AI accelerator efficiency: Compute Carbon Intensity (CCI)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To help manage the environmental impact of AI workloads, we monitor the Compute Carbon Intensity (CCI) of our AI accelerator hardware. CCI is defined in &lt;/span&gt;&lt;a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=11097303" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;An Introduction to Life-Cycle Emissions of Artificial Intelligence Hardware&lt;/span&gt;&lt;/a&gt;&lt;sup&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: super;"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/sup&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;sup&gt; &lt;/sup&gt;as the estimated amount of CO2 equivalent emitted for every utilized floating-point operation (CO2e/FLOP). This metric provides a holistic, chip-level view by including both the embodied emissions associated with manufacturing, transportation, and data center construction (Scope 3), as well as the operational emissions associated with running these chips in data centers (Scope 1 and 2).&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The Ironwood advantage: high performance, low footprint&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Google’s TPU CCI continues to improve with each chip generation. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Drawing from empirical data measured in January 2026, Ironwood demonstrates a remarkable 3.7x &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;improvement&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; in CCI relative to TPU v5p. This accelerates efficiency gains from the 1.2x CCI improvement of TPU v5p relative to TPU v4, and demonstrates continued carbon efficiency optimization of Google’s performance-optimized TPU architecture.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These efficiency gains are driven by outsized compute performance increases between TPU generations relative to growth in machine energy consumption and manufacturing emissions.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; In fact, &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;fleetwide measurements demonstrate a 5x improvement in utilized FLOPs across generations, from TPU v5p to Ironwood.&lt;/span&gt;&lt;sup&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: super;"&gt;3&lt;/span&gt;&lt;/span&gt;&lt;/sup&gt;&lt;span style="vertical-align: baseline;"&gt; Because the performance denominator in our CCI equation (CO2e/FLOP) is scaling faster than emissions, the net carbon cost per operation drops significantly with every new chip.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_Oan2vLj.max-1000x1000.png"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p style="text-align: center;"&gt;&lt;sup&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Figure 1: Ironwood’s accelerating CCI improvement measured on Google’s performance-optimized TPU cohort, considering January 2026 workloads.&lt;/span&gt;&lt;/span&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: super;"&gt;4&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Operating Google’s TPU fleet more efficiently&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Updated TPU CCI metrics also offer a direct comparison to the measurement we published in 2025. Specifically, from October 2024 to January 2026, Google’s versatile TPU cohort ran more efficiently than what we reported previously:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;TPU v5e achieved a 43% reduction in total CCI over 15 months, dropping to 228 gCO2e/EFLOP. This was driven by a 72% increase in average utilization.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Trillium, the sixth-generation TPU, saw a 20% reduction in total CCI over the same time period, bringing its emissions intensity down to 125 gCO2e/EFLOP.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_HRjRsFh.max-1000x1000.png"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p style="text-align: center;"&gt;&lt;sup&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;Figure 2: Google’s versatile TPU cohort demonstrates deployment efficiency gains for the same TPU generations between October 2024 and January 2026.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: super;"&gt;5&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span&gt;&lt;span style="vertical-align: baseline;"&gt;These results demonstrate that Google continues to improve the carbon-efficiency of our AI infrastructure. While the massive scale of AI demand requires a significant and growing amount of power, our innovations allow us to deliver substantially more compute performance for every unit of energy consumed.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Decoupling energy and emissions from performance&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To what can we attribute these improvements? Beyond Ironwood’s raw hardware capabilities, these CCI gains are further enabled by deep software and system-level optimizations across our infrastructure:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Software efficiency (MoE):&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The widespread adoption of sparse architectures, such as Mixture of Experts (MoE), routes computation only to necessary parameters. This drastically reduces the active FLOPs required per inference or training step without sacrificing model capacity or quality.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Lower precision math (FP8):&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; By heavily leveraging 8-bit floating-point (FP8) formats, we effectively double compute throughput and halve memory bandwidth requirements compared to 16-bit formats. This shows that we can maintain output quality while exponentially decreasing the energy cost per mathematical operation.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Workload mix and intelligent scheduling:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Advanced fleet orchestration continuously balances the workload mix across our infrastructure. By intelligently scheduling tasks, we ensure high continuous utilization rates, optimize duty cycles, and minimize the carbon penalty of idle power draw.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Scale sustainably with Google Cloud&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;AI’s trajectory requires infrastructure that can scale exponentially without an equivalent surge in carbon emissions. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;The 3.7x carbon efficiency improvement from TPU v5p to Ironwood demonstrates that we can achieve greater compute density while minimizing the growth of our energy and environmental footprint through deliberate hardware and software codesign.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; To learn more and get started with Ironwood, register your interest with &lt;/span&gt;&lt;a href="https://cloud.google.com/resources/ironwood-tpu-interest?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;this form&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;sub&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;1. Following the methodology published in an &lt;/span&gt;&lt;a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=11097303" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;August 2025 technical report&lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;, we quantified the full lifecycle emissions of TPU hardware as a point-in-time snapshot across Google’s generations of TPUs as of January 2026. The functional unit for this study is one AI computer deployed in the data center, which includes one or more accelerator trays (containing TPUs) connected to one host tray (i.e., a computing server). Peripheral components beyond the tray (e.g., rack, shelf, and network equipment) and auxiliary computing and storage resources are excluded from the calculation of embodied and operational emissions. We include the electricity used in data center cooling in operational emissions. To estimate operational emissions from electricity consumption of running workloads, we used a one month sample of observed machine power data from our entire TPU fleet, applying Google’s 2024 average fleetwide carbon intensity. To estimate embodied emissions from manufacturing, transportation, and retirement, we performed a life-cycle assessment of the hardware. Data center construction emissions were estimated based on Google’s disclosed 2024 carbon footprint. These findings do not represent model-level emissions, nor are they a complete quantification of Google’s AI emissions. Based on the TPU location of a specific workload, CCI results of specific workloads may vary.&lt;br/&gt;&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;2. The authors would like to thank and acknowledge the co-authors of this paper for their important contributions to enable these results: Ian Schneider, Hui Xu, Stephan Benecke, Parthasarathy Ranganathan, and Cooper Elsworth.&lt;br/&gt;&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;3. This comparison considers the utilized FLOPS (BF16) between deployed TPU v5p and Ironwood chips in Google’s fleet in January 2026. This trend is consistent with the improvement in peak FLOPS (BF16) between v5p (459 FLOPS) and Ironwood (2,307 FLOPS).&lt;br/&gt;&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;4.The GHG protocol offers two accounting standards for operational emissions. Results presented here consider market-based emissions, which includes the impact of carbon-free energy purchases. Location-based accounting, which excludes carbon-free energy purchases, would raise operational CCI to 793, 712, and 195 gCO2e/EFLOP, respectively. The ratio of CCI improvements would be at a similar level, and Ironwood’s embodied CCI would drop from 23% to 8% of its total CCI.&lt;br/&gt;&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;5. To ensure a fair comparison across varying TPU utilizations, this analysis replicates the propensity score weighting methodology from the &lt;/span&gt;&lt;a href="https://ieeexplore.ieee.org/iel8/40/11236092/11097303.pdf" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;August 2025 technical report&lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt; and compares January 2026 results to the results published in 2025. This statistical technique adjusts for duty cycle variations to balance the comparison of TPUs during a given time period. This empirical methodology results in small variations in calculated CCI between temporal periods, reflecting fluctuations in real-world energy consumption and hardware utilization across the global infrastructure. &lt;/span&gt;&lt;/sub&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Mon, 06 Apr 2026 16:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/systems/ironwood-tpus-deliver-37x-carbon-efficiency-gains/</guid><category>Compute</category><category>Sustainability</category><category>TPUs</category><category>Systems</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>AI infrastructure efficiency: Ironwood TPUs deliver 3.7x carbon efficiency gains</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/systems/ironwood-tpus-deliver-37x-carbon-efficiency-gains/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Keguo (Tim) Huang</name><title>Senior Data Scientist, Google</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>David Patterson</name><title>Google Distinguished Engineer, Google</title><department></department><company></company></author></item><item><title>A developer’s guide to training with Ironwood TPUs</title><link>https://cloud.google.com/blog/products/compute/training-large-models-on-ironwood-tpus/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The transition toward trillion-parameter AI models has created an exponential demand for computational resources, testing the limits of traditional infrastructure. The seventh-generation Ironwood TPU features Google’s custom-designed AI infrastructure: It is engineered to scale as a holistic system supporting pods of up to 9,216 chips by combining Inter-Chip Interconnect (ICI), Optical Circuit Switch (OCS), Data Center Network (DCN) and massive aggregated High Bandwidth Memory (HBM) capacity. In addition, Ironwood features an integrated &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/inside-the-ironwood-tpu-codesigned-ai-stack?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;co-design&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; between hardware architecture and software, introducing innovations such as compiler-centric XLA and Python-native kernels via &lt;/span&gt;&lt;a href="https://docs.jax.dev/en/latest/pallas/index.html" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Pallas&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. Together, these features significantly scale organizations’ capacity to train and serve sophisticated frontier models, optimize the entire AI lifecycle and enable sustained high performance. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_YpVMWLp.max-1000x1000.jpg"
        
          alt="image1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This technical overview explores the specific methods and tools within the JAX and MaxText ecosystems designed to refine training efficiency and reach peak performance on Ironwood hardware.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Key optimization strategies for Ironwood&lt;/span&gt;&lt;/h2&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;1. Leverage native FP8 with MaxText&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ironwood is the first TPU generation with native 8-bit floating point (FP8) support in its Matrix Multiply Units (MXUs). By utilizing FP8 precision for weights, activations, and gradients, users can theoretically double throughput compared to Brain Floating Point 16 (BF16). When FP8 recipes are configured correctly, increased efficiency is achievable without compromising model quality. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To implement these FP8 training recipes, users can start with the &lt;/span&gt;&lt;a href="https://github.com/google/qwix" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Qwix&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; library. This functionality is enabled by specifying the relevant flags within the MaxText configuration.  &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;See our blog post, &lt;/span&gt;&lt;a href="https://discuss.google.dev/t/inside-the-optimization-of-fp8-training-on-ironwood/336681" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Inside the optimization of FP8 training on Ironwood&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, in the Google Developer forums for more details.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;2. Accelerate with Tokamax kernels&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://github.com/openxla/tokamax/tree/main" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Tokamax&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is a library of high-performance JAX kernels optimized for TPUs. These kernels are designed to mitigate specific bottlenecks through the following mechanisms:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Splash Attention&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This mechanism addresses the I/O limitations inherent in standard attention processes. By maintaining computations within on-chip SRAM, it is particularly effective for processing long context lengths where memory bandwidth typically becomes a constraint. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Megablox Grouped Matrix Multiplication (GMM)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This manages the “ragged” tensors (data structures with inconsistent row lengths that typically create hardware idle time) often found in Mixture of Experts (MoE) models. By utilizing GMM, the system avoids inefficient padding and ensures higher utilization of the MXU. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Kernel tuning&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The Tokamax library includes &lt;/span&gt;&lt;a href="https://github.com/openxla/tokamax/blob/main/tokamax/experimental/utils/tuning/tpu/README.md" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Utilities&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for hyperparameter optimization. These tools allow for the adjustment of tile sizes and other configurations to align with the specific memory hierarchy of the Ironwood TPU.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;3.  Offload collectives to SparseCore&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The fourth-generation SparseCores in Ironwood are processors specifically designed to manage irregular memory access patterns. By using specific &lt;/span&gt;&lt;a href="https://github.com/AI-Hypercomputer/maxtext/blob/c0abc4c0c0a98e02413d7b6c669927d013467045/benchmarks/xla_flags_library.py#L70-L116" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;XLA flags&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, users can offload collective communication operations—such as &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;All-Gather&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Reduce-Scatter&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;—directly to the SparseCore.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This offloading mechanism allows the TensorCores to remain dedicated to primary model computations while communication tasks execute in parallel. This functional overlap is a critical strategy for hiding communication latency and ensuring consistent data throughput to the MXUs.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;4. Fine-tune the memory pipeline on VMEM&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;VMEM, a critical part of the TPU memory architecture, is a fast on-chip SRAM that is designed to optimize kernel performance. You can improve the overall speed of execution by tuning the allocation of VMEM between  current operation and future weight prefetch. For example, increasing the VMEM reserved for the current scope allows increasing the tile sizes used by the kernel, which can increase kernel performance by removing potential memory stalls. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Refer to &lt;/span&gt;&lt;a href="https://docs.jax.dev/en/latest/pallas/tpu/pipelining.html" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;TPU Pipelining&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for more on TPU memory architecture.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;5. Choose optimal sharding strategies&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Lastly, MaxText supports various parallelism techniques which are available on all TPUs. The best choice depends on model size, architecture (Dense vs. MoE), and sequence length. Selecting a proper sharding strategy can improve the performance of the model:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Fully Sharded Data Parallelism (FSDP)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This is the preferred strategy for training large models that exceed the memory capacity of a single chip. FSDP shards model weights, gradients, and optimizer states across multiple chips. Increasing the per-device batch size and introducing more compute can hide the latency of the All-Gather operations and improve efficiency.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Tensor Parallelism (TP)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Shards individual tensors. Given Ironwood's high arithmetic intensity, TP is most effective for very large model dimensions. Leveraging TP with a dimension of 2 can take advantage of the fast die-to-die interconnect on Ironwood's dual-chiplet design.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Expert Parallelism (EP)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Helpful for MoE models to distribute experts across devices.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Context Parallelism (CP)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Necessary for very long sequences, sharding activations along the sequence dimension.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hybrid approaches&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Combining strategies is often required to balance compute, memory, and communication on large-scale runs.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;See the &lt;/span&gt;&lt;a href="https://discuss.google.dev/t/optimizing-frontier-model-training-on-tpu-v7x-ironwood/336983/2" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Optimizing Frontier Model Training on TPU v7x Ironwood&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; post in the Developer forums for more detail on techniques 2-5 above.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;strong style="vertical-align: baseline;"&gt;The Ironwood advantage: System-level performance&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These optimization techniques, coupled with Ironwood's architectural strengths like the high-speed 3D Torus Inter-Chip Interconnect (ICI) and massive HBM capacity, create a highly performant platform for training frontier models. The tight co-design across hardware, compilers (XLA), and frameworks (JAX, MaxText) ensures you can extract maximum performance from your AI Infrastructure.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ready to accelerate your AI journey? Explore the resources below to dive deeper into each optimization method.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Further reading&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;a href="https://discuss.google.dev/t/inside-the-optimization-of-fp8-training-on-ironwood/336681" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Inside the optimization of FP8 training on Ironwood&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;a href="https://discuss.google.dev/t/optimizing-frontier-model-training-on-tpu-v7x-ironwood/336983/2" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Optimizing Frontier Model Training on TPU v7x Ironwood&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;sub&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;A special thanks to &lt;/span&gt;&lt;em&gt;&lt;span data-rich-links='{"per_n":"Hina Jajoo","per_e":"hjajoo@google.com","type":"person"}' style="vertical-align: baseline;"&gt;Hina Jajoo&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;span data-rich-links='{"per_n":"Amanda Liang","per_e":"amandaliang@google.com","type":"person"}' style="vertical-align: baseline;"&gt;Amanda Liang&lt;/span&gt;&lt;/em&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt; for their contributions to this blog post.&lt;/span&gt;&lt;/sub&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Mon, 23 Mar 2026 16:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/training-large-models-on-ironwood-tpus/</guid><category>AI &amp; Machine Learning</category><category>TPUs</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>A developer’s guide to training with Ironwood TPUs</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/training-large-models-on-ironwood-tpus/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Lillian Yu</name><title>Product Strategy &amp; Operations</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Liat Berry</name><title>Product Manager, Google TPUs</title><department></department><company></company></author></item><item><title>Google Cloud and NVIDIA expand AI innovation across industries at GTC 2026</title><link>https://cloud.google.com/blog/products/compute/google-cloud-ai-infrastructure-at-nvidia-gtc-2026/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The era of agentic AI is fundamentally changing enterprise infrastructure needs. As organizations build systems capable of dynamic reasoning and autonomous execution, the underlying infrastructure must evolve as well. Scaling these agentic workloads alongside massive mixture-of-experts (MoE) architectures demands a deeply optimized co-engineered stack.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To meet these demands, we’ve built the Google Cloud AI Hypercomputer, an AI-optimized infrastructure as a service, that integrates performance-optimized hardware, leading software, open frameworks, and flexible consumption models into a single, cohesive system to deliver ultra-low latency, high-throughput, and cost-effective inference. To give our customers even more options within this integrated architecture, we are expanding our partnership with NVIDIA.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This week at NVIDIA GTC 2026, Google Cloud and NVIDIA are expanding our partnership with a wave of new announcements, showcasing a co-engineered AI infrastructure foundation:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Infrastructure and hardware&lt;/strong&gt;&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Strong momentum for Google Cloud G4 VMs, powered by NVIDIA RTX PRO&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; 6000 Blackwell Server Edition&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Preview of flexible, fractional G4 VMs using NVIDIA vGPU technology — a first in the industry for NVIDIA RTX PRO&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; 6000 Blackwell Server Edition&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Upcoming support for NVIDIA Vera Rubin NVL72 Platform&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Software and platform&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;NVIDIA Dynamo integration with GKE Inference Gateway&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Enhanced NVIDIA support across Vertex AI Training and Model Garden&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Ecosystem&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;Kaggle competition for NVIDIA Nemotron on G4 VMs&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Launch of a dedicated public sector AI startup accelerator program&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Let’s take a closer look at the announcements.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Accelerating AI workloads with G4 VMs&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;G4 VMs, powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, are built to power a diverse spectrum of high-performance workloads — from advanced spatial computing to complete AI development lifecycles. For instance, companies like Otto Group &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;One.O &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;and WPP use the G4 to run physically accurate simulations and real-time 3D rendering at scale.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Beyond simulation, the G4 also shines in model fine-tuning and inference, particularly for models ranging from 30B to more than 100B parameters. By leveraging 4-bit floating point (FP4) precision and Google’s peer-to-peer (P2P) communication, customers are achieving &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/g4-vms-p2p-fabric-boosts-multi-gpu-workloads"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;higher throughput for model serving and considerable latency reductions&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, enabling a new class of real-time, multimodal AI agents and highly responsive generative AI applications.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Here are some examples of how customers are already leveraging the performance and efficiency of G4 VMs to accelerate their most demanding workloads:&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Google Cloud’s G4 VMs give us the scalable GPU backbone we need to push billions of miles of photorealistic simulation through our pipeline. The 4x lift in throughput means our ML teams can iterate faster, train on richer data, and validate edge cases long before our models ever see the real world.&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;” &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;– Sony Mohapatra, Director, AI/ML Engineering, General Motors&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="vertical-align: baseline;"&gt;“&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Now with G4 VMs powered by NVIDIA Blackwell, we're pushing our multimodal models even further — faster inference, better reliability, instant replies across languages. The goal stays the same: making voice agents that work at enterprise scale without compromise. We are excited to keep building together and see what our customers deploy with this.” &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;– Mati Staniszewski, Cofounder, ElevenLabs&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“Google Cloud G4 VMs provide the computational backbone for our Robotic Coordination Layer, allowing us to synchronize autonomous fleets across our logistics centers with millisecond precision. By simulating complex warehouse environments in a high-fidelity digital twin, we can optimize our entire supply chain virtually before a single robot moves on the floor.”&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; – &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Dr. Stefan Borsutzky, CEO of Otto Group One.O&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“After transitioning to G4 VMs, we achieved a 50% reduction in processing latency and 6x increase in throughput just by updating our Terraform scripts. It’s rare to get that kind of performance boost for our core workloads without adding any operational overhead.”&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; – Alfonso Acosta, Head of Engineering, Imgix&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Introducing fractional G4 VMs &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are excited to announce the preview of fractional G4 VMs, providing a highly efficient and cost-effective entry point for AI and graphics workloads. These new configurations, using NVIDIA virtual GPU (vGPU) technology, allow you to leverage the power of the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs in flexible, smaller increments, so you can right-size your infrastructure to match the specific demands of your applications.&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;“Enterprises need unprecedented flexibility to scale complex, agentic AI workloads. With Google Cloud, we’re introducing fractional G4 VMs powered by NVIDIA RTX PRO 6000 to let customers right‑size GPU capacity and maximize ROI. Together with our co‑engineered stack – from NVIDIA NeMo on Vertex AI to NVIDIA Dynamo with GKE – we’re delivering an open, high‑performance platform for next‑generation reasoning and MoE models.” &lt;/span&gt;&lt;/em&gt;&lt;span style="vertical-align: baseline;"&gt;– Ian Buck, VP / General Manager, Hyperscale and HPC, NVIDIA&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;By providing more granular access to advanced hardware, fractional G4 VMs let you optimize resource allocation and reduce overhead without sacrificing performance. You can now select from additional GPU slice sizes for your specific needs:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;1/2 GPU:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Ideal for more intensive tasks such as LLM inference, robotics sensor simulation, and high-fidelity 3D rendering.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;1/4 GPU:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Optimized for mainstream workloads, including mid-range creative design, video transcoding, and real-time data visualization.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;1/8 GPU:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Great for lightweight applications such as remote desktops, productivity tools, and entry-level streaming services.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These flexible G4 size portfolio let you:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Right-size infrastructure:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Precisely match GPU capacity to application demands, ranging from lightweight remote desktops to intensive data processing.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Maximize cost efficiency:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Lower operational overhead by utilizing — and paying for — only the fractional GPU resources you need for specific tasks.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Scale diverse workloads:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Power a broad spectrum of innovation, from high-fidelity creative design and streaming to complex robotics simulations and real-time inference.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These fractional G4 VMs can be managed by Google Kubernetes Engine (GKE), allowing developers to use advanced container binpacking to achieve even higher price-performance and resource utilization. When managed through Dynamic Workload Scheduler, you can set fallback priorities for fractional slices. This significantly improves obtainability by allowing the scheduler to automatically find available GPU configurations for each workload.&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;The G4 vGPU’s flexible sizing allows us to precisely tailor compute resources to the scale of each molecular simulation, ensuring maximum efficiency across our drug discovery pipeline. This granular control means our researchers can seamlessly pivot between smaller workflows and massive parallel processing without being constrained by fixed hardware configurations.&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;” &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;– Shane Brauner, &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;EVP, CIO, Schrödinger&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Scaling AI Hypercomputer with NVIDIA Vera Rubin NVL72&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Building on our deep engineering partnership with NVIDIA, we’re proud to support the successor to NVIDIA Blackwell architecture, the recently announced NVIDIA Vera Rubin platform. We plan to be among the first cloud providers to offer NVIDIA Vera Rubin NVL72 rack-scale systems in the second half of 2026, integrating them into our AI Hypercomputer architecture to empower the next generation of reasoning and agentic AI. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Delivering efficiency across the AI infrastructure stack &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As part of our commitment to a fully open ecosystem, we are excited to announce the integration of Dynamo and GKE &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/concepts/about-gke-inference-gateway"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Inference Gateway&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. This integration provides a modular, open-source control plane across the application layer and the hardware. By combining Dynamo with Inference Gateway on GKE, teams can tailor their infrastructure to their exact needs, allowing them to extract the maximum ROI from accelerators, accelerate time-to-market for new AI models, and future-proof their deployments.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You can learn to maximize performance for massive MoE architectures through new &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/scaling-moe-inference-with-nvidia-dynamo-on-google-cloud-a4x?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;advanced scaling recipes&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for A4X VMs (powered by NVIDIA GB200 NVL72 and Dynamo). These configurations show how to overcome memory and interconnect bottlenecks when running AI inference workloads on AI Hypercomputer.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are also enhancing resource obtainability through the Dynamic Workload Scheduler, with Calendar Mode and Flex Start for A4X and A4X Max (powered by NVIDIA GB300 NVL72), as well as new Flex Start support for G4 VMs. Dynamic Workload Scheduler lets you reserve the precise capacity that you need, or use flexible start windows. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Snap, a long-time Google Cloud customer, achieved significant cost savings by migrating two of its primary data processing pipelines to Google Cloud G2 VMs powered by NVIDIA L4 Tensor Core GPUs. This was made possible by leveraging Spark on GKE alongside NVIDIA’s new cuDF libraries, which automated the optimization of its shuffle-heavy workloads for optimal GPU efficiency. &lt;/span&gt;&lt;a href="https://www.nvidia.com/gtc/session-catalog/sessions/gtc26-s81678/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Learn more at GTC session S81678.&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Advancing Vertex AI training and Model Garden &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are meeting the demands of next-generation AI with two major infrastructure advancements to &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/docs/training/training-clusters/overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI training clusters&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. First, support for &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;A4X VM domains&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; lets you leverage Vertex AI’s managed infrastructure and framework capabilities for massive-scale training on &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;NVIDIA GB200 NVL72 &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;rack-scale systems. To ensure these intensive workloads remain uninterrupted, new hardware resiliency capabilities let you apply configurable, proactive fault detection scans, which identify and mitigate potential hardware issues before they can disrupt critical “hero” training runs. These capabilities enable higher goodput and helps ensure that multi-week training jobs stay on track without costly restarts.&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“We are setting a new standard for the agentic enterprise — delivering highly capable, consistent, accurate, and responsive AI agents with Google and NVIDIA. By leveraging Vertex AI training clusters on &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;NVIDIA GB200 NVL72&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt; to power our Agentforce 360 Platform, we’ve eliminated infrastructure bottlenecks to keep our GPUs fully saturated. This high-performance, resilient architecture allows our researchers to focus on innovation at scale, driving substantial gains for our most complex reasoning workloads.” - &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Silvio Savarese, Chief Scientist, Salesforce&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At the same time, we continue to broaden Vertex AI Model Garden with support for &lt;/span&gt;&lt;a href="https://console.cloud.google.com/vertex-ai/publishers/nvidia/model-garden/nemotron-3-super" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;NVIDIA’s Nemotron 3&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; family of open models. These include the Nemotron 3 Nano, featuring one-click deployment to simplify integration into private VPCs. We’ve also expanded our catalog to include the NVIDIA Nemotron 3 Super 120B model for immediate access to high-performance, large-scale reasoning. To maximize the value of these models, we’ve integrated NVIDIA’s latest performance libraries directly into Vertex AI to optimize popular open-source models on NVIDIA TensorRT-LLM. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To enable the community to get hands-on with NVIDIA Nemotron on Google Cloud, we &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;are also launching the NVIDIA Nemotron model reasoning challenge on Kaggle, powered by &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;G4 VMs. The competition invites the community to improve Nemotron 3 Nano’s reasoning accuracy on a new benchmark using techniques such as prompting, synthetic data generation, data curation, and fine-tuning – all running on cost-efficient G4 infrastructure so participants can iterate quickly and share their methods with the broader ecosystem. To learn more and register, &lt;/span&gt;&lt;a href="https://www.kaggle.com/competitions/nvidia-nemotron-model-reasoning-challenge" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;visit the Kaggle competition page&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Empowering public sector AI startups &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To foster continued innovation within the ecosystem, Google Public Sector and NVIDIA are launching an AI startup accelerator program. This year-long initiative will support a select cohort of AI-focused Independent Software Vendors (ISVs) building solutions for the public sector.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Participants gain dual access to both NVIDIA Inception and Google Cloud’s ISV accelerator resources. Kicking off at GTC and continuing through Google Cloud Next, this joint program will equip emerging technology leaders with the co-engineered infrastructure, technical guidance, and go-to-market support required to scale mission-critical public sector applications. To learn more about the program, please complete the &lt;/span&gt;&lt;a href="https://docs.google.com/forms/d/e/1FAIpQLSci71lEfkHJKb9wVN2UmXVGaOk3DeB84mW5dve8ulo9kl60pg/viewform" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;interest form&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. Additional cohorts will be selected and announced in the future.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Co-engineering collaboration powers every layer of the AI stack&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The transition to complex, agentic AI demands more than just raw compute. It requires a fully optimized, co-engineered stack. By integrating flexible hardware like fractional G4 instances and the upcoming Vera Rubin platform into our AI Hypercomputer architecture, and pairing it with deep software co-engineering, we provide the scale, resilience, and efficiency you need to turn your most ambitious AI visions into reality.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Coming to GTC? Stop by booth #513 to learn more and talk to our team. And you can always learn more about our collaboration with NVIDIA at &lt;/span&gt;&lt;a href="http://cloud.google.com/NVIDIA"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;cloud.google.com/NVIDIA&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Mon, 16 Mar 2026 16:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/google-cloud-ai-infrastructure-at-nvidia-gtc-2026/</guid><category>AI &amp; Machine Learning</category><category>Partners</category><category>Compute</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/Google_Cloud_NVIDIA_Hero_Image_for_GTC26_Blo.max-600x600.jpg" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Google Cloud and NVIDIA expand AI innovation across industries at GTC 2026</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/Google_Cloud_NVIDIA_Hero_Image_for_GTC26_Blo.max-600x600.jpg</image><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/google-cloud-ai-infrastructure-at-nvidia-gtc-2026/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Mark Lohmeyer</name><title>VP and GM, AI and Computing Infrastructure</title><department></department><company></company></author></item><item><title>H4D VMs, now GA, deliver exceptional performance and scaling for HPC workloads</title><link>https://cloud.google.com/blog/products/compute/h4d-vms-now-ga/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today, we’re announcing  the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;general availability of H4D VMs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, our latest high performance computing (HPC)-optimized VM, powered by the 5th Generation AMD EPYC&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;™ processors&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;. H4D VMs deliver exceptional performance, scalability, and value for industries like manufacturing, health care and life sciences, weather forecasting, and electronic design automation (EDA).&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; H4D supports orchestration via Cluster Toolkit with Slurm and via Google Kubernetes Engine (GKE). Each approach allows for near-instant deployment and scaling of demanding workloads.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For the first time, the Google Cloud CPU portfolio features a VM family with &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;C&lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt;loud Remote Direct Memory Access (RDMA).&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;H4D’s RDMA is on the &lt;/span&gt;&lt;a href="https://cloud.google.com/titanium"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Titanium network adapter&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and lets you scale single-node H4D performance to multiple nodes, accelerating large production workloads. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Faster time to solution across domains and scales&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Powered by the high core density of the 5th Gen AMD EPYC CPU and Google’s innovative, low-latency &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/topics/systems/introducing-falcon-a-reliable-low-latency-hardware-transport"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Falcon hardware transport&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;,&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; H4D VMs enable you to iterate and discover faster than ever before.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We demonstrated H4D performance through a series of industry-standard benchmarks, showing its capabilities across diverse domains and problem sizes.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Healthcare and life sciences&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;For researchers in healthcare and life sciences (HCLS), H4D VMs accelerate complex molecular simulations critical to scientific discovery. Compared to our previous C2D VMs, H4D VMs deliver up to a 4.3X speedup running LAMMPs (LJ benchmark) at 96 VMs, delivering 95% parallel efficiency on 18k cores. For drug discovery, we demonstrated a 5.8X speed-up using GROMACS (water_33m) at 32 VMs delivering 72% parallel efficiency on 6k cores. H4D also delivers further scalability, which we demonstrated by running the LAMMPS LJ benchmark on 192 VMs (&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;~37k cores) while maintaining 92% parallel efficiency (see Figure 3).&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_JTLuwUW.max-1000x1000.jpg"
        
          alt="1-Figuer1&amp;amp;2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--medium
      
      
        h-c-grid__col
        
        h-c-grid__col--4 h-c-grid__col--offset-4
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/2_RA1vjLg.jpg"
        
          alt="2-Figuer3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Manufacturing&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;For manufacturing, H4D VMs help engineers shorten design cycles, run larger simulations, and iterate faster by delivering a strong performance boost for mission-critical Computer-Aided Engineering (CAE) workflows. Compared to our previous C2D VMs when running complex Computational Fluid Dynamics (CFD) simulations, H4D VMs deliver a 4.1X speedup running Ansys Fluent (F1_RaceCar_140m benchmark) on 32 VMs with 85% parallel efficiency. When running open-source OpenFOAM  (Motorbike_100m), we demonstrated a 5.2X speedup over C2D using 16 VMs and achieving superlinear parallel efficiency of 122%.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/3_9YSJuty.jpg"
        
          alt="3-Figuer4&amp;amp;5"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;A new standard for HPC price/performance&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;H4D VMs are designed to deliver the best price-performance for HPC workloads on Google Cloud by pairing superior performance with flexible consumption models. H4D supports Dynamic Workload Scheduler (DWS), which adapts to your workflow with Flex Start mode for just-in-time capacity and Calendar mode for guaranteed reservations. This allows you to access compute for as low as 3 cents per core-hour without long-term commitments. The resulting performance and cost efficiencies over previous generation VMs are detailed in Figures 6 and 7. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/4_VFxG3YM.jpg"
        
          alt="4-Figuer6"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/5_FKrLh4Z.jpg"
        
          alt="5-Figuer7"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Comprehensive HPC management&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To manage and deploy large, dense clusters of H4D VMs, you can leverage Google Cloud’s &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/ai-hypercomputer/docs/cluster-capabilities"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cluster Director&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, which offers advanced maintenance capabilities (you can sign up for the preview &lt;/span&gt;&lt;a href="https://forms.gle/dppWNms5DF44gCwV9" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;) alongside the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/cluster-toolkit/docs/overview"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cluster Toolkit&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for rapid cluster deployment  via turnkey system blueprints. For job and workload management, H4D VMs integrate with &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/batch/docs/get-started"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Batch&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, Google Cloud’s fully managed, cloud-native service that handles queuing, scheduling, and resource provisioning. Additionally, there’s support for &lt;/span&gt;&lt;a href="https://cloud.google.com/products/dws/pricing?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;DWS&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, which can be used in both Calendar mode for future reservations and Flex Start mode for time-limited, on-demand usage.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;What customers and partners are saying&lt;/span&gt;&lt;/h3&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/jump.max-1000x1000.jpg"
        
          alt="jump"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="ciutv"&gt;&lt;i&gt;“We were able to test the H4D platform in early access at&lt;/i&gt; &lt;a href="https://www.jumptrading.com/"&gt;&lt;i&gt;Jump Trading&lt;/i&gt;&lt;/a&gt;&lt;i&gt;, and were extremely impressed with the results. The successful testing process demonstrated that H4D offers the performance, stability, and efficiency we require for demanding, high-volume operations. We see up to 50% better price/performance compared to prior generation machines and are now accelerating integration with our critical grid workloads on Google Cloud."&lt;/i&gt; &lt;b&gt;- Alex Davies, Chief Technology Officer &amp;amp; Benjamin Stromski, HPC Linux Engineering, Jump Trading&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/hmx_labs.max-1000x1000.jpg"
        
          alt="hmx labs"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="ciutv"&gt;&lt;i&gt;“There lingers, especially in large-scale and compute-intensive domains, the idea that the fastest systems can only be built on premises and run on bare metal hardware. Terms such as ‘hypervisor tax” are often thrown around as justification for operating with bare metal. Our testing paints a different picture. The Google H4D VM performs better on our financial risk benchmark than the bare metal top of stack AMD CPU of the same generation."&lt;/i&gt; &lt;b&gt;- Hamza Mian/CEO, HMxLabs&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/totalcare.max-1000x1000.jpg"
        
          alt="totalcare"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="ciutv"&gt;&lt;i&gt;"As a leading provider of managed HPC solutions for the demanding CAE and manufacturing sectors, our evaluation of the H4D platform was focused heavily on its ability to handle our clients' largest, most tightly-coupled simulation workloads. We are extremely impressed with the results. The testing confirmed that the underlying RDMA fabric exhibits the outstanding low-latency and high-bandwidth performance required for massive parallel processing. This level of interconnect efficiency is non-negotiable for speeding up critical manufacturing simulations like crash testing and CFD. H4D has proven itself to be a true accelerator for high-throughput engineering workloads, and we are excited about its potential to redefine the performance ceiling for HPC in the engineering world."&lt;/i&gt; &lt;b&gt;- Rodney Mach/President, TotalCAE&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/Google.max-1000x1000.jpg"
        
          alt="Google"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="ciutv"&gt;&lt;b&gt;&lt;i&gt;“&lt;/i&gt;&lt;/b&gt;&lt;i&gt;The new H4D instances are a significant step forward for our demanding next-generation TPU simulation workloads. We've seen a 30% performance improvement across a variety of EDA benchmarks compared to C2D, demonstrating the strong single core performance of H4D. This directly translates to faster development cycles and allows our engineering teams to iterate more quickly”&lt;/i&gt;&lt;b&gt; - Trevor Switkowski, Technical Lead of Chip Design Methodology, Google Cloud&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Experience H4D today&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;H4D is now available in &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;us-central1-a (Iowa), europe-west4-b (Netherlands) and asia-southeast1-a (Singapore)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; with additional regions coming soon. Check regional availability on our &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/regions-zones#available"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Regions and Zones page&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and deploy your most demanding HPC workloads by leveraging &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/create-vm-with-rdma"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud RDMA&lt;/span&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;. &lt;/strong&gt;&lt;/p&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;sub&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;The following configurations were run for the above benchmarks:&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;LAMMPS version 20250722, GROMACS: version 2023.1, OpenFOAM version 2312, Ansys Fluent version 2024R1. All runs used IntelMPI 2021.17.2. C2D/C3D/C4D used TCP, H4D used RDMA with RXM &amp;amp; SAR_LIMIT=2G. All runs used full ppn (processes-per-node) available on each platform (56, 180, 192 for C2D, C3D and C4D/H4D respectively). Ansys Fluent runs used 168ppn on H4D and variable ppn for C4D. SMT off for all. Cost comparision across single nodes of H4D-highmem-192 with DWS Flex Start price, c3d-standard-360 and c2d-standard-112 OD price.&lt;/span&gt;&lt;/em&gt;&lt;/sub&gt;&lt;/p&gt;
&lt;p&gt;&lt;sub&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;Parallel efficiency and optimal node count depend on input size and communication patterns, and therefore vary across workloads.&lt;/span&gt;&lt;/em&gt;&lt;/sub&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 04 Mar 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/h4d-vms-now-ga/</guid><category>HPC</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>H4D VMs, now GA, deliver exceptional performance and scaling for HPC workloads</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/h4d-vms-now-ga/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Aysha Keen</name><title>Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Felix Schürmann</name><title>Senior HPC Technologist</title><department></department><company></company></author></item><item><title>Simpler billing, clearer savings: A FinOps guide to updated spend-based CUDs</title><link>https://cloud.google.com/blog/topics/cost-management/a-finops-professionals-guide-to-updated-spend-based-cuds/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Optimizing cloud spend is one of the most rewarding aspects of FinOps — and committed use discounts (CUDs) remain one of the most effective levers to pull.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In July 2025, we began rolling out &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/docs/cuds-multiprice"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;updates to the spend-based CUD model&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to make it easier to understand your costs and savings, expand coverage to new SKUs (including Cloud Run and H3/M-series VMs), and offer increased flexibility. These changes are now available to all customers. Let’s dive into how this new model simplifies your FinOps practice.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;1. What is the spend-based CUD data change all about? &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The most important shift is the move from a credit-based system to a &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;direct discounted price model using &lt;/strong&gt;&lt;a href="https://docs.cloud.google.com/docs/cuds-multiprice#consumption-model-intro"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;consumption models.&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Under the old &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;credits model&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, you committed to an hourly on-demand amount. To find your &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;savings&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; (the actual cost reduction realized), you had to use three different numbers: the full on-demand cost, the commitment fee, and the offsetting credit.&lt;/span&gt;&lt;/p&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;1. &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;The old math:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="2" style="list-style-type: lower-alpha; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;$10.00 (On-demand) + $5.50 (Commitment fee) - $10.00 (Credit) = $5.50 (Net Cost)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: lower-alpha; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Savings = $10.00 (On-demand) - $5.50 (Net costs) = $4.50&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With the new &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/docs/cuds-multiprice#consumption-model-intro"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;direct discount model&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, you don’t need to do that math to calculate your net costs. You commit directly to the net, discounted spend amount. Your usage is simply billed at that discounted rate.&lt;/span&gt;&lt;/p&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;2. &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;The new math:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;ol style="list-style-type: lower-alpha;"&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;$5.50 (Discounted costs)&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Savings = $10.00 (On-demand) - $5.50 (Discounted costs) = $4.50&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;  &lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You can now see your net cost at a glance, and calculating the savings only requires comparing the on-demand price ($10.00) to your new discounted cost ($5.50), which equals &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;$4.50/hr.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;2. How do I validate my savings before and after the changes?  &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The unified &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/billing/docs/how-to/analyze-cuds"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;CUD Analysis tool&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is your best resource for auditing the migration or performing deep-dives on your spend. CUD Analysis for the new spend-based CUD model&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; allows you to quickly verify the savings you are getting with the new model, and you can use this tool to compare that the savings didn’t change between the old and the new model. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You can validate your savings by following these steps:&lt;/span&gt;&lt;/p&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;1. Identify the date when the migration took place; you can see the migration date in the billing overview page.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_jzjRx1j.max-1000x1000.png"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;2. Go to CUD Analysis to validate the savings before and after the migration. &lt;/span&gt;&lt;/p&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;3. To quantify costs from before the migration:&lt;/span&gt;&lt;/p&gt;
&lt;ol style="list-style-type: lower-alpha;"&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Filter the view for one day before the migration, in this case &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Oct. 26, 2025.&lt;/strong&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;Select a CUD Product, for example &lt;strong style="vertical-align: baseline;"&gt;Cloud SQL CUD.&lt;/strong&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;In our example, &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;we&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;paid a $50.35 CUD fee to get a $69.12 credit. When you subtract that fee from the credit, your actual take-home &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;savings were $18.77&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_2jbhCzc.max-1000x1000.png"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;4. To validate costs after the migration&lt;/span&gt;&lt;/p&gt;
&lt;ol style="list-style-type: lower-alpha;"&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Change the date to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Oct. 28, 2025&lt;/strong&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;Under the new model, you pay the discounted rates upfront. Your dashboard will reflect a Net Cost of $50.35, compared to the $69.12 on-demand cost, clearly showing your &lt;strong style="vertical-align: baseline;"&gt;$18.77 in savings.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_nQjMUwd.max-1000x1000.png"
        
          alt="3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In addition, this release also includes &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/docs/cuds-verify-discounts#example_cost_reports"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;an update to &lt;/span&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cost Reports&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to include “Savings Programs,” which accurately reflects your actual net savings ($18.77 in our example above), rather than gross credit. &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;When comparing pre- and post-migration data in Cost Reports, ensure you include both usage SKUs and commitment fee SKUs to capture the full scope of the commitment.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;3. What other capabilities are in the new CUD Analysis?&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Beyond support for the new model, the new &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/billing/docs/how-to/analyze-cuds"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;CUD Analysis tool&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; offers deeper visibility into your CUD coverage and CUD utilization. You can now analyze your CUDs with &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;hourly data granularity&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for up to 30 days. This is a major improvement for FinOps teams, as daily averages often hide underutilization spikes that occur during specific hours.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/4_HLosdOT.max-1000x1000.png"
        
          alt="4"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="rirdr"&gt;CUD Analysis: Compute Flexible CUD coverage analysis&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/5_9A7ZjUx.max-1000x1000.png"
        
          alt="5"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="rirdr"&gt;CUD Analysis: Per CUD purchase utilization visibility&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;If you want to use your own data analysis tools, we offer a new &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/billing/docs/how-to/export-data-bigquery-tables/cud-export"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;spend-based CUD metadata export&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;that lets you manage your spend-based CUDs programmatically. You can use this export to join with the Billing BigQuery Export datasets to run in-depth, programmatic analysis on all your commitment data. You can also export &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/billing/docs/how-to/analyze-cuds#download_your_report"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;a CSV from the CUD Analysis view&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to see the raw data for every resource and its price without needing the full BigQuery export.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;4. How much commitment should I buy? &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/docs/cuds-recommender"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;CUD recommendations&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; are the primary tool for determining how much of a commitment to purchase. We recently enhanced our Compute Flexible CUD commitment recommendations to provide greater accuracy by including data from GKE, Cloud Run, Cloud Run Functions, and Compute Engine. Additionally, CUD scenario modeling allows you to adjust these suggestions in real-time. You can adjust coverage thresholds, filter out specific dates with irregular usage, or extend the lookback analysis window up to 180 days to identify the exact commitment level that aligns with your specific risk profile.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/6_MpUcC4f.max-1000x1000.png"
        
          alt="6"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="rirdr"&gt;CUD scenario modeling: experiment with multiple options to identify your ideal CUD strategy&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;5. Is there anything else I should know about Flex CUDs? &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With the release of the new spend-based model, we’ve addressed the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;reporting limitation&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; affecting customers who use a combination of &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/committed-use-discounts-overview#spend_based"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Flex CUDs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and GKE/Cloud Run CUDs. Previously, our analysis tools were unable to accurately identify the source of specific credits, leading to discrepancies in KPI metrics like savings, coverage, and utilization. Under the new spend-based CUD model, this limitation has been corrected, so your CUD analysis now provides an accurate, granular view of your savings per Google Cloud service.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To begin navigating the updated spend-based model, visit the Billing console. You can learn more in our documentation:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;a href="https://cloud.google.com/docs/cuds-multiprice"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Enhancements to the Spend-based CUD program &lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;a href="https://cloud.google.com/docs/cuds-multiprice-datamodel"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Insights into the multi-price data model&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/docs/cuds-verify-discounts"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Verify your savings post-migration&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-related_article_tout"&gt;





&lt;div class="uni-related-article-tout h-c-page"&gt;
  &lt;section class="h-c-grid"&gt;
    &lt;a href="https://cloud.google.com/blog/products/compute/expanded-coverage-for-compute-flex-cuds/"
       data-analytics='{
                       "event": "page interaction",
                       "category": "article lead",
                       "action": "related article - inline",
                       "label": "article: {slug}"
                     }'
       class="uni-related-article-tout__wrapper h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
        h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3 uni-click-tracker"&gt;
      &lt;div class="uni-related-article-tout__inner-wrapper"&gt;
        &lt;p class="uni-related-article-tout__eyebrow h-c-eyebrow"&gt;Related Article&lt;/p&gt;

        &lt;div class="uni-related-article-tout__content-wrapper"&gt;
          &lt;div class="uni-related-article-tout__image-wrapper"&gt;
            &lt;div class="uni-related-article-tout__image" style="background-image: url('')"&gt;&lt;/div&gt;
          &lt;/div&gt;
          &lt;div class="uni-related-article-tout__content"&gt;
            &lt;h4 class="uni-related-article-tout__header h-has-bottom-margin"&gt;Save more with expanded coverage for Compute Flex CUDs&lt;/h4&gt;
            &lt;p class="uni-related-article-tout__body"&gt;Compute Flexible Committed Use Discounts (Flex CUDs) now cover memory-optimized and HPC VM families and Cloud Run.&lt;/p&gt;
            &lt;div class="cta module-cta h-c-copy  uni-related-article-tout__cta muted"&gt;
              &lt;span class="nowrap"&gt;Read Article
                &lt;svg class="icon h-c-icon" role="presentation"&gt;
                  &lt;use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#mi-arrow-forward"&gt;&lt;/use&gt;
                &lt;/svg&gt;
              &lt;/span&gt;
            &lt;/div&gt;
          &lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/a&gt;
  &lt;/section&gt;
&lt;/div&gt;

&lt;/div&gt;</description><pubDate>Thu, 12 Feb 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/cost-management/a-finops-professionals-guide-to-updated-spend-based-cuds/</guid><category>Compute</category><category>Cost Management</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Simpler billing, clearer savings: A FinOps guide to updated spend-based CUDs</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/cost-management/a-finops-professionals-guide-to-updated-spend-based-cuds/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Alfonso Hernandez</name><title>Sr. Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Rahul Sharma</name><title>Sr. Product Manager</title><department></department><company></company></author></item><item><title>High-performance inference meets serverless compute with NVIDIA RTX PRO 6000 on Cloud Run</title><link>https://cloud.google.com/blog/products/serverless/cloud-run-supports-nvidia-rtx-6000-pro-gpus-for-ai-workloads/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Running large-scale inference models can involve significant operational toil, including cluster management and manual VM maintenance. One solution is to leverage a serverless compute platform to abstract away the underlying infrastructure. Today, we’re bringing the serverless experience to high-end inference with support for &lt;/span&gt;&lt;a href="https://www.nvidia.com/en-us/data-center/rtx-pro-6000-blackwell-server-edition/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;NVIDIA RTX PRO™ 6000 Blackwell Server Edition GPUs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; on Cloud Run. Now in preview, you can deploy massive models like Gemma 3 27B or Llama 3.1 70B with the 'deploy and forget' experience you’ve come to expect from Cloud Run. No reservations. No cluster management. Just code.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;A powerful GPU platform&lt;/strong&gt;&lt;/h3&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_qqUpivV.max-1000x1000.jpg"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The NVIDIA RTX PRO 6000 Blackwell GPU provides a huge leap in performance compared to the NVIDIA L4 GPU, bringing 96GB vGPU memory, 1.6 TB/s of bandwidth and support for FP4 and FP6. This means you can serve up to 70B+ parameter models without having to manage any underlying infrastructure. Cloud Run lets you attach a NVIDIA RTX PRO 6000 Blackwell GPU to your Cloud Run service, job, or worker pools, on demand, with no reservations required. Here are some ways you can use the NVIDIA RTX PRO 6000 Blackwell GPU to accelerate your business:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Generative AI and inference:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; With its FP4 precision support, the NVIDIA RTX PRO 6000 Blackwell GPU’s high-efficiency compute accelerates LLM fine-tuning and inference, letting you create real-time generative AI applications such as multi-modal and text-to-image creation models. By &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs/configuring/services/gpu"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;running your model on Cloud Run services&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, you can also take advantage of rapid startup and scaling, going from zero instances to having a GPU with drivers installed under 5 seconds. When traffic eventually scales down zero and no more requests are being received, Cloud Run automatically scales your GPU instances down to zero.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong style="vertical-align: baseline;"&gt;Fine-tuning and offline inference&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: NVIDIA RTX PRO 6000 Blackwell GPUs can be used in conjunction with &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs/configuring/jobs/gpu"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Run jobs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to fine-tune your model. The fifth-generation NVIDIA Tensor Cores can be used in conjunction with AI models to help accelerate rendering pipelines and enhance content creation. &lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Tailored scaling for specialized workloads&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Use &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs/configuring/workerpools/gpu"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GPU-enabled worker pools&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to apply granular control over your GPU workers, whether you need to dynamically scale based on custom external metrics or manually provision "always-on" instances for complex, stateful processing.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We built Cloud Run to be the simplest way to run production-ready, GPU-accelerated tasks. Some highlights of Cloud Run include: &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Managed GPUs with flexible compute: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Cloud Run pre-installs the necessary NVIDIA drivers so you can focus on your code. Cloud Run instances using NVIDIA RTX PRO 6000 Blackwell GPUs can configure up to 44 vCPU and 176GB of RAM.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Production-grade reliability:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; By default, Cloud Run offers zonal redundancy, helping to ensure enough capacity for your service to be resilient to a zonal outage; this also applies to Cloud Run with GPUs. Alternatively, you can turn off zonal redundancy and benefit from a lower price for best-effort failover of your GPU workloads in case of a zonal outage.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Tight integration&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Cloud Run works natively with the rest of Google Cloud. You can load massive model weights by mounting Cloud Storage buckets as local volumes, or use &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/iap/docs/enabling-cloud-run"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Identity-Aware Proxy (IAP)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to secure traffic that’s bound for a Cloud Run service.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Get started&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The NVIDIA RTX PRO 6000 Blackwell GPU is available in preview on demand with availability in &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;us-central1&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;europe-west4&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, and limited availability in &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;asia-south2&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;asia-southeast1&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;. You can deploy your first service using &lt;/span&gt;&lt;a href="https://ollama.com/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Ollama&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, one of the easiest way to run open models, on Cloud Run with NVIDIA RTX PRO 6000 GPUs enabled:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;gcloud beta run deploy my-service  \\\r\n--image ollama/ollama --port 11434 \\\r\n--cpu 20 --memory 80Gi \\\r\n--gpu-type nvidia-rtx-pro-6000 \\\r\n--no-gpu-zonal-redundancy \\\r\n--region us-central1&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f4a55cc3e80&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For more details, check out our updated &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs/configuring/services/gpu"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Run documentation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs/configuring/services/gpu-best-practices"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;AI inference best practices&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Mon, 02 Feb 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/serverless/cloud-run-supports-nvidia-rtx-6000-pro-gpus-for-ai-workloads/</guid><category>AI &amp; Machine Learning</category><category>Compute</category><category>Serverless</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>High-performance inference meets serverless compute with NVIDIA RTX PRO 6000 on Cloud Run</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/serverless/cloud-run-supports-nvidia-rtx-6000-pro-gpus-for-ai-workloads/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>James Ma</name><title>Sr. Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Oded Shahar</name><title>Sr. Engineering Manager</title><department></department><company></company></author></item><item><title>Unlock 2x better price-performance with Axion-based N4A VMs, now generally available</title><link>https://cloud.google.com/blog/products/compute/axion-based-n4a-vms-now-in-preview/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;January 27, 2026: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;The N4A is now generally available. You can get started by deploying &lt;/span&gt;&lt;a href="http://console.cloud.google.com/compute/instancesAdd"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;N4A from the Google Cloud console&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Decision makers and builders today face a constant challenge: managing rising cloud costs while delivering the performance their customers demand. As applications evolve to use scale-out microservices and handle ever-growing data volumes, organizations need maximum efficiency from their underlying infrastructure to support their growing general-purpose workloads.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/image5_bCjzyyQ.max-1000x1000.png"
        
          alt="image5"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To meet this need, we’re excited to announce our latest Axion-based virtual machine series: N4A, available in preview on Compute Engine, Google Kubernetes Engine (GKE), Dataproc, and Batch, with support in Dataflow and other services coming soon. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;N4A is the most cost-effective N-series VM to date, delivering &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;up to 2x better price-performance and 80% better performance-per-watt &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;than comparable current-generation x86-based VMs. This makes it easier for customers to further optimize the Total Cost of Ownership (TCO) for a broad range of general-purpose workloads. We see this with cloud-native businesses running scale-out web servers and microservices on GKE, enterprise teams managing backend application servers and mid-sized databases, and engineering organizations operating large CI/CD build farms. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At Google Cloud, we co-design our compute offerings with storage, networking and software at every layer of the stack, from orchestrators to runtimes, to deliver exceptional system-level performance and cost-efficiency. N4A’s breakthrough price-performance is powered by our latest-generation Google Axion Processors, built on the Arm® Neoverse® N3 compute core, Google &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/dynamic-resource-management"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Dynamic Resource Management&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (DRM) technology, and &lt;/span&gt;&lt;a href="https://cloud.google.com/titanium"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Titanium&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, Google Cloud’s custom-designed hardware and software system that offloads networking and storage processing to free up the CPU. Titanium is part of Google Cloud’s vertically integrated software stack — from the custom silicon in our servers to our planet-scale network traversing &lt;/span&gt;&lt;a href="https://cloud.google.com/about/locations"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;7.75 million kilometers of terrestrial and subsea fiber&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; across 42 regions — that is engineered to maximize efficiency and provide the ultra-low latency and high bandwidth to customers at global scale.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Redefining general-purpose compute and enabling AI inference&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;N4A is engineered for versatility, with a feature set to support your general-purpose and CPU-based AI workloads. It comes in predefined and custom shapes, with up to 64 vCPUs and 512GB of DDR5 in high-cpu (2GB of memory per vCPU), standard (4GB per vCPU), and high-memory (8GB per vCPU) configurations, with instance networking up to 50 Gbps of bandwidth. N4A VMs feature support for our latest generation Hyperdisk storage options, including Hyperdisk Balanced, Hyperdisk Throughput, and Hyperdisk ML (coming later), providing up to 160K IOPS, 2.4GB/s of throughput per instance. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;N4A performs well across a range of industry-standard benchmarks that represent the key workloads our customers run every day. For example, relative to comparable current-generation x86-based VM offerings, N4A delivers up to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;105%&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; better price-performance for &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;compute-bound workloads&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, up to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;90%&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; better price-performance for &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;scale-out web servers&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, up to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;85%&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; better price-performance for &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Java applications&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, and up to&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; 20%&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; better price-performance for general-purpose databases.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_q9MnCJ1.max-1000x1000.png"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="dxvss"&gt;Footnote: As of October 2025. Performance based on the estimated SPECrate®2017_int_base, estimated SPECjbb2015, MySQL Transactions/minute (RO), and Google internal Nginx Reverse Proxy benchmark scores run in production on comparable latest-generation generally-available VMs with general purpose storage types. Price-performance claims based on published and upcoming list prices for Google Cloud.&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the real world, early adopters are seeing dramatic price-performance improvements from the new N4A instances.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_3I8oyl8.max-1000x1000.jpg"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="59dyk"&gt;&lt;i&gt;"At ZoomInfo, we operate a massive data intelligence platform where efficiency is paramount. Our core data processing pipelines, which are critical for delivering timely insights to our customers, run extensively on Dataflow and Java services in GKE. In our preview of the new N4A instances, we measured a 60% improvement in price-performance for these key workloads compared to their x86-based counterparts. This allows us to scale our platform more efficiently and deliver more value to our customers, faster."&lt;/i&gt; - &lt;b&gt;Sergei Koren, Chief Infrastructure Architect, ZoomInfo​&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_nDU2gjP.max-1000x1000.jpg"
        
          alt="3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="xulw1"&gt;&lt;i&gt;“Organizations today need performance, efficiency, flexibility, and scale to meet the computing demands of the AI era; this requires the close collaboration and co-design that is at the heart of our partnership with Google Cloud. As N4A redefines cost-efficiency, customers gain a new level of infrastructure optimization, enabling enterprises to choose the right infrastructure for their workload requirements with Arm and Google Cloud.”&lt;/i&gt; - &lt;b&gt;Bhumik Patel, Director, Server Ecosystem Development, Infrastructure Business, Arm&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Granular control with Custom Machine Types and Hyperdisk&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;A key advantage of our N-series VMs has always been flexibility, and with N4A, we are bringing one of our most popular features to the Axion family for the first time: Custom Machine Types (&lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/instances/creating-instance-with-custom-machine-type"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;CMT&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;). Instead of fitting your workload into a predefined shape, CMTs on N4A lets you independently configure the amount of vCPU and memory to meet your application's unique needs. This ability to right-size your instances means you pay only for the resources you use, minimizing waste and optimizing your total cost of ownership.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This same principle of matching resources to your specific workload applies to storage. N4A VMs feature support for our latest generation of &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/disks/hyperdisks"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Hyperdisk&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, allowing you to select the perfect storage profile for your application's needs:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hyperdisk Balanced:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Offers an optimal mix of performance and cost for the majority of general-purpose workloads, with up to 160K IOPs per N4A VM.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hyperdisk Throughput:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Delivers up to 2.4GiBps of max throughput for bandwidth-intensive analytics workloads like Hadoop or Kafka, providing high-capacity storage at an excellent value.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hyperdisk ML &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;(post GA)&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Purpose-built for AI/ML workloads, allows you to attach a single disk containing your model weights or datasets to up to 32 N4A instances simultaneously for large-scale inference or training tasks.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hyperdisk Storage Pools&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Instead of provisioning capacity and performance on a per-volume basis, allows you to provision performance and capacity in aggregate, &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/cost-saving-strategies-when-migrating-to-google-cloud-compute?e=48754805#:~:text=2.%20Optimize%20your%20block%20storage%20selections"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;further optimizing costs by up to 50%&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and simplifying management.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/4_ZB4gdHF.max-1000x1000.jpg"
        
          alt="4"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="7cqx3"&gt;&lt;i&gt;"At Vimeo, we have long relied on Custom Machine Types to efficiently manage our massive video transcoding platform. Our initial tests on the new Axion-based N4A instances have been very compelling, unlocking a new level of efficiency. We've observed a 30% improvement in performance for our core transcoding workload compared to comparable x86 VMs. This points to a clear path for improving our unit economics and scaling our services more profitably, without changing our operational model."&lt;/i&gt; - &lt;b&gt;Joe Peled, Sr. Director of Hosting &amp;amp; Delivery Ops, Vimeo&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;A growing Arm-based Axion portfolio for customer choice&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;C-series VMs are designed for workloads that require consistently high performance, e.g., medium-to-large-scale databases and in-memory caches. Alongside them, N-series VMs have been a key Compute Engine pillar, offering a balance of price-performance and flexibility, lowering the cost of running workloads with variable resource needs such as scale-out Java/GKE workloads. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;We released our first Axion-based machine series, &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/try-c4a-the-first-google-axion-processor?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;C4A&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, in October 2024, and the &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;introduction of N4A complements C4A, providing a range of Google Axion instances suited to your workloads’ precise needs. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;On top of that, GKE unlocks significant price-performance advantages by orchestrating Axion-based C4A and N4A machine types. GKE leverages &lt;/span&gt;&lt;a href="https://cloud.google.com/kubernetes-engine/docs/concepts/about-custom-compute-classes"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Custom Compute Classes&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to provision and mix these machine types, matching workloads to the right hardware. This automated, heterogeneous cluster management allows teams to optimize their total cost of ownership across their entire application stack.&lt;/span&gt;&lt;span style="text-decoration: line-through; vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Also &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/new-axion-c4a-metal-offers-bare-metal-performance-on-arm"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;joining the Axion family is C4A.metal&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, Google Cloud’s first Axion bare metal instance that helps builders meet use cases that require access to the underlying physical server to run specialized applications in a non-virtualized environment, such as automotive systems development, workloads with strict licensing requirements, and Android software development. &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/new-axion-c4a-metal-offers-bare-metal-performance-on-arm"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;C4A.metal will be available in preview soon&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Supported by the broad and mature Arm ecosystem, adopting Axion is easier than ever, and the combination of C4A and N4A can help you lower the total cost of running your business, without compromising on performance or workload-specific requirements&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;N4A for cost optimization and flexibility.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Deliberately engineered for general-purpose workloads that need a balance of price and performance, including scale-out web servers, microservices, containerized applications, open-source databases, batch, data analytics, development environments, data preparation and AI/ML experimentation.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;C4A for consistently high performance, predictability, and control.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Powering workloads where every microsecond counts, such as medium- to large-scale databases, in-memory caches, cost-effective AI/ML inference, and high-traffic gaming servers. C4A delivers consistent performance, offering a controlled maintenance experience for mission-critical workloads, networking bandwidth up to 100 Gbps, and next-generation Titanium Local SSD storage. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/5_m4GINGe.max-1000x1000.jpg"
        
          alt="5"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="7cqx3"&gt;&lt;i&gt;"Migrating to Google Cloud's Axion portfolio gave us a critical competitive advantage. We slashed our compute consumption by 20% while maintaining low and stable latency with C4A instances, such as our Supply-Side Platform (SSP) backend service. Additionally, C4A enabled us to leverage Hyperdisk with precisely the IOPS we need for our stateful workloads, regardless of instance size. This flexibility gives us the best of both worlds - allowing us to win more ad auctions for our clients while significantly improving our margins. We're now testing the N4A family by running some of our key workloads that require the most flexibility, such as our API relay service. We are happy to share that several applications running in production are consuming 15% less CPU compared to our previous infrastructure, reducing our costs further, while ensuring that the right instance backs the workload characteristics required.”&lt;/i&gt; - &lt;b&gt;Or Ben Dahan, Cloud &amp;amp; Software Architect at Rise&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Get started with N4A today&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;N4A is available in the following Google Cloud regions: &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;us-central1 (Iowa)&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;us-east4 (Virginia)&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, us-east1 (South Carolina), us-west1 (Oregon), asia-southeast1 (Singapore), europe-west1 (Belgium), europe-west2 (London), &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;europe-west3 (Frankfurt) &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;and &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;europe-west4 (Netherlands)&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; with more regions to follow. Learn&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; more about N4A &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/general-purpose-machines#n4a_series"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here in documentation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;; deploy N4A &lt;/span&gt;&lt;a href="http://console.cloud.google.com/compute/instancesAdd"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here in the console&lt;/span&gt;&lt;/a&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Tue, 27 Jan 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/axion-based-n4a-vms-now-in-preview/</guid><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Unlock 2x better price-performance with Axion-based N4A VMs, now generally available</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/axion-based-n4a-vms-now-in-preview/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Nate Baum</name><title>Senior Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Mo Farhat</name><title>Group Product Manager</title><department></department><company></company></author></item><item><title>Scaling WideEP Mixture-of-Experts inference with Google Cloud A4X (GB200) and NVIDIA Dynamo</title><link>https://cloud.google.com/blog/products/compute/scaling-moe-inference-with-nvidia-dynamo-on-google-cloud-a4x/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As organizations transition from standard LLMs to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;massive Mixture-of-Experts (MoE) &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;architectures like DeepSeek-R1, the primary constraint has shifted from raw compute density to communication latency and memory bandwidth. Today, we’re releasing two new validated recipes designed to help customers overcome the infrastructure bottlenecks of the agentic AI era. These new recipes provide clear steps to optimize both throughput and latency built on the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;A4X machine series&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; powered by &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;NVIDIA GB200 NVL72&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;NVIDIA Dynamo&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, which extend the &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/ai-inference-recipe-using-nvidia-dynamo-with-ai-hypercomputer?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;reference architecture&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; we published in September 2025 for disaggregated inference on A3 Ultra (NVIDIA H200) VMs.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We’re bringing the best of both worlds to AI infrastructure by combining the multi-layered scalability of Google Cloud’s AI infrastructure with the rack-scale acceleration of the A4X. These recipes are part of a broader collaboration between our organizations that includes investments in important inference infrastructure like &lt;/span&gt;&lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Dynamic Resource Allocation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (DRA) and &lt;/span&gt;&lt;a href="https://gateway-api-inference-extension.sigs.k8s.io/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Inference Gateway&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Highlights of the updated reference architecture include: &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Infrastructure:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Google Cloud’s A4X machine series, powered by NVIDIA GB200 NVL72, creating a single 72-GPU compute domain connected with fifth-generation NVIDIA NVLink.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Serving architecture:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; NVIDIA Dynamo functions as the distributed runtime, managing KV cache state and kernel scheduling across the rack-scale fabric.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Performance: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;For 8K/1K input sequence length (ISL)/ output sequence length (OSL) , we achieved &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;over 6K total tokens/sec/GPU&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; in throughput-optimized configurations and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;10ms inter-token latency (ITL)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; in latency-optimized configurations.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span&gt;&lt;strong style="vertical-align: baseline;"&gt;Deployment:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Two new recipes are available today in &lt;/span&gt;&lt;a href="https://github.com/AI-Hypercomputer/gpu-recipes/blob/main/inference/a4x/disaggregated-serving/dynamo/README.md" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;this repo&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for deploying this stack on Google Cloud using Google Kubernetes Engine (GKE) for orchestration.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The modern inference stack&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To achieve exascale performance, inference cannot be treated as a monolithic workload. It requires a modular architecture where every layer is optimized for specific throughput and latency targets. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;The AI Hypercomputer inference stack consists of three distinct layers:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Infrastructure layer:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The physical compute, networking, and storage fabric (e.g., A4X).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Serving layer:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The specific model architecture and the optimized execution kernels (e.g., NVIDIA Dynamo, NVIDIA TensorRT-LLM, Pax) and runtime environment managing request scheduling, KV cache state, and distributed coordination.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Orchestration layer:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The control plane for resource lifecycle management, scaling, and fault tolerance (e.g., Kubernetes).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the reference architecture detailed below, we focus on a specific, high-performance instantiation of this stack designed for the NVIDIA ecosystem. We combine the A4X at the infrastructure layer with NVIDIA&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Dynamo at the model serving Layer, orchestrated by GKE.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Infrastructure layer: The A4X rack-scale architecture&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In our &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/new-a4x-vms-powered-by-nvidia-gb200-gpus?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;A4X launch announcement&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; in February 2025 we referenced how the A4X VM addressed bandwidth constraints by implementing the GB200 NVL72 architecture, which fundamentally alters the topology available to the scheduler.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Unlike previous generations where NVLink domains were bound by the server chassis (typically 8 GPUs), the A4X exposes a unified fabric, with:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;72 NVIDIA Blackwell GPUs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; interconnected via the NVLink Switch System that enables the 72 GPUs to operate as one giant GPU with unified shared memory&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;130TB/s aggregate bandwidth&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, enabling all-to-all communication with latency profiles comparable to on-board memory access (72 GPUs x 1.8 TB/s/GPU)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Native NVFP4 support:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Blackwell Tensor Cores support 4-bit floating point precision, effectively doubling throughput relative to FP8 for compatible model layers. We used &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;FP8 Precision Scaling&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for this benchmark to support configuration and comparison with previously published results.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Serving layer: NVIDIA Dynamo&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Hardware of this scale requires a runtime capable of managing distributed state without introducing synchronization overhead. NVIDIA Dynamo serves as this distributed inference runtime. It moves beyond simple model serving to coordinate the complex lifecycle of inference requests across the underlying infrastructure.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The serving layer optimizes utilization on the A4X through these specific mechanisms:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Wide Expert Parallelism (WideEP)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Traditional MoE serving shards experts within a single node (typically 8 GPUs), leading to load imbalances when specific experts become "hot." We use the A4X's unified fabric to distribute experts across the full 72-GPU rack. This WideEP configuration absorbs bursty expert activation patterns by balancing the load across a massive compute pool, helping to ensure that no single GPU becomes a straggler.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Deep Expert Parallelism (&lt;/strong&gt;&lt;a href="https://github.com/deepseek-ai/DeepEP" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;DeepEP&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: While WideEP distributes the experts, DeepEP optimizes the critical "dispatch" and "combine" communication phases. DeepEP accelerates the high-bandwidth all-to-all operations required to route tokens to their assigned experts. This approach minimizes the synchronization overhead that typically bottlenecks MoE inference at scale.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Disaggregated request processing:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Dynamo decouples the compute-bound prefill phase from the memory-bound decode phase. On the A4X, this allows the scheduler to allocate specific GPU groups within the rack to prefill (maximizing tensor core saturation) while other GPUs handle decode (maximizing memory bandwidth utilization), preventing resource contention.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Global KV cache management:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Dynamo maintains a global view of the KV cache state. Its routing logic directs requests to the specific GPU holding the relevant context, minimizing redundant computation and cache migration.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;JIT kernel optimization:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The runtime leverages NVIDIA Blackwell-specific kernels, performing just-in-time fusion of operations to reduce memory-access overhead during the generation phase.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Orchestration layer: Mapping software to hardware&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;While the A4X provides the physical fabric and Dynamo provides the runtime logic, the orchestration layer is responsible for mapping the software requirements to the hardware topology. For rack-scale architectures like the GB200 NVL72, container orchestration needs to evolve beyond standard scheduling. By making the orchestrator explicitly aware of the physical NVLink domains, we can fully unlock the platform’s performance and help ensure optimal workload placement.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;GKE enforces this hardware-software alignment through these specific mechanisms:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;1. Rack-level atomic scheduling:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; With GB200 NVL72, the  "unit of compute" is no longer a single GPU or a single node — the entire rack is the new fundamental building block of accelerated computing. We use GKE capacity reservations with specific affinity settings. This targets a reserved block of A4X infrastructure that guarantees dense deployment. By consuming this reservation, GKE helps ensure that all pods comprising a Dynamo instance land on the specific, physically contiguous rack hardware required to establish the NVLink domain, providing the hard topology guarantee needed for WideEP and DeepEP.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;2. Low-latency model loading via GCS FUSE: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Serving massive MoE models requires loading terabytes of weights into high-bandwidth memory (HBM). Traditional approaches that download weights to local disk incur unacceptable "cold start" latencies. We leverage the &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GCS FUSE CSI Driver&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to mount model weights directly from Google Cloud Storage as a local file system. This allows the Dynamo runtime to "lazy load" the model, streaming data chunks directly into GPU memory on demand. This approach eliminates the pre-download phase, significantly reducing the time-to-ready for new inference replicas and enabling faster auto-scaling in response to traffic bursts.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;3. Kernel-bypass networking (GPUDirect RDMA): &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;To maximize the aggregate 130 TB/s bandwidth of the A4X, the networking stack must minimize CPU and I/O involvement. We configure the GKE cluster to enable&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;GPUDirect RDMA&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;over the Titanium network adapter. By injecting specific NCCL topology configurations and enabling IPC_LOCK capabilities in the container, we allow the application to bypass the OS kernel and perform Direct Memory Access (DMA) operations between the GPU and the network interface. This configuration offloads the NVIDIA Grace CPU from data path management, so that networking I/O does not become a bottleneck during high-throughput token generation.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Performance validation&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We observed the following when assessing the scaling characteristics of an 8K/1K workload on DeepSeek-R1 (FP8) with SGLang for two distinct optimization targets. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;1. Throughput-optimized configuration&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Setup:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; All 72 GPUs utilizing DeepEP. 10 prefill nodes with 5 workers (TP8) and 8 decode nodes with 1 worker (TP32).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Result:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We sustained over &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;6K total tokens/sec/GPU&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; (1.5K output tokens/sec/GPU), which matches the performance published by InferenceMAX (&lt;/span&gt;&lt;a href="https://github.com/InferenceMAX/InferenceMAX/actions/runs/20356790608/job/58493812121" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;source&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;2. Latency-optimized configuration&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Setup:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; 8 GPUs (two nodes) without DeepEP. 1 prefill node with 1 prefill worker (TP4) and 1 decode node with 1 decode worker (TP4). &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Result:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We sustained a median Inter-Token Latency (ITL) of &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;10ms&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; at a concurrency of 4, which matches the performance published by InferenceMAX (&lt;/span&gt;&lt;a href="https://github.com/InferenceMAX/InferenceMAX/actions/runs/20413316138/job/58653323053" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;source&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Looking ahead&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As models evolve from static chat interfaces to complex, multi-turn reasoning agents, the requirements for inference infrastructure will continue to shift. We are actively updating and releasing benchmarks and recipes as we invest across all three layers of the AI inference stack to meet these demands:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Infrastructure layer&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/now-shipping-a4x-max-vertex-ai-training-and-more?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;recently released A4X Max&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is based on the NVIDIA GB300 NVL72 in a single 72 GPU rack configuration, bringing 1.5X more NVFP4 FLOPs, 1.5X more GPU memory, and 2X higher network bandwidth compared to A4X. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Serving layer&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: We are actively exploring deeper integrations with components of NVIDIA Dynamo, e.g., pairing KV Block Manager with Google Cloud remote storage, funneling Dynamo metrics into our Cloud Monitoring dashboards for enhanced observability, and leveraging GKE Custom Compute Classes (CCC) for better capacity and obtainability, as well as setting a new baseline with FP4 precision.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Orchestration&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: We plan to incorporate additional optimizations into these tests, e.g. &lt;/span&gt;&lt;a href="https://gateway-api-inference-extension.sigs.k8s.io/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Inference Gateway&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; as the intelligent inference scheduling component, following the design patterns established in the llm-d &lt;/span&gt;&lt;a href="https://llm-d.ai/docs/guide" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;well-lit paths&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. We aim to provide a centralized mechanism for sophisticated traffic orchestration — handling request prioritization, queuing, and multi-model routing before the workload ever reaches the serving-layer runtime.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Whether you are deploying massive MoE models or architecting the next generation of reasoning agents, this stack provides the exascale foundation required to turn frontier research into production reality. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Get started today&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At Google Cloud, we’re committed to providing the most open, flexible, and performant infrastructure for your AI workloads. With full support for the NVIDIA Dynamo suite — from intelligent routing and scaling to the latest NVIDIA AI infrastructure — we provide a complete, production-ready solution for serving LLMs at scale.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We updated our deployment repository with two specific recipes for the A4X machine class: &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://github.com/AI-Hypercomputer/gpu-recipes/blob/main/inference/a4x/disaggregated-serving/dynamo/README.md#32-sglang-deployment-with-deepep-72-gpus" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Recipe for throughput optimized&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; - 72 GPUs with DeepEP&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://github.com/AI-Hypercomputer/gpu-recipes/blob/main/inference/a4x/disaggregated-serving/dynamo/README.md#sglang-wo-deepep" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Recipe for latency optimized&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; - 8 GPUs without DeepEP&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We look forward to seeing what you build!&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 22 Jan 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/scaling-moe-inference-with-nvidia-dynamo-on-google-cloud-a4x/</guid><category>AI &amp; Machine Learning</category><category>AI Hypercomputer</category><category>GKE</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Scaling WideEP Mixture-of-Experts inference with Google Cloud A4X (GB200) and NVIDIA Dynamo</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/scaling-moe-inference-with-nvidia-dynamo-on-google-cloud-a4x/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Sean Horgan</name><title>Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Ling Lin</name><title>Software Engineer</title><department></department><company></company></author></item><item><title>Simplify VM OS agent management at scale: Introducing VM Extensions Manager</title><link>https://cloud.google.com/blog/products/compute/introducing-vm-extensions-manager/</link><description>&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_d395npc.max-1000x1000.png"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;If you're an IT administrator, you know that managing Operating System (OS) agents (Google calls them extensions) across a large fleet of VM instances can be complex and frustrating. Indeed, this operational overhead can be a major barrier to adopting extension-based services on VM fleets, despite the fact that they unlock powerful application-level capabilities.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To solve this problem, we’re excited to announce the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;preview of&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;VM Extensions Manager&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, a new capability integrated directly into the Compute Engine API that simplifies installing and managing these Google-provided extensions.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;VM Extensions Manager provides a centralized, policy-driven framework for managing the entire lifecycle of Google Cloud extensions on your VM instances. Instead of relying on manual scripts, startup scripts, or other bespoke solutions, you can now define a policy to ensure all your VM instances — both existing and new — conform to that state, reducing operational overhead from months to hours.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;How to get started with VM Extensions Manager&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;VM Extensions Manager is integrated directly into the &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;compute.googleapis.com&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; API, meaning there are no new APIs to discover or enable. You can get started in minutes.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;1. Define your extension policy&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;First, define a policy that specifies the desired state of your extensions.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For the preview, you can create &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;zonal policies&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; at the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Project level&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. This policy targets VM instances within a single, specific zone. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Over the coming months, we’ll expand support to include &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;global policies&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, as well as policies at the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Organization&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Folder levels&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. This will allow you to build a flexible hierarchy of policies (using priorities) to manage your extension on your enterprise fleet from a single control plane.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You can create this policy directly from the Google Cloud console: &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_2Dllyl3.max-1000x1000.png"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Demo of Creating VM Extension policy using Cloud Console&lt;/strong&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/3_Bayaqjl.gif"
        
          alt="3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;2. Select your extensions&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;In the policy, you select the Google Cloud extensions you want to manage. For the preview, VM Extensions Manager supports several critical Google Cloud extensions, including:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/logging/docs/agent/ops-agent/agent-vmem-policies"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Ops Agent&lt;/strong&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;(ops-agent): The Ops Agent is the primary agent for collecting telemetry from your Compute Engine instances.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/workload-manager/docs/evaluate/set-up-agent-for-sap"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Agent for SAP&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (sap-extension): Google Cloud's Agent for SAP is provided by Google Cloud for the support and monitoring of SAP workloads running on Compute Engine instances and Bare Metal Solution servers.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/agent-for-compute-workloads"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Agent for Compute Workload&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (workload-extension): The Agent for Compute Workloads lets you monitor and evaluate workloads running on Compute Engine.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We'll be adding support for more extension-based services in the coming months.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You can choose to pin a specific extension version or, keep it empty (the default) to get the latest extension installed. If you choose the default, VM Extensions Manager automatically handles the rollout of new versions as they are released — no more waiting to access new features and improvements.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;3. Roll out global policy with more control&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;VM Extensions Manager gives you control over how global policy changes are deployed across many zones with rollout speeds. Zonal policies don't offer rollout speeds; they are enforced instantaneously when the VMs are online.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In coming weeks, we will expand support for global policy via gcloud first and update the documentation with relevant information. UI updates will follow in coming months. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At preview, however, global policy lets you select two distinct rollout speeds:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;SLOW&lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt; (Recommended):&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; This is the default option, designed for safety. It orchestrates a zone-by-zone rollout (within the scope of the policy) with a built-in wait time between waves, minimizing the potential blast radius of a problematic change over a period of time, by default 5 days. This is perfect for standard maintenance and updates.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;FAST&lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; This option eliminates the wait time between waves, executing the change across the entire fleet across zones as quickly as possible. It is intended for urgent use cases, such as deploying a critical security patch in a "break-glass" emergency scenario across all VMs in all zones.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Once you save the policy, VM Extensions Manager takes over. The underlying progressive rollout engine manages the complex orchestration, and you can monitor its progress.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;A flexible system for standardization and control&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;VM Extensions Manager is designed to bring standardization and control to extensions on your VM fleets. You can start today by applying zonal policies to your projects to ensure extensions are correctly installed on VM instances in the correct zones.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To get started defining Extension policies for your Compute Engine VM instances, read the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/vm-extensions/about-vm-extension-manager"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;documentation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to create your first policy. We're excited to see how you use VM Extensions Manager to standardize, secure, and simplify the management of your VM fleet.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Mon, 05 Jan 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/introducing-vm-extensions-manager/</guid><category>Management Tools</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Simplify VM OS agent management at scale: Introducing VM Extensions Manager</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/introducing-vm-extensions-manager/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Omkar Suram</name><title>Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Mike Columbus</name><title>CE Director, Northam Platform Specialists</title><department></department><company></company></author></item><item><title>Automate AI and HPC clusters with Cluster Director, now generally available</title><link>https://cloud.google.com/blog/products/compute/cluster-director-is-now-generally-available/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The complexity of the infrastructure behind AI training and high performance computing (HPC) workloads can really slow teams down. At Google Cloud, where we work with some of the world’s largest AI research teams, we see it everywhere we go: researchers hampered by complex configuration files, platform teams struggling to manage GPUs with home-grown scripts, and operational leads battling the constant, unpredictable hardware failures that derail multi-week training runs. Access to raw compute isn't enough. To operate at the cutting edge, you need &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;reliability&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; that survives hardware failures, &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;orchestration&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; that respects topology, and a &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;lifecycle&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; management strategy that adapts to evolving needs.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today, we are delivering on those requirements with the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;General Availability (GA) of &lt;/strong&gt;&lt;a href="https://cloud.google.com/products/cluster-director" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cluster Director&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Preview&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; of Cluster Director support for Slurm on &lt;/span&gt;&lt;a href="https://cloud.google.com/kubernetes-engine?utm_source=google&amp;amp;utm_medium=cpc&amp;amp;utm_campaign=na-CA-all-en-dr-bkws-all-all-trial-e-dr-1710134&amp;amp;utm_content=text-ad-none-any-DEV_c-CRE_772382725406-ADGP_Hybrid+%7C+BKWS+-+EXA+%7C+Txt-AppMod-GKE-Kubernetes+Engine-KWID_335784956140-kwd-335784956140&amp;amp;utm_term=KW_kubernetes+google-ST_kubernetes+google&amp;amp;gclsrc=aw.ds&amp;amp;gad_source=1&amp;amp;gad_campaignid=22976548925&amp;amp;gclid=Cj0KCQiAgP_JBhD-ARIsANpEMxxNCV54Smw89kgAplcXoolCw8LdVBSA9buRDhHT_4QlTybV4LZoqKIaAqJcEALw_wcB&amp;amp;e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Kubernetes Engine (GKE)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Cluster Director (GA) is a managed infrastructure service designed to meet the rigorous demands of modern supercomputing. It replaces fragile DIY tooling with a robust topology-aware control plane that handles the entire lifecycle of Slurm clusters, from the first deployment to the thousandth training run. &lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;We are expanding Cluster Director to support Slurm on GKE (Preview), designed to give you the best of both worlds: the familiar precision of high-performance scheduling and the automated scale of Kubernetes. It achieves this by treating GKE node pools as a direct compute resource for your Slurm cluster, allowing you to scale your workloads with Kubernetes' power without changing your existing Slurm workflows.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Cluster Director, now GA&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Cluster Director offers advanced capabilities at each phase of the cluster lifecycle, spanning preparation (Day 0), where infrastructure design and capacity are determined; deployment (Day 1), where the cluster is automatically deployed and configured; and monitoring (Day 2), where performance, health, and optimization are continuously tracked.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This holistic approach ensures that you get the benefits of fully configurable infrastructure while automating lower-level operations so your compute resources are always optimized, reliable, and available. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;So, what does all this cost? That’s the best part. There's no extra charge to use Cluster Director. You only pay for the underlying Google Cloud resources — your compute, storage, and networking.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;How Cluster Director supports each phase of deployment&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Day 0: Preparation &lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Standing up a cluster typically involves weeks of planning, wrangling Terraform, and debugging the network. Cluster Director changes the ‘Day 0’ experience entirely, with tools for designing infrastructure topology that’s optimized for your workload requirements. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/1_gBjYYUA.gif"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To streamline your Day 0 setup, Cluster Director provides:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Reference architectures:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We’ve codified Google’s internal best practices into reusable cluster templates, enabling you to spin up standardized, validated clusters in minutes. This helps ensure that every team in your organization is using the same security standards for their deployments and deploying on infrastructure that is configured correctly by default — right down to the network topology and storage mounting. &lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Guided configuration:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We know that having too many options can lead to configuration paralysis. The Cluster Director control plane guides you through a&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;streamlined setup flow. You select your resources, and our system handles the complex backend mapping, ensuring that storage tiers, network fabrics, and compute shapes are compatible and optimized before you deploy.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Broad hardware support:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Cluster Director offers &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/cluster-director/docs/compute"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;full support&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for large-scale AI systems, including Google Cloud’s &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;A4X and A4X Max VMs powered by &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;NVIDIA GB200 and GB300 GPUs, and versatile CPUs such as &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;N2 VMs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for cost-effective login nodes and debugging partitions.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Flexible consumption options:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Cluster Director integrates with your preferred procurement strategy, with support for &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/reservations-overview"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Reservations&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for guaranteed capacity during critical training runs, &lt;/span&gt;&lt;a href="https://cloud.google.com/products/dws/pricing?e=48754805&amp;amp;hl=en"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Dynamic Workload Scheduler&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt; Flex-start &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;for dynamic scaling, or &lt;/span&gt;&lt;a href="https://cloud.google.com/solutions/spot-vms?e=48754805&amp;amp;hl=en"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Spot VMs&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for opportunistic low-cost runs.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Google Cloud's Cluster Director is optimized for managing large-scale AI and HPC environments. It complements the power and performance of NVIDIA's accelerated computing platform. Together, we're providing customers with a simplified, powerful, and scalable solution to tackle the next generation of computing challenges.&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; - &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Dave Salvator, Director of Accelerated Computing Products, NVIDIA&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Day 1: Deployment &lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Deploying hardware is one thing, but maximizing performance is another thing entirely. Day 1 is the execution phase, where your configuration transforms into a fully operational cluster. The good news is that Cluster Director doesn't just provision VMs, it validates that your software and hardware components are healthy, properly networked, and ready to accept the first workload.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/2_MyVTseY.gif"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To ensure a high-performance deployment, Cluster Director automates:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Getting a clean "bill of health":&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Before your job ever touches a GPU, Cluster Director runs a rigorous suite of health checks, including &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;DCGMI&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; diagnostics and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;NCCL&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; performance validation, to verify the integrity of the network, storage, and accelerators.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Keeping accelerators fed with data:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Storage throughput is often the silent killer of training efficiency. That’s why Cluster Director fully supports Google Cloud Managed Lustre with selectable performance tiers, allowing you to attach high-throughput parallel storage directly to your compute nodes, so your GPUs are never starved for data.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Maximizing Interconnect Performance:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; To achieve peak scaling, Cluster Director implements topology-aware scheduling and compact placement policies. By utilizing dense reservations on Google’s non-blocking fabric, the system ensures that your distributed workloads are placed on the shortest physical path possible, minimizing tail latency and maximizing collective communication (NCCL) speeds from the get-go.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“Cluster Director is an amazing product, which has enabled me to spin up a ready to use Nvidia GPU cluster with Slurm, including all networking, routing, and high performance network file-system for large-scale distributed model training within less than an hour. The cluster was immediately ready to run our containerizedAI training workloads with excellent throughput with only minimal customization effort."&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; - Dr. Florian Eyben, Head of AI Foundation Models &amp;amp; Speech Technology, Agile Robots SE, Munich, Germany&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Day 2: Monitoring&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The reality of AI and HPC infrastructure is that hardware fails and requirements change. A rigid cluster is an inefficient cluster. As you move into the ongoing “Day 2” operational phase, you need to maintain cluster health, maximize utilization and performance. Cluster Director provides a control plane equipped for the complexities of long-term operations. Today we are introducing new &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;active cluster management&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; capabilities to handle the messy reality of Day 2 operations.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/3_VSuBKiw.gif"
        
          alt="3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;New active cluster management capabilities include:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Topology-level visibility:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; You can’t orchestrate what you can’t see. Cluster Director’s observability graphs and topology grids let you visualize your entire fleet, spot thermal throttles or interconnect issues, and optimize job placement based on physical proximity.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;One-click remediation:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; When a node degrades, you shouldn't have to SSH in to debug it. Cluster Director allows you to replace faulty nodes with a single click directly from the Google Cloud console. The system handles the draining, teardown, and replacement, returning your cluster to full capacity in minutes.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Adaptive infrastructure:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; When your research needs change, so should your cluster. You can now modify active clusters, with activities such as adding or removing storage filesystems, on the fly, without tearing down the cluster or interrupting ongoing work.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Cluster Director support for Slurm on GKE, now in preview&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Innovation thrives in the open. Google, the creator of Kubernetes, and SchedMD, the developers behind Slurm, have long championed the open-source technologies that power the world's most advanced computing. For years, NVIDIA and SchedMD have worked in lockstep to optimize GPU scheduling, introducing foundational features like the Generic Resource (GRES) framework and Multi-Instance GPU (MIG) support that are essential for modern AI. By acquiring SchedMD, NVIDIA is doubling down on its commitment to Slurm as a vendor-neutral standard, ensuring that the software powering the world's fastest supercomputers remains open, performant, and perfectly tuned for the future of accelerated computing.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Building on this foundation of accelerated computing, Google is deepening its collaboration with SchedMD to answer a fundamental industry challenge: how to bridge the gap between cloud-native orchestration and high-performance scheduling. We are excited to announce the Preview of Cluster Director support for Slurm on GKE, utilizing SchedMD’s Slinky offering.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This initiative brings together the two standards of the infrastructure world. By running a native Slurm cluster directly on top of GKE, we are amplifying the strengths of both communities:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Researchers &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;get the uncompromised Slurm interface and batch capabilities, such as &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;sbatch&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;squeue&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, that have defined HPC for decades.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Platform teams&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; gain the operational velocity that GKE, with its auto-scaling, self-healing, and bin-packing, brings to the table.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Slurm on GKE is strengthened by our long-standing partnership with SchedMD, which helps create a unified, open, and powerful foundation for the next generation of AI and HPC workloads. &lt;/span&gt;&lt;a href="https://forms.gle/LaV116jNy2CvAnNV8" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Request preview access now&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Try Cluster Director today&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ready to start using Cluster Director for your AI and HPC cluster automation? &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Learn more about the end-to-end capabilities in &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/cluster-director/docs"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;documentation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style="vertical-align: baseline;"&gt;Activate &lt;/span&gt;&lt;a href="http://console.cloud.google.com/cluster-director"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cluster Director&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; in the console.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;</description><pubDate>Wed, 17 Dec 2025 18:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/cluster-director-is-now-generally-available/</guid><category>AI &amp; Machine Learning</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Automate AI and HPC clusters with Cluster Director, now generally available</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/cluster-director-is-now-generally-available/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Ilias Katsardis</name><title>Sr. Product Manager, Cluster Director, Google Cloud</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Jason Monden</name><title>Group Product Manager, AI Infrastructure, Google Cloud</title><department></department><company></company></author></item><item><title>Google named a Leader in The Forrester Wave™: AI Infrastructure Solutions, Q4 2025</title><link>https://cloud.google.com/blog/products/compute/forrester-wave-ai-infrastructure-solutions-q4-2025-leader/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For most organizations, the question is no longer &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;if&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; they will use AI, but &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;how&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; to scale it from a promising prototype into a production-grade service that drives business outcomes. In this age of inference, competitive advantage is defined by your ability to serve useful information to users around the world at the lowest possible cost. As you move from demos to production deployments at scale, you need to simplify infrastructure operations with integrated systems that provide the latest AI software and accelerator hardware platforms, while keeping costs and architectural complexity low. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Yesterday, Forrester released &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;The Forrester Wave™: AI Infrastructure Solutions, Q4 2025&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; report, evaluating 13 vendors, and we believe their findings validate our commitment to solving these core challenges. &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Google received the highest score of all vendors in the Current Offering category &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;and received the highest possible score in 16 out of 19 evaluation criteria, including, but not limited to: Vision, Architecture, Training, Inferencing, Efficiency, and Security.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://cloud.google.com/resources/content/2025-forrester-wave-ai-infrastructure"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Access the full report&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;The Forrester Wave™: AI Infrastructure Solutions, Q4 2025&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Accelerating time-to-value with an integrated system&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Enterprises don’t run AI in a vacuum. They need to integrate it with a diverse range of applications and databases while adhering to stringent security protocols. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Forrester recognized Google Cloud’s strategy of co-design by giving us the highest possible score in the Efficiency and Scalability criteria:&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“Google pursues a strategy of silicon-infrastructure co-design. It develops TPUs to improve inference efficiency and NVIDIA GPUs for access to broader ecosystem compatibility. Google designs TPUs to integrate tightly with its networking fabric, giving customers high bandwidth and low latency for inference at scale.”&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For over two decades, we have operated some of the world's largest services, from Google Search and YouTube to Maps, where their unprecedented scale required us to solve problems that no one else had. We couldn't simply buy the platform and infrastructure we needed; we had to invent it. This led to a decade-long journey of deep, system-level co-design, building everything from our custom network fabric and specialized accelerators to frontier models, all under one roof. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The result was an integrated supercomputing system, AI Hypercomputer, which has paid significant dividends for our customers. It supports a wide range of AI-optimized hardware, allowing you to optimize for granular, workload-level objectives — whether that's higher throughput, lower latency, faster time-to-results, or lower TCO. That means you can use our custom &lt;/span&gt;&lt;a href="https://cloud.google.com/tpu"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Tensor Processing Units&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (TPUs), the latest &lt;/span&gt;&lt;a href="https://cloud.google.com/gpu"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;NVIDIA GPUs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, or both, backed by a system that tightly integrates accelerators with networking and storage for exceptional performance and efficiency. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;It’s&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; also why today, leading generative AI companies such as Anthropic, Lightricks, and LG AI Research trust Google Cloud to power their most demanding AI workloads.&lt;sup&gt;1&lt;/sup&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This system-level integration lays the foundation for speed, but operational complexity could still slow you down. To accelerate your time-to-market, we provide multiple ways to deploy and manage AI infrastructure, abstracting away the heavy lifting regardless of your preferred workflow. Google Kubernetes Engine (GKE) Autopilot automates management for containerized applications, helping customers like LiveX.AI reduce operational costs by 66%. Similarly, Cluster Director simplifies deployment for Slurm-based environments, enabling customers like LG AI Research to slash setup time from 10 days to under one day. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Managing AI cost and complexity&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Forrester gave Google Cloud the highest scores possible in the Pricing Flexibility and Transparency criterion. The price of compute is only one part of the AI infrastructure cost equation. A complete view should also account for development costs, downtime and inefficient resource utilization. We offer optionality at every layer of the stack to provide the flexibility businesses demand.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Flexible consumption:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Dynamic Workload Scheduler allows you to secure compute at up to 50% savings, by ensuring you only pay for the capacity you need, when you need it.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Load balancing&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: GKE Inference Gateway improves throughput by using AI-aware routing to balance requests across models, preventing bottlenecks and ensuring servers aren't sitting idle.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Eliminating data bottlenecks&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Anywhere Cache co-locates data with compute, reducing read latency by up to 96% and eliminating the "integration tax" of moving data. By using Anywhere Cache together with our unified data platform BigQuery, you can avoid latency and egress fees while keeping your accelerators fed with data. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Mitigating strategic risk through flexibility and choice&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are also committed to enabling customer choice across accelerators, frameworks and multicloud environments. This isn’t new for us. Our deep experience with Kubernetes, which we developed then open-sourced, taught us that open ecosystems are the fastest path to innovation and provide our customers with the most flexibility. We are bringing that same ethos to the AI era by actively contributing to the tools you already use.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Open source frameworks and hardware portability:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We continue to support open frameworks such as PyTorch, JAX, and Keras. We’ve also directly addressed concerns about workload portability on custom silicon by investing in TPU support for vLLM, allowing developers to easily switch between TPUs and GPUs (or use both) with only minimal configuration changes.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hybrid and multicloud flexibility:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Our commitment to choice extends to where you run your applications. &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Google Distributed Cloud&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; brings our services to on-premises, edge and cloud locations, while &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Cross-Cloud Network&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; securely connects applications and users with high-speed connectivity between your environments and other clouds. This powerful combination means you're no longer locked into a specific environment; you can easily migrate workloads and apply uniform management practices, streamlining operations, and mitigating the risk of lock-in.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Systems you can rely on&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When your entire business model depends on the availability of AI services, infrastructure uptime is critical. Google Cloud's global infrastructure is engineered for enterprise-grade reliability, an approach rooted in our history as the birthplace of Site Reliability Engineering (SRE).&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We operate one of the world's largest private software-defined networks, handling approximately 25% of global internet egress traffic. Unlike providers that rely on the public internet, we keep your traffic on Google’s own fiber to improve speed, reliability, and latency. This global backbone is powered by our Jupiter data center fabric, which scales to 13 Petabits/sec of bandwidth, delivering 50x greater reliability than previous generations — to say nothing of other providers. Finally, to improve cluster-level fault tolerance, we employ capabilities like &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/elastic-training-and-optimized-checkpointing-improve-ml-goodput"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;elastic training and multi-tier checkpointing&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, which allow jobs to continue uninterrupted, by dynamically resizing the cluster around failed nodes while minimizing the time to recovery.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Building on a secure foundation&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our approach is to secure AI from the ground up. In fact, Google Cloud maintains a leading track record for cloud security. Independent analysis from cloudvulndb.org (2024-2025) shows that our platform has up to 70% fewer critical and high vulnerabilities compared to the other two leading cloud providers. We were also the first in the industry to publish an AI/ML Privacy Commitment, which guarantees that we do not use your data to train our models. With those safeguards in place, security is integrated into the foundation of Google Cloud, based on the zero-trust principles that protect Google’s own services:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;A hardware root of trust:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Our custom Titan chips, as part of our Titanium architecture, create a verifiable hardware root of trust. We recently extended this with Titanium Intelligence Enclaves for &lt;/span&gt;&lt;a href="https://blog.google/technology/ai/google-private-ai-compute/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Private AI Compute&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, allowing you to process sensitive data in a hardened, isolated, and encrypted environment.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Built-in AI security:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;a href="https://cloud.google.com/security/products/security-command-center"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Security Command Center (SCC)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; natively integrates with our infrastructure, providing &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/identity-security/introducing-ai-protection-security-for-the-ai-era"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;AI Protection&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; by automatically discovering assets, preventing security issues, detecting active threats with frontline &lt;/span&gt;&lt;a href="https://cloud.google.com/security/products/threat-intelligence"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Threat Intelligence&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, and discovering known and unknown risks before attackers can exploit them.  &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Sovereign solutions:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We enable you to meet stringent data residency, operational control, and software sovereignty requirements through solutions like &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Data Boundary&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. This is complemented by flexible options like partner-operated sovereign controls and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Google Distributed Cloud&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for air-gapped needs.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Platform controls for AI and agent governance: &lt;/strong&gt;&lt;a href="https://cloud.google.com/vertex-ai"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; provides the essential governance layer for the enterprise builder to deploy models and agents at scale. This trust is anchored in Google Cloud’s secure-by-default infrastructure, utilizing platform controls like &lt;/span&gt;&lt;a href="https://cloud.google.com/security/vpc-service-controls"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;VPC Service Controls (VPC-SC)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kms/docs/cmek"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Customer-Managed Encryption Keys (CMEK)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to sandbox environments and protect sensitive data, and Agent Identity for granular IAM permissions. At the platform level, Vertex AI and &lt;/span&gt;&lt;a href="https://cloud.google.com/products/agent-builder"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Agent Builder&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; integrate &lt;/span&gt;&lt;a href="https://cloud.google.com/security/products/model-armor"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Model Armor&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to provide runtime protection against emergent agentic threats, such as prompt injection and data exfiltration. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Delivering continuous AI innovation&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are honored to be recognized as a Leader in The Forrester Wave™ report, which we believe validates decades of R&amp;amp;D and our approach to building ultra-scale AI infrastructure. Look to us to continue on this path of system-level innovation as we help you convert the promise of AI into a reality.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Access the full report:&lt;/strong&gt; &lt;a href="https://cloud.google.com/resources/content/2025-forrester-wave-ai-infrastructure"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;The Forrester Wave™: AI Infrastructure Solutions, Q4 2025&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;sup&gt;&lt;em&gt;1. IDC Business Value Snapshot, Sponsored by Google Cloud, The Business Value of Google Cloud AI Hypercomputer, US53855425, October 2025&lt;/em&gt;&lt;/sup&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 17 Dec 2025 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/forrester-wave-ai-infrastructure-solutions-q4-2025-leader/</guid><category>AI &amp; Machine Learning</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Google named a Leader in The Forrester Wave™: AI Infrastructure Solutions, Q4 2025</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/forrester-wave-ai-infrastructure-solutions-q4-2025-leader/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Mark Lohmeyer</name><title>VP and GM, AI and Computing Infrastructure</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Saurabh Tiwary</name><title>VP &amp; GM, Cloud AI</title><department></department><company></company></author></item><item><title>AI agents are here. Is your infrastructure ready?</title><link>https://cloud.google.com/blog/products/compute/idc-on-the-ai-efficiency-gap/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;&lt;strong&gt;Editor’s note&lt;/strong&gt;: Today we hear from Dave McCarthy of IDC about a total cost of ownership crisis for AI infrastructure — and what you can do about it. Read on for his insights.&lt;/span&gt;&lt;/p&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The AI landscape is undergoing a seismic shift. For the past few years, the industry has been focused on the massive, resource-intensive process of training generative AI models. But the focus is now rapidly pivoting to a new, even larger challenge: inference.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Inference — the process of using a trained model to make real-time predictions — is no longer just one part of the AI lifecycle; it is quickly becoming the dominant workload. In a recent IDC global survey of over 1,300 AI decision-makers, inference was already cited as the largest AI workload segment, accounting for 47% of all AI operations.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This dominance is driven by the sheer volume of real-world applications. While a model is trained periodically, it is used for inference non-stop, with every user query, API call, and recommendation. It is also critical to recognize that this inference surge will be distributed across hybrid environments. According to IDC survey respondents, 63% of workloads will reside in the cloud, which remains the standard for scalable applications like content creation and chatbots. In contrast, 37% will be deployed on on-premises infrastructure, usually related to use cases such as robotics and other systems that interact directly with the physical world.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Now, a new factor is set to multiply this demand: the rise of autonomous and semi-autonomous AI agents.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These "agentic workflows" represent the next logical step in AI, where models don't just respond to a single prompt but execute complex, multi-step tasks. An AI agent might be asked to "plan a trip to Paris," requiring it to perform dozens of interconnected operations: browsing for flights, checking hotel availability, comparing reviews, and mapping locations. Each of these steps is an inference operation, creating a cascade of requests that must be orchestrated across different systems.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This surge in demand is exposing a critical vulnerability for many organizations: the AI efficiency gap.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;The TCO crisis in an age of agents&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The AI efficiency gap is the difference between the theoretical performance of an AI stack and the actual, real-world performance achieved. This gap is the source of a Total Cost of Ownership (TCO) crisis, and it’s driven by system-wide inefficiencies.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our research shows that more than half (54.3%) of organizations use multiple AI frameworks and hardware platforms. While this flexibility seems beneficial, it has a staggering downside: 92% of these organizations report a negative effect on efficiency.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This fragmented "patchwork" approach, stitched together from disparate and non-optimized services, creates a ripple effect of problems:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;41.6% reported increased compute costs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Redundant processes and poor utilization drive up spending.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;40.4% reported increased engineering complexity&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Teams spend more time managing the fragmented stack than delivering value.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;40.0% reported increased latency&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Bottlenecks in one part of the system (like storage or networking) degrade the overall performance of an application.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The core problem is that organizations are paying for expensive, high-performance accelerators, but are failing to keep them busy. Our data shows that 29% of all AI budget waste is tied to inference. This waste is a direct result of idle GPU time (cited by 29.4% of respondents) and inefficient use of resources (22.3%).&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When an expensive accelerator is idle, it’s often waiting for data from a slow storage system or for the application server to prepare the next request. This is a system-level failure, not a component failure.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This failure is often compounded by significant hurdles in data management, which serves as the fuel for these AI engines. Survey respondents highlighted three primary challenges contributing to this gap: 47.7% struggle with ensuring data quality and governance, 45.6% grapple with data storage management and related costs, and 44.1% cite the complexity and time required for data cleaning and preparation. When data pipelines cannot keep pace with high-speed accelerators, the entire infrastructure becomes inefficient.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Closing the gap: From fragmented stacks to integrated systems&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To scale cost-effectively in the age of AI agents, we must stop thinking about individual components and start focusing on system-level design.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;An agentic workflow, for example, requires tight coordination between two distinct types of compute:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;General-purpose compute&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This is the operational backbone. It runs the application servers, orchestrates the workflow, pre-processes data, and handles all the logic around the model.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Specialized accelerators&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This is the high-performance engine that runs the AI model itself.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In a fragmented environment, these two sides are inefficiently connected, and latency skyrockets. The path forward is an optimized architecture where the software, networking, storage, and compute — both general-purpose and specialized — are designed to work as a single, cohesive system.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This holistic approach is the only sustainable way to manage the TCO of AI. It redefines the goal away from simply buying faster accelerators and toward improving the overall "price-performance" and "unit economics" of the entire end-to-end workflow. By eliminating bottlenecks and maximizing the utilization of every resource, organizations can finally close the efficiency gap. Organizations are actively shifting strategies to capture this value. Our survey indicates that 28.9% of respondents are prioritizing model optimization techniques, while 26.3% are partnering with AI service providers to navigate this complexity. Additionally, 25% are investing in training to upskill their teams, ensuring they can increase the value of their AI investments.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The age of inference is here, and the age of agents is right behind it. This next wave of innovation will be won not by the organizations with the most powerful accelerators, but by those who build the most efficient, integrated, and cost-effective systems to power them.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;A message from Google Cloud&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;We sponsored this IDC research to help IT leaders navigate the critical shift to the "Age of Inference." We recognize that the "efficiency gap" identified here — driven by fragmented stacks and idle resources — is the primary barrier to sustainable ROI. That is why we created AI Hypercomputer: an integrated supercomputer system designed to deliver exceptional performance and efficiency for demanding AI workloads. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;IDC surveyed 1,300 global IT leaders to uncover how they are designing their stack for maximum efficiency and ROI. Get your free copy of the whitepaper to learn more: &lt;/span&gt;&lt;a href="https://cloud.google.com/resources/content/ai-efficiency-gap"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;The AI Efficiency Gap: From TCO Crisis to Optimized Cost and Performance&lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 11 Dec 2025 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/idc-on-the-ai-efficiency-gap/</guid><category>AI &amp; Machine Learning</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>AI agents are here. Is your infrastructure ready?</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/idc-on-the-ai-efficiency-gap/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Dave McCarthy</name><title>Research Vice President, Cloud and Edge Infrastructure Services, IDC</title><department></department><company></company></author></item><item><title>Nutanix NC2 is now officially supported on Google Cloud</title><link>https://cloud.google.com/blog/topics/partners/nutanix-nc2-generally-available-google-cloud/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today, we are thrilled to announce Nutanix Cloud Clusters (NC2) is generally available on Google Cloud.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;NC2 on Google Cloud is designed to migrate and modernize specialized, regulated, and mission-critical applications without refactoring your workloads or compromising on performance. This partnership brings the power of Google Cloud’s infrastructure and advanced AI models to your hybrid cloud, without compromising on data residency, connectivity, or operational consistency. You can now run your Nutanix Hybrid Cloud directly on &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/bare-metal-instances"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Compute Engine&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"The General Availability of Nutanix Cloud Clusters (NC2) on Google Cloud is a significant milestone empowering our joint customers to become AI-ready. We are excited to extend the simplicity and resilience of Nutanix NC2 onto Google Cloud's high-performance workload-optimized compute. Nutanix on Google Cloud enables our customers to migrate and modernize their critical workloads while unlocking the full power of Google’s industry-leading data and AI capabilities." &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;- Saveen Pakala, VP, Product Management, Hybrid Cloud, Nutanix&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Nutanix and Google Cloud allow you to maximize agility and minimize disruption for your critical applications. By combining NC2's enterprise flexibility with Google Cloud's power, you gain access to three core advantages. First, your workloads run on Compute Engine’s dynamically scalable workload-optimized infrastructure powering all machine families. Nutanix NC2 supports Compute Engine bare metal instances in the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/storage-optimized-machines?_gl=1*2vt8da*_up*MQ..&amp;amp;gclid=CjwKCAiAqfe8BhBwEiwAsne6gduqCwwkpJZbE9aPtQmusSUIJYOzGeKiVzaE-1_M9aml0iqY5L8_IBoCh90QAvD_BwE&amp;amp;gclsrc=aw.ds#z3_machine_types"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Z3&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/general-purpose-machines#c4_series"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;C4&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; families. These are powered by the &lt;/span&gt;&lt;a href="https://cloud.google.com/titanium"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Titanium offload system&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and leverage &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/disks/local-ssd"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Titanium SSDs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;low-latency, high-throughput storage performance&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;, hosted in Google Cloud with &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;global reach, enterprise-grade security&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;, &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;and commitment to sustainability. Second, you accelerate AI innovation&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; by co-locating data and machine learning services like &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Gemini Enterprise&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; and &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Vertex AI&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;. Finally, you can save costs by dynamically &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;scaling capacity&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; and  utilizing &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;committed use discounts (CUDs)&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; and &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Flex CUDs&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Key use cases to accelerate your cloud journey&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The integration of NC2 on Google Cloud offers flexible, strategic options for hybrid cloud operations. Beyond consolidation and cost control, these capabilities set the stage for true modernization:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Seamless workload migration: Move entire applications between your on-premises Nutanix environment and Google Cloud without re-factoring or re-architecting. &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;This capability saves significant time during data center consolidation.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Consistent operations: Maintain the &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;same management plane, security policies, and automation&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; across your private data center and Google Cloud, which dramatically reduces operational complexity and training costs.&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Disaster recovery (DR): Leverage Google Cloud as a robust and cost-efficient recovery target. &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Usage of a minimal “pilot light” cluster reduces compute costs, so you scale up only when a disaster event occurs.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Capacity bursting: Instantly add capacity in the cloud to handle seasonal demands, VDI workloads, development/test &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;cycles&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;, or requirements from mergers and acquisitions (M&amp;amp;A).&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;License portability: Protect your software investments by easily moving your existing &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Nutanix software licenses&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; to Google Cloud as your business needs evolve.&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“Like many others, we are always on a journey to modernize and shift to achieve the best outcomes for our customers. Nutanix Cloud Clusters (NC2) on Google Cloud brings us a solid platform to continue our hybrid cloud expansion. Our ability to seamlessly run workloads on-premises and on NC2 on Google Cloud without having to re-factor is increasingly valuable as we continue our modernization journey. We look forward to continuing our strong partnership with Google Cloud and Nutanix.” &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;- VP of IT at a global oil &amp;amp; gas company based in Oklahoma&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The architecture &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;NC2 on Compute Engine simplifies building a hybrid cloud by deploying the Nutanix Cloud Infrastructure (NCI) software stack, including the Acropolis Hypervisor (AHV), directly onto high-performance Compute Engine infrastructure.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_dJgDPX1.max-1000x1000.png"
        
          alt="image1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The key components of the solution include:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Compute Engine instances:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; NC2 runs on &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/bare-metal-instances"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Compute Engine bare metal instances&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; in the recently introduced C4 and Z3 machine families.&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;These powerful instances provide the foundation with high-density compute, memory, local NVMe storage, and high network bandwidth.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div align="center"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;&lt;table&gt;&lt;colgroup&gt;&lt;col/&gt;&lt;col/&gt;&lt;col/&gt;&lt;col/&gt;&lt;col/&gt;&lt;col/&gt;&lt;/colgroup&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;strong&gt;&lt;span style="vertical-align: baseline;"&gt;Machine Family &lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;strong&gt;&lt;span style="vertical-align: baseline;"&gt;GCE Machine Type&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;strong&gt;vCPUs&lt;/strong&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;strong&gt;Memory&lt;/strong&gt;  &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;strong&gt;Storage&lt;/strong&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;strong&gt;Processor&lt;/strong&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Z3, Storage Optimized &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;z3-highmem-192-highlssd-metal&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;192&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;1536GB&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;72TB of NVMe Local SSD&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Intel, Sapphire Rapid&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;C4, General Purpose &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;c4-highmem-288-lssd-metal&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;288&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;1080GB&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;18TB of NVMe Local SSD&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Intel, Granite Rapid&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;C4, General Purpose &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;c4-standard-288-lssd-metal&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;288&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;2232GB&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;18TB of NVMe Local SSD&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Intel, Granite Rapid&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Simplified networking :&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; NC2 runs entirely within your existing Google Cloud Virtual Private Cloud (&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;VPC&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;). Built-in Nutanix Flow Virtual Networking for overlay is integrated to reduce hybrid cloud complexity. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Unified management:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The entire environment, both on-premises and in Google Cloud, is managed through the familiar&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Prism Central&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;console, simplifying day-to-day operations and skill requirements for your IT teams.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Easy procurement:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Later this month, you’ll be able to purchase Nutanix NC2 licensing directly from &lt;/span&gt;&lt;a href="https://cloud.google.com/marketplace?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Cloud Marketplace&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; . This offers a single, unified billing experience for both your Google Cloud infrastructure and Nutanix NC2, in one simple process. A key benefit is the ability to use your existing Google Cloud spend commitments for Nutanix NC2 software. This helps you maximize your investment and streamline your financial operations, providing more value from your cloud budget.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Connect your data to Google Cloud AI and analytics&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;A significant modernization opportunity comes from connecting your stable, trusted Nutanix workloads with Google Cloud's powerful data and AI tools. Your applications running on NC2 can tap directly into services like &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;BigQuery&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Vertex AI&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; with low latency, enabling you to:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Derive deeper business value:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Easily send application log data, transactional records, and other operational data from your Nutanix VMs to BigQuery for real-time, scalable data warehousing and complex analysis.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Build custom machine learning models:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Use Vertex&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;AI to create, deploy, and manage custom ML models that analyze data generated by your core applications (e.g., predictive maintenance or fraud detection).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Use conversational AI:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Quickly build and deploy conversational agents using technologies like Dialogflow that interact directly with the application data residing on your NC2 cluster.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Ready to simplify your cloud operations?&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;NC2 on Google Cloud is currently available  across 17 Google Cloud regions, with a planned expansion continuing through 2026. For precise details on regional and zonal availability, please check the official &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/bare-metal-instances#regions_zones"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Compute Engine bare metal regional availability&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; documentation, and reference the &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/all-pricing?e=48754805&amp;amp;hl=en"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Compute Engine pricing page&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for infrastructure costs. To learn more about the solution, try taking a &lt;/span&gt;&lt;a href="https://cloud.nutanixtestdrive.com/login?type=nc2gcp" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;test drive&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; or visit &lt;/span&gt;&lt;a href="https://cloud.google.com/find-a-partner/partner/nutanix-inc"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Nutanix partner page&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. Available later this month, you will be able to explore NC2 on Google Cloud licensing through the Google Cloud Marketplace.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Tue, 09 Dec 2025 14:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/partners/nutanix-nc2-generally-available-google-cloud/</guid><category>Compute</category><category>Infrastructure</category><category>Partners</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Nutanix NC2 is now officially supported on Google Cloud</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/partners/nutanix-nc2-generally-available-google-cloud/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Yarden Halperin</name><title>Product Manager, Google Cloud</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Ziv Kalmanovich</name><title>Group Product Manager, Google Cloud</title><department></department><company></company></author></item><item><title>Running high-scale reinforcement learning (RL) for LLMs on GKE</title><link>https://cloud.google.com/blog/products/compute/run-high-scale-rl-for-llms-on-gke/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As Large Language Models (LLMs) evolve, Reinforcement Learning (RL) is becoming the crucial technique for aligning powerful models with human preferences and complex task objectives.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;However, enterprises that need to implement and scale RL for LLMs are facing infrastructure challenges. The primary hurdles include the memory contention from concurrently hosting multiple large models (such as the actor, critic, reward, and reference models), iterative switching between high latency inference generation, and high throughput training phases.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This blog details Google Cloud's full-stack, integrated approach, from custom TPU hardware to the GKE orchestration layer — and shares how you can solve the hybrid, high-stakes demands of RL at scale.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;A quick primer: Reinforcement Learning (RL) for LLMs&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;RL is a continuous feedback loop that combines elements of both training and inference. At a high level, the RL loop for LLMs functions as follows:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;The LLM generates a response to a given prompt.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;A "reward model" (often trained on human preferences) assigns a quantitative score, or reward, to the output.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;An RL algorithm (e.g., DPO, GRPO) uses this reward signal to update the LLM's parameters, adjusting its policy to generate higher-rewarding outputs in subsequent interactions.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This generation, evaluation, and optimization continually improves the LLM's performance based on predefined objectives.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;RL workloads are hybrid and cyclical. The main goal of RL is not to minimize error (training) or fast prediction (inference), but to maximize reward through iterative interaction. The primary constraint for the RL workload is not just the computational power, but also system-wide efficiency, specifically minimizing aggregate sampler latency and maximizing the speed of weight copying for efficient end-to-end step time.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Google Cloud's full-stack approach to RL&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Solving these system-wide challenges requires an integrated approach. You can't just have fast hardware or a good orchestrator; you need every layer of the stack to work together. Here is how our full-stack approach is built to solve the specific demands of RL:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;1. Flexible, high-performance compute (TPUs and GPUs):&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Instead of locking customers into one path, we provide two high-performance options. Our &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;TPU stack&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; is a vertically integrated, JAX-native solution where our custom hardware (excelling at matrix operations) is co-designed with our post-training libraries (MaxText and Tunix). In parallel, we fully support the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;NVIDIA GPU ecosystem&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, partnering with NVIDIA on optimized NeMo RL recipes so customers can leverage their existing expertise directly on GKE.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;2. Holistic, full-stack optimization:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We integrate optimization from the bare metal up. This includes our custom TPU accelerators, high-throughput storage (Managed Lustre, Google Cloud Storage), and — critically — the orchestration and scheduling that GKE provides. By optimizing the entire stack, we can attack the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;system-wide&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; latencies that bottleneck hybrid RL workloads.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;3. Leadership in open-source:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; RL infrastructure is complex and built on a wide range of tools. Our leadership starts with open-sourcing Kubernetes and extends to active partnerships with orchestrators like Ray. We contribute to key projects like vLLM, develop open-source solutions like llm-d for cost-effective serving, and open-source our own high-performance MaxText and Tunix libraries. This helps ensure you can integrate the best tools for the job, not just the ones from a single vendor.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;4. Proven, mega-scale orchestration:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Post-training RL can require compute resources that rival pre-training. This requires an orchestration layer that can manage massive, distributed jobs as a single unit. GKE AI mega-clusters support up to 65,000 nodes today, and we are heavily investing in multi-cluster solutions like MultiKueue to scale RL workloads beyond the limits of a single cluster.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Running RL workloads on GKE&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Existing GKE infrastructure is well-suited for demanding RL workloads and provides several infrastructure-level efficiencies. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The image below outlines the architecture and key recommendations for implementing RL at scale. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_HnbQkXW.max-1000x1000.png"
        
          alt="image1"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="drc60"&gt;Figure : GKE infrastructure for running RL&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At the base, the infrastructure layer provides the foundational hardware, including supported compute types (CPUs, GPUs, and TPUs). You can use the Run:ai model streamer to accelerate the model streaming for all three compute types. High performance storage (Managed Lustre, Cloud Storage) can be used for storage needs for RL. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The middle layer is the managed K8s layer powered by GKE, which handles the resource orchestration, resource obtainability using Spot or Dynamic Workload Scheduler, autoscaling, placement, job queuing and job scheduling and more at mega scale. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Finally, the open frameworks layer runs on top of GKE, providing the application and execution environment. This includes the managed support for open-source tools such as KubeRay, Slurm and gVisor sandbox for secure isolated task execution. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Building RL workflow&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Before creating an RL workload, you must first identify a clear use case. With that objective defined, you then architect the core components: selecting the algorithm (e.g, DPO, GRPO), the model server (like vLLM or SGLang), the target GPU/TPU hardware, and other critical configurations.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Next, you can provision a GKE cluster configured with Workload Identity, GCS Fuse, and DGCM metrics. For robust batch processing, install the Kueue and JobSet APIs. We recommend deploying Ray as the orchestrator on top of this GKE stack. From there, you can launch the Nemo RL container, configure it for your GRPO job, and begin monitoring its execution. For the detailed implementation steps and source code, please refer to this &lt;/span&gt;&lt;a href="https://github.com/AI-Hypercomputer/gpu-recipes/tree/main/RL/a4/recipes/qwen2.5-1.5b/nemoRL" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;repository&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Getting started with RL&lt;/strong&gt;&lt;/h3&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Run RL on GPUs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Try the RL recipe on TPUs using &lt;/span&gt;&lt;a href="https://maxtext.readthedocs.io/en/latest/tutorials/grpo_with_pathways.html" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;MaxText and Pathways&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for GRPO algorithm, or if you use GPUs, try the &lt;/span&gt;&lt;a href="https://github.com/AI-Hypercomputer/gpu-recipes/tree/main/RL/a4/recipes" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;NemoRL recipes&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Partner with the open-source ecosystem&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Our leadership in AI is built on open standards like Kubernetes, llm-d, Ray, MaxText or Tunix. We invite you to partner with us to build the future of AI together. Come contribute to llm-d! Join the &lt;/span&gt;&lt;a href="https://llm-d.ai/docs/community" rel="noopener" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Open Sans', 'Helvetica Neue', sans-serif;" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;llm-d community&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, check out the repository on GitHub, and help us define the future of open-source LLM serving.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;&lt;/div&gt;</description><pubDate>Mon, 10 Nov 2025 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/run-high-scale-rl-for-llms-on-gke/</guid><category>AI &amp; Machine Learning</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Running high-scale reinforcement learning (RL) for LLMs on GKE</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/run-high-scale-rl-for-llms-on-gke/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Poonam Lamba</name><title>Senior Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Bogdan Berce</name><title>Software Engineer</title><department></department><company></company></author></item></channel></rss>