Nvidia Unveils Rubin CPX: A Leap to 30 PetaFLOPS and 1M+ Token Contexts
September 9, 2025 — Nvidia today announced the Rubin CPX, a purpose-built GPU designed to turbocharge inference workloads that demand massive context capacities, setting a new bar for generative AI computing. Shipping is slated by the end of 2026, and expectations are sky-high.
What is Rubin CPX?
At its core, Rubin CPX packs a monstrous 30 petaFLOPS of NVFP4 compute power, paired with a hefty 128 GB of GDDR7 memory specifically optimized for long-context AI inference. Nvidia reports it delivers a 3× attention acceleration improvement over their previous GB300 NVL72 GPU, substantially elevating throughput for workloads that process million-token contexts or more.
This boost is critical for advanced AI tasks such as software development assistants, HD video generation, and other applications leveraging extensive generative sequences.
Architecture and Ecosystem
Rubin CPX is not a standalone giant but part of a holistic inference solution. It integrates alongside Nvidia's Vera CPUs and Rubin GPUs, collectively deployed in the NVL144 CPX rack that packs:
- 144 Rubin CPX GPUs
- 144 Rubin GPUs
- 36 Vera CPUs
Delivering a staggering 8 exaFLOPS of NVFP4 compute power — roughly a 7.5× leap over the GB300 NVL72 — as well as 100 TB of ultra-fast memory and 1.7 PB/s memory bandwidth within a single rack.
This design fully embraces disaggregated inference architectures, ensuring scalable efficiency and responsiveness for giant, long-context generative AI workloads.
Maximizing Context Windows, Maximizing ROI
The Rubin CPX dramatically increases the max token count AI models can efficiently process in context, hitting into million-token territory. This expansion enables next-gen chatbots, assistants, and creative AI to “remember” and reason over far more information in one go, improving quality and coherence.
Nvidia CEO Jensen Huang emphasized the business angle: “Cloud providers are investing hundreds of millions into this technology, and we project returns to reach as high as 500 million within a few years.”
This speaks volumes about the explosive demand and ROI potential for accelerated, memory-heavy AI inference infrastructure.
When and Where
Nvidia targets delivering Rubin CPX-powered systems by the end of 2026, allowing cloud providers and AI-focused enterprises to prep for the next era of AI workloads that rely on extreme throughput and massive context awareness.
Why It Matters
- First GPUs purpose-built to massively scale context window sizes on inference workloads
- 3x faster attention mechanisms driving smoother, faster generative AI responses
- Enabling richer, longer, and more complex AI conversations & creations
- Part of a full-stack disaggregated system that maximizes performance and developer ROI
Final Thought
With Rubin CPX, Nvidia is doubling down on inference acceleration and memory capacity exactly where generative AI needs it most. The era of million-token contexts is no longer a distant dream — it’s swiftly becoming the new standard, unlocked by GPU architectures designed for a sprawling future of AI creativity and utility.
The AI cloud arms race just leveled up again, and Rubin CPX promises to be a powerhouse centerpiece. Keep an eye on late 2026 — this is when next-gen AI workloads will start running at a scale and speed previously imaginable only in theory.
Sources: Nvidia Rubin CPX official blog (Sep 9, 2025)