Today, NVIDIA showed a bit more of its Top 10 supercomputer. The NVIDIA EOS currently sits at #9 on the latest November 2023 Top500 list of most powerful supercomputers at the double precision linpack benchmark. That is notable because NVIDIA de-emphasized double precision gains for AI performance generations ago. Hitting top 10 in a benchmark the system was not designed for is a big achievement. As a result, this is a massive AI supercomputer.
NVIDIA EOS A Top 10 Supercomputer Shown
NVIDIA EOS has 576 NVIDIA DGX H100 systems and uses NVIDIA Quantum-2 400Gb/s Infiniband for an Rmax of 121.4 PFlops/s on double precision linpack, but 18.4 exaflops of FP8 AI compute.
Given that there are 576 DGX H100 systems, we have 4608 GPUs, which is likely well over $200M+ if you were trying to put it together at street pricing.
NVIDIA is using its SuperPOD architecture to build EOS in blocks that it can then scale to larger topologies. A big part of this announcement also reminds folks in the industry that NVIDIA can scale to 4608 accelerators (and more), while many of the other AI training cards are not able to scale as easily to that number.
Here, we can see that NVIDIA has four systems per rack, which should use 32kW or less per rack. That would be the assumption until we see the rear of the systems.
In the shot above, we can see the liquid cooling rack manifolds behind the individual on the KVM cart. If you want to learn about how liquid cooling work, you can seeĀ How Liquid Cooling Servers Works with Gigabyte and CoolIT.
We recently also looked at a liquid-cooled Supermicro SYS-821GE-TNHR 8x NVIDIA H100 AI Server
And also QCT’s liquid cooling solution:
For future AI servers, folks are going to want to use liquid cooling because of the power efficiency gains from the transition to liquid versus air. If you want to deploy AI servers, you need to be thinking of liquid cooling, and NVIDIA is showing that with EOS.
Final Words
The NVIDIA EOS supercomputer is one that the company can use for its internal development purposes and also can do work for customers. Having a large-scale cluster available is something that differentiates NVIDIA from some of its competitors.
This was another small look at EOS, the system NVIDIA showed last November. I know Patrick has been trying to do an in-person tour for a long time, similar to the Intel Developer Cloud tour.
Maybe in the B100 generation, that will happen?