This week at SC23, Microsoft Azure’s “Eagle” made its debut on the semi-annual Top500 list. Microsoft has a machine that has more HPC compute power than all but two systems on the list. Perhaps the coolest part: you can use it. While traditional HPC clusters are built in places like national labs and access is limited to those who are approved by the institutions that run them, Microsoft’s is a cloud solution.
Microsoft Azure Eagle is a Paradigm Shifting Cloud Supercomputer
Eagle edged out the former #1 Supercomputer Fugaku by Fujitsu and RIKEN that debuted in 2020. With 14,400 NVIDIA H100 GPUs (14400/8 = 1800 nodes?), InfiniBand, and Intel Xeon Sapphire Rapids CPUs, the new system turned in a #3 result only around 4% less than #2 on the November 2023 list.
At SC23 we got to see a Microsoft Azure NDv5 node on display on the show floor.
This system is massive, but at its heart is a block of 8x NVIDIA H100 GPUs on a HGX H100 8-GPU platform.
We were told that these are using Intel Xeon Platinum processors from the 4th Gen Intel Xeon Scalable “Sapphire Rapids” line so they are DDR5 processors.
They also have not just Infiniband networking but also spaces for Microsoft’s NIC solution.
Putting together well over a thousand of these systems with all of the networking, storage, and all of the other bits makes a #3 supercomputer.
Final Words
A fun X or Twitter nugget is that Microsoft’s benchmarking team only got three days with the system to turn in a Linpack run for a Top500 result before the system was then made available to customers. Contrast that with the Aurora, the #2 system on this list, where the run happened on about half of the cluster and even that half was not running at its final expected performance. This is an awesome result by Microsoft and really shows a paradigm shift.
Microsoft has not just a capable system, but a top 3 system for traditional HPC workloads. The “dirty” secret is that Microsoft is not using the traditional 4x GPU node. Instead, it is using the NVIDIA HGX H100 platform. That makes the Eagle supercomputer also a major AI system using today’s most popular GPUs and AI platforms. While Microsoft’s HPC team was focused on providing cloud alternatives to big FP64 on-prem supercomputers, it also built a giant AI supercomputer given its architecture competing head to head with NVIDIA Eos in MLPerf Training v3.1 results on only a part of the Azure machine.
The scale and performance of Microsoft’s offering is undeniable at this point. Eagle is using its Azure flavor of a standard NVIDIA HGX H100 platform making it also a mainstream AI supercomputer, instead of some kind of HPC-only exotic system architecture. Perhaps the best, and the paradigm-shifting nature of the new machine is the access. One does not need to go through a site selection, 6+ month installation and bring-up, and so forth to use the system expanding the accessibility of HPC. Eagle is usable by many customers in a cloud model instead of having restricted access by institutional gatekeepers (beyond Microsoft) which again increases the reach of HPC, so long as customers can pay for their use.
The statement “Eagle is using its Azure flavor of a standard NVIDIA HGX H100 platform making it also a mainstream AI supercomputer, instead of some kind of HPC-only exotic system architecture” is interesting because only last week the half and quarter precision sparse matrix accelerators used for AI were considered exotic while the floating-point standards from the 70’s used in statistics, engineering, scientific computing, spreadsheets and so forth were considered traditional.
Huh? An HPC site is a cloud provider. Yes normally a container, rather than a full VM but we still call that cloud, no?
I’m not sure what makes you assert that HPC is “exotic”. Yes, you can find well-justified “specialized” devices, but most HPC is utterly mundane (whitebox, or if you want to pay more, labeled). InfiniBand? Once exotic, sure, but not any more (and besides, meeting serious competition from COTS ethernet!)
HPC systems are tuned differently, but that’s just because their typical workload is not web hosting.