US HPC and the DOGE Impact on Next-Gen Supercomputers

3
XAI Colossus Data Center Supermicro Liquid Cooled Nodes Low Angle
XAI Colossus Data Center Supermicro Liquid Cooled Nodes Low Angle

This week, we saw El Capitan landing another HPE and AMD system atop the Top500 list. Through a packed meeting schedule, the constant questions involved the impact of AI. There was one discussion that made folks uncomfortable: The potential impact of the new administration on the procurement of large-scale high-performance computing systems. We can point to AI hardware and HPC mixed precision algorithms, but perhaps the incoming Department of Government Efficiency or DOGE is the more interesting angle. I figured it would be worth at least jotting some thoughts and viewpoints I heard this week, and starting the discussion.

US HPC and the DOGE Impact on Next-Gen Supercomputers

In our recent xAI Colossus tour, I mentioned that the 100,000 GPU Phase 1 stood up in 122 days was much faster than public supercomputers. Given the performance of FP64 calculations is different for AMD and Intel GPUs versus NVIDIA, still in approximate system size, a number of nodes/ accelerators gives us some sense of scale.

  • Frontier: 1 CPU, 4 GPUs, ~9400 Nodes (2022)/ ~9800 Nodes (2024)
  • Aurora: 2 CPUs, 6 GPUs, ~9000 Nodes
  • El Capitan: 4 APUs, ~11,000 nodes

Those three systems combined should use just over 100MW with ~140K accelerators. The xAI team has already stated a goal to grow this number quickly to 200,000. At least from an order of magnitude perspective, I would not be surprised if xAI Colossus surpassed the number of CPUs accelerators Frontier, Aurora, and El Capitan had combined within a year. At 200,000 GPUs it should easily be a bigger cluster than those three combined.

Networking is really interesting. Many of the HPC folks will point to AI clusters and say they are less good. On the other hand, we showed xAI Colossus’s Supermicro liquid-cooled systems, and you will see HGX H200 reviews this quarter on STH, all with one 400GbE NVIDIA BlueField-3 SuperNIC for each accelerator and one 400GbE BlueField-3 DPU for each of the NVIDIA GPUs. Even in its Phase 1, 100,000 GPU form this was over 110,000 NICs. The switches are also much newer and bigger than Rosetta 1 based HPE Slingshot 10/ 11 (100G/200G) switches deployed in the three current Top500 Exascale systems.

XAI Colossus Data Center NICs
XAI Colossus Data Center NICs

If we use a 20192020 start date the systems took perhaps 3-5 years to deliver after the procurement contracts were announced, and there is generally a procurement process that happens before eventually the system is announced, the US DoE effectively built three very different systems in 5-6 years that Elon Musk’s xAI team will do in well under a year, and got reasonably close to in 122 days.

Intel Aurora Sapphire Rapids And Ponte Vecchio HPE Cray EX Node ISC 2022 PVC 1
Intel Aurora Sapphire Rapids And Ponte Vecchio HPE Cray EX Node ISC 2022 PVC 1

Of course, the US DoE is scrappy. It is not directly competing with companies like Microsoft, Meta, and more buying NVIDIA GPUs at high prices. Most likely the three public US Exascale supercomputers were a publicly floated budget when announced of under $2B is probably a third of what xAI Colossus costs just based on NVIDIA’s supply constrained pricing.

HPE Cray Frontier AMD Blade Nozzle Side
HPE Cray Frontier AMD Blade Nozzle Side

Summing this up in some rough napkin math, the US DoE managed to in 5-6 years buy and deploy three systems with roughly what Elon Musk’s team built or will have built in under a year, and at a cost perhaps closer to a third of an AI system.

Here is my law that I have been floating with hyper-scalers and many in the industry:

The value of an AI accelerator is greatest when it is first deployed, and decreases thereafter.

Some say it is when the accelerator comes off the line at the fab, but until it is packaged, then installed in a system, and then deployed into a cluster it is effectively a paperweight. If you look at GPU pricing over time, it consistently decreases as the supply of each successive (NVIDIA) generation of accelerator is deployed at greater scale. An available NVIDIA H100 in Q4 2024 is much less than it was a year ago and much more than it will be in a year.

Patrick With The NVIDIA H100 At NVIDIA HQ April 2022
Patrick With The NVIDIA H100 At NVIDIA HQ April 2022

Bringing this back to the Department of Government Efficiency or “DOGE” with an aim to increase the efficiency of government expenditures, we can imagine the meeting between DOGE and the agencies involved in next-gen system procurement. Perhaps 1/6th the speed, and 1/3rd the nominal cost ignoring the time value of money building systems that are 1/3rd the size.

In theory, the US Government should be able to deploy faster than a private company. Everyone that is involved in IT hardware deployment knows that the US Federal Government can check a box on an order and legally, it jumps to the head of the line in a NVIDIA supply constrained environment. It also has broad powers and existing facilities that it could use to streamline site preparation.

Given that, it feels like before a DOGE meeting, agencies will want to re-think their speed to deliver systems. How do you tell someone who built a bigger system in a year with a private organization that it takes 5-6 times longer to build smaller systems with the weight of the US Government?

Non-Recurring Engineering Components

Of course, there is a lot more to the procurement of these Exascale systems that go beyond the hardware. The programs often include extras because they are deploying novel architectures.

Patrick With HPE Cray EX AMD EPYC And Instinct MI250X Node At CoolIT Systems
Patrick With HPE Cray EX AMD EPYC And Instinct MI250X Node At CoolIT Systems

For example, under the current contracts, work has gone into porting HPC codes to GPUs, making codes work on the Intel and AMD GPU architectures as well as the APU architecture of the AMD MI300A. There is a lot of great work happening and being highlighted in forums like Supercomputing 2024. That work certainly has value.

One of the more notable impacts of this work is that it has probably helped AMD become a #2 to NVIDIA. Microsoft is deploying many AMD GPUs in its infrastructure. Without AMD and perhaps Broadcom/ Marvell working with hyper-scalers on AI chips, NVIDIA would have even more pricing pressure in the AI space. We will just say that Intel desperately needs Falcon Shores at this point. The work that went into Frontier and El Capitan has probably saved the industry billions of dollars by helping a lower-cost second source for some AI installations.

CoolIT HPE Cray EX AMD EPYC And Instinct MI250X Node On Bench 3
CoolIT HPE Cray EX AMD EPYC And Instinct MI250X Node On Bench 3

It is strange, but if AMD can quantify the impact of the Frontier/ El Capitan NRE efforts and its wins have had on NVIDIA’s pricing dominance it probably would help both AMD and NVIDIA.

If the purpose of building a supercomputer is to do as much work in a science domain as possible at the best value, then the next question will be what is the fastest way to achieve that value. A big difference between those making AI clusters today versus those making government supercomputers is the voracity in quickly achieving that value.

Final Words

Some have speculated that if xAI Colossus submitted a HPL result for a Top500 it would have easily been #1. Others think that the interconnect might have held it back. Our guess is that before being notified of their #1 Top500 submission last week, the LLNL folks with El Capitan were preparing SC24 materials for being #1 or #2. We know that there are several hyper-scalers that could have submitted Top500 results. Our best guess on why they did not comes down the law on AI accelerators above. Running HPL on an AI system when it is deployed takes the more valuable compute time than the system will have at any future point making it an inefficient use of mutli-billion dollar systems.

After filming the xAI Colossus video, and seeing it being built on a prior trip, it is immediately obvious that there is an enormous gap in the speed private companies are building large-scale systems versus the government. Since the new administration has announced plans to review large-scale government spending with DOGE, and one of the DOGE leads can legitimately say his teams build much bigger and faster, those involved in the solicitation and planning for next-generation US supercomputers need to prepare answers.

To be clear, I think the US building supercomputers is great, and I do not think this is a complete waste of taxpayer dollars by any means. It is by far not the most wasteful spending category since there are tangible science benefits and we are really discussing scale, cost, and speed deltas. Perhaps the big impact if DOGE takes up supercomputing will be the voracity of how these systems are deployed.

3 COMMENTS

  1. DOGE/Musk is more of the same – grift. I am very doubtful anything positive for the nation comes from this. How does an efficiency team have 2 leaders? Its a farce. They’ll cut regulations that make it easier for their companies to make money and put the rest of us at risk.

    The whole cabinet is scoundrels – Trump, Gaetz, Fox Talking Head, Tulsi, Brain Worms Jr, WWE Lady. Seems pretty swampy to me.

    Lower your expectations of what they can actually get done.

  2. LLNL should be thanking Elon’s team for not submitting Linpack numbers. 1024 H100’s is around 45PF. If you’ve got 100 times as many GPUs, that’s 4.5EF. Running at half the efficiency scaling would still make it #1 by a great margin

    There’s grift in all government contracts. I don’t think this is the first area they’re going after and we’ll probably see the next contract announced before they get to it. But you’ve made some good points.

  3. When you have to do things like actually honor agreements with unions, not overwork your employees, not cut corners or burn cash to get ahead, planning on what you build functioning for a longer term, all these things tend to take longer.

    Reworded: Doing things the right way is rarely ever the fastest way.

    Government installations tend to be online for long periods of time and are expected to function across that lifespan. I seriously doubt the xAI cluster is expected to survive longer than 5 years at most before being completely replaced.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.