Inside the 100K GPU xAI Colossus Cluster that Supermicro Helped Build for Elon Musk

17

The Colossus Facility

Since we have liquid-cooled racks of AI servers, the power and facility water is essential to the installation. Here is a look at the massive water pipes. There are sets of cooler and warmer water. Cooler water is brought into the facility and circulates through the CDU in each rack. Heat is transferred from the GPUs and rear door heat exchanger loops to the facility water loops at the CDU. The warmer water is then brought outside the facility to chillers. Of course, the chillers are not the type that will make you ice cubes. Instead, the goal is just to lower the temperature of the water enough so that it cools down enough to be recycled through the facility again.

XAI Colossus Data Center Facility Water Pipes
XAI Colossus Data Center Facility Water Pipes

Power is fascinating. When we were in Memphis while the system was built, we saw the teams moving huge power cables into place.

XAI Colossus Data Center Some Electrical Infrastructure With Patrick For Scale
XAI Colossus Data Center Some Electrical Infrastructure With Patrick For Scale

Outside of the facility, we saw containers with Tesla Megapacks. This is one of the really neat learning points that the teams had building this giant cluster. AI servers do not run at 100% rated power consumption 24×7. Instead, they have many peaks and valleys in power consumption. With so many GPUs on site, the power consumption fluctuates as the workload moves to the GPUs, and then results are collated, and new jobs are dispatched. The team found that the millisecond spikes and drops in power were stressful enough that putting the Tesla Megapacks in the middle to help buffer those spikes in power helped make the entire installation more reliable.

Tesla Megapacks Ready For Installation At XAI Colossus
Tesla Megapacks Ready For Installation At XAI Colossus

Of course, the facility is just getting started. While the initial cluster of four 25,000 GPU data halls is up and running for around 100,000 GPUs at the time of our visit, the cluster expansion work is moving rapidly.

XAI Colossus Data Center In Memphis Outside Expansion
XAI Colossus Data Center In Memphis Outside Expansion

This seems to be the start of something truly awesome.

Final Words

One of the key themes I learned while doing this is that the xAI team needs more time for petty vendor differences. The only way this got built was a surge of experts building the systems together with a vision of building a giant AI cluster at an unheard-of speed. If I had just seen it the day we filmed the video, I would have had a different perspective on how many people were working together to build something of this scale. It was cool going on-site both times and having folks come up to me and tell me they have been avid readers or viewers of STH for so long.

XAI Data Center Tour Patrick In Compute Hall
XAI Data Center Tour Patrick In Compute Hall

If you want to get involved in this project or large AI installations, check out the job postings at xAI and Supermicro. I hear folks in the AI community talk about how LLMs continue scale with more compute and how they can be generally applicable than just for chatbots. As I walked around Colossus, one thought I had is that something of this scale only gets built if data-driven folks see huge value on the horizon. Grok and the xAI team’s future work feels destined to be much more than a simple 2024-era chatbot. A lot of very smart people are spending a lot of money and spending their time to make that happen as fast as possible.

We have come a long way since I first fielded the call on this from the hospital the day after my son was born. In the end, it was a fantastic experience to see this get built. Thank you to all of those who went out of their way to make this possible.

If you are working on a large AI cluster, let us know. It is exciting to see what will happen next.

If you want to learn more, here is the Supermicro AI link and the company’s landing page for the AI Supercluster. Or, just watch the video.

17 COMMENTS

  1. In the latter part (2017-2021) of my almost decade working in the primary data center for an HFT firm, we moved from air cooled servers to immersion cooling.

    From the server side that basically meant finding a vendor willing to warranty servers cooled this way, removing the fans, replacing thermal paste with a special type of foil and (eventually) using power cords made of a more expensive outing coating (so they didn’t turn rock hard from the mineral oil cooling fluid.)

    But from the switch side (25 GbE) no way the network team was going to let me put their Arista switches in the vats…Which made for some awkwardly long cabling and eventually a problem with oil wicking out the vats via the twinax cabling (yuck!).

  2. I would look at immersion cooling as a crude (but effective) “bridge technology” between the worlds of the past with 100% air cooling for mass market scale out servers, and a future heavy on plumbing connections and water blocks.

  3. This is extremely impressive.

    However, coming online in 122 days is not without issue. For example, this facility uses at least 18 unlicensed/unpermitted portable methane gas generators that are of significant concern to the local population – one that already struggles with asthma rates and air quality alerts. There is also some question as to how well the local utility can support the water requirements of liquid cooling at this scale. One of the big concerns about liquid cooling with datacenters is the impact to the water cycle. When water is typically consumed it ends up as wastewater feeding back to treatment facilities where it ends up back in circulation relatively quickly.

    Water-based cooling systems used in datacenters use evaporation – which has a much longer cycle: atmosphere -> clouds -> rainwater -> water table.

    Other clusters and datacenters used by the likes of Meta, Amazon, Google, Microsoft, etc take the time and caution to minimize these kinds of environmental impact.

    Again, very impressive from a technical standpoint but throwing this together to have it online in record time should not have to come at the expense of the local population for the arbitrary bragging rights of a billionaire.

  4. Musk is a shitty person and should not run companies that the USA depends on strategically, but yeah its a cool datacenter.

  5. 100% agreed on the Musk comments. So much god worship out there overlooking the accomplishments from Shotwell, Straubel, Eberhard, Tarpenning and countless others. Interesting article though ;-)

  6. What would be even cooler than owning 100k GPUs would be putting out any AI products, models, or research that was interesting and impactful. xAI is still-born as a company because no researcher with a reputation to protect is willing to join it, the same reason Tesla’s self driving models make no significant progress.

  7. > There is no 9 links per server but only 8 . 1 is for management …

    On each GPU node: one 400GbE link for each of 8 GPUs, plus another 400GbE for the CPU, plus gigabit IPMI.

  8. To Skywalker: I guess it’s most likey caused by schedule(H100 while not blackwell SKU) and X software environment.

  9. The most impressive part is that they will soon double that capacity with the new Nvidia H200 batch deployment.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.