Inside the 100K GPU xAI Colossus Cluster that Supermicro Helped Build for Elon Musk

4

Supermicro-based Storage at xAI

Storage was really interesting. In AI clusters, you generally see large storage arrays. Here, we had storage software from different vendors running, but almost every storage server we saw was Supermicro as well. That should not be a surprise. Supermicro is the OEM for many storage vendors.

XAI Colossus Data Center Supermicro 1U NVMe Storage Node
XAI Colossus Data Center Supermicro 1U NVMe Storage Node

One aspect that was very neat to see while we toured the facility was how similar some of the storage servers look to the CPU compute servers.

XAI Colossus Data Center Supermicro 1U NVMe Storage Node 2
XAI Colossus Data Center Supermicro 1U NVMe Storage Node 2

In either case, you will see a lot of 2.5” NVMe storage bays in our photos and video. Something we have covered on our Substack is that large AI clusters have been moving away from disk-based storage to flash because it can save significant amounts of power while offering more performance and more density. Flash can cost more per petabyte, but in clusters of this scale, flash tends to win on a TCO basis.

Supermicro-based CPU Compute at xAI

With all of these clusters, you generally see a solid number of traditional CPU compute nodes. Processing and data manipulation tasks still run very well on CPUs versus GPUs. You may also want to keep the GPUs running AI training or inference workloads instead of other tasks.

XAI Colossus Data Center CPU Compute Rack
XAI Colossus Data Center CPU Compute Rack

Here, we see racks of 1U servers. Each of the servers is designed to balance compute density with the heat being generated. A great example of this is that we can see the orange tabs for NVMe storage bays on front but also about a third of the faceplate being dedicated to drawing cool air into the system.

XAI Colossus Data Center CPU Compute Rack 2
XAI Colossus Data Center CPU Compute Rack 2

These 1U compute servers can be cooled by fans and then a rear door heat exchanger can remove heat and exchange it with the facility water loops. Due to the data center design with rear door heat exchangers, xAI can handle both liquid-cooled gear and air-cooled gear.

Networking at xAI Colossus

Networking is one of the fascinating parts. If your computer uses an Ethernet cable, that is the same base technology as the networking here. Except, that this is 400GbE or 400 times faster, per optical connection than the common 1GbE networking we see elsewhere. There are also nine of these links per system which means that we have about 3.6Tbps of bandwidth per GPU compute server.

XAI Colossus Data Center NICs
XAI Colossus Data Center NICs

The RDMA network for the GPUs makes up the majority of this bandwidth. Each GPU gets its own NIC. Here, xAI is using NVIDIA BlueField-3 SuperNICs and Spectrum-X networking. NVIDIA has some special sauce in their network stack that helps ensure the right data gets to the right place navigating around bottlenecks in the cluster.

XAI Colossus Data Center Switch Fiber 1
XAI Colossus Data Center Switch Fiber 1

That is a big deal. Many supercomputer networks use InfiniBand or other technologies, but this is Ethernet. Ethernet means it can scale. Everyone reading this on STH will have the page delivered over an Ethernet network at some point. Ethernet is the backbone of the Internet. As a result, it is a technology that is immensely scalable. These enormous AI clusters are scaling to the point where some of the more exotic technologies have not touched in terms of scale. This is a really bold move by the xAI team.

Beyond the GPU RDMA network, the CPUs also get a 400GbE connection, which uses a different switch fabric entirely. xAI is running a network for its GPUs and one for the rest of the cluster, which is a very common design point in high-performance computing clusters.

XAI Colossus Data Center Single Mode And Multi Mode Fiber
XAI Colossus Data Center Single Mode And Multi Mode Fiber

Just to give you some sense of how fast 400GbE is, it is more connectivity than a top-of-the-line early 2021 Intel Xeon server processor could handle across all of its PCIe lanes combined. That level of networking is being used nine times per server here.

XAI Colossus Data Center Switch Stack
XAI Colossus Data Center Switch Stack

All of that networking means that we have huge amounts of fiber runs. Each fiber run is cut and terminated to the correct length and labeled.

XAI Colossus Data Center Fiber Runs
XAI Colossus Data Center Fiber Runs

I had the opportunity to meet some of the folks doing this work back in August. Structured cabling is always neat to see.

XAI Colossus Data Center Overhead Cabling
XAI Colossus Data Center Overhead Cabling

In addition to the high-speed cluster networking, there is lower-speed networking that is used for the various management interfaces and environmental devices that are a part of any cluster like this.

Something that was very obvious walking through this facility is that liquid-cooled network switches are desperately needed. We recently reviewed a 64-port 800GbE switch, in the same 51.2T class as the ones used in many AI clusters. Something that the industry needs to solve is cooling not just the switch chips, but also the optics that in a modern switch can use significantly more power than the switch chip. Perhaps enormous installations like these might move the industry towards co-packaged optics so that the cooling of the switches can follow the compute to liquid cooling. We have seen liquid-cooled co-packaged optic switch demos before, so hopefully a look at this installation will help those go from prototypes to production in the future.

4 COMMENTS

  1. In the latter part (2017-2021) of my almost decade working in the primary data center for an HFT firm, we moved from air cooled servers to immersion cooling.

    From the server side that basically meant finding a vendor willing to warranty servers cooled this way, removing the fans, replacing thermal paste with a special type of foil and (eventually) using power cords made of a more expensive outing coating (so they didn’t turn rock hard from the mineral oil cooling fluid.)

    But from the switch side (25 GbE) no way the network team was going to let me put their Arista switches in the vats…Which made for some awkwardly long cabling and eventually a problem with oil wicking out the vats via the twinax cabling (yuck!).

  2. I would look at immersion cooling as a crude (but effective) “bridge technology” between the worlds of the past with 100% air cooling for mass market scale out servers, and a future heavy on plumbing connections and water blocks.

  3. This is extremely impressive.

    However, coming online in 122 days is not without issue. For example, this facility uses at least 18 unlicensed/unpermitted portable methane gas generators that are of significant concern to the local population – one that already struggles with asthma rates and air quality alerts. There is also some question as to how well the local utility can support the water requirements of liquid cooling at this scale. One of the big concerns about liquid cooling with datacenters is the impact to the water cycle. When water is typically consumed it ends up as wastewater feeding back to treatment facilities where it ends up back in circulation relatively quickly.

    Water-based cooling systems used in datacenters use evaporation – which has a much longer cycle: atmosphere -> clouds -> rainwater -> water table.

    Other clusters and datacenters used by the likes of Meta, Amazon, Google, Microsoft, etc take the time and caution to minimize these kinds of environmental impact.

    Again, very impressive from a technical standpoint but throwing this together to have it online in record time should not have to come at the expense of the local population for the arbitrary bragging rights of a billionaire.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.