Supermicro-based Storage at xAI
Storage was really interesting. In AI clusters, you generally see large storage arrays. Here, we had storage software from different vendors running, but almost every storage server we saw was Supermicro as well. That should not be a surprise. Supermicro is the OEM for many storage vendors.
One aspect that was very neat to see while we toured the facility was how similar some of the storage servers look to the CPU compute servers.
In either case, you will see a lot of 2.5” NVMe storage bays in our photos and video. Something we have covered on our Substack is that large AI clusters have been moving away from disk-based storage to flash because it can save significant amounts of power while offering more performance and more density. Flash can cost more per petabyte, but in clusters of this scale, flash tends to win on a TCO basis.
Supermicro-based CPU Compute at xAI
With all of these clusters, you generally see a solid number of traditional CPU compute nodes. Processing and data manipulation tasks still run very well on CPUs versus GPUs. You may also want to keep the GPUs running AI training or inference workloads instead of other tasks.
Here, we see racks of 1U servers. Each of the servers is designed to balance compute density with the heat being generated. A great example of this is that we can see the orange tabs for NVMe storage bays on front but also about a third of the faceplate being dedicated to drawing cool air into the system.
These 1U compute servers can be cooled by fans and then a rear door heat exchanger can remove heat and exchange it with the facility water loops. Due to the data center design with rear door heat exchangers, xAI can handle both liquid-cooled gear and air-cooled gear.
Networking at xAI Colossus
Networking is one of the fascinating parts. If your computer uses an Ethernet cable, that is the same base technology as the networking here. Except, that this is 400GbE or 400 times faster, per optical connection than the common 1GbE networking we see elsewhere. There are also nine of these links per system which means that we have about 3.6Tbps of bandwidth per GPU compute server.
The RDMA network for the GPUs makes up the majority of this bandwidth. Each GPU gets its own NIC. Here, xAI is using NVIDIA BlueField-3 SuperNICs and Spectrum-X networking. NVIDIA has some special sauce in their network stack that helps ensure the right data gets to the right place navigating around bottlenecks in the cluster.
That is a big deal. Many supercomputer networks use InfiniBand or other technologies, but this is Ethernet. Ethernet means it can scale. Everyone reading this on STH will have the page delivered over an Ethernet network at some point. Ethernet is the backbone of the Internet. As a result, it is a technology that is immensely scalable. These enormous AI clusters are scaling to the point where some of the more exotic technologies have not touched in terms of scale. This is a really bold move by the xAI team.
Beyond the GPU RDMA network, the CPUs also get a 400GbE connection, which uses a different switch fabric entirely. xAI is running a network for its GPUs and one for the rest of the cluster, which is a very common design point in high-performance computing clusters.
Just to give you some sense of how fast 400GbE is, it is more connectivity than a top-of-the-line early 2021 Intel Xeon server processor could handle across all of its PCIe lanes combined. That level of networking is being used nine times per server here.
All of that networking means that we have huge amounts of fiber runs. Each fiber run is cut and terminated to the correct length and labeled.
I had the opportunity to meet some of the folks doing this work back in August. Structured cabling is always neat to see.
In addition to the high-speed cluster networking, there is lower-speed networking that is used for the various management interfaces and environmental devices that are a part of any cluster like this.
Something that was very obvious walking through this facility is that liquid-cooled network switches are desperately needed. We recently reviewed a 64-port 800GbE switch, in the same 51.2T class as the ones used in many AI clusters. Something that the industry needs to solve is cooling not just the switch chips, but also the optics that in a modern switch can use significantly more power than the switch chip. Perhaps enormous installations like these might move the industry towards co-packaged optics so that the cooling of the switches can follow the compute to liquid cooling. We have seen liquid-cooled co-packaged optic switch demos before, so hopefully a look at this installation will help those go from prototypes to production in the future.
In the latter part (2017-2021) of my almost decade working in the primary data center for an HFT firm, we moved from air cooled servers to immersion cooling.
From the server side that basically meant finding a vendor willing to warranty servers cooled this way, removing the fans, replacing thermal paste with a special type of foil and (eventually) using power cords made of a more expensive outing coating (so they didn’t turn rock hard from the mineral oil cooling fluid.)
But from the switch side (25 GbE) no way the network team was going to let me put their Arista switches in the vats…Which made for some awkwardly long cabling and eventually a problem with oil wicking out the vats via the twinax cabling (yuck!).
I would look at immersion cooling as a crude (but effective) “bridge technology” between the worlds of the past with 100% air cooling for mass market scale out servers, and a future heavy on plumbing connections and water blocks.
This is extremely impressive.
However, coming online in 122 days is not without issue. For example, this facility uses at least 18 unlicensed/unpermitted portable methane gas generators that are of significant concern to the local population – one that already struggles with asthma rates and air quality alerts. There is also some question as to how well the local utility can support the water requirements of liquid cooling at this scale. One of the big concerns about liquid cooling with datacenters is the impact to the water cycle. When water is typically consumed it ends up as wastewater feeding back to treatment facilities where it ends up back in circulation relatively quickly.
Water-based cooling systems used in datacenters use evaporation – which has a much longer cycle: atmosphere -> clouds -> rainwater -> water table.
Other clusters and datacenters used by the likes of Meta, Amazon, Google, Microsoft, etc take the time and caution to minimize these kinds of environmental impact.
Again, very impressive from a technical standpoint but throwing this together to have it online in record time should not have to come at the expense of the local population for the arbitrary bragging rights of a billionaire.
Patrick – great video. Before you were born, there was a terrific movie (based on a book) titled: Colossus: The Forbin Project. https://www.imdb.com/title/tt0064177/ Highly recommended
There is no 9 links per server but only 8 . 1 is for management …
Musk is a shitty person and should not run companies that the USA depends on strategically, but yeah its a cool datacenter.
what what an amazing cluster and datacenter. Super cool that you got to tour it and show all of this to us!
I hope for Supermicro that they prepaid the bill, not like X/Twitter/Musk did with Wiwynn
Interesting data centre. Too bad Musk is behind it. What a jerk.
100% agreed on the Musk comments. So much god worship out there overlooking the accomplishments from Shotwell, Straubel, Eberhard, Tarpenning and countless others. Interesting article though ;-)
What would be even cooler than owning 100k GPUs would be putting out any AI products, models, or research that was interesting and impactful. xAI is still-born as a company because no researcher with a reputation to protect is willing to join it, the same reason Tesla’s self driving models make no significant progress.
Why so many guys hate Mask. Anyway, his gaint XAI DC is amazing project.
> There is no 9 links per server but only 8 . 1 is for management …
On each GPU node: one 400GbE link for each of 8 GPUs, plus another 400GbE for the CPU, plus gigabit IPMI.
Good article, wondering why did the choose x86 based CPUs instead of grace CPU’s?
To Skywalker: I guess it’s most likey caused by schedule(H100 while not blackwell SKU) and X software environment.
With Elon Musk the news are fun and interesting to watch ! :D
What he is doing is amazing !
The most impressive part is that they will soon double that capacity with the new Nvidia H200 batch deployment.