Liquid Cooling Next-Gen Servers Getting Hands-on with 3 Options

August 2, 2021

Performance and Power Impact

There are two main areas that we are going to focus on here, performance and power. The systems are effectively the same except for some miscellaneous network cards that are either not hooked up to transceivers or will not have an appreciable impact. These dual AMD EPYC 7713 systems are as close as we could reasonably get. That allowed us to issue the same commands to both systems with the differences just being the GPUs and cooling.

First, we ran HPL from the NVIDIA NGC repository. During the runs, we confirmed that the 400W GPUs were running at around 400W and the 500W GPUs were pulling around 500W each. With eight GPUs that meant that we had ~3.2kW of GPUs on the air-cooled system and ~4kW of GPUs on the liquid-cooled system and that the liquid-cooled GPUs were indeed pulling a combined total of 800W more.

Supermicro AS 04124GO NART Air Cooling Linpack 8x A100 40GB 400W

The performance numbers came out and were a bit shocking. This is likely due, in part, to the 80GB memory configuration on the 500W SXM modules, but we got around 100TFLOPS on the 500W system and around 80TFLOPS on the 400W system. That is roughly 20% better performance. We did not expect the delta to be that big. For many buyers, getting 20% more performance on GPUs using 25% more power is perfectly reasonable. That, however, was only part of the story.

Supermicro AS 04124GO NART Liquid Cooling Linpack 8x A100 80GB 500W

Taking a quick look at the power consumption during the runs, we could see a notable delta.

Air-cooled 8x NVIDIA A100 40GB 400W: 5164W
Liquid-cooled 8x NVIDIA A100 80GB 500W: 5364W

We effectively only saw a 200W difference between the systems running Linpack.

Supermicro AS 04124GO NART Air Cooling Linpack 8x A100 40GB 400W Power Consumption

At this point, you may be wondering, what is going on. We watched the eight GPUs during the runs consume 100W more each, but instead of an 800W delta, we saw only a 200W delta. The answer for this is perhaps the coolest part: it is liquid cooling.

Supermicro AS 04124GO NART Liquid Cooling Linpack 8x A100 80GB 500W Power Consumption

What was actually happening here is that the chassis fans are not spinning up to cool the system. At a server level, that means that even with 800W more electrical load from the GPUs, we are net only seeing a 200W increase in power consumption.

This is a figure that many vendors will show, but it is also a bit misleading. In the rack, there is an Asetek cooling unit that is using power as well as pumps. This setup is also using water that is coming in from the building so there are resources being used outside of the servers. One could, of course, argue that the air-cooled system requires air conditioning in the room or a rear heat exchanger to keep the ambient temperatures reasonable and that requires power as well.

Supermicro Liquid Cooling Supermicro AS 4124GO NART Air And Liquid Cooled Front

All told this is the exact type of benefits folks talk about with liquid cooling. At a server level, we managed to get approximately 20% more performance for less than 4% more power consumption. While the GPUs cost a bit more (we were not given firm pricing but NVIDIA does charge a premium), and liquid cooling costs more, getting that extra performance in a system means we need fewer servers, NICs, switch ports, and so forth leads to massive cost savings. On a $100,000 server, getting 20% better performance one can equate to around $20,000 in benefit which pays for a lot of quick-connect fittings.

Final Words

There are a lot of folks hesitant about liquid cooling for obvious reasons. At the same time, liquid cooling benefits are already here today offering gains like the 20% better linpack performance for a ~3.8% power increase.

Supermicro Liquid Cooling Leak Detection Sensor Wrapped

As chips use more power and generate more heat in 2022, liquid cooling will be required for an incrementally greater portion of the server market.

Supermicro Liquid Cooling Direct To Chip Cooling Top Water Feeds 2

The fact is that adding liquid cooling to a data center or even just servers is not a trivial task. Each facility is different and environments are different as well based on physical location. As a result, if you want to cool top-end gear that will arrive in 2022 and beyond, you need to be thinking about that today.

Supermicro Liquid Cooling ExxonMobil Fluid

I just wanted to say thank you again to Supermicro for helping me do this article and getting the lab set up for this testing. If you are looking at liquid cooling in the future, they have been providing these solutions to large customers for years and we highlighted some of both the older and newer generations in this article. They now have an entire practice set up to help customers transition to different types of liquid cooling.

Supermicro Liquid Cooling Supermicro AS 4124GO NART Liquid Cooled With Quick Connect Front 1

For our readers, our hope is that this gives some tangible independent data on the impact of liquid cooling. It is a big project to add newer cooling to infrastructure, but it is also something that those looking to deploy high-end 2022 and beyond solutions will need to utilize. Indeed, this was the first time we could show the benefit since we had access to the top-end NVIDIA A100 80GB 500W GPUs.

8 COMMENTS

domih August 2, 2021 At 9:59 am

The kiddies on desktop showed the way.

If data centers want their systems to be cool, just add RGB all over.
Bob Niland August 2, 2021 At 12:58 pm

I’m surprised to see the Fluorinert® is still being used, due to its GWP and direct health issues.

There are also likely indirect health issues for any coolant that relies on non-native halogen compounds. They really need to be checked for long-term thyroid effects (and not just via TSH), as well as both short-term and long term human microbiome effects (and the human race doesn’t presently know enough to have confidence).

Photonics can’t get here soon enough, otherwise, plan on a cooling tower for your cooling tower.
nrtc August 3, 2021 At 3:24 pm

Going from 80TFLOP/s to 100TFLOP/s is actually a 25% increase, rather than the 20% mentioned several times. So it seems the Linpack performance scales perfectly with power consumption.
Boyd Waters August 4, 2021 At 1:22 am

Very nice report.

Data center transition for us would be limited to the compute cluster, which is already managed by a separate team. But still, the procedures for monitoring a compute server liquid loop is going to take a bit of time before we become comfortable with that.
RunsackAL August 4, 2021 At 10:32 am

Great video. Your next-gen server series is just fire.
Nils August 6, 2021 At 8:27 am

Would be interesting to know the temperature of the inflow vs. outflow. Is the warmed water used for something?
Kabore harouna September 7, 2021 At 1:19 am

Bonjour
Je sui taïlleur coudre les pagne tradissonelle ect…
Je ve crée atelier pour l’enfant de demin.
Merci bein.
Shahar Belkin October 27, 2021 At 3:53 am

Hi Patrick.

Liked your video Liquid Cooling High-End Servers Direct to Chip.
have you seen ZutaCore’s innovative direct-on-chip, waterless, two-phase liquid cooling?
Will be happy to meet in the OCP Summit and give you the two-phase story.

regards,

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Performance and Power Impact

Final Words

RELATED ARTICLESMORE FROM AUTHOR

The OCP DC-SCM HFF is Taking Over

Intel Foundry Thermal Capabilities with TIM Options and In-Package Liquid Cooling Shown

Quick Guide to Show CXL Memory in a System

8 COMMENTS

LEAVE A REPLY

RELATED ARTICLES MORE FROM AUTHOR