At Supercomputing 2018, we saw a really interesting phase change immersion cooling machine from Sugon. Liquid cooling was a key theme of the show with next-generation CPUs, GPUs, and other accelerators marching to higher TDPs. We saw a variety of big tank immersion cooling. There were cold plate water cooling solutions galore as we showed with Cray Shasta Cooling by CoolIT. Sugon is taking a different approach with their Nebula solution.
Sugon Nebula Phase Change Immersion Cooling
Here is a view of the full system from the front. What you can see is the immersion cooling main pumping and heat exchange system in the middle (labeled Sugon) and a compute rack to the right. Each compute rack is made up of front serviceable blades.
The blade trays themselves were extremely interesting. One can see a design with four CPUs and four CPUs per sled. The GPUs are SXM2 NVIDIA Tesla cards. The CPUs appeared to be socket SP3 mounting patterns and AMD EPYC. We asked someone in the booth and received a yes nod when we asked if they were Hygon Dhyana. [Editor’s note: Sugon contacted us after this article was published and said they can support Intel Xeon and AMD as well.]
Here is a view of one of these units operating on the show floor. Fluid is circulated into each sled to remove heat from components.
Moving to the rear, we can see the pipes and tubes that deliver the coolant behind the chassis. This coolant also keeps the network switches cool. In these pictures, you can see the backplane which supports power the node but also the communication fabric. Even the fabric chassis are being liquid cooled.
Making this entire system work is the Sugon Nebula unit that takes care of heat exchange, pumping, and circulation. Here is a shot from behind the center unit.
Final Words
This was an extremely interesting demo. Some of our US and European readers may not know Sugon. They are a large Chinese OEM vendor. You can see from our Top500 November 2018 New Systems Analysis that Sugon is a major HPC player capable of designing and delivering Top500 systems. Indeed, it has 57 of 500 systems on the November 2018 list. As companies look to higher TDP parts in the future, liquid and immersion cooling will be important to maintain system density.
This is a really interesting system, oil I take it? Looks like a really scaled up custom loop, like you’d see in a gaming rig. Except instead of cold plates they’re individually filling the sealed sleds and fabric with circulating fluid. What happens when a bubble of air enters the system after servicing one of those blades? Interesting demo, but being locked into proprietary sleds, rails, and fabric… Probably a no go just there. I like the idea of the full lockers/vats with standard racks set sideways (lifting servers out to maintain them) and drip drying. Just need to be careful with the heavy systems, maybe a crane apparatus or something overhead?
Michah: phase change means boiling point is about 60°C
Example of coolant: https://en.wikipedia.org/wiki/Fluorinert
interesting motherboard layout, i guess this means ROME might take the multi CPU to x4 rather than maxing out at x2 only. Might be a useful case for these super clusters/computers.
PS: @STH, why dont you add something like disqus, its easier to comment and talk about it.
The problem with such system is that you are taking heat from the rack, adding some more thermal energy (there must be a compressor there or something) and releasing all that heat to the same environment where the rack is located. Then you will need more energy to transport that heat outdoor.
There is a reason why your AC unit has 2 pieces, one that stays outdoors and another that is indoor. We would need a similar solution here.
@Abhishek What’s wrong with the current comment system?
I wonder if these makes sense for AWS and DO like companies? Sounds like merely for Top500 only.
Well .. i/we have to type in name and email each time to comment and you could potentially use different names each time as well (maybe least of the issue) … but just saying .. would be better engaging with readers if STH had some kind of login or something like disqus(i am not proposing to use disqus by any means..but maybe consider if it fits STH).
@Moses
Fluorinert is still one of the best, some 35 years later (that I know of). 3M also knows this, hence it’s still way damn pricey too. Mineral oil works … for about a month or two, then the PCB degrades.
@Marcelo
heat taken from rack can be used elsewhere. E.g. some power plants use “excess” heat to provide heat for homes. Not every situation can use that, but “combined cycle” is win-win. Like 50 – 60 % thermal efficiency. China in particular has building pairs: factory / dormitory. [Not personally a fan of termite living].