AMD has hit the reset button on ROCm and quietly launched v5.0 this week. I could not be more excited. With it, they have added support for the MI200 server accelerator announced last November as well as enhanced support for RDNA2. It also seems to be doing a better job of providing a more robust experience. There is even a timely mention of FPGAs.
AMD Hits the ROCm Reset Switch with 5.0 – Get Excited
For those that do not know ROCm (just ROCm now, not an acronym), this is AMD’s accelerated computing framework. You are most likely to encounter it when using AMD GPUs for compute and it is in many ways AMD’s approach to counter something like NVIDIA CUDA. To put it bluntly, NVIDIA has a head-start in the space with CUDA, and AMD is catching up.
The MI25 is deprecated and will be continued to be supported under the 4.5 branch till Nov 2022. This is important for several reasons. MI25 has a small deployment base, and it does not support the hardware features needed to continue development or would do so at great performance penalties. Secondly, they are not dropping support for it in the driver, they are simply moving forward as NVIDIA does, with new hardware, a new CUDA version.
The new ROCm installer supports multi-version installs, uninstalls, as well as updates.
With this release, AMD is in a much better place than it was just 6 months ago but challenges persist. The most notable is their relatively fragile software recipe. This has to do with communication with OS partners, their bug tracking, tester hardware resources, and documentation fragmentation.
To address their documentation issues AMD has launched a new documentation hub: https://docs.amd.com/. This supports ROCm V4.5 and 5.0. For older versions, https://rocmdocs.amd.com/ remains active. While this increases the fragmentation, it is also a fresh start and renewed focus on accurate documentation.
Each release of ROCm has a supported kernel list and I have encountered a broken recipe less than a week after previous ROCm releases. Even now on V5.0 they list support for Ubuntu 20.04.3 5.11 HWE, even as Ubuntu has just rolled to 5.13. In the past, this has broken things, today it has not. I did a quick install to test 5.13 HWE as well as RDNA2 support and was pleasantly surprised I did not have to drop back to the 5.8 kernel which would be the default recommendation in this situation. HWE kernels are moving targets so this is a big improvement. This is growth for AMD in terms of the stability of recipes but an area of communication they need to develop with Canonical.
Add another reason to be excited. AMD dropped this easter egg in the documentation.
With the upcoming acquisition of Xilinx comes FPGAs and they appear to be driving to support them under the same ROCm framework as their GPU-based accelerators. Truly exciting.
Final Words
Evolving ROCm is part of AMD’s charge when it comes to getting its accelerators adopted in the data center. With two Exascale systems coming from AMD and using its GPUs, AMD needs to keep investing in a software platform that scales.
In a few days, AMD-Xilinx will close after a statutory waiting period. At that point, AMD will have CPU, GPU, and FPGA compute resources, along with a number of acceleration technologies. NVIDIA has its CUDA base, and Intel has its OneAPI. The big question is how AMD will integrate ROCm and Xilinx Vitis going forward.
“…documentation fragmentation…”
You are a very kind person, Bryan.
@hoohoo
+1
“…they are not dropping support for it in the driver, they are simply moving forward as NVIDIA does, with new hardware, a new CUDA version.”
The latest CUDA 11.6 supports the K20/K40 GPUs from 2013 (they are deprecated, but still supported in this version of CUDA).
ROCm 5.0 on the other hand drops support for the MI25 GPU from 2017…
I’d say it’s not really “as NVIDIA does” – hardware support in ROCm is still much worse than in CUDA.
Does ROCm 5.0 deprecating MI25 and presumably other Vega 10 chips mean they are not supported by the latest 5.0 release or that support is there but slated to be removed later?
if I could blink and 3 years of Intel development transpire when I open my eyes, my site would gladly shut our eyes for 3 years necessary for that to happen
if I could have anything to make work easier it’s a unified API to AMD and Intel GPU and FPGA w some shared hard ip blocks especially RISCV control cores in common.
the rate at which any business using hardware acceleration successfully actually gets to expand so quickly leaves behind the ability to live safely in a a monoculture even from a supplier negotiations pov that is frightening the moment you you take off your dollar pink shades…
I felt like this sentence in Para 1 is orphaned by some missing referential context or parent clause:
” It also seems to be doing a better job of providing a more robust experience. ”
am I missing meaning?
is there any chance of getting review samples of any kind of PCI switching systems as featured in your SC21 top ten?
+*awesome SC21 report..might have been minimalist for the current blebs [bling plus Web plus plebs?] but it gave us a instant lead into our fiscal 23 budget direction decisions. Direct. we trade advertising so this is like the most gilt edged triple A rating from every agency plus Grant’s observer all squared. Bullseye.*+
jmk – we have the AIC JBOX incoming. I believe that is what you are referring to for the PCIe switching systems.
@Eric, looking at the release notes for ROCm 4.5 and 5.0, MI25 support was deprecated in 4.5.x and is now removed in 5.0.x.
RDNA support for W6800 overpriced card, if you look at the docs. We (the SME crowd) have been asking for ROCm to support the gaming cards.
I guess Dr. Lisa Su still has not figured it out: you get broad uptake of a tech by making it broadly applicable.
nVidia took over the world with CUDA by making it work on the entire product stack.
@Andrea Bocci Does this mean that MI25 can not be used (run or be detected) with ROCm 5.0+? or can we still ‘run’ the OpenCL and HIP codes using ROCm 5.0+?
The time between deprecation in 4.5 and removed in 5.0 was like less than a few months! What help is it if there continues to be driver support but no library/compiler/runtime support? I want to support AMD but they always seem to do everything in their power to frustrate anyone who tries to use their devices.