The Most Important Server of 2022 The Gigabyte Ampere Altra Max and NVIDIA A100

3

Gigabyte G242-P32 Topology and Running

Here is the system topology. One can see that a number of the x16 slots are actually only x8 electrical. Still, the Ampere Altra provides more PCIe connectivity than the Ice Lake generation Xeons in single socket configurations.

Gigabyte G242 P32 Block Diagram
Gigabyte G242 P32 Block Diagram

This system looks a lot like what an AMD EPYC server would look like versus an Intel Xeon server with a PCH. The CPU is the center of the whole system instead of also offloading low value functionality to a PCH.

Once the server is running, here is the lscpu output with the Arm Neoverse-N1 cores. Since we are using the Ampere Altra Max 128-core CPU we get 128 cores. These are all cores without the use of SMT/ Hyper-threading.

Gigabyte G242 P32 Ampere Altra Max Ubuntu 22.04.1 With 128 Cores Lscpu
Gigabyte G242 P32 Ampere Altra Max Ubuntu 22.04.1 With 128 Cores Lscpu

Here is a screenshot of nvidia-smi with a single GPU.

Gigabyte G242 P32 Ampere Altra Max Ubuntu 22.04.1 With NVIDIA A100 80Gb
Gigabyte G242 P32 Ampere Altra Max Ubuntu 22.04.1 With NVIDIA A100 80Gb

As a quick note, we simply had to download and install drivers similar to how we do for NVIDIA-x86 systems and featurs like nvidia-smi worked immediately. We also installed NVIDIA for Docker and ran NVIDIA NGC containers quickly.

Gigabyte G242-P32 Performance

A big part of this is “why Ampere?” The Ampere Altra and Altra Max are very interesting parts. In some tasks, software is well optimized and Ampere is able to take advantage of using high core counts. At the same time, the hyper-scale deployments for Oracle Cloud, Google Cloud, Microsoft Azure have used the Altra not the Altra Max and lower core counts than we are using here. Just in general, the Ampere Altra line is designed to have smaller cores that prioritize fairness for cloud VMs over absolute single-core performance.

Ampere Altra Max M128 30 With 2x NVIDIA A100 80GB PCIe 6
Ampere Altra Max M128 30 With 2x NVIDIA A100 80GB PCIe 6

NVIDIA came out with a number of benchmarks showing Arm v. x86 with the NVIDIA A100.

NVIDIA MLperf Inference 2
NVIDIA MLperf Inference 2

Part of the Ampere Altra CPUs value proposition is a focus on integer performance. Afterall, if you want faster floating point performance for AI or HPC, one may as well offload to the GPU at that point.

NVIDIA A100 80GB PCIe 3
NVIDIA A100 80GB PCIe 3

We came pretty close to NVIDIA’s ResNet-50 numbers so it seems reasonable. Also, NVIDIA is probably better at tuning its MLPerf submissions than we are, but our results are close even on the Arm platform.

NVIDIA Arm ResNet 50 Inference
NVIDIA Arm ResNet 50 Inference

The Ampere platform is also using less power by ~80W compared to what we would expect to see from an EPYC platform in this configuration.

Ubuntu installed without a hitch. We could compile and run benchmarks like STREAM quickly. There is a bit more tuning to do here, but decent results for sure.

Gigabyte G242 P32 Ampere Altra Max Ubuntu 22.04.1 With NVIDIA A100 Stream Example
Gigabyte G242 P32 Ampere Altra Max Ubuntu 22.04.1 With NVIDIA A100 Stream Example

When I discuss the Ampere Altra Max CPUs, I often say they are somewhere between an AMD EPYC 7763 and 7773X. Here are a few example of that. NGINX is a highly-optimized workload. Better put, this is perhaps one of the best workloads for Arm and is why we see it in every benchmark. This is running STH’s website access traces and one can see for our hosting, Ampere would be great.

Gigabyte Ampere Altra MAX Q128 30 STH Nginx CDN Performance
Gigabyte Ampere Altra MAX Q128 30 STH Nginx CDN Performance

In our MariaDB pricing analytics workload (deal desk analytics via a sanitized transaction list from a major enterprise OEM) we see that the AMD EPYC 7773X performs exceedingly well because of the larger caches. Usually this is a workload that the Arm part performs well in, but the AMD 3D V-Cache is a huge benefit. As a result, it falls between 64C AMD parts.

Gigabyte Ampere Altra MAX Q128 30 MariaDB Pricing Analytics
Gigabyte Ampere Altra MAX Q128 30 MariaDB Pricing Analytics

Since really the focus here is on GPU/ PCIe performance, here is the Ampere Altra Max performance compared to other PCIe Gen4 platforms we have tested with a Kioxia CM6 PCIe Gen4 NVMe SSD. One can see better performance than we saw on the Huawei Ascend 910 with the Kunpeng 920 arm CPUs.

Gigabyte Ampere Altra MAX PCIe Performance Kioxia CM6 6.4TB In Adapter
Gigabyte Ampere Altra MAX PCIe Performance Kioxia CM6 6.4TB In Adapter

The best benchmark we found was actually our c-ray 8K benchmark where simply having 128 cores allowed the Ampere Altra Max to be a big winner.

Gigabyte Ampere Altra MAX C Ray 8K Performance
Gigabyte Ampere Altra MAX C Ray 8K Performance

With all of that said, a pretty large portion of our traditional benchmarks are x86 or x86 optimized. That is something we are going to fix in our next-generation benchmarks we will unveil with Genoa later this quarter. Still, these are not anemic CPUs by any means and the single socket platform actually has some advantages over the dual socket Altra configuration.

Next, I wanted to discuss the market impact of this kind of system.

3 COMMENTS

  1. I guess I’m curious as to what makes the Ampere Altra processor so “unpleasant” to install. Why would installing this chip be any different from installing any other, Patrick? Apart from the physical size, there doesn’t appear to be anything much different between this processor and all the other Intel, AMD and IBM chips I’ve seen.

  2. Stephen, super easy to answer. AMD and Intel both use carriers. AMD SP3 uses the CPU carrier that you slide in, then it is lowered in place. On the Intel side, if you install the CPUs correctly, then you are not touching the CPU when it is lowered aligned into the socket. With Ampere, you need to align it by hand versus AMD and Intel both have mechanical alignment. Even with suction cups, the Ampere socket does not have the same level of mechanical alignment before it is lowred into the socket.

    See how to install LGA4189 correctly Article and Video.

  3. @Patrick Kennedy

    Thanks for the very fast reply. Sounds like a possible engineering/design problem with the type of socket, maybe the ILM needs to be re-designed to allow for the alignment features you mention. Otherwise, I still don’t see how this is an absolute deal-breaker. The chip package itself should have keying notches in the edges that ought to provide enough positive feedback even when loaded by hand.

    Believe it or not, I did see your LGA-4189 installation article when it came out. Very well done and very useful. Now, if only we could have some higher-resolution photos of the chips you feature. I noticed that the photos of the Ponte Vecchio chip you had earlier were only a mere 800×600-ish pixels. Seems we need an upgrade to at least 1920×1080 minimum. I have a hard time seeing much detail, specifically where contact pads are concerned. I like drawing pictures of these things and being able to see the pads well enough to count them is a must.

    After all the recent coverage of the Altra and Altra Max, I’m kind of wishing someone on eBay would start selling used ones so I can buy one. CPU collecting is one of my hobbies and I’m longing to get my first LGA-4189 Xeon, LGA-4094 Threadripper/Epyc and one of these Altra chips. I can’t afford to buy these things when new and typically have to wait at least ten years before they’ll even show up in the used listings. Of course, somebody could donate one if they wanted to, though that’s probably asking too much of the Universe.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.