Ampere AmpereOne A192-32X Review A 192 Arm Core Server CPU

9

Supermicro MegaDC ARS-211M-NR Performance

Now what everyone wants to know, let us chat a bit about what happens when you get an Ampere AmpereOne A192-32X processor. There are 192 cores without SMT so one gets 192 threads. This is similar to Intel Xeon 6 6700E Sierra Forest insofar as SMT is absent, and this is more of an E-core rather than a P-core design in x86 parlance.

One area where this CPU does extremely well is just letting all of the cores run at 3.2GHz. This is stress-ng running across 192 cores and the 3.2GHz clock speed is across all 192 cores.

AmpereOne A192 32X 3.2GHz All Cores Stress Ng Small
AmpereOne A192 32X 3.2GHz All Cores Stress Ng Small

In some server CPU architectures, one might see a few cores run faster, others run a bit slower. AmpereOne is designed so all cores can run up to the same speed. That was especially the case in generations past. Starting with the AMD EPYC Bergamo series, we have seen some of the competitive x86 architectures behave similarly.

SPEC CPU2017 Results

SPEC CPU2017 is perhaps the most widely known and used benchmark in server RFPs. We do our own SPEC CPU2017 testing, and our results are usually a few percentage points lower than what OEMs submit as official results. It is a consistent ~5% just because of all of the optimization work OEMs do for these important benchmarks. Since there are official numbers at this point, it feels right to use the official numbers if we are talking about an industry benchmark.

Ampere AmpereOne A192-32X SPECrate2017_int_base Results
Ampere AmpereOne A192-32X SPECrate2017_int_base Results

We are using the official results here so that means optimized compilers. Ampere would suggest using all gcc and shows its numbers for de-rating AMD and Intel to gcc figures for this benchmark. That discussion is like debating religion.

One could argue that GCC is the least common denominator, so that is the right way to look at this. On the flip side, the biggest chip company, NVIDIA, has gotten to that point using CUDA and an optimized toolchain. If we say we need to use GCC here, then do we need to shy away from using NVIDIA’s tools to look at its AI performance? We think that since the official scores are free to use whatever compiler they want, that should be the methodology we use.

First off, there is a huge jump in performance between the Altra Max and AmpereOne. 50% more cores, but almost double the performance.

When we compare the results to AMD, Turin Dense is just a monster. AMD has twice the per-socket performance, and twice the threads. Ampere might argue compilers, and the 1 core/ 1 thread gives it the ability to do 192x 1 vCPU VMs per CPU. AMD would likely counter that it can do 192x 2 vCPU VMs per CPU.

Compared to Intel, again using a different compiler, the 144-core Intel Xeon 6780E is very close. Ampere can say it can do 192x 1 vCPU VMs per CPU while Intel can only do 144. If you are a cloud provider, more vCPUs means more customers per system.

With Sierra Forest-AP launching in Q1 2025, we would assume that the 288 core part will hit 1250-1410. AmpereOne M, if it scales linearly to 256 cores would be 936. That is close to the AMD EPYC Bergamo 128 core/ 256 thread part.

STH nginx CDN Performance

On the nginx CDN test, we are using an old snapshot and access patterns from the STH website, with DRAM caching disabled, to show what the performance looks like fetching data from disks. This requires low latency nginx operation but an additional step of low-latency I/O access, which makes it interesting at a server level. Here is a quick look at the distribution:

Ampere AmpereOne A192-32X STH nginx CDN
Ampere AmpereOne A192-32X STH nginx CDN

Just as a quick note, the configuration we use is a snapshot of our live configuration. Here, nginx is one of the very well-optimized for Arm workloads, but we probably have some room to grow in terms of whether we need to optimize our configuration for Arm. Still, it is about what we would expect with AmpereOne being roughly core-for-core competitive with Sierra Forest and ahead of AMD EPYC Bergamo on a per-socket basis. We also get a little better than per-core scaling here over Altra Max. Note: We ended up buying an ASRock Rack barebones based on the ASRock Rack ALTRAD8UD-1L2T we reviewed and building a single socket Altra Max platform for this.

This is somewhat of a strange benchmark for a different reason. To give you some sense, if you had under 1 million page views/ day or 11.5 PV/ second of just serving relatively static content, you could probably handle that on a single-socket AmpereOne 192 core part. If you look at a distribution of websites, that site would probably be one of the top 25,000 sites in the world. For most websites, you are probably running on a few vCPUs instead of on an entire server.

MariaDB Pricing Analytics

This is a very interesting one for me personally. The origin of this test is that we have a workload that runs deal management pricing analytics on a set of data that has been anonymized from a major data center OEM. The application effectively looks for pricing trends across product lines, regions, and channels to determine good deal/ bad deal guidance based on market trends to inform real-time BOM configurations. If this seems very specific, the big difference between this and something deployed at a major vendor is the data we are using. This is the kind of application that has moved to AI inference methodologies, but it is a great real-world example of something a business may run in the cloud.

Ampere AmpereOne A192-32X MariaDB Pricing Analytics
Ampere AmpereOne A192-32X MariaDB Pricing Analytics

This is very similar to the nginx test in terms of optimization notes. Open databases are widely used in cloud instances, so the underlying software may be better optimized than our port of the application at this point. Still, this is effectively a real-world tool that has run tens of billions of dollars of data center hardware deals through it (of course, using different data), making it a very real-world business application.

This is a big upgrade over something like a Cascade Lake Xeon (2nd Gen Intel Xeon Scalable.) If you were upgrading, then you would be looking at a >5:1 socket consolidation.

STH STFB KVM Virtualization Testing

One of the other workloads we wanted to share is from one of our DemoEval customers. We have permission to publish the results, but the application itself being tested is closed source. This is a KVM virtualization-based workload where our client is testing how many VMs it can have online at a given time while completing work under the target SLA. Each VM is a self-contained worker. This is very akin to a VMware VMark in terms of what it is doing, just using KVM to be more general.

Ampere AmpereOne A192-32X KVM STFB SLA Virtualization
Ampere AmpereOne A192-32X KVM STFB SLA Virtualization

Here, the XL and L VM tile results are being impacted by the memory footprints, so really, we are looking at the right three results. This is tuned more for dual-socket workloads, but it is still interesting. Overall, the results are very good. The Intel Xeon 6780E 1P did not make this cut, but you can assume it is a bit higher than the 6766E here since it is a similar core count, just at higher clock speeds due to higher TDP.

AmpereOne does well here, albeit the x86 camp has higher thread counts, which helps here. Also, using 64GB DIMMs, the 12-channel platforms get more RAM and memory bandwidth, which helps.

SSD and NIC Performance

Something that is important with these new platforms is looking at the performance of not just the CPU itself, but all of the connected devices. We tried the Kioxia CM7 and Solidigm D7-PS1010 SSDs, and they were within 1.5% of our Sapphire Rapids baseline platform. That is within a margin of error.

Kioxia CM7 NVMe SSD Front 2
Kioxia CM7 NVMe SSD Front 2

We did not see a big impact on the performance as we saw with the AMD EPYC 9965 based on the SSD used.

In our recent testing, we also saw a fairly big impact based on the NIC used on the AMD EPYC 9965. We tried the Broadcom 400GbE NIC.

Broadcom 400GbE OCP NIC 3.0 Angle 2
Broadcom 400GbE OCP NIC 3.0 Angle 2

The NVIDIA CX-6 and CX-7 solutions we tested as well to get dual 100GbE and dual 200GbE solutions tested.

Supermicro AOC A100 M2CM 1
Supermicro AOC A100 M2CM 1

We also utilized a NVIDIA BlueField-3 DPU with its two dual 200GbE connections. Is this needed? Certainly not. However, adding a card with 16 Armv8.2+ A78 cores to a 192 Arm core server is a bit fun and Inception-like.al

NVIDIA BlueField 3 DPU Front
NVIDIA BlueField 3 DPU Front

Again, we were within a margin of error for these NIC platforms compared to the 4th Gen Intel Xeon baseline. We did not need to get a 400GbE NIC on the CPU in order to alleviate a bottleneck as we saw with the AMD EPYC 9965.

Next, let us talk about power consumption.

9 COMMENTS

  1. How significant do you think the 8 vs. 12 channel memory controller will be for the target audience?

    Lots of vCPUs for cloud-scale virtualization is all well and good as long as you aren’t ending up limited by running out of RAM before you run out of vCPUs or needing to offer really awkward ‘salvage’ configs that either give people more vCPUs than they actually need because you’ve still got more to allocate after you’ve run out of RAM or compute-only VMs with barely any RAM and whatever extra cores you have on hand; or just paying a premium for the densest DIMMs going.

    Is actual customer demand in terms of VM configuration/best per-GB DIMM pricing reasonably well aligned for 192 core/8 channel; or is this a case where potentially a lot of otherwise interested customers are going to go with Intel or AMD for many-cores parts just because their memory controllers are bigger?

  2. You’ve gotta love the STH review. It’s very fair and balanced taking into account real market forces. I’m so sick of consumer sites just saying moar cores fast brrr brrr. Thank you STH team for knowing it isn’t just about core counts.

    I’m curious what real pricing is on AMD and Intel now. I don’t think their published lists are useful

  3. We might finally pick up an Arm server with one of these. You’re right they’re much cheaper than a $50K GH200 to get into.

  4. “We are using the official results here so that means optimized compilers. Ampere would suggest using all gcc and shows its numbers for de-rating AMD and Intel to gcc figures for this benchmark. That discussion is like debating religion.”

    Question to ask is “Do any real server chip customers actually use AOCC or ICC compilers for production software?”

    Also, to use CUDA in the argument is suspect, IMO, given it’s GPU, not CPU, centric optimizations.

  5. It’s a great review.

    JayBEE I don’t see it that way. It’s like you’ve got a race with rules. They’re showing the results based on the race and the rules of the race.

    I’d argue it hurts Ampere and other ARM CPUs that they’re constantly having to say well we’re going to use not official numbers and handicap our competition. It’s like listening to sniveling reasons why they can’t compete according to race rules. I’d rather just see them say this is what we’ve got. This whole message of we can’t use ICC or AOCC just makes customers also think if they can’t use ICC or AOCC what else can’t these chips do? I can’t just spin up my x86 VM’s as is to ARM, forget any hope of live migration. Arm’s marketing message just falls flat because it’s re-enforcing what they can’t do. For the cloud providers that own software stacks they don’t care. It’s also why the HPE RL300 G11 failed so hard they don’t have AmpereOne.

    That’s something I think STH could have harped on more. If you’re migrating x86 instances, even if it isn’t a live migration, it is turn off, then on to go between AMD and Intel. You’re rebuilding for ARM. Even if the software works great, there’s extra steps.

  6. I can tell you that my company does not use specialized compilers, namely AOCC or ICC, when evaluating AMD, Intel, and Ampere products. We want as best “apples to apples” comparisons as possible when evaluating performance across different server offerings. Results generated by special compilers, compilers my company will never use, are of no interest to our performance evaluations.

    And let’s not forget that some of the specialty compiler optimizations were deemed invalid by SPEC.
    https://www.servethehome.com/impact-of-intel-compiler-optimizations-on-spec-cpu2017-example-hpe-dell/

  7. I don’t think most enterprises run their own apples to apples on this kind of thing. How do they know they’ve tuned properly for each? The server vendor tells them? In this case, that isn’t Dell Lenovo or HPE. That’s why most orgs just have the SPEC CPU2017 in their RFP’s.

    SJones that was 3 generations ago, and stopped being relevant with emerald, right? It’s only Intel not AMD too, right?

  8. xander1977

    SPEC ruled an AOCC and ICC optimization for 505.mcf_r as a violation, but there had been so many scores already published with it, they withdrew it. Can’t find the link at the moment. This was an optimization that GCC did not implement. With 505.mcf_r being one of the lower resulting tests, this huge improvement from the optimization had a large impact on the overall SIR score since the overall is the geomean of the 10 individual tests.

    While “apples to apples” is difficult to achieve, a critical part of that work for us is in fact using common GCC versions across architectures. This also helps us identify areas of potential code/compiler improvements to pursue.

  9. JayBEE asked “Do any real server chip customers actually use AOCC or ICC compilers for production software?”

    From my perspective the kind of customers who run the kind of software focused on by SPEC CPU are likely to employ experts whose main job is helping others tune the compiler and application to the hardware. If you are not that customer, then making a hardware decision based on SPEC is similar to choosing the family car based on the success of a racing team sponsored by the same manufacturer.

    On the other hand Intel had been donating much of their proprietary compiler technology to GCC and LLVM. The result allows Intel to focus on x86 performance optimisations while language standards and conformance are handled by others. Something similar needs to happen for ARM and I suspect it does.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.