Supermicro MegaDC ARS-211M-NR Performance
Now what everyone wants to know, let us chat a bit about what happens when you get an Ampere AmpereOne A192-32X processor. There are 192 cores without SMT so one gets 192 threads. This is similar to Intel Xeon 6 6700E Sierra Forest insofar as SMT is absent, and this is more of an E-core rather than a P-core design in x86 parlance.
One area where this CPU does extremely well is just letting all of the cores run at 3.2GHz. This is stress-ng running across 192 cores and the 3.2GHz clock speed is across all 192 cores.
In some server CPU architectures, one might see a few cores run faster, others run a bit slower. AmpereOne is designed so all cores can run up to the same speed. That was especially the case in generations past. Starting with the AMD EPYC Bergamo series, we have seen some of the competitive x86 architectures behave similarly.
SPEC CPU2017 Results
SPEC CPU2017 is perhaps the most widely known and used benchmark in server RFPs. We do our own SPEC CPU2017 testing, and our results are usually a few percentage points lower than what OEMs submit as official results. It is a consistent ~5% just because of all of the optimization work OEMs do for these important benchmarks. Since there are official numbers at this point, it feels right to use the official numbers if we are talking about an industry benchmark.
We are using the official results here so that means optimized compilers. Ampere would suggest using all gcc and shows its numbers for de-rating AMD and Intel to gcc figures for this benchmark. That discussion is like debating religion.
One could argue that GCC is the least common denominator, so that is the right way to look at this. On the flip side, the biggest chip company, NVIDIA, has gotten to that point using CUDA and an optimized toolchain. If we say we need to use GCC here, then do we need to shy away from using NVIDIA’s tools to look at its AI performance? We think that since the official scores are free to use whatever compiler they want, that should be the methodology we use.
First off, there is a huge jump in performance between the Altra Max and AmpereOne. 50% more cores, but almost double the performance.
When we compare the results to AMD, Turin Dense is just a monster. AMD has twice the per-socket performance, and twice the threads. Ampere might argue compilers, and the 1 core/ 1 thread gives it the ability to do 192x 1 vCPU VMs per CPU. AMD would likely counter that it can do 192x 2 vCPU VMs per CPU.
Compared to Intel, again using a different compiler, the 144-core Intel Xeon 6780E is very close. Ampere can say it can do 192x 1 vCPU VMs per CPU while Intel can only do 144. If you are a cloud provider, more vCPUs means more customers per system.
With Sierra Forest-AP launching in Q1 2025, we would assume that the 288 core part will hit 1250-1410. AmpereOne M, if it scales linearly to 256 cores would be 936. That is close to the AMD EPYC Bergamo 128 core/ 256 thread part.
STH nginx CDN Performance
On the nginx CDN test, we are using an old snapshot and access patterns from the STH website, with DRAM caching disabled, to show what the performance looks like fetching data from disks. This requires low latency nginx operation but an additional step of low-latency I/O access, which makes it interesting at a server level. Here is a quick look at the distribution:
Just as a quick note, the configuration we use is a snapshot of our live configuration. Here, nginx is one of the very well-optimized for Arm workloads, but we probably have some room to grow in terms of whether we need to optimize our configuration for Arm. Still, it is about what we would expect with AmpereOne being roughly core-for-core competitive with Sierra Forest and ahead of AMD EPYC Bergamo on a per-socket basis. We also get a little better than per-core scaling here over Altra Max.
This is somewhat of a strange benchmark for a different reason. To give you some sense, if you had under 1 million page views/ day or 11.5 PV/ second of just serving relatively static content, you could probably handle that on a single-socket AmpereOne 192 core part. If you look at a distribution of websites, that site would probably be one of the top 25,000 sites in the world. For most websites, you are probably running on a few vCPUs instead of on an entire server.
MariaDB Pricing Analytics
This is a very interesting one for me personally. The origin of this test is that we have a workload that runs deal management pricing analytics on a set of data that has been anonymized from a major data center OEM. The application effectively looks for pricing trends across product lines, regions, and channels to determine good deal/ bad deal guidance based on market trends to inform real-time BOM configurations. If this seems very specific, the big difference between this and something deployed at a major vendor is the data we are using. This is the kind of application that has moved to AI inference methodologies, but it is a great real-world example of something a business may run in the cloud.
This is very similar to the nginx test in terms of optimization notes. Open databases are widely used in cloud instances, so the underlying software may be better optimized than our port of the application at this point. Still, this is effectively a real-world tool that has run tens of billions of dollars of data center hardware deals through it (of course, using different data), making it a very real-world business application.
This is a big upgrade over something like a Cascade Lake Xeon (2nd Gen Intel Xeon Scalable.) If you were upgrading, then you would be looking at a >5:1 socket consolidation.
STH STFB KVM Virtualization Testing
One of the other workloads we wanted to share is from one of our DemoEval customers. We have permission to publish the results, but the application itself being tested is closed source. This is a KVM virtualization-based workload where our client is testing how many VMs it can have online at a given time while completing work under the target SLA. Each VM is a self-contained worker. This is very akin to a VMware VMark in terms of what it is doing, just using KVM to be more general.
Here, the XL and L VM tile results are being impacted by the memory footprints, so really, we are looking at the right three results. This is tuned more for dual-socket workloads, but it is still interesting. Overall, the results are very good. The Intel Xeon 6780E 1P did not make this cut, but you can assume it is a bit higher than the 6766E here since it is a similar core count, just at higher clock speeds due to higher TDP.
AmpereOne does well here, albeit the x86 camp has higher thread counts, which helps here. Also, using 64GB DIMMs, the 12-channel platforms get more RAM and memory bandwidth, which helps.
SSD and NIC Performance
Something that is important with these new platforms is looking at the performance of not just the CPU itself, but all of the connected devices. We tried the Kioxia CM7 and Solidigm D7-PS1010 SSDs, and they were within 1.5% of our Sapphire Rapids baseline platform. That is within a margin of error.
We did not see a big impact on the performance as we saw with the AMD EPYC 9965 based on the SSD used.
In our recent testing, we also saw a fairly big impact based on the NIC used on the AMD EPYC 9965. We tried the Broadcom 400GbE NIC.
The NVIDIA CX-6 and CX-7 solutions we tested as well to get dual 100GbE and dual 200GbE solutions tested.
We also utilized a NVIDIA BlueField-3 DPU with its two dual 200GbE connections. Is this needed? Certainly not. However, adding a card with 16 Armv8.2+ A78 cores to a 192 Arm core server is a bit fun and Inception-like.
Again, we were within a margin of error for these NIC platforms compared to the 4th Gen Intel Xeon baseline. We did not need to get a 400GbE NIC on the CPU in order to alleviate a bottleneck as we saw with the AMD EPYC 9965.
Next, let us talk about power consumption.
How significant do you think the 8 vs. 12 channel memory controller will be for the target audience?
Lots of vCPUs for cloud-scale virtualization is all well and good as long as you aren’t ending up limited by running out of RAM before you run out of vCPUs or needing to offer really awkward ‘salvage’ configs that either give people more vCPUs than they actually need because you’ve still got more to allocate after you’ve run out of RAM or compute-only VMs with barely any RAM and whatever extra cores you have on hand; or just paying a premium for the densest DIMMs going.
Is actual customer demand in terms of VM configuration/best per-GB DIMM pricing reasonably well aligned for 192 core/8 channel; or is this a case where potentially a lot of otherwise interested customers are going to go with Intel or AMD for many-cores parts just because their memory controllers are bigger?
You’ve gotta love the STH review. It’s very fair and balanced taking into account real market forces. I’m so sick of consumer sites just saying moar cores fast brrr brrr. Thank you STH team for knowing it isn’t just about core counts.
I’m curious what real pricing is on AMD and Intel now. I don’t think their published lists are useful