Arm Neoverse N1 and Architectural Details
I wanted to start this discussion with my absolute least favorite and also my favorite slide from the two-day event. The Arm Neoverse roadmap.
Arm is committing to new cores every year here following the company’s consumer strategy. Here is the issue: the enterprise market is different than the consumer market. Case and point, if you go into the average tech company, phones in your meeting will all be 0-3 years old. In the data center and enterprise space, 3-year-old servers are still perfectly good machines and these days companies are looking at five-year-old servers (Ivy Bridge/ Haswell-EP) and thinking they may still have a few years left.
A one year product cycle for the Arm Neoverse has implications. We are talking about Arm Neoverse N1 architecture here. Realistically, late 2019 or early 2020 we will start to see volume products in the market based on Neoverse N1. By then, we will start talking about the Zeus platform. Essentially, in 2020 you may be able to buy N1 Ares, but you will want Zeus.
Arm may be setting itself up for the annual Osborne. On the consumer side, Arm can do this. Consumers break phones, have contractual upside to get new phones via service agreements, hand-me-downs happen. On the server side, there are a lot of organizations that are perfectly happy to squeeze another quarter from CapEx assets. Intel recently saw this first hand in 2018 when they raised prices with Intel Xeon Scalable and had to create an Intel Xeon E5-2600 V4 “shortage” to push customers to new platforms. If you were thinking of buying an Arm server today to deploy, the message is to wait a few quarters and they will be 60% faster. If you can wait two years, the roadmap says you should get 208% faster servers. When companies are extending lifecycles of servers 1-2 years, promising an annual cadence of 30-60% gains is not how you get people to invest today.
A counterpoint and one we appreciate is that Arm needs to show it has a unified roadmap before customers will invest and commit their earmarked x86 spend to Arm Neoverse. If we end up in a world in five years where every ounce of performance from servers is needed, then this makes perfect sense. With 5G, that may be the case.
Time will tell, but if you look at average CPU utilization in servers, there is a good chance that we are using less than 50% deployed capacity worldwide. That underutilization problem throughout the industry is well known. Companies can hold off on upgrades because excess capacity exists in great quantities. From our Dell EMC PowerEdge MX launch (read our exclusive PowerEdge MX Review for more on that platform), Dell EMC, the world’s largest server vendor acknowledges that we have >50% stranded capacity.
The annual Osborne is something that Arm risks with its aggressive roadmap, but it also needs to press investment in the space. It is a double-edged sword.
With that out of the way, we wanted to take a look at the Arm Neoverse N1 platform performance (expected.) Arm is using a 64 core hyperscale reference design with a theoretical 105W TDP. We asked if this was server TDP or CPU TDP at the workshop. We were told it is CPU TDP.
We are going to go into the benchmarks after the architecture, but Arm is promising ~2x gains in many benchmarks compared to Cortex-A72.
We are going to go into the Arm Neoverse N1 microarchitecture next, however, this shows a point. Why would you buy today’s eMAG or ThunderX2 for anything other than a development/ proof-of-concept cluster when you can wait twelve months and get 2x the performance and potentially more from an Arm server?
Arm Neoverse N1 Microarchitecture
Of the two cores being launched today, the Arm Neoverse N1 is the higher frequency part that is designed to compete in the mainstream CPU market. You may have previously read about the Arm Neoverse N1 platform when it was codenamed “Ares.”
The Arm Neoverse N1 is tangentially targeted at enterprise workloads. When the company talks about the N1 it is not to discuss running Oracle EBS. Instead, it is to run today’s open source cloud workloads at hyperscale customers and in edge aggregation compute devices.
Underpinning the Neoverse N1 is Armv8.2 compatibility and a heavy focus on cache hierarchy which we will see soon.
The actual pipeline is an 11-stage pipeline that we are told can collapse into a 9 stage pipeline with predict/ fetch pipeline overlap. Features like dual native quadword AdvSIMD datapaths are set to make Arm server cores faster clock for clock. Building a deep and wide pipeline is key to keeping higher frequency cores fed.
Cache design is focused on improving branch heavy workloads. There are large 64K I and D caches. Beyond those caches are larger 512K or 1MB L2 caches. Perhaps one of the more interesting features is that the Arm Neoverse N1 64 core CPUs are targeted to have 128MB of system level cache. For some context, AMD EPYC parts generally have 64MB of L3 cache save for a few 32MB parts.
Part of what Arm Neoverse N1 is doing is heavy branch prediction to keep its caches and pipelines full. Arm is using this more aggressive engine to increase utilization. We asked about speculative execution and Arm said that it has been doing work on specific cases and believes the Neoverse N1 microarchitecture plus software will make mitigations performance neutral.
It is also adding enhancements for virtual machine context switching as with 64 cores VMs will certainly be a target workload. It helps that KVM support on Aarch64 is solid and VMware ESXi announced 64-bit Arm support.
Examples of how the branch predictor will help are that PHP is known to hit the indirect branch predictor on modern CPUs. Arm is developing their solution for hyperscalers and workloads like running PHP applications.
An entire slide and discussion was dedicated to the cache hierarchy. If you are buying servers, the particulars may not be of interest, but the three main points were: large, high bandwidth, low latency caches to maintain performance. The L1 caches are certainly large with the L2 caches being about the size we see on this generation of AMD EPYC “Naples” and Intel Xeon Scalable (Skylake-SP.)
The coherent L1 cache is a big deal for Arm. Technically, the Arm ISA allows for software coherency. In larger scale systems software mechanisms swamp performance so the hardware coherency is required.
On the performance side, Arm is claiming a substantial floating point and integer performance bump. For AI inferencing, that includes INT8 compute support that Arm sees as a way to give free machine learning capability to hyperscalers using underutilized CPUs.
In terms of power efficiency, the core architecture is targeted at the 2.6 to 3.1GHz range. We were told that the 3.1GHz part would be a higher voltage part. For some context, 3.0GHz is the maximum clock speed of Skylake-SP’s low-end Intel Xeon Silver 4100 series parts. At the same time, Intel’s CPUs throttle performance back to lower clock speeds when all cores are used, and the current Intel Xeon Silver 4116 will throttle below 3.0GHz even with only 12 cores.
With a 1-1.8W per core and L2 figure, that puts a 64 core part in the 64-115W range. Arm is also touting the space savings from the Arm Neoverse N1 cores on 7nm. Note here that a lot more goes on in a CPU than simply the core and the L2 caches. We were quoted about a 400mm square die for 64 cores.
Arm Neoverse N1 Platform Features
Part of Arm Neoverse N1’s value proposition is power/ performance benefits over x86. Part of that will be an early transition to 7nm TSMC. Another part of that is on-chip power management. Arm Neoverse N1 is set to manage higher power threads and keep the hardware running efficiently. One example is that the AMU can prioritize threads that are frequency limited versus threads that are throughput limited, distributing SoC power accordingly.
While the Intel x86 world has been progressively getting better virtual machine support for over a decade, Arm is forging ahead with improvements of its own. We are going to show this a bit more in the performance section later in this article.
Modern servers have RAS features built-in. Neoverse N1 has Armv8.2 RAS features including:
- Write-once error-handler software – works across cores
- Seamless kernel support with ServerReady compliance
- No errors exposed until/ unless poisoned data is consumed
- Single error correction, double detection ECC in writeable caches
There is a certain level of RAS features customers expect, and Arm thinks it meets or exceeds requirements.
Arm also mentions that it has software profiling tools to help customers tune their workloads for its CPUs. Also, Arm is able to use these tools to see where hotspots are in its design and refine its core design to address workloads.
Now that we have discussed the cores, the next step is putting the cores together to make a larger system.
Arm Neoverse N1 Scaling
A major undertaking is in the fabric connecting cores in modern architectures. Part of the Neoverse N1 platform is that this fabric is a high bandwidth, low latency mesh which allows for many cores and piping data to where it needs to go.
With support for technologies like PCIe Gen4, CCIX, and a large 128MB system level cache, and support for 128+ cores, the Arm Neoverse N1 needs a fabric to help tie everything together. With Cortex-A72 Arm utilized a ring architecture. As core counts increase, it is moving to a mesh architecture. In 2017, Intel also switched its fabric from a ring to a mesh which you can learn about in our piece: The New Intel Mesh Interconnect Architecture and Platform Implications.
From a scalability perspective, with features like SmartNIC support, chiplet support, and CCIX, Arm is stating that the Neoverse N1 is scalable. It believes that 64 cores will be the hyperscaler sweet spot in this generation. 128 cores may be a maximum, but the technologies are there to go further.
In terms of segmentation, Arm has a number of segments it is targeting. The Edge Compute and Network-Storage Security are targeted at the Intel Xeon D and AMD EPYC 3000 series. The Hyperscale Datacenter is focused at mainstream Intel Xeon and AMD EPYC 7000.
Next, we are going to look at the state of the Arm Neoverse N1 and the development testbed. Arm Neoverse N1 silicon exists.
This was a great long read.
STH is now like a mix of the technical side of Anandtech, the business side of TNP, and adding in it’s own mix of hands on experience working with this hw. I can’t wait for your N1 review
This article took me 2 hours to read this morning between meetings and tea. Great read STH.
Amazing article!
Arm is set to dominate the EDGE, I don’t really see how Intel hopes to gain any market share with the power draw of the x86 ecosystem. Given how much money they can out on R&D, we should expect to see something from them … and the Big.Little using Atom little cores doesn’t sound the right approach
M4r1k I don’t know if Arm really has a power consumption advantage when it comes to bigger chips.
https://www.servethehome.com/updated-cavium-thunderx2-power-consumption-results/
Both STH and AT had crazy high power for the ThunderX1 too.
Maybe in mobile they’re way better, but when they’re trying to get feature parity they haven’t proven to use less than modern x86.
How much is the motherboard with an Arm Neoverse N1 8 or 16 cores cpu?
Risky89 – Arm Neoverse N1 CPUs will be coming out in a few quarters. The development board with the Neoverse N1 CPU is a low production unit that is primarily going to companies that are building chips.
thanks Patrick