We recently went into the Ampere AmpereOne 192 Core performance and architecture. A few days ago, we showed the AmpereOne A192-32X a 192 core Arm server CPU. We were told that there would be new information in the Hot Chips 2024 talk, so let us get to it.
Please excuse typos, as this is being written live.
Ampere AmpereOne at Hot Chips 2024
Here is the company’s roadmap. The current AmpereOne has up to 192 cores and 8-channel memory. There will be a 12-channel DDR5 192 core 5nm part shipping next quarter. Then next year a 256 core 3nm part.
Here is the AmpereOne core.
AmpereOne has a new branch prediction engine. Something interesting here is that Ampere effectively has workloads from customers that are common cloud workloads that it uses to size features in its processors.
There are eight schedulers feeding 12 execution pipes. Ampere has floating point/ vector execution pipelines, but this is more designed for the integer workloads common in the cloud.
Here are the specs on the load store unit. A big feature is the memory tagging solution that can be used in production environments.
All entries are universal in the TLBs. There are not different classes of TLBs in Ampere’s architecture.
Instead of having a large L3 cache. AmpereOne has a larger 2MB L2 cache. These L2 caches are private to ensure that cloud tenants are private.
On the SoC side, Ampere has compute, memory, and PCIe subsystems. The PCIe and memory controllers are TSMC 7nm. Connecting these dies together is a up to 2.8TB/s die to die interconnect. Ampere can scale up to 12 channel DDR5 using the same platform. Ampere can also integrate customer IP.
The main AmpereOne compute chiplet is built on TSMC 5nm. Each core cluster is a group of four Ampere custom cores. There are also 64 distributed coherency engines each with 1MB of L3 cache. That is 64MB L3 combined. On the east and west sides, there are die-to-die interconnects.
The compute chiplet is connected to the MCU (memory) and PCIe I/O dies. The memory die has two channels of DDR5. The PCIe I/O die has 32 lanes of PCIe Gen5. Four PCIe dies on the package means there are 128 PCIe Gen5 lanes.
Memory tagging can help find software errors and also help mitigate buffer overflow attacks.
AmpereOne has adaptive traffic management on the SoC to minimize noisy neighbor effects in cloud CPUs.
We have shown these before, but here are Ampere’s performance figures.
Here is another slide. We talked through these in Ampere AmpereOne 192 Core Performance Outlined a few weeks ago.
Ampere supports common AI frameworks out of the box.
Here is Ampere AI inference performance.
This slide talks about the ecosystem. Ampere is based on Arm, so there is a lot more support. We talked about this a few days ago in a hands-on piece This is Ampere AmpereOne A192-32X a 192 Core Arm Server CPU.
It is much easier to use an Arm server today than it was eight years ago. It has been cool to see.
Final Words
Of course, stay tuned for our AmpereOne hands-on series. We have already shown the parts, and you can see the two of the PCIe and memory I/O dies in the photo below. We are going to have a hands-on piece soon, but I am here at Hot Chips. Stay tuned for more hands-on AmpereOne on STH.
Supplement on Substack
Since we often get requests when we take chip photographs, we have high-resolution images for the chip shots you see here (and more) available to our paid Substack subscribers.