This week we got some significant updates for the server roadmaps of Intel, AMD, and NVIDIA, alongside Ampere’s recent AmpereOne update. At the same time, there appears to be massive confusion about the difference between an announced spec, a product launch, and product availability. Even with all of that, we know that in 12 months, a 32-40 core server chip is going to feel small, and no longer mid-range.
Server Core Counts Going Supernova by Q1 2025
Let us take a quick summary of the announcements:
- Intel:
- P-core: Emerald Rapids Up to 64 Cores Today
- P-core: Granite Rapids-AP Up to 128 Cores Q3 2024 (Intel Innovation 2024?)
- P-core: Granite Rapids-SP Up to 86 Cores Q1 2025
- E-core: Sierra Forest-SP Up to 144 Cores Today
- E-core: Sierra Forest-AP Up to 288 Cores Q4 2024/ Q1 2025
- AMD:
- P-core Genoa/ Genoa-X Up to 96 Cores Today
- Pc-core: Bergamo 128 cores today
- Pc-core: Siena 64 cores today, single socket
- P-core Turin 128 cores announced for Q4 (SC24?)
- Pc-core Turin Dense 192 cores announced (also Q4?)
- NVIDIA:
- P-core Grace in 2024 with 72 Arm cores/ 144 with dual chip Grace Superchip Module. For our charts we are going to use 144 cores, but that is 72x 2 in a module. It might be more accurate to call it 72 cores.
- P-core Vera in 2026. NVIDIA did not give a core count or architecture details, but I would be betting on more than 72 cores per CPU and an Arm architecture.
- Ampere:
- 128 Core Ampere Altra Max 2021
- 192 Core AmpereOne 2024 – We saw a chip in 2023, but neither STH nor Phoronix have the parts so we are going to call these 2024 parts under the assumption they will be made available publicly this year
- 256 Core 12-channel DDR5 AmpereOne 2025
One can argue that the Grace Superchip is a 144-core CPU split across two modules on a single package. If you do that, four public vendors in the market have now reached 128 cores/ socket or more with parts that are public. At this point, we have seen enough Grace Superchip that we will call those public and AmpereOne we first saw a year ago in person.
So here is what the announcement pace looks like from 2010-2025 by when we make the assumption that Turin Dense will be announced in the November timeframe, AmpereOne will be 256 cores in 2025. Note, NVIDIA and AMD have not announced parts for 2025 yet.
Over the past week, a lot of folks have been confusing AMD EPYC Turin’s announced specs with the shipping parts. AMD’s release cycle tends to be longer between product launch and chip availability than Intel, but nowhere near a FPGA vendor’s lag for example. Genoa, as an example was released in November 2022, but it really started shipping in decent volumes in 2023.
What is really interesting here is that the P-core crown will likely pass from AMD to Intel in Q3 2024. Our best guess is September for Intel Innovation, but the materials just say Q3. Granite Rapids-AP will have new technologies beyond just the P-cores, but it will be 128 new cores when AMD EPYC Genoa only has 96. Given features like MCR DIMM support, higher TDP, and more acceleration, we would expect that it is in a different performance class than AMD EPYC Bergamo at 128 cores.
Here is what happens when we fill in the timeline with the max available dual socket core count by year.
In Q4 2024, our best guess is November, given Supercomputing 2024, AMD will have its 128-core P-core Turin part out, and we expect at least more detail on Turin Dense at 192 cores. We do not expect Turin Dense to be out before Q4 2024. Also, as we saw with the AMD EPYC Genoa launch that happened just before SC22, and then with Bergamo last summer, there tends to be a gap between AMD’s launch and availability. Again, we need to say it is generally a much shorter window than on AMD’s embedded side.
For quite some time, we had heard that Intel was going to release chips in Q4 2024, but now we have a Q1 2025 date on the Computex slide. If AMD launches Turin Dense in November 2024, there is a decent chance Intel will launch Sierra Forest-AP at 288 cores in the same quarter. If that happens, both parts may ship in ramped volumes in 2025 instead of 2024. Often, executives at chip companies have bonus structures based on key metrics like launching certain chips by the end of the year. A well-known practice is “Launching” in late November or December, even if volume shipments start the following year.
When we say it is getting exciting, let us take a moment to look at the historical trendlines. When we say there is big acceleration in core counts, even just looking at P-cores Intel, we would have expected it to be around 80 in 2024, but it will be at 128 next quarter. AMD’s P-cores seem like they are on the trendline. Where it gets interesting is what we are calling the AMD Pc cores (Zen 4c, Zen 5c). The core counts go way above the AMD trendline. Likewise, Intel goes crazy with the E-core counts that are nowhere near its P-core timeline.
Final Words
Given all of this, sometime around Q4 2024 and Q1 2025, we are going to have AMD with 128 and 192 core systems. Intel with 86, 128, 144, and 288 max core count lines. Ampere has 192 to 256 maximum cores, and NVIDIA has 72 cores (in a 144 Grace Superchip module.)
As that happens, systems with 24, 32, or even 64 cores are going to have half or less of the core counts of modern server architectures. As we move to new architectures, we expect core counts to increase further.
Get ready for core counts to go supernova over the next year or two.
One problem I see is that the RAM speeds are stalling somewhat as usual. It took very long to go from DDR4 to DDR5. MCRDIMM might help, CXL memory as well. But currently it’s more and more cores sharing the same bandwidth.
I do not see this trend continuing much beyond 2025 is pretty likely to be a bit of an inflection point as cooling solutions physics will force a plateuau. The low-power chips may continue a bit more for niche efficiency-focused SKUs, but not much either.
In that context, I would be interesting to super-impose this with rising TDPs. It was not so long ago that a 165W chip 22-core Broadwell was a “hig power” SKU.
These days we have the high power “mainstream” SKUs at 400W and going 500+. This is at the limit what air cooling in 1U can do. Even 2U is challenging.
IMO, putting aside custom stuff for hyperscalaers and HPC, this is pretty much guaranteed to stop not far beyond 500W/socket for practical issues with cooling. For the same reasons IBM Power did not go much beyond 500W for about a decade by now.
With these changes it indicates that the new standard power density per rack is going to 20kWh just to allow racks to be fully populated with single socket systems.
Which poses the question where the ideal balance for density lies, hopefully STH can add performance efficiency into the standard testing as well to help support these decisions. For example total amount of power consumed for a fixed-problem benchmark.
Nvidia just showed of a 15,000 watt single board (HGX B200) so nothing from AMD or Intel that’s “only” pulling 400 watts per socket and only running 2P per board is going to present a challenge versus what Nvidia has done.
Nvidia shows that if you have enough money, the limits are sky high – so much higher than anyone is doing with CPU cores today.
From my point of view, these high core count systems necessitate a new review and analysis strategy that focuses on processor scaling.
I would be interested how memory bandwidth scales with number of CPU cores. Somewhat paradoxically, total bandwidth typically decreases as more cores are memory bound. RAM has limits on number of open pages and starts thrashing as more processors try to read and write.
I’d be interested in the standard HPCG and HPL benchmarks (with emphasis on HPCG) to determine multi threaded performance on traditional numerical tasks. Again a scaling analysis with varying number of cores devoted to the task and others idle would be useful.
I feel there should be a test related to PCIe throughput. Maybe one could install multiple GPUs and simultaneously copy data between host and device memory. Throughout to pools of NVMe SSDs would also be interesting, especially ones built on dmcrypt encrypted block devices.
The build speeds of Linux kernel is interesting, but for 256 cores one needs to test an entire build farm. For example, a build world type activity as seen in the BSDs and various package builds systems for Linux.
Some sort of test of VM performance would also be important. For example, partition a machine in 8-core pieces, then run the same benchmarks in the corresponding VMs. How much does running a memory bandwidth test in half the VMs affect the performance of HPCG in the another? For me which benchmarks are run is not as important as how they affect each other when multiple copies are run in separate virtual machines.
Energy efficiency when performing each type of task is also important.
My suspicion is the ARM systems will show less degradation as more cores become busy, but that is just a guess. At any rate, adding some new tests to illustrate multi core scaling, in my opinion, would add a lot to the combination of news and analysis that appears on this site.
Tons of cores is nice and fine —– until you have to pay for Windows or VMware licensing.
So for those uses, these high core count servers are non-starters in many cases.
I don’t think that things that stalled Genoa release will affect Turin release that much.
Genoa was stalled due to 2 things:
1. There was time needed to iron out all the quirks in the new platform – especially in regard to the DDR5 memory
2. Availability was artificially limited due to the need to wait for the Sapphire Rapids, as Intel was sponsoring the marketing for the new generation of server platform, and these funds were available in Q1
Turin is the drop-in replacement for Genoa. It might need just firmware refresh to get things working. In worst case it should be just motherboard refresh needed.
Wow, I wonder if 16-core mobile processors are soon to come as well, what with the impending “supernova” of server processor core counts. It almost seems inevitable that laptops will go up as well. Who knows, in another ten years…32 cores in your high-end gaming laptop PCs? Wow, just WOW.
All we need next, is for software to finally catch up, after decades of multicore, almost all of it is still single core only, and the use cases for more cores remain with running VM’s and containers.
To the comment about paying for licenses, I use a flavour of Linux with LXC (with LXD in testing mode) for containers, and libvirt for VM’s, 100% free to use, and it’s been working really well in a production environment for a few years. I’ve almost fully switched over from VM’s to the containers, which are much lighter on resources. VM’s have become only useful to me, when I need hardware emulation, otherwise I find containers to be far easier to setup and manage, and much more efficient with resources. I’m not sure how many containers I could run on a 196 core system with 256GB RAM, it would be in the 100’s depending on the applications that are running on them.
Intel Tech has a video interview of Supermicro. They are showing a Gaudi3 server with Dual Granite Rapids AP and saying it will launch in Sept.
Why would it take 10 years to get 32-cores in a laptop? They already have 20-cores in laptops right now…
Linux is great – except when you NEED windows VMs (none of them have great remote management or VM tools)… I support 3 different KVM/QEMU based systems – their Windows guest tools all suck vs HyperV or VMware.
Containers are great – except when you need high availability and/or don’t want a workload to restart on host migration. This is often a requirement in my space.
Love how the nVidia brainwashing works
AMD puts X cores in a CPU and that is an X-core CPU.
Intel puts Y cores in a CPU and that is a Y-core CPU.
nVidia puts 0.5*Z cores in a CPU then puts two such CPUs on a freaking enormous carrier board and that is a Z-core CPU.
This is an example of your new “influencer” revenue model, Patrick?
@Gas, I’ve seen laptops of late with Intel CPU containing 2 P cores, some number of area efficient cores, and 2 really really area efficient cores. Is that what you are talking about? Because if you think having three core types supporting non-identical instruction sets in rhe same CPU is a good idea, well, I have a bridge to sell you.
@hoohoo
The sort of cores I was thinking of when I commented were just the normal, full-sized type of x86-64 core, like the regular Zen cores found in your typical Ryzen CPU. Not whatever sort of mixed-architecture thing you are referring to and not Zen 4c. Just the normal core type found in any full-sized desktop processor.
@Gas
I’m specifically referring to the kind of cores typified by the full-capability Zen 4 or similar core, not Zen 4c or any sort of mixed-architecture thing. Why would it take ten years, you ask? I don’t know. I was guessing based on speculations of what I might be able to buy in 6 to 10 years from now. I know it probably won’t be literally ten years before we see 32-core Zen-whatever laptops. My guesses are conservative and are bound to be quite off, timelines-wise.