Key Lessons Learned: Market Impact
Let us get out in front of this and put it out there. While the Intel Xeon 6780E may have the most physical cores we can get in a commercially available CPU right now (the 192 AmpereOne we still do not have so we assume it is a select hyper-scale customer only part), Sierra Forest is not the highest performance chip. It is not that close if we are being real about it. What it is, is the best chip for a surprisingly large segment of the market that is completely non-sexy. Let us take a look at some of the competition.
Intel Xeon 6700E versus Ampere Altra Max
When we first looked at the Ampere Altra Max M128-30 in 2021, it was a novel part offering 128 physical cores without SMT and an Arm architecture. Until the AMD EPYC Bergamo launch in 2023, it was still the highest physical core count CPU you could get, with the challenge that it required a change from x86 to Arm.
AmpereOne may have 192 cores and we have seen plenty of systems for the part, but OEMs are not pushing it, and we do not have a sample yet. It is hard to consider that a competitor. Now, with fewer cores at 250W TDP there are going to be two reasons to buy Ampere Altra Max:
- One is that pricing may end up being attractive when the dust settles. Ampere Altra (Max) is still using DDR4 and PCIe Gen4 so the systems costs can be lower, save for volume considerations. The flip side to this is that you do not get things like DDR5, PCIe Gen5, and onboard accelerators for things like QAT. Now that Ubuntu 24.04 server supports Intel QAT, having a QAT-accelerated nginx server gets an extra boost for Intel’s solution. Remember, that is a prime application for Ampere.
- Second is that you want to develop on Arm and do not want the power and cost increase that comes with the NVIDIA Grace Superchip. NVIDIA’s chip is fast, but NVIDIA is all about high-margin parts these days. There is actually a lot of value in a lower-cost Arm platform, but that is a different value proposition than trying to be the cloud-native data center option.
For those buying smaller numbers of servers, an E-core x86 will be the better option since things will just work out of the box, meaning that Intel can justify value at a higher cost. Intel can also support 400GbE single-slot networking with the more modern platform. AmpereOne needs to be here… yesterday.
Intel Xeon 6700E versus 5th Gen Intel Xeon “Emerald Rapids”
Here is a hot take: Xeon 6 is out, but a lot of folks should still buy 5th Gen Xeon Emerald Rapids.
The E-core-only CPU concept is going to be very hard for some folks to grasp. 144 E-cores are not 144 P cores. It might be closer to 72 Emerald Rapids P-cores, and even then, it is not going to be a win in every area. If you are doing HPC, Sierra Forest is wrong. If you have per-core licensing that you are dealing with, you need to be on P-cores.
A somewhat terrible outcome for those still on Windows Server and VMware virtualization infrastructure is that you want P-cores if you are paying per core. Those on Linux virtualization who are either using free solutions, per-socket (unlimited cores), or per server (unlimited cores), will be able to use E-core CPUs and lower their TCO dramatically for a portion of their workloads.
It turns out that organizations spend a ton supporting applications with low-performance needs. E-cores help fix that.
Intel Xeon 6700E versus AMD EPYC 9004 “Bergamo”
I think that this is the comparison a lot of folks will jump to when they look at Intel v. AMD. To me, this is really the wrong comparison. Sierra Forest is more of a “midrange socket” part. We know that the 288 core Sierra Forest-AP (presumably the Xeon 6900E) will have things like 12-channel memory as well. To us, that is the real comparison point. Just taking a look at the CPUs, and it will make a bit of sense.
One of the fun things is that during our AMD EPYC “Bergamo” coverage, a theme that kept creeping up was whether AMD has too much performance with its Zen 4c architecture. If the idea is to move workloads that look like Intel Xeon E5 V4 or 1st/ 2nd Gen Intel Xeon Scalable VMs with 2-8 vCPUs that are 5-70% loaded, then Zen 4c is already too big for that. If you are truly taking perhaps half of your CPU compute infrastructure and doing projects to lower the power and space used to host it, then having the new cores 3-40% loaded may achieve consolidation, but not necessarily to the level it can be done.
Here is the strange thing, if you are the type of organization that wants SMT off due to security posture, then Bergamo is 128 cores. At 144 cores, even the smaller Xeon 6700E can have more cores at a lower TDP. Some will scream, “What about the performance?” but again, remember, a large double-digit percentage of all infrastructure is determined by the ability to place a number of VMs and their vCPUs onto hardware. 256 threads is more than 144, but without SMT that becomes a 128 v. 144 discussion.
To me though, Sierra Forest Xeon 6700E is aimed downmarket of the AMD EPYC Bergamo. We are going to end up in a cycle where the Xeon 6900E will be competing against Bergamo and Turin Dense early next year, but Clearwater Forest will also compete against Turin Dense, and things should get really interesting.
Intel Xeon 6700E versus AMD EPYC 8004 “Siena”
The AMD EPYC 8004 “Siena” series might actually be the better comparison point if not for one fact: it is a single socket only. Even so, if you asked me what is AMD’s lower power and lower cost server option, I would probably point you to Siena. We have done many platform reviews, and another Siena server review is coming, but this is an underappreciated platform for AMD. One of the big reasons for that is that it does not scale to two-socket servers.
There is an Intel Xeon 6731E, a 96-core part with a lot of PCIe Gen5 I/O and eight DDR5 memory channels. It also has built-in QAT acceleration. That is a chip to watch if we get good single-socket Xeon 6700 platforms. Let us put it another way, there are probably less than 50,000 websites in the world that cannot use a single Intel Xeon 6731E, with OpenSSL QAT offload as front-end hosts. Add a second for redundancy. Of course, beyond the top 50,000, needs can grow exponentially, and many sites have huge back-ends, but that should give some sense of scale on the real performance needs of hosting websites.
As much as I love the AMD EPYC 8004 series, depending on the Xeon 6731E total system availability and pricing, that may be a category killer chip.
What Intel has here is something like a Siena, with a lower-power chip but the ability to scale to more cores and threads and potentially at lower power figures. Maybe the biggest feature is scaling to two sockets because someone will undoubtedly be reading this and thinking that for low-utilization VM hosting, it is cheaper to use one high-speed NIC and network uplink per 2P server than a 1P server.
Looking Ahead to Clearwater Forest
Clearwater Forest is allegedly going to be the next-generation E-core part. Intel is putting a lot of engineering resources into its E-core parts and so we expect better performance.
The real question is Intel 18A. If Intel can deliver a leading process with 18A, and if the packaging does not eat up a lot of power (see Sapphire Rapids as an example) then Clearwater is shaping up to be a beast of a CPU.
Something to keep in mind is that on a modern CPU, the actual CPU core logic is only a relatively small fraction of the overall die. A decade ago, I penned a piece titled Intel can hold ARM (largely) out of the datacenter for 3 years. The basic premise was that if Intel had process leadership, the part of the die that was not SRAM, PCIe controller, interconnect, and so forth was so small that the net impact of Arm cores versus Intel cores would be negated by the process delta. The crazy success of the Intel Atom C2000 series (even with the AVR54 bug) really highlighted this. Of course, with the slow transition to 10nm and beyond, Intel lost that process leadership, and soon Arm CPUs marched into the data center. Clearwater Forest is the chip that will be able to go after Arm-based offerings again with what could be a superior process technology. The difference now versus in 2014 is that to make the Intel Foundry model work, Intel will want to fab Arm CPUs too.
To me, Sierra Forest is not going to be the chip to achieve 20-40% of Xeon sales. This is more of the proving the concept part. Clearwater Forest is the chip that will start to move the needle.
Final Words
I am not sure if, looking ahead 5 years if the market is 30% cloud-native processors or 70%. In either case, on the Intel side, Sierra Forest is the first offering they have in what is a growing market that will be huge in the coming years. Sierra Forest and the Intel Xeon 6700E series is Intel’s first foray into cloud-native processors. While the focus is squarely on AI these days, and there is a school of thought that little else matters, something is clear: Sierra Forest and its kin are like the AI GPUs of traditional virtualized computing.
For years, the metric everyone looked at was total performance per chip. Realistically, not every workload can effectively scale to over 100 threads or cores. Ask a cloud provider, and they will tell you that 8 vCPUs and smaller VMs make up the majority of the VMs that their customers deploy. At the same time, we evaluate a chip based on how fast it can do one task while they are often being used to host a dozen or more customers simultaneously all doing different tasks. That is really weird.
Sierra Forest is going to be the first chip that was designed with that reality in mind. There are millions?, tens of millions?, hundreds of millions? more? VMs out there are running happily at 1%-70% utilization and on current gen N – 3 CPUs. That infrastructure is taking up space, but more importantly, power. Sierra Forest is not trying to replace the HPC workloads or heavy AI workloads running 70-100% CPU utilization 24×7. Instead, it is trying to appeal to the masses. Over the past two years, this concept has been one that people have most frequently looked at me as if I was speaking alien when going through Cloud Native Efficient Computing, which is why we have a series on it. There are a ton of applications that need cores, but not necessarily high performance per core. That is at odds with a desktop processor mentality, but it is the reality in servers.
After using the chips, it is hard not to think that they are super cool. The bigger challenge is what is super cool (er?): today’s Intel Xeon 6700E series, the Xeon 6900E series a few months from now, or hopefully next year’s Clearwater Forest? We know that this is at least a step in the right direction, especially for those that are using legacy servers that can get almost silly consolidation ratios from these servers, while fitting more closely into legacy server power envelopes.
What we can say is simply this: If you still have Xeon E5 servers or 1st/ 2nd gen Intel Xeon Scalable virtualization or container hosts, Sierra Forest offers wild consolidation gains that will drive big power savings. Those power savings can be directly channeled to add more AI servers, even if your traditional computing demands are slowly growing.
The 2S core-to-core latency image is too low-res to read unfortunately.
The discussion about needing more instances of the workload for a good kernel compile benchmark is exactly what the concepts of weak scaling vs. strong scaling in the HPC world are. It is a property is the application. If when using N cores, the calculation finishes in 1/Nth the time of a single threaded instance, it is said to have a strong scaling property. If when using N cores, the problem size is also N times bigger and the calculation finishes in the same time as the reference problem on 1 core, it is said to have a weak scaling property. In both cases perfect scaling means that using N cores is N times faster, but for weak scaling that only holds if you also increase the problem size.
I am curious about the PCH. If the PCH goes away will the SATA controller also move into the CPU or will the motherboard manufacturers have to add a SATA controller, or is SATA simply dead?
I see on the diagram the network cards, the M.2 and the other PCIe lanes, but I don’t see the SATA controller that usually lived in the PCH.
Or maybe there will be a simpler PCH?
I hope you’d do more of that linux kernel compile benchmark partitioning. That’s what’s needed for chips like these. If you’re consolidating from old servers to new big CC CPUs you don’t need an app on all cores. You need to take an 8 vCPU VM on old and run it on new. My critique is instead of 36 vCPU I’d want to see 8 vCPU
“Ask a cloud provider, and they will tell you that 8 vCPUs and smaller VMs make up the majority of the VMs that their customers deploy.”
As someone who runs a small cloud hosting data center I can confirm this. We have 9 out of 150VMs that have more than 8 vCPUs. Of those 9 I can tell you that they could easily be cut in half and not affect performance of the VMs at all. However, the customers wanted and are paying for the extra vCPUs so they get them.
OK, auto-refresh of page wiped my comment … a ahem great way to prevent any bigger posts. So will keep the rant and skip analytical part of the comment I wanted to post for others ..
——–
The conclusion is, well, Intel-sponsored I guess.
On one hand you give one to Intel, stating Sierra Forest is really positioned a step below Bergamo – which is correct.
Then, one paragraph later, you criticise Siena for the same fact – that it is positioned (even more so) below Sierra Forrest.
A lost opportunity.
For the Bergamo comparison -“but again, remember, a large double-digit percentage of all infrastructure is determined by the ability to place a number of VMs and their vCPUs onto hardware. 256 threads is more than 144, but without SMT that becomes a 128 v. 144 discussion.” That is such a contrived conclusion. I doubt how many service providers actually think like this/
Divyjot I work at a big cloud provider so my thoughts are my own. You might notice all the cloud providers aren’t bringing SMT to their custom silicon designs. SMT with the side channel attacks is a nightmare. You don’t see scheduling across different physical cores in large public clouds for this reason.
That conclusion that Bergamo’s Zen 4c is too much perf per core is also on target.
I’d say they did a great job, but I’d also say the 288 core is going to be a step up. I’d rather have 288 physical cores than 384 threads using SMT.
AMD needs a 256 core Turin Dense. What they’ve missed is that Intel offers more than twice the E cores than the P. We’re buying Genoa not Bergamo top end even with STH saying Bergamo is great because we didn’t want to hit that low cache case in our infrastructure. 96 to 128 is only 33% more. You’re needing to show a bigger jump in core counts. 128 to 192 is only 50% more. AMD needs 256.
I think this is just an appetizer for Clearwater Forest next year with Darkmont cores on Intel 18A. That would be a serious product for most workloads except ones requiring AVX512.
Oh wow, a truly rare unicorn here, a Patrick/STH article right out of the funny pages, which is great, everybody likes to laugh once in a while!
Hurray, cloud providers are getting more efficient. Meanwhile, I’m not seeing the costs for these low end minimalist servers going down. It’s impressive how many more cores and how much more RAM and how many more gigabits of networking you can buy per $ only for the price from year to year to stay the same…
It would be great if your benchmark suite reflected some more use cases to reflect the weird CPUs, especially for the embedded parts.
Things like QAT for nginx or an opensense router or Tailscale exit node or SMB server. I know they aren’t traditional compute tasks but they do need CPUs and it’s what most STH readers probably actually use the devices for.
@Patrick: Please stop with this ridiculous pro-Intel framing
You say that Bergamo is “above” Sierra Forest but they basically have the same list prices. The 9754S with 1T per core is even cheaper and I would have loved to see a comparison of that 1T1C for both AMD and Intel.
“What Intel has here is something like a Siena”: No, you really need to change your conclusion after Intel published their price list.
BTW
Bergamo 9754 is going for 5400 Euros (including 19% VAT) at regular retailers in Europe and 9734 for 3600 Euro. I really don’t think Bergamo will be “above” Sierra Forest even at “real world” prices for larger customers.
Forget AMD. I think this article is sponsored by Ampere or Arm. Ampere or Arm must have paid to not have its chips in these charts. Intel’s 1G E Core Xeon is more than 30% faster per core than the Altra Max M128 even with more cores in the same power. You’re also not being fair since Sierra’s using DDR5 so that’s gap for memory. PCIe Generation 5 is higher power and faster. So Intel’s 250W is being used some for that. 144 cores at 250W is amazing. We’ve got so much older gear and even still low utilization so BOTE math makes this a big winner. We’ve got renewal at the colo coming. I can’t wait to watch how they’ll take reducing 40 cabs to 4.
I think AMD’s faster on AVX512 but web servers will get much more benefit from QAT than they do AVX512. I don’t think that’s being taken into account enough. You’re handicapping Intel versus AMD by not using that.
If you do the math on the 9754S loss of threads that’s about 14% below the 9754. Intel’s got integer performance 25% above the 8594+ so you’d end up at 19% lower perf for the 6780E than the 9754S, not taking into account QAT which you should but it won’t work for integer workloads.
With that 19% lower performance you’ve got 12.5% more cores on Intel, so that’ll have a larger impact on how many vCPUs you can provision. You’re at a lower perf per core with Intel, but more vCPU capacity.
When we look at power though, that 6780E screenshot is 302W so it’s 58W less than the 360W TDP 9754S since AMD typically uses its entire TDP. That’s just over 16% less power. I’d assume that extra 28W is for accelerators and other chip parts.
So Intel’s 19% less perf than Bergamo without SMT at 16% less power. Yet Intel’s delivering 12.5% more vCPUs and that QAT if you’re enabling it for OpenSSL offload or IPsec will more than outweigh the 3% perf/power difference. I don’t think QAT’s as important on super computer chips, but in this market, it’s aimed directly in target workloads.
If you’re just going vCPU / power and don’t care about 20% performance, then the 6766E is the clear winner in all of this. We’ve got over 70,000 VM’s where I work and I can tell you that we are 97% 8 vCPU and fewer. Less than 15% of those VMs have hit 100% CPU in the last quarter.
What this article fails to cover is the 1 vCPU VM. If you’re a poor cloud provider like a tier 5 one maybe you’re putting two different VMs on a SMT core’s 2 threads. For any serious tier 1 or tier 2 cloud provider, and any respectable enterprise cloud, they aren’t putting 2 different VMs on the same physical core’s 2 threads.
I’d say this is a great article and very fair. I don’t think SF is beating AMD in perf. It’s targeting what we’ve all been seeing in the industry where there’s so many small VMs that aren’t using entire vCPU performance. It’s the GPU moment for Linux virtualization.
@ModelAirplanesInSeattleGuy
“more vCPUs”: Don’t know where you’re working but no company I’ve been at cares about just more VMs. It’s about cost(including power) and performance. We never consolidate to a server where the VMs don’t offer significant performance upgrades. It’s about future proofing.
“since AMD typically uses its entire TDP” : Like all CPUs it depends on the workload. Your calculation is worthless
Regarding QAT: What is the performance of these 2 QAT (at least Xeon 4th/5th gen platinum has 4 units) units when you use 144 VMs(like your example, or just 32) accessing QAT through SR-IOV? The fact that it’s hard to find any information on it shows that very few are using QAT despite all this talk. Anyone looking for such an extreme use case would use DPUs.
xeon 6 is new socket or it can be used in Sapphire rappids motherboards?
Xeon 6 processor is based on a new socket and is not compatible with the Sapphire Rapids motherboard/socket.