Intel Xeon 6 6700E Sierra Forest Shatters Xeon Expectations

18

Intel P-Core Versus E-Core The Great Divide

The reason this is, by far, the biggest change to Xeon in decades is that it is the first time we will really have two different core architectures in the mainstream sockets. Intel’s observation (albeit it is far from the first to arrive here) is that there is a set of compute that runs 24×7 at near 100% utilization. Those Performance cores, or P-cores, are designed to deliver the maximum performance per thread. If you have per-core licensed software, HPC applications, big, expensive AI servers, and so forth, you want P-core CPUs. The Efficient Cores or E-cores are designed for this huge body of workloads that needs to exist but are better suited to chasing power efficiency for cost savings rather than maximum performance.

Intel Xeon 6 Workloads 2
Intel Xeon 6 Workloads 2

Here is Intel’s shot at which workloads are best suited to P-cores versus E-cores. This is generally pretty good, but virtualization should have a big note on it. A target market for this is cloud CSPs with 2-8 vCPU customer instances that they want to deliver at a lower cost. If your idea of virtualization is VMware or Microsoft Windows Server with core and per socket style licensing, then you would rightly look to P-cores and will miss out on the benefits of E-cores from an efficiency standpoint.

Intel Xeon 6 P Core E Core Workloads
Intel Xeon 6 P Core E Core Workloads

When we say E-core, there are architectural differences. While both have a lot of similar features, AVX-512 and AMX are not present on the E-cores.

Intel Xeon 6 Feature Comparison
Intel Xeon 6 Feature Comparison

Here is a bit more on the cores. At this point, let us simply look at the codenames. The P-core Redwood Cove is the Meteor Lake generation P-core on the desktop side. The E-core “Crestmont” is the Meteor Lake generation E-core. Of course, that is a bit of an oversimplification, but as Intel discusses Lunar Lake today, we want to keep folks grounded on where the new chips fall compared to their desktop counterparts. If you want to know why we do things like review Intel i3-N305 2.5GbE firewalls it is because we have been keeping tabs on E-core progress up to this point. That is also the reason we are really excited about larger all-E-core processors.

Intel Xeon 6 Workloads 3
Intel Xeon 6 Workloads 3

There may be folks that hear E-core and think Atom. That is correct that those are the lineage of the E-core designs. In 2012, for example, we reviewed the all E-core Supermicro X7SPA-HF-D525 with the Atom D525 then the 2013 Intel Atom S1260. 2013 also saw Avoton/ Rangely. In 2017, we reviewed the Atom C3000 series. We have been reviewing the Snow Ridge and Alder Lake-N parts more recently. Still, the theme is that Intel has been increasing the performance per core on its E-cores at an almost absurd rate over the past few years. When we say E-core, instead of thinking about this like it is some super slow core, think of it more like an Intel Xeon E5 V4 core level of performance or a thread of 5th Gen Xeon Emerald Rapids.

Intel Xeon 6 Roadmap
Intel Xeon 6 Roadmap

Still, P-cores are far from done. Next quarter we are getting the big chips, the Intel Xeon 6900P series. We will get up to 128 cores, 12 memory channels, and the ability to use MCR DIMMs. MCRDIMMs will give this new chip bandwidth competitive with the Intel Xeon Max, but without requiring lower capacity HBM onboard. P-core Xeon CPUs are set to double in about three quarters.

Intel Xeon 6900P Q3 2024
Intel Xeon 6900P Q3 2024

Taking a step back, this means we are about to have a small socket E-core and P-core, and then a big socket P-core, then E-core. Today we are focusing on small socket E-core.

Intel Xeon 6700 Die Packages
Intel Xeon 6700 Die Packages

Next quarter, we will have a big socket P-core CPU for the 128 core Granite Rapids and that socket will also support an E-core variant.

Intel Xeon 6900 Series Die Packages
Intel Xeon 6900 Series Die Packages

We will get the other models in Q1 2025, like the big socket E-core Sierra Forest-AP with 288 cores per socket and up to 86 P-cores in the smaller socket. There is also a Xeon 6 SoC on the roadmap for Q1. If this is confusing, here is what we can tell from Intel’s various slides in a single table.

Intel Xeon 6 Rollout Plan
Intel Xeon 6 Rollout Plan

Small socket is LGA4710 and big socket will likely be LGA7529.

The other big feature here is how these are built. Intel is using an Intel 3 process die for its E-core tiles.

Intel Xeon 6 Intel 3 Compute Tile
Intel Xeon 6 Intel 3 Compute Tile

Then, on either side, there are Intel 7 process I/O dies with things like socket-to-socket UPI links, PCIe Gen5 controllers, CXL, and accelerators.

Intel Xeon 6 Intel 7 IO Tile
Intel Xeon 6 Intel 7 IO Tile

We recently went into how big of an update the XCC 5th Gen Intel Xeon “Emerald Rapids” was over the 4th Gen. Intel is now able to put 144 cores and memory controllers on the same tile.

5th Gen Xeon To Intel Xeon 6 Tile Design
5th Gen Xeon To Intel Xeon 6 Tile Design

Here are those chips in reverse and you can see the packages are about the same size. Both are much smaller than an AMD EPYC Genoa or Bergamo. This is important since the 6700 series seems to be positioned below the 6900 series in terms of platform cost, performance, and power consumption.

Intel Xeon 6780E And 5th Gen Intel Xeon EMR 1
Intel Xeon 6780E And 5th Gen Intel Xeon EMR 1

Next, let us get to some of the Intel Xeon 6 platform features.

18 COMMENTS

  1. The discussion about needing more instances of the workload for a good kernel compile benchmark is exactly what the concepts of weak scaling vs. strong scaling in the HPC world are. It is a property is the application. If when using N cores, the calculation finishes in 1/Nth the time of a single threaded instance, it is said to have a strong scaling property. If when using N cores, the problem size is also N times bigger and the calculation finishes in the same time as the reference problem on 1 core, it is said to have a weak scaling property. In both cases perfect scaling means that using N cores is N times faster, but for weak scaling that only holds if you also increase the problem size.

  2. I am curious about the PCH. If the PCH goes away will the SATA controller also move into the CPU or will the motherboard manufacturers have to add a SATA controller, or is SATA simply dead?
    I see on the diagram the network cards, the M.2 and the other PCIe lanes, but I don’t see the SATA controller that usually lived in the PCH.
    Or maybe there will be a simpler PCH?

  3. I hope you’d do more of that linux kernel compile benchmark partitioning. That’s what’s needed for chips like these. If you’re consolidating from old servers to new big CC CPUs you don’t need an app on all cores. You need to take an 8 vCPU VM on old and run it on new. My critique is instead of 36 vCPU I’d want to see 8 vCPU

  4. “Ask a cloud provider, and they will tell you that 8 vCPUs and smaller VMs make up the majority of the VMs that their customers deploy.”

    As someone who runs a small cloud hosting data center I can confirm this. We have 9 out of 150VMs that have more than 8 vCPUs. Of those 9 I can tell you that they could easily be cut in half and not affect performance of the VMs at all. However, the customers wanted and are paying for the extra vCPUs so they get them.

  5. OK, auto-refresh of page wiped my comment … a ahem great way to prevent any bigger posts. So will keep the rant and skip analytical part of the comment I wanted to post for others ..

    ——–
    The conclusion is, well, Intel-sponsored I guess.
    On one hand you give one to Intel, stating Sierra Forest is really positioned a step below Bergamo – which is correct.
    Then, one paragraph later, you criticise Siena for the same fact – that it is positioned (even more so) below Sierra Forrest.

    A lost opportunity.

  6. For the Bergamo comparison -“but again, remember, a large double-digit percentage of all infrastructure is determined by the ability to place a number of VMs and their vCPUs onto hardware. 256 threads is more than 144, but without SMT that becomes a 128 v. 144 discussion.” That is such a contrived conclusion. I doubt how many service providers actually think like this/

  7. Divyjot I work at a big cloud provider so my thoughts are my own. You might notice all the cloud providers aren’t bringing SMT to their custom silicon designs. SMT with the side channel attacks is a nightmare. You don’t see scheduling across different physical cores in large public clouds for this reason.

    That conclusion that Bergamo’s Zen 4c is too much perf per core is also on target.

    I’d say they did a great job, but I’d also say the 288 core is going to be a step up. I’d rather have 288 physical cores than 384 threads using SMT.

  8. AMD needs a 256 core Turin Dense. What they’ve missed is that Intel offers more than twice the E cores than the P. We’re buying Genoa not Bergamo top end even with STH saying Bergamo is great because we didn’t want to hit that low cache case in our infrastructure. 96 to 128 is only 33% more. You’re needing to show a bigger jump in core counts. 128 to 192 is only 50% more. AMD needs 256.

  9. I think this is just an appetizer for Clearwater Forest next year with Darkmont cores on Intel 18A. That would be a serious product for most workloads except ones requiring AVX512.

  10. Oh wow, a truly rare unicorn here, a Patrick/STH article right out of the funny pages, which is great, everybody likes to laugh once in a while!

  11. Hurray, cloud providers are getting more efficient. Meanwhile, I’m not seeing the costs for these low end minimalist servers going down. It’s impressive how many more cores and how much more RAM and how many more gigabits of networking you can buy per $ only for the price from year to year to stay the same…

    It would be great if your benchmark suite reflected some more use cases to reflect the weird CPUs, especially for the embedded parts.

    Things like QAT for nginx or an opensense router or Tailscale exit node or SMB server. I know they aren’t traditional compute tasks but they do need CPUs and it’s what most STH readers probably actually use the devices for.

  12. @Patrick: Please stop with this ridiculous pro-Intel framing

    You say that Bergamo is “above” Sierra Forest but they basically have the same list prices. The 9754S with 1T per core is even cheaper and I would have loved to see a comparison of that 1T1C for both AMD and Intel.

    “What Intel has here is something like a Siena”: No, you really need to change your conclusion after Intel published their price list.

  13. BTW

    Bergamo 9754 is going for 5400 Euros (including 19% VAT) at regular retailers in Europe and 9734 for 3600 Euro. I really don’t think Bergamo will be “above” Sierra Forest even at “real world” prices for larger customers.

  14. Forget AMD. I think this article is sponsored by Ampere or Arm. Ampere or Arm must have paid to not have its chips in these charts. Intel’s 1G E Core Xeon is more than 30% faster per core than the Altra Max M128 even with more cores in the same power. You’re also not being fair since Sierra’s using DDR5 so that’s gap for memory. PCIe Generation 5 is higher power and faster. So Intel’s 250W is being used some for that. 144 cores at 250W is amazing. We’ve got so much older gear and even still low utilization so BOTE math makes this a big winner. We’ve got renewal at the colo coming. I can’t wait to watch how they’ll take reducing 40 cabs to 4.

  15. I think AMD’s faster on AVX512 but web servers will get much more benefit from QAT than they do AVX512. I don’t think that’s being taken into account enough. You’re handicapping Intel versus AMD by not using that.

    If you do the math on the 9754S loss of threads that’s about 14% below the 9754. Intel’s got integer performance 25% above the 8594+ so you’d end up at 19% lower perf for the 6780E than the 9754S, not taking into account QAT which you should but it won’t work for integer workloads.

    With that 19% lower performance you’ve got 12.5% more cores on Intel, so that’ll have a larger impact on how many vCPUs you can provision. You’re at a lower perf per core with Intel, but more vCPU capacity.

    When we look at power though, that 6780E screenshot is 302W so it’s 58W less than the 360W TDP 9754S since AMD typically uses its entire TDP. That’s just over 16% less power. I’d assume that extra 28W is for accelerators and other chip parts.

    So Intel’s 19% less perf than Bergamo without SMT at 16% less power. Yet Intel’s delivering 12.5% more vCPUs and that QAT if you’re enabling it for OpenSSL offload or IPsec will more than outweigh the 3% perf/power difference. I don’t think QAT’s as important on super computer chips, but in this market, it’s aimed directly in target workloads.

    If you’re just going vCPU / power and don’t care about 20% performance, then the 6766E is the clear winner in all of this. We’ve got over 70,000 VM’s where I work and I can tell you that we are 97% 8 vCPU and fewer. Less than 15% of those VMs have hit 100% CPU in the last quarter.

    What this article fails to cover is the 1 vCPU VM. If you’re a poor cloud provider like a tier 5 one maybe you’re putting two different VMs on a SMT core’s 2 threads. For any serious tier 1 or tier 2 cloud provider, and any respectable enterprise cloud, they aren’t putting 2 different VMs on the same physical core’s 2 threads.

    I’d say this is a great article and very fair. I don’t think SF is beating AMD in perf. It’s targeting what we’ve all been seeing in the industry where there’s so many small VMs that aren’t using entire vCPU performance. It’s the GPU moment for Linux virtualization.

  16. @ModelAirplanesInSeattleGuy

    “more vCPUs”: Don’t know where you’re working but no company I’ve been at cares about just more VMs. It’s about cost(including power) and performance. We never consolidate to a server where the VMs don’t offer significant performance upgrades. It’s about future proofing.

    “since AMD typically uses its entire TDP” : Like all CPUs it depends on the workload. Your calculation is worthless

    Regarding QAT: What is the performance of these 2 QAT (at least Xeon 4th/5th gen platinum has 4 units) units when you use 144 VMs(like your example, or just 32) accessing QAT through SR-IOV? The fact that it’s hard to find any information on it shows that very few are using QAT despite all this talk. Anyone looking for such an extreme use case would use DPUs.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.