Intel Xeon 6 6700E Sierra Forest Shatters Xeon Expectations

18

Intel Xeon 6 Platforms

Now, the challenge is that delivering four different high-level P/E-core/socket combinations (at least) means that there is a lot of variability in specs. You will see a lot of “Up to” here and that is because Intel is differentiating on things like features depending on the core type and SKU.

Intel Xeon 6P And 6E Comparisons
Intel Xeon 6P And 6E Comparisons

A great example of this is that the Xeon 6700E has 88 PCIe Gen5 lanes, but the slide says 96 lanes because it does not just cover the 6700E series. The P-cores, however, can get up to 136 PCIe Gen5 lanes for single-socket designs. Intel also has 12-channel memory with P-cores MCR DIMM support and more UPI links in the 6900 series for more inter-socket bandwidth.

Intel Xeon 6 6700 And 6900 Platform 3
Intel Xeon 6 6700 And 6900 Platform 3

At STH we have been talking about CXL a lot for the past several years. Intel Xeon 6 supports CXL 2.0 Type 1, Type 2, and Type 3 devices on the 64 lanes that support CXL.

Intel Xeon 6 CXL 2.0 1
Intel Xeon 6 CXL 2.0 1

Of course, there are also caveats to things like the CXL Type 3 devices as the P-core variants get interleaved memory options in both sockets, but the E-core variants do not. We have seen the CXL Heterogeneous Interleaved mode and it is really cool. Imagine having ~8 additional channels of memory bandwidth and running DIMMs and CXL memory modules both in a giant pool of memory with bandwidth striped across all of them. Sierra Forest does not get that. Instead, CXL memory can either be its own NUMA node or a hardware-assisted flat memory mode.

Intel Xeon 6 CXL 2.0 2
Intel Xeon 6 CXL 2.0 2

Just to make this a bit easier, using something like an Astera Labs Aurora A1000 card, we can simply put that in one of the PCIe Gen5 x16 riser slots.

Astera Labs Aurora A1000 With 4x 64GB DDR5 DIMMs 2
Astera Labs Aurora A1000 With 4x 64GB DDR5 DIMMs 2

We can fill the card with 4x 64GB DDR5 RDIMMs and get another 256GB of memory capacity at about the same latency as adjacent socket memory (e.g. the memory pool connected to the opposite CPU in a 2-socket server.) We can then do this four times and get an extra TB of memory in the server and roughly 8 channels of DDR5 worth of bandwidth.

Astera Labs Aurora A1000 With 4x 64GB DDR5 DIMMs 3
Astera Labs Aurora A1000 With 4x 64GB DDR5 DIMMs 3

CXL has another trick, however. Suppose you are a hyper-scale customer with loads of working DDR4 modules. In that case, you can use a compatible CXL DDR4 controller card and plug memory modules into your DDR5 server via a similar methodology to the above. We have a ton of DDR4 memory and cannot wait until we can get a CXL DDR4 memory shelf that we can cable (using retimers) and use with these systems.

Intel Xeon 6 CXL 2.0 3
Intel Xeon 6 CXL 2.0 3

All told, it comes together in something like this as the Intel-QCT example:

Intel QCT Birtch Stream Xeon 6 Platform 7
Intel QCT Birtch Stream Xeon 6 Platform 7

Or as a Supermicro example:

Supermicro SYS 222H TN Xeon 6 Birtch Stream Platform 3
Supermicro SYS 222H TN Xeon 6 Birtch Stream Platform 3

Here we can see the dual sockets, with 8 channels of DDR5 and 2 DIMM per channel memory. In the Supermicro system, we have 32x 64GB of memory for 2TB of memory along with the 288 cores. This is awesome.

Intel Sierra Forest 6700E Pads 1
Intel Sierra Forest 6700E Pads 1

Next, let us get to the performance.

18 COMMENTS

  1. The discussion about needing more instances of the workload for a good kernel compile benchmark is exactly what the concepts of weak scaling vs. strong scaling in the HPC world are. It is a property is the application. If when using N cores, the calculation finishes in 1/Nth the time of a single threaded instance, it is said to have a strong scaling property. If when using N cores, the problem size is also N times bigger and the calculation finishes in the same time as the reference problem on 1 core, it is said to have a weak scaling property. In both cases perfect scaling means that using N cores is N times faster, but for weak scaling that only holds if you also increase the problem size.

  2. I am curious about the PCH. If the PCH goes away will the SATA controller also move into the CPU or will the motherboard manufacturers have to add a SATA controller, or is SATA simply dead?
    I see on the diagram the network cards, the M.2 and the other PCIe lanes, but I don’t see the SATA controller that usually lived in the PCH.
    Or maybe there will be a simpler PCH?

  3. I hope you’d do more of that linux kernel compile benchmark partitioning. That’s what’s needed for chips like these. If you’re consolidating from old servers to new big CC CPUs you don’t need an app on all cores. You need to take an 8 vCPU VM on old and run it on new. My critique is instead of 36 vCPU I’d want to see 8 vCPU

  4. “Ask a cloud provider, and they will tell you that 8 vCPUs and smaller VMs make up the majority of the VMs that their customers deploy.”

    As someone who runs a small cloud hosting data center I can confirm this. We have 9 out of 150VMs that have more than 8 vCPUs. Of those 9 I can tell you that they could easily be cut in half and not affect performance of the VMs at all. However, the customers wanted and are paying for the extra vCPUs so they get them.

  5. OK, auto-refresh of page wiped my comment … a ahem great way to prevent any bigger posts. So will keep the rant and skip analytical part of the comment I wanted to post for others ..

    ——–
    The conclusion is, well, Intel-sponsored I guess.
    On one hand you give one to Intel, stating Sierra Forest is really positioned a step below Bergamo – which is correct.
    Then, one paragraph later, you criticise Siena for the same fact – that it is positioned (even more so) below Sierra Forrest.

    A lost opportunity.

  6. For the Bergamo comparison -“but again, remember, a large double-digit percentage of all infrastructure is determined by the ability to place a number of VMs and their vCPUs onto hardware. 256 threads is more than 144, but without SMT that becomes a 128 v. 144 discussion.” That is such a contrived conclusion. I doubt how many service providers actually think like this/

  7. Divyjot I work at a big cloud provider so my thoughts are my own. You might notice all the cloud providers aren’t bringing SMT to their custom silicon designs. SMT with the side channel attacks is a nightmare. You don’t see scheduling across different physical cores in large public clouds for this reason.

    That conclusion that Bergamo’s Zen 4c is too much perf per core is also on target.

    I’d say they did a great job, but I’d also say the 288 core is going to be a step up. I’d rather have 288 physical cores than 384 threads using SMT.

  8. AMD needs a 256 core Turin Dense. What they’ve missed is that Intel offers more than twice the E cores than the P. We’re buying Genoa not Bergamo top end even with STH saying Bergamo is great because we didn’t want to hit that low cache case in our infrastructure. 96 to 128 is only 33% more. You’re needing to show a bigger jump in core counts. 128 to 192 is only 50% more. AMD needs 256.

  9. I think this is just an appetizer for Clearwater Forest next year with Darkmont cores on Intel 18A. That would be a serious product for most workloads except ones requiring AVX512.

  10. Oh wow, a truly rare unicorn here, a Patrick/STH article right out of the funny pages, which is great, everybody likes to laugh once in a while!

  11. Hurray, cloud providers are getting more efficient. Meanwhile, I’m not seeing the costs for these low end minimalist servers going down. It’s impressive how many more cores and how much more RAM and how many more gigabits of networking you can buy per $ only for the price from year to year to stay the same…

    It would be great if your benchmark suite reflected some more use cases to reflect the weird CPUs, especially for the embedded parts.

    Things like QAT for nginx or an opensense router or Tailscale exit node or SMB server. I know they aren’t traditional compute tasks but they do need CPUs and it’s what most STH readers probably actually use the devices for.

  12. @Patrick: Please stop with this ridiculous pro-Intel framing

    You say that Bergamo is “above” Sierra Forest but they basically have the same list prices. The 9754S with 1T per core is even cheaper and I would have loved to see a comparison of that 1T1C for both AMD and Intel.

    “What Intel has here is something like a Siena”: No, you really need to change your conclusion after Intel published their price list.

  13. BTW

    Bergamo 9754 is going for 5400 Euros (including 19% VAT) at regular retailers in Europe and 9734 for 3600 Euro. I really don’t think Bergamo will be “above” Sierra Forest even at “real world” prices for larger customers.

  14. Forget AMD. I think this article is sponsored by Ampere or Arm. Ampere or Arm must have paid to not have its chips in these charts. Intel’s 1G E Core Xeon is more than 30% faster per core than the Altra Max M128 even with more cores in the same power. You’re also not being fair since Sierra’s using DDR5 so that’s gap for memory. PCIe Generation 5 is higher power and faster. So Intel’s 250W is being used some for that. 144 cores at 250W is amazing. We’ve got so much older gear and even still low utilization so BOTE math makes this a big winner. We’ve got renewal at the colo coming. I can’t wait to watch how they’ll take reducing 40 cabs to 4.

  15. I think AMD’s faster on AVX512 but web servers will get much more benefit from QAT than they do AVX512. I don’t think that’s being taken into account enough. You’re handicapping Intel versus AMD by not using that.

    If you do the math on the 9754S loss of threads that’s about 14% below the 9754. Intel’s got integer performance 25% above the 8594+ so you’d end up at 19% lower perf for the 6780E than the 9754S, not taking into account QAT which you should but it won’t work for integer workloads.

    With that 19% lower performance you’ve got 12.5% more cores on Intel, so that’ll have a larger impact on how many vCPUs you can provision. You’re at a lower perf per core with Intel, but more vCPU capacity.

    When we look at power though, that 6780E screenshot is 302W so it’s 58W less than the 360W TDP 9754S since AMD typically uses its entire TDP. That’s just over 16% less power. I’d assume that extra 28W is for accelerators and other chip parts.

    So Intel’s 19% less perf than Bergamo without SMT at 16% less power. Yet Intel’s delivering 12.5% more vCPUs and that QAT if you’re enabling it for OpenSSL offload or IPsec will more than outweigh the 3% perf/power difference. I don’t think QAT’s as important on super computer chips, but in this market, it’s aimed directly in target workloads.

    If you’re just going vCPU / power and don’t care about 20% performance, then the 6766E is the clear winner in all of this. We’ve got over 70,000 VM’s where I work and I can tell you that we are 97% 8 vCPU and fewer. Less than 15% of those VMs have hit 100% CPU in the last quarter.

    What this article fails to cover is the 1 vCPU VM. If you’re a poor cloud provider like a tier 5 one maybe you’re putting two different VMs on a SMT core’s 2 threads. For any serious tier 1 or tier 2 cloud provider, and any respectable enterprise cloud, they aren’t putting 2 different VMs on the same physical core’s 2 threads.

    I’d say this is a great article and very fair. I don’t think SF is beating AMD in perf. It’s targeting what we’ve all been seeing in the industry where there’s so many small VMs that aren’t using entire vCPU performance. It’s the GPU moment for Linux virtualization.

  16. @ModelAirplanesInSeattleGuy

    “more vCPUs”: Don’t know where you’re working but no company I’ve been at cares about just more VMs. It’s about cost(including power) and performance. We never consolidate to a server where the VMs don’t offer significant performance upgrades. It’s about future proofing.

    “since AMD typically uses its entire TDP” : Like all CPUs it depends on the workload. Your calculation is worthless

    Regarding QAT: What is the performance of these 2 QAT (at least Xeon 4th/5th gen platinum has 4 units) units when you use 144 VMs(like your example, or just 32) accessing QAT through SR-IOV? The fact that it’s hard to find any information on it shows that very few are using QAT despite all this talk. Anyone looking for such an extreme use case would use DPUs.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.