Intel Sierra Forest Power Consumption is Super
At this point, some folks may look at the chips and wonder why Intel would go down the path of lower-performance E-cores. One of the coolest things about Sierra Forest is the power consumption.
Let us take a quick look at what is going on here. At idle, we were seeing around 56W for each Xeon 6766E. The Xeon 6780E was slightly higher at just over 60W. Again, this is on early firmware, so the power may change.
At load, the power consumption was showing a bit less than the maximum TDP of the chips.
Remember we have the 250W Intel Xeon 6766E and the 330W Xeon 6780E.
Just to double-check, we also ran a quick powerstat with stress-ng running, and on the Intel Xeon 6780E, for example, we were at 644W. Not bad for two 330W TDP CPUs.
Let us move up to the server power consumption level, and let us use the Supermicro platform since that is commercially available as of when this review will go live. First, starting with the base BMC power consumption with the server off but the NICs connected, keyboard plugged in, and so forth, we are anywhere from about 6-12W range.
Turning the system on, at idle, we are at around 315W. That is higher than some would expect, but it is almost exactly what we would have expected to see from an all E-core CPU. These cores are not throttling down to oblivion. Instead, the min to max power states on E-core implementations we have seen tend to be relatively close. When we fire the system up, we hit a peak of 905W via the BMCs power supply readings.
As a quick note, we are losing a small amount here due to using redundant power supplies, so single power supply use would be under 900W. Second, this is not the typical lean configuration that we use for power consumption testing. Instead, we have 32x 64GB DDR5 DIMMs for a total of 2TB of memory. We had two 100GbE ports up, and even a quad Intel X710-T4 10Gbase-T NIC linked because we needed an RJ45 connection, given we only had time to test this in the studio, not the data center lab. Also, we needed to use the Supermicro here since the Intel-QCT box was more of a development platform and gave higher power consumption figures when we double-checked at the wall.
Shedding half of the DIMMs (~5W each) and peripherals, we got this under 800W and there was probably room to go down from there. That is quite good with two 330W TDP CPUs.
With the Xeon 6766E, we managed to get the power consumption in the full 2TB configuration to around 725W. Dropping 80W TDP per CPU gives us 160W total towards that drop, and cooling accounts for a few watts as well. That is absolutely a crazy figure. 288 cores, 2TB of DDR5 memory, 100GbE and 10Gbase-T networking, and only 725W? Going to 1DPC and only 1TB of memory, we were sub 650W maximum for around 2.25W / core, which lit up with 100GbE networking.
Something that we did not test, but seems correct based on what we saw, is that there is a sweet spot. Somewhere in the sub 70% utilization rate, these chips are much better than when we run them at 100%.
Intel is highlighting the Xeon 6766E here because 144 cores in 250W is great.
We are just going to leave these since we did not get to do that detailed testing, but from what we have seen, they are directionally accurate.
Next, let us get to market impact.
The 2S core-to-core latency image is too low-res to read unfortunately.
The discussion about needing more instances of the workload for a good kernel compile benchmark is exactly what the concepts of weak scaling vs. strong scaling in the HPC world are. It is a property is the application. If when using N cores, the calculation finishes in 1/Nth the time of a single threaded instance, it is said to have a strong scaling property. If when using N cores, the problem size is also N times bigger and the calculation finishes in the same time as the reference problem on 1 core, it is said to have a weak scaling property. In both cases perfect scaling means that using N cores is N times faster, but for weak scaling that only holds if you also increase the problem size.
I am curious about the PCH. If the PCH goes away will the SATA controller also move into the CPU or will the motherboard manufacturers have to add a SATA controller, or is SATA simply dead?
I see on the diagram the network cards, the M.2 and the other PCIe lanes, but I don’t see the SATA controller that usually lived in the PCH.
Or maybe there will be a simpler PCH?
I hope you’d do more of that linux kernel compile benchmark partitioning. That’s what’s needed for chips like these. If you’re consolidating from old servers to new big CC CPUs you don’t need an app on all cores. You need to take an 8 vCPU VM on old and run it on new. My critique is instead of 36 vCPU I’d want to see 8 vCPU
“Ask a cloud provider, and they will tell you that 8 vCPUs and smaller VMs make up the majority of the VMs that their customers deploy.”
As someone who runs a small cloud hosting data center I can confirm this. We have 9 out of 150VMs that have more than 8 vCPUs. Of those 9 I can tell you that they could easily be cut in half and not affect performance of the VMs at all. However, the customers wanted and are paying for the extra vCPUs so they get them.
OK, auto-refresh of page wiped my comment … a ahem great way to prevent any bigger posts. So will keep the rant and skip analytical part of the comment I wanted to post for others ..
——–
The conclusion is, well, Intel-sponsored I guess.
On one hand you give one to Intel, stating Sierra Forest is really positioned a step below Bergamo – which is correct.
Then, one paragraph later, you criticise Siena for the same fact – that it is positioned (even more so) below Sierra Forrest.
A lost opportunity.
For the Bergamo comparison -“but again, remember, a large double-digit percentage of all infrastructure is determined by the ability to place a number of VMs and their vCPUs onto hardware. 256 threads is more than 144, but without SMT that becomes a 128 v. 144 discussion.” That is such a contrived conclusion. I doubt how many service providers actually think like this/
Divyjot I work at a big cloud provider so my thoughts are my own. You might notice all the cloud providers aren’t bringing SMT to their custom silicon designs. SMT with the side channel attacks is a nightmare. You don’t see scheduling across different physical cores in large public clouds for this reason.
That conclusion that Bergamo’s Zen 4c is too much perf per core is also on target.
I’d say they did a great job, but I’d also say the 288 core is going to be a step up. I’d rather have 288 physical cores than 384 threads using SMT.
AMD needs a 256 core Turin Dense. What they’ve missed is that Intel offers more than twice the E cores than the P. We’re buying Genoa not Bergamo top end even with STH saying Bergamo is great because we didn’t want to hit that low cache case in our infrastructure. 96 to 128 is only 33% more. You’re needing to show a bigger jump in core counts. 128 to 192 is only 50% more. AMD needs 256.
I think this is just an appetizer for Clearwater Forest next year with Darkmont cores on Intel 18A. That would be a serious product for most workloads except ones requiring AVX512.
Oh wow, a truly rare unicorn here, a Patrick/STH article right out of the funny pages, which is great, everybody likes to laugh once in a while!
Hurray, cloud providers are getting more efficient. Meanwhile, I’m not seeing the costs for these low end minimalist servers going down. It’s impressive how many more cores and how much more RAM and how many more gigabits of networking you can buy per $ only for the price from year to year to stay the same…
It would be great if your benchmark suite reflected some more use cases to reflect the weird CPUs, especially for the embedded parts.
Things like QAT for nginx or an opensense router or Tailscale exit node or SMB server. I know they aren’t traditional compute tasks but they do need CPUs and it’s what most STH readers probably actually use the devices for.
@Patrick: Please stop with this ridiculous pro-Intel framing
You say that Bergamo is “above” Sierra Forest but they basically have the same list prices. The 9754S with 1T per core is even cheaper and I would have loved to see a comparison of that 1T1C for both AMD and Intel.
“What Intel has here is something like a Siena”: No, you really need to change your conclusion after Intel published their price list.
BTW
Bergamo 9754 is going for 5400 Euros (including 19% VAT) at regular retailers in Europe and 9734 for 3600 Euro. I really don’t think Bergamo will be “above” Sierra Forest even at “real world” prices for larger customers.
Forget AMD. I think this article is sponsored by Ampere or Arm. Ampere or Arm must have paid to not have its chips in these charts. Intel’s 1G E Core Xeon is more than 30% faster per core than the Altra Max M128 even with more cores in the same power. You’re also not being fair since Sierra’s using DDR5 so that’s gap for memory. PCIe Generation 5 is higher power and faster. So Intel’s 250W is being used some for that. 144 cores at 250W is amazing. We’ve got so much older gear and even still low utilization so BOTE math makes this a big winner. We’ve got renewal at the colo coming. I can’t wait to watch how they’ll take reducing 40 cabs to 4.
I think AMD’s faster on AVX512 but web servers will get much more benefit from QAT than they do AVX512. I don’t think that’s being taken into account enough. You’re handicapping Intel versus AMD by not using that.
If you do the math on the 9754S loss of threads that’s about 14% below the 9754. Intel’s got integer performance 25% above the 8594+ so you’d end up at 19% lower perf for the 6780E than the 9754S, not taking into account QAT which you should but it won’t work for integer workloads.
With that 19% lower performance you’ve got 12.5% more cores on Intel, so that’ll have a larger impact on how many vCPUs you can provision. You’re at a lower perf per core with Intel, but more vCPU capacity.
When we look at power though, that 6780E screenshot is 302W so it’s 58W less than the 360W TDP 9754S since AMD typically uses its entire TDP. That’s just over 16% less power. I’d assume that extra 28W is for accelerators and other chip parts.
So Intel’s 19% less perf than Bergamo without SMT at 16% less power. Yet Intel’s delivering 12.5% more vCPUs and that QAT if you’re enabling it for OpenSSL offload or IPsec will more than outweigh the 3% perf/power difference. I don’t think QAT’s as important on super computer chips, but in this market, it’s aimed directly in target workloads.
If you’re just going vCPU / power and don’t care about 20% performance, then the 6766E is the clear winner in all of this. We’ve got over 70,000 VM’s where I work and I can tell you that we are 97% 8 vCPU and fewer. Less than 15% of those VMs have hit 100% CPU in the last quarter.
What this article fails to cover is the 1 vCPU VM. If you’re a poor cloud provider like a tier 5 one maybe you’re putting two different VMs on a SMT core’s 2 threads. For any serious tier 1 or tier 2 cloud provider, and any respectable enterprise cloud, they aren’t putting 2 different VMs on the same physical core’s 2 threads.
I’d say this is a great article and very fair. I don’t think SF is beating AMD in perf. It’s targeting what we’ve all been seeing in the industry where there’s so many small VMs that aren’t using entire vCPU performance. It’s the GPU moment for Linux virtualization.
@ModelAirplanesInSeattleGuy
“more vCPUs”: Don’t know where you’re working but no company I’ve been at cares about just more VMs. It’s about cost(including power) and performance. We never consolidate to a server where the VMs don’t offer significant performance upgrades. It’s about future proofing.
“since AMD typically uses its entire TDP” : Like all CPUs it depends on the workload. Your calculation is worthless
Regarding QAT: What is the performance of these 2 QAT (at least Xeon 4th/5th gen platinum has 4 units) units when you use 144 VMs(like your example, or just 32) accessing QAT through SR-IOV? The fact that it’s hard to find any information on it shows that very few are using QAT despite all this talk. Anyone looking for such an extreme use case would use DPUs.
xeon 6 is new socket or it can be used in Sapphire rappids motherboards?
Xeon 6 processor is based on a new socket and is not compatible with the Sapphire Rapids motherboard/socket.