We have been taking a more conservative approach to the Spectre and Meltdown design issues at STH. Many of the systems folks have told us we are still (at minimum) a few weeks away from end-state OS patches and microcode updates. One of the reasons is that we expect a few iterations for final performance and stability to shake-out. Recently, we became aware of a stability issue that was confirmed by Intel, on the Broadwell-EP and Haswell-EP (Xeon E5 V3 and V4 series) CPUs that is causing stability issues due to new Intel microcode released as part of the patching process for Meltdown and Spectre. We are seeing evidence that the recently released fixes are causing more crashes and reboots on impacted systems.
Evidence That The Initial Haswell-EP and Broadwell-EP Intel Microcode Fixes Have Issues
We recently noticed that some vendors are holding back on Haswell and Broadwell generation BIOS and microcode patches. One example is Supermicro where as of 12 January 2018, their vulnerability tracking page shows there is an issue with Intel’s fixes for LGA2011-3 CPUs. Most of the “X11” generation (Skylake-SP) generation systems have updates. The “X10” generation of Broadwell-EP and Ivy Bridge-EP systems are still pending. Supermicro “X9” generation, Sandy Bridge-EP and Ivy Bridge-EP, generation systems are awaiting a fix from the vendor (Intel.) For its part, Intel committed to getting Skylake-SP, Broadwell-EP, and Haswell-EP systems patched first.
Our takeaway is that there must be a reason vendors have still not updated the Haswell/ Broadwell generation microcode. Then, Intel released a statement on the matter:
We have received reports from a few customers of higher system reboots after applying firmware updates. Specifically, these systems are running Intel Broadwell and Haswell CPUs for both client and data center. We are working quickly with these customers to understand, diagnose and address this reboot issue. If this requires a revised firmware update from Intel, we will distribute that update through the normal channels. We are also working directly with data center customers to directly discuss the issue.
(Source: Intel)
From what we are hearing from vendors and data center operators “higher system reboots” translates to crashes and random restarts.
Final Words
Intel is still working with the industry to rectify the design flaw(s) that lead to Meltdown and Spectre. We know there will be some performance impact that is workload dependent. In the meantime, if you have mainstream Intel Xeon E5-2600 V3 and V4 systems that now make up the bulk of servers deployed, it may pay to wait for the microcode fixes to become more stable before applying them to your server farm. For server uptime, this is not a great solution as users will reboot once to apply the already available OS kernel patches and a second time for microcode updates. On systems with long boot times, two reboots in a quarter can mean significant downtime.
Who wrote that gem of a press release? “Higher system reboots”? “Higher” seems like an odd choice of wording, and the whole phrasing leaves out the fact that a crash would normally precede such a reboot, which should probably be classified as a reset.
Are microcode updates only going to come from Intel and/or the system/motherboard manufacturers? Or is it possible that they will get included in some OS-delivered updates?
I’m concerned that those who wish to wait on applying the microcode updates may not be aware and/or able to defer the if they come through OS-delivered updates.
We had OS patches make RHEL systems unavailable until system BIOS microcode firmware was updated to resolve the issue, holding back on patch blast for a bit yet (month or so until things settle down).