Red Hat has been heavily involved in the Meltdown and Spectre patch efforts. It also has its initial patches ready well before the originally planned disclosure date of January 9, 2018. Red Hat is also in the unique position that it has the most robust set of open source OS enterprise customers. Those same customers are clamoring for information regarding the performance impacts of the Meltdown and Spectre series of patches.
Red Hat Meltdown and Spectre Patch Performance Impacts
To classify performance impacts of its Meltdown and Spectre patches, Red Hat introduced four basic categories of applications after it tested on Haswell, Broadwell, and Skylake generation servers. Here are the four buckets:
- Measurable: 8-19% – Highly cached random memory, with buffered I/O, OLTP database workloads, and benchmarks with high kernel-to-user space transitions are impacted between 8-19%. Examples include OLTP Workloads (tpc), sysbench, pgbench, netperf (< 256 byte), and fio (random I/O to NvME).
- Modest: 3-7% – Database analytics, Decision Support System (DSS), and Java VMs are impacted less than the “Measurable” category. These applications may have significant sequential disk or network traffic, but kernel/device drivers are able to aggregate requests to moderate level of kernel-to-user transitions. Examples include SPECjbb2005, Queries/Hour and overall analytic timing (sec).
- Small: 2-5% – HPC (High Performance Computing) CPU-intensive workloads are affected the least with only 2-5% performance impact because jobs run mostly in user space and are scheduled using cpu-pinning or numa-control. Examples include Linpack NxN on x86 and SPECcpu2006.
- Minimal: Linux accelerator technologies that generally bypass the kernel in favor of user direct access are the least affected, with less than 2% overhead measured. Examples tested include DPDK (VsPERF at 64 byte) and OpenOnload (STAC-N). Userspace accesses to VDSO like get-time-of-day are not impacted. We expect similar minimal impact for other offloads.
- NOTE: Because microbenchmarks like netperf/uperf, iozone, and fio are designed to stress a specific hardware component or operation, their results are not generally representative of customer workload. Some microbenchmarks have shown a larger performance impact, related to the specific area they stress.
(Source: Red Hat)
Overall, Red Hat tells RHEL customers to expect 1-20% performance impact. The testing was done on RHEL 7 but 6 and 5 should be similar.
Final Words
The OLTP benchmarks are certainly an area we have seen significant performance impacts in. If you have a database server that spikes above 65% utilization, it is worth looking at whether that server is going to be sufficient post-patches. As previously stated, STH will be providing updated benchmarks as the patches for these vulnerabilities mature. These are not two simple patches but rather a significant series of patches that matter.
Finally, if you are running databases in AWS, GCP, or Azure instances or have databases in VMs, VPS or shared hosting instances, we do suggest looking at your CPU monitoring data. Likewise, these vulnerabilities impact containers (e.g. Dockerized MySQL) as well. For simple web apps, this is not going to be a major impact. On the other hand, if you are running with a smaller resource pool, such as in a NVMe backed VPS or VM, you are going to want to monitor these patches closely.
It’s probably also worthwhile to compare a few kernels. RHEL is known for using ancient, heavily patched kernel versions, many distributions also seem to find it necessary to use an outdated kernel (sometimes non-LTS) with a lot of patches heaped on top. PCID which should help reduce the performance impact of the TLB flushes is first available in 4.14.
@Nils,
The reason RHEL use ancient kernels, is because in the Enterprise arena, you prioritize stability. Not performance. That is the reason enterprise companies still use decade old mainframes. If you have to choose a server that powers large parts of the financial system, do you use the latest bleeding edge Linux version? Or an old mature system that has been battle tested? Some people dont use a new Windows version until at least Service Pack 1 has arrived – because then the bugs have been ironed out. New code is immature and full of bugs. Old code is stable and mature. Same with space rockets, they use ancient and tested cpus, instead of the latest ARM cpu – because the old cpu is battle tested. If it is not broken, dont touch it. Enterprise have very long support cycles, and they dont want to change anything nor upgrade.
So I dont see the point of using the latest Linux kernel. RHEL tries to adress the servers, not desktops. RHEL will never use the latest bleeding edge Linux kernel. Better to test what is used in Enterprise: old kernels
@kebabbert
I think slapping the label of “enterprise” on it and using it as the gold standard for how to run servers is misguided, and just because many companies act that way doesn’t mean that it’s a good idea. It’s very often to the detriment of innovation.
I’ve worked in a few large companies (mostly retail and e-Commerce) and the differences can be staggering. One place runs what you would call “battle tested” systems, very large expensive servers running stable but stale Linux distributions boasting uptimes of sometimes over a year. The software is very often “enterprise” (= expensive) as well, proprietary databases, large monolithic software. The other place runs microservices, uses container technology, uses DevOps and other agile methodologies. Guess which one is growing and which one is stagnating. Guess who is attracting the better talent. Guess who is faster to fix things if they break.
Also, the LTS series of Linux is hardly “bleeding edge”, I also wouldn’t call the stable series “bleeding edge”. That backports happen frequently also tells us that the old software very often has the same bugs. Running an outdated kernel didn’t protect you from dirty CoW or Meltdown for example. I believe we’re going to see a lot more of these types of security bugs in the time to come. I’d rather kernel developers focus their effort on continuing to turn out a great product instead of backporting something because some suit in a company is too afraid to do an upgrade, but after all that’s RedHat’s business.
Something I’ve also seen very often with those companies focused on stability and uptime is that security updates don’t happen at all or at a snails pace. To simply reboot a system requires weeks of planning – thus we have live patching which to me seems like a lot more black magic and fraught with risk than simply booting a new kernel.
Your experience may differ, but in my experience having a corporate culture that embraces change, uses proven methodologies instead of dogma (“never change a running system”) and focuses on fixing problems should they arise instead of preventing them through stagnation is the superior culture. There are no mainframes at Google, Facebook, Amazon or Netflix.
Just my 10c, the problem with I/O in an enterprise environment is that you never have enough of it, there is always a bottleneck. And as most if not all servers are virtualised in one way or another, companies can be forced to rethink their strategy (overcommit, how many VM’s per host, …) which will make it more expensive to run their datacenter, and also VMware is not really giving their licenses away…
The performance impact from Meltdown and Spectre might not be high when we look at the numbers, but the money impact for the enterprise might be staggering.
We will have to wait and see.
@Nils
Your funny :) your own dogma is pure fud….
hope you enjoy managing a million / billion micro services…
:)