AMD EPYC 7002 Rome CPUs Hang After Less Than 3 Years of Uptime

9
AMD EPYC 7642 Cover
AMD EPYC 7642 Cover

Thanks to a reader that sent in this Reddit post we were alerted that the AMD EPYC 7002 “Rome” series core can hang after just under 3 years of uptime, or around 1044 days. While there are many bugs in processors given their complexity, this one is particularly interesting.

AMD EPYC 7002 Rome CPUs Hang After Less Than 3 Years of Uptime

This is not just speculation, instead, this is an official AMD Errata 1474 in 56323-PUB_1.01.

Description
A core will fail to exit CC6 after about 1044 days after the last system reset. The time of failure may vary depending on the spread spectrum and REFCLK frequency.
Potential Effect on System
A core will hang.
Suggested Workaround
Either disable CC6 or reboot system before the projected time of failure.
Fix Planned
No fix planned
(Source: AMD Revision Guide for AMD Family 17h Models 30h-3Fh Processors)

For most of our readers, machines will be rebooted once every so often for things like security patches or other maintenance windows. At the same time, this is a fairly big deal since the remedy is effectively rebooting a system.

We checked the STH lab and it appears as though we actually had a HPE AMD EPYC 7002 Rome system that we forgot about hit 2 years and 261 days or 991 days total uptime running Proxmox VE before the system was decommissioned. The reason that the system had such high uptime is that it was part of a lab project that was outside our normal management tools and we forgot it was there apparently.

Final Words

If a typical server lifecycle is 5 years these days, then it means that one might need to do a minimum of a single reboot over its lifetime to avoid this bug, so long as the single reboot happens between days 913 and 1044. Then again, a number of our readers are going to think this is silly with regular security patches. Others are going to think this bug is a major pain to track and deal with. If you are the type of admin that has a server up for around three years, then this might impact you.

Perhaps the next week should be dedicated to looking for older AMD EPYC “Rome” systems and seeing if any have had >900 day uptime.

9 COMMENTS

  1. re: No fix planned
    Although that might seem harsh, it would probably be at least a firmware revision, and the update process would likely require a reset which itself provides another 2.85yr.

  2. Does the singular in “a core will fail to exit CC6” imply that the behavior will only occur on a single core in a given CPU; or will all cores be subject to the same risk?

  3. I understand that each core has its own timer, so the whole CPU will eventually go Sleeping Beauty, one core after the other. Cue in Prince Charming Reboot…

  4. Early in my career system stability/uptime numbers were bragging rights.

    These days it makes no sense. Consistently patching/updating/securing your systems is part of the responsibility of running them.

    Now I see uptimes of 100+ days as a badge of an inattentive admin.

  5. I once encountered a VMS x86 system running a company’s payroll. It had been running uninterrupted for about 7.5 years.

    During that time the company had gone thru several comptrollers. When they asked me to look into replacing it with a Windows system I asked where the machine was. Nobody knew. We had to trace the ethernet cable to find.

  6. @James on the other hand it was also a badge of honor for software vendors to provide updates that did not require a reboot. In core patching and such. Linux can patch a running kernel afaik, other OS’s can too.

    It’s really just MS that patches by replacing the entire bloody OS installation.

  7. I read the “fix” differently. We patch monthly in most cases, we never restart (reset) the system, only the OS that is on the system. It is 100% possible we will not restart a system in 5-7 years of use.

  8. Or just disable C6 and the bug never happens? This can be done at runtime, so there’s no reason to reboot. There’s even a systems service available that automatically disables it.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.