In 2017, we covered the AVR54 bug in Intel Atom C2000 series chips. Intel quietly worked with partners before news broke of this bug. See our piece: The Intel Atom C2000 Series Bug: Why it is so quiet for more information. Eventually, that was fixed with a new C0 stepping, see Intel Atom C2000 C0 Stepping Fixing the AVR54 Bug. One of our eagle-eyed STH forum members, Petr, noticed a pattern with the Intel Atom E3800 series that looks eerily similar. As they did with the related Atom C2000 series, Intel published a PCN for the Intel Atom E3800 series for a new stepping to fix LPC issues that can cause those systems to not boot. There seems to be a pattern developing.
As a note to our readers, we sent a note to Intel requesting clarification on this 48 hours before publishing as a courtesy to Intel and another reminder before going live. We will update this page with the response if we get one. Sometimes spring break delays responses this time of year, but Intel has already published formal documentation on the bug so we are going to press based on what Intel has publicly shared.
Update 2018-04-03 3:20 PM Pacific – We received a statement from an Intel spokesperson “We have not received customer reports of this issue on any of our Bay Trail products. Analysis of some use conditions – as might be applicable to our Internet of Things (IOT) customers, led us to make changes in hardware for the Intel Atom Processor E3800 Series and the Intel Celeron Processors N2930, N2807, J1900. In these cases, we have previously communicated those plans to customers.”
Intel Atom E3800 Killer VLI89 Bug
We are going to call this one the Intel Atom E3800 VLI89 bug. Here is an excerpt from the current Intel Atom E3800 series spec sheet.
VLI89 System May Experience Inability to Boot or May Cease Operation
Problem: Under certain conditions where activity is high for several years the LPC, USB (low speed and full speed) and SD Card circuitry may stop functioning in the outer years of
use.
Implication: LPC circuitry that stops functioning may cause operation to cease or inability to boot.
Here is a screenshot of the current version of the Errata:
Every CPU has errata and bugs. It is a fact of silicon that we have seen for decades. While the Intel Atom C2000 series bug was a major issue as it powered a large number of 24×7 embedded systems such as network appliances and storage, we did not hear much about related Atom families.
This VLI89 is a big enough deal for the Atom E3800 that Intel has planned a fix for it. See the current errata summary table:
Coming full circle, the VLI89 LPC bus issues are forcing the D1 stepping (per Intel PCN P116196-00). Here is an excerpt:
Summary of changes of the D-1 stepping:
1. Intel identified possible circuit design issues in the LPC bus, USB2.0 LS/FS and SD Card logic which may result in degradation of the LPC bus, USB2.0 LS/FS and SD Card signals over time at a rate higher than Intel’s quality goals. The D-1 stepping dispositions these possible issues.
It makes sense that this is the VLI89 fix that Intel has planned for the Atom E3800 series.
Intel Atom E3800 Series Impact
The Intel Atom E3800 series needs to function. The design is focused on powering 24×7 solutions such as those for kiosks and digital signage, but also for some handheld medical equipment. The E3800 series was intended for important enough applications that it could be configured to utilize ECC memory.
Here is the Intel ARK page for current SKUs. As it was explained to us, the Intel Atom E3800 series was the Atom you used if you had a mission-critical low power application that needed the video/ audio I/O that you could not get from the Intel Atom C2000 series. These A/V features are one reason why we did not see the Intel Atom E3000 series in switches and other infrastructure like we did the C2000 series.
Still, these are 24×7 embedded devices and VLI89 is big enough to force a silicon stepping change for a bug that has the same name as we saw with the Intel Atom C2000 series AVR54.
Other Bay Trail: Intel Celeron and Pentium N- and J- Series
So what about the other Bay Trail Atom designs. Specifically the J and N series? When we went to look up the datasheet here is what we saw.
VLP74 looks familiar. We saw it in VLI90 above. It was also present on the Intel Atom C2000 series documented in AVR50:
In both the E3800 series and the C2000 series, VLI90 and AVR50 were not fixed in subsequent settings. Perhaps that is not really a big deal. If you wanted a reason to be slightly nervous, this is what a snapshot of the Bay Trail Datasheets and Specification Updates page looks like.
The Intel Atom C2000 series, as we mentioned at the start of this article, had the major flaw that was being worked on for months under NDA. STH does not have access to this document, but we have put in a request to understand the scope of how much of Bay Trail is impacted by the “System May Experience Inability to Boot or May Cease Operation” bugs since both the Intel Atom C2000 series and the Atom E3800 series are both had issues with that title in their original steppings. All we know is that the other architectures seem to have the issue and that there is a NDA spec update for the third piece of the puzzle.
Update 2018-04-03 3:20 PM Pacific and Intel spokesperson commented that these are impacted and are receiving a hardware change (see above.)
Final Words
One way to view this is it is another tough case for the industry. If you are an OEM or SMB service provider selling these products to small customers in the field (think to power a coffee shop’s digital signage menu,) then increased failure rates are bad. That means more service calls to endpoints. If you are a STH reader and have these systems deployed in the field, you should be aware of the issue.
At the same time, bugs are a fact of life. Boxes fail. Although we would prefer to not see this happen at all, this is the way we would like to see it handled. Notifying customers of the problem, then releasing a silicon stepping fix.
The big question mark now is the Intel Celeron and Pentium N- and J- series Bay Trail SoCs as they are contemporaneous Silvermont architectures that also have some of the same LPC clock errata.
Please note: This article links to many external sources. We double-checked accuracy of these public sources during editing and they were still accurate as of 8:45AM Pacific on April 13, 2018.
I’ll give you the reason it hasn’t hit those Pentium and Celeron’s. They’re older platforms and Intel doesn’t support them as long as their embedded counterparts. Why start a ruckus in the consumer platforms with angry people if you can replace a few that fail. If they’re not on 24×7 then they’re not gonna see the failure in this short of a timeframe.
STH does so many Intel articles that I’d call them a schill but then you’re doing investigative journalism on Intel bugs. This is the only tech site that’s talked about this one.
If OS kernel relocation (Intel’s “Meltdown inside) has been acceptable, what else couldn’t be accepted?
“you’re doing investigative journalism”
STH is the site as my reference when studying information of various issues, while many other sites, e.g. ExtremeIntel, HotIntel, …, have been deleting comments and questions about the security issues related to Intel chips.
As an electrical engineer I can provide some clarification on VLI90 and AVR50. These do not sound like bugs where there is a risk of a fault happening over time.
An open/drain output pin is one which can assume two states, open-circuit (doesn’t try to drive any value on the line) and ground (sinks current, tries to take the line to 0 volts). Typically, an open/drain output will have a weak pullup resistor present on the board so that while the chip is “driving” the open state, the line will be pulled high by the resistor, but when the chip drives ground, the line goes to 0 volts.
Open/drain is frequently used for pins that are bidirectional and may have multiple drivers. While the output part of the pad is set to open-circuit, the chip can be watching the input path to see whether anybody else is pulling the line down to ground, and when the chip wants to drive ground level for others to see, it can do so.
In this case, it sounds as though this errata makes it possible to misconfigure the LPC_CLKRUN# signal such that it actively drives a high level rather than going open circuit, in which case an external device trying to drive ground level on LPC_CLKRUN# will probably not be able to pull the signal all the way to ground, and things won’t work right.
This isn’t a degradation, it’s a design error. If they’re not interested in fixing it, it’s probably because, in practice, it doesn’t matter much — for example, it might be a very optional LPC bus feature which isn’t really needed in 99.99% of modern systems. (Without bothering to go out and read the LPC spec, I can’t say whether that’s actually true, I’m just saying that if they haven’t bothered to fix it there’s probably some reason along those lines.)
I had a QNAP NAS with an Intel Celeron J1800 that bricked out of nowhere a few weeks ago. This article is giving me some closure since I was afraid I had mismanaged it in some way, so thanks.
My case is exactly as cuibonobo mentions. My QNAP NAS bricked last Friday, out of nowhere. It had a Celeron J3355 proc. SO this confirms Celeron J series was affected as well, not just ATOM.