Intel confirms no recall for Raptor Lake CPUs, microcode won’t fix affected units

Karna@lemmy.ml to Technology@lemmy.ml – 197 points –
videocardz.com
41

Because hey we already got your money and doing the right thing doesn't line my already full pockets.

Yes, but then who will dare to buy from them in future?

Frankly? Most people.

Most people are really dumb about this type of thing.

I agree that most people won't care but take issue with calling them "dumb". Everyone has a limited amount of time on this planet to build skills and chase hobbies. A lot of people on this site have tech-related jobs and hobbies, so of course this matters to us. I might expect someone who buys pre-built gaming PCs to keep this on their radar, but the vast majority of folks who use computers as email and social media machines, including those who only use it for data entry type jobs, have little reason to care about the specifics of their CPU or any other single component of their computer. If their computer breaks, that's annoying, but that's life. They'll spend the same amount on a new laptop as we might spend on a new CPU and get on with their day.

I don't know what brand of spark plugs are in my car, and maybe a mechanic or car enthusiast would find that dumb. But hey, I'm too busy caring about my CPU to spend time worrying about my car unless it breaks.

Anyone who ritualistically buys Dell. I believe Intel is on the record as having called Dell "the best friend money can buy."

So that's the thing, realistically. If this affected just consumer chips that you can buy off the shelf, sure. A bit of market lost, but the OEMs and the data center clients are still there.

But this time it's all different. Intel fucked everyone's stuff up, and said they're not doing anything about it.

That will certainly not go unnoticed.

They're not doing a recall, but that doesn't mean they won't somehow compensate big OEMs for their warranty issues.

You don't really have much of a choice in the high end laptop world. Maybe this will be enough to push manufacturers to put AMD CPUs into high end workstations. I'd kill for a Thinkpad P1 with AMD.

I'd love an AMD P1 as well. I just bought a P1 after looking at their offerings for a while and while I did look into their AMD offerings they just seemed to have intentionally neutered them.

They absolutely do. I know the X1 carbon was co designed with Intel so you can't get an AMD X1, but I don't think the P1 was.

Also which P1 did you get? I'm contemplating trying to find a fully loaded gen 6, or maybe going with the gen 7/8 (whenever that comes out).

I got a Gen 5 - 21DDS70700 from Newegg shortly before the Gen 7 was released. I originally bought a refurbished higher end model from their outlet store but they sent the wrong model so I returned it. It is built and performs well and has a great screen and keyboard.

Have you done any stress tests on the system? And if you did did you monitor CPU power?

I have a gen 4 and at least whenever the CPU and GPU are active the CPU runs absurdly slow. I havent tested just CPU loads, but rendering a video where the CPU was 100% loaded and the GPU was about 5% loaded my CPU would drop to only 25 watts and throttle hard. With throttlestop I'm able to get 35 watts at the very most out of my CPU for long loads. I'm curious of they've done anything for the newer models so the CPU can actually stretch it's legs. It really sucks buying this i9 but having it perform worse than an i5. Even if there's PLENTY of thermal headroom.

I have not. I'm mostly doing webdev on it and it works fine for that. I honestly don't even need the dedicated GPU. If there's something you'd like me to run lmk and I'll see if I can find time for it. I'm running Debian stable on it and it's the i7-12700H with an RTX A1000 and 32GB 4800MHz RAM version.

Honestly I think any CPU stress test is enough for it to start to choke sooner or later. It's just faster with the GPU active. If you have video editing software you could render a long video with CPU rendering. If not I'm sure any furmark like stress test + yes > /dev/null a number of times is plenty.

The biggest thing is that it takes time. Sometimes it starts after 5 minutes, sometimes it's 20 minutes, sometimes it could be longer depending on the load. Since you're on linux IDK if lenovos tuning of TDP would apply, so you might get higher 35 watts like I can if I manually override their power limits.

I pretty much only use the GPU if I'm traveling and want to play a game. Normally I just want the CPU to do things and it's fine using 55+ watts for short loads, but I tax the shit out of my CPU for long periods of time and only 25 watts sucks on 11th gen Intel.

I stopped in 2007 and haven't looked back, and advise friends and family to do the same. This is just more ammo for the "but why" rebuttal speech, and baby, "wanting your cpu to not die" is an awfully juicy bullet.

Watching Intel fuck themselves the last decade has been an absolute delight, but this, I could almost fap to this news.

The worst part is Intel honestly could've have spun this into something of a win if they actually handled it properly. They've got over $25B in cash reserves, they could easily afford to do a recall and a big PR campaign about how good they are at accepting responsibility and fixing mistakes.

They've got over $25B in cash reserves

What. I haven't heard about this massive reserve

That's not too insane for a company of their size. It's higher than many, but far from out of the norm.

Can’t wait for Steve’s next video. Oh boy.

Moore's Law is Dead shared an interesting video yesterday about these chips. Supposedly, leaks from his sources at Intel say that high voltages being pushed through the ring bus cause degradation. The leaks claim it shares the same power rail as the P and E cores, meaning it's influenced by the voltage requested by the cores.

For context, the ring bus is responsible for communication between cores, peripherals, and the platform. This includes memory accesses, which means that if the ring bus fails and does something incorrectly, it could appear normal but result in errors far down the line.

Going beyond the video specifically, and considering what others have suggested as workarounds, it seems like ring bus degradation might be a decent candidate for the actual root cause of these issues.

Some observations around chips degrading were:

  • High memory pressure exacerbates the issue.
  • Chips with more cores deteriorate faster.

Some of the suggestions to work around the issue were:

  • Lower the memory speed.
  • Lower the voltage and clock speeds.
  • Disabling E cores.

All of those can be related to stress being put on the ring bus:

  • Higher voltage being put through the bus -> higher likelihood of physical damage
  • More memory pressure -> more usage of the bus, more opportunity for damage to accumulate
  • More cores -> more memory pressure
  • Slower memory speeds -> less maximum throughput -> less stress

I'm not claiming anything definitive, but I think my money is on this one.

Thanks for the additional details.

The scariest part of this whole problem is there is no way for the owners of i13/14 CPU to figure out to what extent the CPU is damaged. It's like holding a ticking bomb without knowing when that will go off!

100%. Whatever Intel does at this point, I don't trust it to be a fix so much as a mitigation or attempt to delay the inevitable until a few years after the warranty period.

If it's possible for people to return their 13th/14th gen processor and trade up for a 12th gen, that would be the safest solution.

I've heard speculation that this is exasperated by a feature where the CPU increases the voltage to boost clocks when running single core workloads at low temperatures. If that's true, having less load or better cooling may be detrimental to the life of the processor.

If the product has issues it should be legally required to either have a warranty extension, recall, or both. Heck they shouldn't be selling more units until it's figured out and patched.

It's absurd to say: "it might have problems but we'll keep selling it as is".

We have safety recalls. There should be product degregation recalls.

Looks like a class action lawsuit, smells like a class action lawsuit

It is Intel so I'm sure they have already done the cost-benefit analysis of a recall vs losing a class-action lawsuit

Another reason added to the CVS-receipt-long list of reasons to never buy Intel.

Thanks, Steve Intel.

Class action lawsuit, but demand entire company to be put under disqualification from operating for some time instead of just wanting money that will amount to you getting like 10€.

HANG ON BEFORE YOU HIT THE DOWNVOTE BUTTON!

They don't need a recall. If your processor ain't broke yet then the patch will (supposedly) prevent it from breaking and if it's ALREADY broke then Intel will (supposedly) replace it via RMA.

So what's the big fuggin' problem here? That Intel won't use the term "recall"?

The "problem" is that the more you understand the engineering, the less you believe Intel when they say they can fix it in microcode. Without writing an entire essay, the TL/DR is that the instability gets worse over time, and the only way that happens is if applied voltages are breaking down dielectric barriers within the chip. This damage is irreparable, 100% of chips in the wild are irreparably damaging themselves over time.

Even if Intel can slow the bleeding with microcode, they can't repair the damage, and every chip that has ever ran under the bad code will have a measurably shorter lifespan. For the average gamer, that sometimes hasn't even been the average warranty period.

+1. Lots of people are also likely to not have any idea about the situation and just think their PC crashes or acts up more. More of these issues can pop up over time.

A recall forces them to notify customers of the issue so the customer can act on it.

They can most likely prevent further breakdown through software. If the meters and controls are functioning correctly, they can undervolt the CPU. But it's not really a fix if that comes with a performance penalty. If it's a bug where the CPU maxes out the voltage when idle so it can do nothing faster, that could be fixed with no performance penalty, but that seems unlikely.

I'm sorry but this is just a fundamentally incorrect take on the physics at play here.

You unfortunately can't ever prevent further breakdown. Every time you run any voltage through any CPU, you are always slowly breaking down gate-oxides. This is a normal, non-thermal failure mode of consumer CPUs. The issue is that this breakdown is non-linear. As the breakdown process increases, it increases resistance inside the die, and as a consequence requires higher minimum voltages to remain stable. That higher voltage accelerates the rate of idle damage, making time disproportionately more damaging the more damaged a chip is.

If you want to read more on these failure modes, I'd recommend the following papers:

L. Shi et al., "Effects of Oxide Electric Field Stress on the Gate Oxide Reliability of Commercial SiC Power MOSFETs," 2022 IEEE 9th Workshop on Wide Bandgap Power Devices & Applications

Y. Qian et al., "Modeling of Hot Carrier Injection on Gate-Induced Drain Leakage in PDSOI nMOSFET," 2021 IEEE International Conference on Integrated Circuits, Technologies and Applications

I’ve only recently become aware of the issue and that’s the way it feels.

But in the absence of a definitive test I think folks are concerned that they will be stuck with a CPU that continues to degrade prematurely. That seems like a valid concern.

So what's the big fuggin' problem here? That Intel won't use the term "recall"?

Would you say the same thing about a car?

"We know the door might fall off but it has not fallen off yet so we are good."

The chances of that door hurting someone are low and yet we still replace all of them because it's the right thing to do.

These processors might fail any minute and you have no way of knowing. There's people who depend on these for work and systems that are running essential services. Even worse, they might fail silently and corrupt something in the process or cause unecessary debugging effort.

If I were running those processors in a company I would expect Intel to replace every single one of them at their cost, before they fail or show signs of failing.

Those things are supposed to be reliable, not a liability.