Rivian blames “fat finger” for infotainment-bricking software update

AnActOfCreation@programming.dev to Technology@lemmy.world – 129 points –
Rivian blames “fat finger” for infotainment-bricking software update
arstechnica.com
26

You are viewing a single comment

Says a lot about their internal organisation structure for something like this to happen. Intern is the only tolerable excuse here, but even still why would you put a newbie in a position where they could brick thousands of vehicles with a slip of the finger?

I'd expect a tech company like Rivian who happens to sell a vehicle to know better than this 🤦‍♂️

wrong build with the wrong security certificates was sent out

Isn't standard practice to validate signed code first before installing it? Hope the next update allows the car's computer system to check the firmware signature before doing what I assume is an automatic installation...

may require physical repair in some cases

Ouch

I don't follow your line about an intern. I don't see it in the article and even if it were the case, an unqualified person being able to do this is on the seniors/leads. Throwing the intern under the bus is what scummy companies do to shift blame - see solar winds , where (spoiler) this strategy doesn't seem to be working out

It's more incompetent to allow an intern to fuck up production than it is to have normal developers make mistakes. It shows a complete lack of controls and care.

Yeah not the intern’s fault, the fault of the system that allowed the intern to be able to do it at all

Anyone who builds software that runs on actual hardware should know that you NEVER deploy builds that haven't been fully exercised on actual hardware.

This tells me that their software QC process is non-existent at best and actually malignant at worst.

If their software is supposed to be their defining feature this is the equivalent of McDonald's "accidentally" shipping frozen discs of literal shit instead of burger patties to franchises who then serve them to customers without question.

If their company dies because of this (it fucking should imo) they 100% deserve it for the countless unknown dangers they've exposed their customers to. It's not this particular thing, bricking the infotainment system, it's the demonstration that they have no or bad process.

Ok, calm down. Seems like a bit of an overreaction to link a bad software update for an infotainment system to “countless unknown dangers”

They screwed up, it happens to the best of us. There isn’t a company on the planet that hasn’t made a mistake and rolled out something that is broken.

What’s important here is that they said “yep, we fucked up, we are prioritizing fixing this problem for customers” instead of trying to hide it or blaming the customer for the problem.

If anything Rivian should be applauded for how they handled it and if this kind of thing continues to happen, then maybe we get the pitch forks out.

Dollars to donuts their infotainment system shares a CAN bus with nodes that affect control systems. If they can't handle the easy stuff, what the hell else are they fucking up?

It's not about the infotainment system, it's about the culture that leads to this problem.

This company will not end because of this issue. Boeing is still kicking and you can actually count the number of people they've killed with shitty software/system integration process.

I've spent my career working in embedded systems and embedded test and verification. This issue is not the first or only issue to get by. Maybe they take this like the red hot poker it is and fix their problems, maybe not. I'm not gonna gamble on their products though.

So if I’m understanding this correctly. If anyone ever rolls out a software update that causes a failure like this it is instantly a sign that the company has a culture that leads to problems. Hard and fast? No exceptions? No one makes a huge mistake, that’s just a mistake that slipped through the cracks?

As for it being connected to the CAN bus, so what? It isn’t some sort of magical system where if something fails all the rest of the connected systems do too. That’s like saying if the monitor on my computer fails and it’s connected to the rest of my computer via the PCIe lanes on my graphics card, then everything else is going to be affected. It doesn’t work like that.

I don’t even have an opinion on the company I just don’t think it’s the end of times because the wrong build rolled out. They fucked up, they owned up to it and based on the response they will learn from it.

The issue is not just that a bad update went out. Freak accidents can happen. Software is complicated and you can never be 100% sure. The problem is the specifics. A fat finger should never be able to push a bad update to a system in customers' hands, forget a system easily capable of killing people in a multitude of ways. I'm not quite as critical as the above commentor but this is a serious issue that should raise major questions about their culture and procedures.

This isn't just some website where a fat finger at worst means the site is down for a while (assuming you do the bare minimum and back up your db). This is a vehicle. That's what they meant about the CAN bus - not that that's really a concern when the infotainment system just gets bricked, but that they have such lax procedures around software that touches a safety-critical system.

Having systems in place to ensure only tested, known good builds are pushed is pretty damn basic safety practice. Swiss cheese model. If they can't even handle the basics, what other bad practices do they have?

Again, not that I think this is necessarily as bad as the other person - perhaps this is the only mistake they've made in their safety procedures and otherwise they're industry leaders - we don't know that yet. But this is extremely concerning and until proven otherwise should be investigated and treated as a very serious safety violation. Safety first.

Thank you for this response. I can agree with this perspective.

My comments were, “hey, let’s be a little more level headed about this” and less “this company should die and heads should roll”.

Interconnected hardware and software systems do affect each other. It's not magic, it's physics.

And yes, your graphics card spewing garbage onto the pci bus can affect the rest of your system.

It actually does work like that.

If I have offended you, that wasn’t my intent. You seem defensive about what I said but I wasn’t trying to upset you.

I said broken monitors don’t necessarily affect the rest of the system. Just like, you know, broken infotainment systems don’t necessarily affect the rest of the car. Can happen sometimes, doesn’t seem to have happened this time. So yes what you are implying is that magic is happening when it clearly didn’t and to sit here and say it will definitely affect other systems misleading.

People make mistakes, it’s unavoidable but the fact that they are willing to admit it was their fault, shows an attitude of learning and growth and is a welcome change from norm, where companies sweep it under the rug and it costs people lives.

Will they probably grow to a point where they are too big to give a shit, probably. At least for now they are being open and honest instead of blaming the user or a third party.

We don’t live in a vacuum, the world isn’t black and white. Come live in the grey and cut people some slack.

Interns do but should not get the level of write access that makes a durable change impacting all customers. Deadlock a server or even wipe SQL tables, this is an outage. Break a customer's configuration, send the wrong client's paperwork, again small scale problem you can deal with. Interns don't change company policy.

I think it's a more foundational architecture question: why do you push builds to all customers at once without gating it by SOMETHING that positively confirms the exact OTA update package has been validated? The absolute simplest thing I can think of is pushing to 1 random car and waiting for the post-install self tests to pass before pushing to everyone else. Maybe there's actually no release automation?? But then you make it safe a different way. It's just defensive coding practice, I'm not even a CS degree but learned on the job something always breaks so you generally account for the expectation that everything will fail by making a fail-safe just so the failure is not spectacular. Nothing fancy, just enough mitigation to keep the fuck up from eating into your weekend if it happens on a Friday.