CrowdStrike’s faulty update crashed 8.5 million Windows devices, says Microsoft

jeffw@lemmy.world to Technology@lemmy.world – 346 points –
CrowdStrike’s faulty update crashed 8.5 million Windows devices, says Microsoft
theverge.com
31

No validation, in the driver or the updater software.

No validation or automated testing on publish.

No staged rollouts.

Just utterly irresponsible all around.

When I worked there six years ago, the company motto was "two feet on the gas pedal" because the CEO was a race car driver. I bailed after 10 months, giving up pre IPO shares. The management for my team was non existent, and I was on the build and release team. People were doing releases of manually. They've improved the automation some from what I here, but looks like the motto finally hit them.

I should also say their metrics were absolutely staggering. The log aggregator was doing something like 2 trillion requests a week. All backed by splunk. I never heard what they were paying, but it must have been fucking nuts.

Race car drivers definitely don't put both feet on the gas pedal though... Like, what?

I would've preferred Colin McRae's classic in the same spirit: "when in doubt, flat out"

The unfortunate thing is that, in the long run, that strategy will probably be super effective. Unless Europe (with the only internet regulations that actually have teeth) does something harsh enough, they will probably pay a few small fines over this at most. Cost of doing business and probably baked in already.

A coworker of mine has worked with CrowdStrike in the past; I haven't. He said that the releases he was familiar with from them in the past were all staged into groups and customers were encouraged to test internally before applying them; not sure if this is a different product or what, but it seems like a big step backwards of what he's saying is right.

I first dealt with them at least 10+ years ago and at the time they had no ability to do staged roll outs or targeted roll outs. We got updates when they said we did, no choice or control. We had to resort to updating our firewall to restrict the download endpoint and only open it in groups to do a phased update.

Interesting! Sounds like they may have changed things a few times, or maybe my co-worker's memory has some gaps.

Channel files are different from sensor updates, which you have no control over for version control. Sensor releases you have control over.

The idea of "security software" is ridiculous overall. You buy a software to fix security problems in Windows and it violates the original product by inserting code into kernel code. You lose support by the original product vendor. And you think you're secure, even the whole stuff makes you forget that IT should be always fit in solving security/restorability problems even when everything else fails.

No staged rollouts.

I read somewhere that CS does allow for staged rollouts but some updates deliberately ignore them.

As if the borked update wasn't bad enough, it was also forced on users that explicitly said not to install it.

CrowdStrike’s channel file updates were pushed to computers regardless of any settings meant to prevent such automatic updates

From my reading this is misleading at best and likely wrong. I don’t work with CrowdStrike Falcon but have installed and maintained very similar EDR tools in enterprise environments and the channel updates referenced are the modern version of definition updates for a classic AV engine. Being up to date is the entire point and so typically there are only global options to either grab those updates from the vendor or host them internally on a central server but you wouldn’t want to slow roll or stage those updates since that fundamentally reduces the protection from zero days and novel attacks that the product is specifically there to detect and stop. These are not engine updates in that they don’t change the code that is running, they give the code new information about what an attack will look like to allow it to detect malicious activity as soon as CrowdStrike knows what the IoCs look like.

In this case it appears that one of these updates pointed to a bad memory location which caused the engine to crash the OS, but it wasn’t a code update that did it (like a software patch). That should have been caught in QA checks prior to the channel update being pushed out, but it’s in CrowdStrikes interest to push these updates to all of their customers PCs as quickly as they can to allow detection of novel attacks.

That should have been caught in QA checks prior to the channel update being pushed out...

I work in QA, and part of the job is justifying why it's necessary to keep a team of people that doesn't actually "produce" anything. Either their QA team is now in the hotseat, or Crowdstrike is now realizing why they need one.

Either way, it sounds like a basic smoke test would have uncovered the issue, and the fact that nobody found this means nobody bothered to do one of the most basic tests: turn it on and see if it "catches fire.'

God, even if they didn't have QA test it, they should have had continuous integration running to test all new channel updates against all versions of their program, considering the update will affect all of them. What an epic process failure.

Being up to date is the entire point and so typically there are only global options to either grab those updates from the vendor or host them internally on a central server but you wouldn’t want to slow roll or stage those updates since that fundamentally reduces the protection from zero days and novel attacks that the product is specifically there to detect and stop.

That's not your, or Crowdstrikes, decision to make. If organizations have applied settings to not install updates automatically then that's what they expect to happen and you need to honour it. You don't "know best". They do.

Being up to date is the entire point

No, it isn’t. The point is to keep systems safe and operational. Blindly rolling out untested updates is not a good strategy for that. I have seen entire systems shut down due to false alerts from updated antivirus software. Luckily only test environments, before these updates were rolled out to production. It does not take much to test updates like this before rolling them out to your entire organisation.

Our organization is configured to install N-1 of current release specifically to avoid this type of stuff. Does it matter? No, we got hit just like everyone else.

I'm getting real sick of companies acting like rapists and society just accepting it, if not justifying it for them.

No means no. Plain and simple.

The distinction between that and a malicious hack consists entirely of intent .

Well that's just terrorism then

Terrorism would require a political angle.

This is malicious incompetence.

One can argue that there is a very niche political angle to this - teaching Windows users the fear of God, so that they'd see the error of their ways. But it works in our favor, so let's not concentrate attention on it.

I doubt it was that few.

For reals. Their self reporting is just trying to mitigate damages from the mistake

This is the best summary I could come up with:


CrowdStrike’s faulty update caused a worldwide tech disaster that affected 8.5 million Windows devices on Friday, according to Microsoft.

Microsoft says that’s “less than one percent of all Windows machines,” but it was enough to create problems for retailers, banks, airlines, and many other industries, as well as everyone who relies on them.

Separately, the technical breakdown from CrowdStrike released Friday explains more about what happened and why so many systems were affected all at once.

CrowdStrike’s breakdown explains the configuration file that was at the heart of the issue:

CrowdStrike explained that the file is not a kernel driver but is responsible for “how Falcon evaluates named pipe1 execution on Windows systems.” Security researcher and Objective See founder Patrick Wardle says that the explanation aligns with the earlier analysis he and others provided about the cause of the crash, as the problem file “C-00000291- “triggered a logic error that resulted in an OS crash” (via CSAgent.sys).”

CrowdStrike’s channel file updates were pushed to computers regardless of any settings meant to prevent such automatic updates, Wardle noted.


The original article contains 193 words, the summary contains 175 words. Saved 9%. I'm a bot and I'm open source!