The idea that a single IT misstep could cripple companies across entire industries might have seemed like a huge overstatement. However, the recent Microsoft outage is a stark reminder of how interconnected our world is. On July 19th, 2024, a faulty security update from CrowdStrike wreaked havoc on Microsoft Windows systems worldwide. How could such an IT catastrophe unfold? Let’s dive in and explore the causes.
Many high-profile companies, such as JP Morgan Chase, Walmart, and Shell, use Falcon, CrowdStrike’s cybersecurity software, to protect their IT infrastructure from data breaches. In fact, it’s used by 82 percent of US state governments and 48 percent of the largest US cities.
Unlike traditional security systems that require bulky hardware and constant updates, CrowdStrike Falcon operates entirely in the cloud. It works through an agent installed on user devices, be it Windows, Mac, or Linux. Once installed, this program connects seamlessly to CrowdStrike’s cloud platform.
So, CrowdStrike’s latest software update for Windows users turned out to be faulty, causing a Blue Screen of Death (BSOD) at boot. You need the system to boot to be able to roll back that update, which is a dead-end scenario for a non-technical user.
Adding to the confusion, an outage hit Microsoft Azure services and the Microsoft 365 suite of apps in the central US earlier on Thursday. While a company spokesperson
Talk about a chaotic end to the week!
The widespread impact of this incident is staggering, considering the CrowdStrike agent is installed on millions of devices – from servers and personal computers to even internet-connected devices (IoT). The update, intended to enhance system security, ironically caused widespread crashes across various industries, including:
Firstly, most organizations deploy software updates automatically, so the rogue updates spread like wildfire. Secondly, the culprit was a poorly written code – an error that CrowdStrike has since taken full responsibility for. While the exact details of this blunder remain unclear, one thing is certain: rigorous software testing could have prevented this IT disaster entirely or at least significantly reduced its impact.
Why the update might have caused issues:
Insufficient Testing: CrowdStrike’s QA process for the update might not have been thorough enough. Software as critical as Falcon should be tested on dozens of devices and hundreds of different environments.
Testing Environment Limitations: CrowdStrike’s testing environment might not have perfectly replicated real-world conditions. This could lead to issues showing up only when the update interacts with other software on user machines.
The good news is that CrowdStrike engineers shared a workaround. Here it is:
Boot Windows into Safe Mode or the Windows Recovery Environment
Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
Locate the file matching “C-00000291*.sys”, and delete it.
Boot the host normally.
The bad news is that it doesn’t work for everyone. First, if you’re not a techie, you don’t understand half of what’s here. Also, this manual fix can’t be applied remotely or to cloud-based systems, requiring physical access to each impacted device. Unfortunately, this translates to a lengthy recovery process for system administrators.
So, what lessons can we learn from one of the most widespread tech meltdowns? Prevention is always better than cure. While having a detailed incident response plan is good, what’s even better is having an ongoing and well-established quality assurance process.
Unit Testing: This involves testing individual components of the update in isolation. This could have identified issues within the update code itself before it interacted with other software.
Functional Testing: This verifies if the update delivers its intended functionality without causing unintended consequences. This could have involved testing the update on various Windows configurations and with different software combinations.
Integration Testing: This focuses on how the update interacts with other software on a system. This could have revealed compatibility issues with specific Windows versions or drivers.
Regression Testing: This ensures the update doesn’t introduce new vulnerabilities or crashes.
Canary Releases: Deploy new updates to a small subset of users first to identify any critical issues before a full rollout.
Feature Flags: Isolate new features behind feature flags to enable quick rollback if necessary.
Incident Response Playbook: Create a detailed incident response playbook that outlines roles, responsibilities, and procedures for handling outages.
Cross-Functional Teams: Foster collaboration between development, QA, operations, and security teams to ensure a holistic approach to software development.
Microsoft estimated that 8.5 million computers worldwide were knocked out by a major IT outage caused by a faulty CrowdStrike update. This chaos highlights the critical need for thorough software testing. Companies must prioritize comprehensive testing and strong IT processes to prevent future disasters. Remember, in tech, testing isn't optional—it's essential.