Crowdstrike Outage | Ralph's Blog

Table of Contents

I’ll preface with this is best effort from my own experience and collecting information from the community. I do not work at CrowdStrike, nor do I have any inside knowledge or information about what occurred internally or the cause.

If you’ve seen the news today… it was a long day for a lot of people in IT. Today I bring you the technical side of the public information from today’s fire around the world. But before I get into the angle that people would expect from me, I hope this is a learning experience for a lot of reasons.

Whoever is kicking themselves at CrowdStrike, it happens to the best of us (maybe not Y2K level, but you know… you guys are killing it in the market it seems.)
Companies should be thinking about resource management and shift rotations for people in the trenches at times like these.
Front line support staff, bear with employees that can’t follow command line instructions easily. The admins of today might be bad at using Windows 18.
I’m sure lots of discussions around auto-update policies and supply chain attack surfaces are happening and will continue to happen as they should.
This should be a wake-up call how testing in production is a terrible idea, QA is required, and global/non-staggered background updates are not expected from an enterprise solution. The ability to push an update to all global customers at once with no notice is a problem.

Those conversations need to happen. But what did your computer do today…

BSOD - Blue Screen of Death

First, let’s explain why you saw the infamous Blue Screen of Death (BSOD) on screens around the world today. A BSOD is Windows telling you something really, really important crashed and it needs to as well. Windows has three “modes” programs operate in: User, Kernel, and S mode.

User mode - The place we normally interact with. This is where programs are usually installed, and it has limited access to system resources. The browser application you are reading this on is most likely running from the “user” space now.
Kernel mode - The place where special software runs that is usually required to operate your system. It is a privileged area where additional access/resources are available. Programs have a more direct access to the hardware here. When you hear “system drivers”, they likely are running here.
S mode - This is a Windows Exclusive and we don’t really need to talk about it much in this context. But know that it exists and it’s more locked down and for Microsoft Store Apps.

Windows Modes

When a program like Microsoft Edge or Outlook crashes, it’s running from User Mode. In that context, the operating system can report the process crashed and as the user you can start it again. Hopefully it works and the issue doesn’t continue, but at least you can use your computer still.

It gets tricky when a program from Kernel Mode crashes. Nine times out of ten, if you see a BSOD happen, there was a crash at the kernel level. Why would you want a program to run in kernel? Well, a lot of it is stuff that comes with your Operating System to make it function. For something like Endpoint Detection and Response (EDR), it needs the information and additional access that the Kernel provides to successfully monitor system events and provide the ability to stop actual malware from taking action on systems.

Drivers like these, ones that need to be running, get added to the list of boot-start drivers on your system and become required for a successful boot. Else, you see a BSOD with the error information. When CrowdStrike’s kernel driver had an error today it led to the BSOD you saw in the news headlines because it’s one of these required drivers.

Ok, but what happened?

So, Kernel Mode should be treated like Spiderman powers (“With great power, comes great responsibility.”) Also, you saw today how many major companies around the world use CrowdStrike’s EDR software (which runs in Kernel mode). Stores and website were down, flights were cancelled, hospitals slowed to a stop. Last night, CrowdStrike pushed what they called a “Channel File Update” to all customer systems at once. This type of file (which was named C-00000291-00000000-00000032.sys) bypasses customer configured Sensor Update Policies and is a background update to the core components to all installed agents. Usually these are updated without issue and no action is ever needed by the user.

However, as you can see in the stack trace below, it introduced a Null Pointer error once the kernel driver (CSagent.sys) tried to load using this file. I won’t go into what pointers are and C++ memory management but understand that when you write low-level languages you have the ability to do things that are unsafe and cause crashes if you don’t write checks into your code. CrowdStrike is written in C++ and that leads us to this example…

Stack trace of a BSOD today Stack trace from a BSOD today

This is a Stack Trace from a crash today. You can see the error “Access Violation” indicating that the error is a result of a problem accessing some memory. You can also see that the read address is 0x9c and there is a move (mov) operation which is Assembly for “copy data from here to there”. Unfortunately, the memory location 0x9c is not accessible due to it being an invalid region of memory. Windows will always crash if something tries to access this address location. In C++, the programmer is supposed to check if they are passing around or referencing NULL pointers and handle the errors themselves. Well, it appears that this function didn’t have the correct error handling needed for an Access Violation error. CrowdStrike throws the access error and, since it’s for a boot-start driver, Windows must crash the entire system when the program crashes. Leaving you with a sad face emoji on your screen. Since you don’t yet have the network driver loaded yet, CrowdStrike is in the boot-start list after-all, your computer can’t download a working channel file from CrowdStrike. You are stuck in a boot-loop until something changes.

The fix (mentioned in the linked CrowdStrike article) was to remove the bad channel file via safe mode, reboot the system, and let the driver load normally. After loading, a new channel file will be downloaded, and things are back to normal. However, due to the nature of the crash, you needed hands on keyboard for most user workstations to apply the fix. You didn’t have network access for someone to remote in and help. Servers are their own story as there are methods to access them remotely in cases like these, but for most it appeared to be a slow and grueling process.

Final Thoughts

This incident involving CrowdStrike’s EDR software serves as a stark reminder of the intricate and delicate nature of our global IT infrastructure. It underscores the need for rigorous testing, robust error handling, and thoughtful deployment practices, particularly when dealing with kernel-level drivers that have far-reaching implications.

This event should serve as a catalyst for organizations to reassess their internal processes, from their approach to software updates to their strategies for managing crises. It’s a call to action for the IT community to foster a culture of patience, understanding, and continuous learning.

While CrowdStrike has been a trusted name in the security community, this incident has undoubtedly left a mark. However, it’s important to remember that errors and mishaps are part of the journey in any field, especially one as complex and rapidly evolving as cybersecurity. It’s how we learn from these incidents and implement changes that truly defines us as organizations and professionals.

As we move forward, let’s take this as an opportunity to not only improve our technical practices but also to strengthen our resilience, our adaptability, and our commitment to making the digital world a safer place. After all, in the realm of IT, every challenge overcome is a step towards progress.