The end of last week was marked by a major outage affecting Windows PCs running CrowdStrike’s cybersecurity software. Following an investigation, CrowdStrike said the outage was caused by a bug in its testing software that prevented it from properly testing an update that was pushed out to millions of PCs on Friday.
At the same time, CrowdStrike promised to more thoroughly test updates to its software in the future, as well as implement a phased rollout procedure to avoid a repeat of the incident that happened a few days ago. As a reminder, CrowdStrike’s Falcon application is used by companies around the world to protect against cyberattacks and is installed on millions of PCs. On Friday, the company began distributing an update to Falcon, which was supposed to collect “telemetry data on possible new methods of combating cyber threats.” Such updates are released with some regularity, but in this case, one of them caused a large-scale failure on a Windows PC.
CrowdStrike typically releases two types of updates. Sensor Content packages update content for Falcon on the user’s device and operate at the Windows kernel level. Rapid Response Content packages update signatures for the Falcon sensor, which is used to detect malware. In this case, a tiny 40 KB Rapid Response Content file caused an outage on 8.5 million computers.
Falcon sensor updates are not typically deployed from the cloud and include AI and machine learning models that allow CrowdStrike to improve its malware detection capabilities over the long term. Some of these capabilities include what are called “Pattern Types,” which are the software code for new detections that are customized based on how the packet is delivered to users’ devices.
CrowdStrike has a cloud platform that it uses to manage its products and validate the content of update packages before they are widely distributed. Last week, the company released two Rapid Response Content updates at once. Now, it has been determined that a bug in the content validation tool caused both packages to pass validation, even though one was problematic and ultimately led to a widespread outage.
While CrowdStrike performs automated and manual testing of updates before they are deployed to the public, it appears that the testing was not thorough enough in this case. The previous deployment of Pattern Types had internally “trusted the checks performed by content inspection tools,” so CrowdStrike believed that a new deployment of a similar update would not cause complications. This resulted in the Falcon sensor receiving the problematic content along with the Rapid Response Content update, loading its code into its content interpreter, and then encountering an error related to an attempt to access memory locations outside of the valid address space. This error could not be handled by Falcon, causing Windows to crash.
To prevent similar incidents in the future, CrowdStrike intends to improve the Rapid Response Content testing process, including by testing on the developer’s local systems, rolling out packages in stages, and integrating the ability to roll back to a previous system state. In addition, developers will deploy additional tools on their systems to stress test updates and identify errors. The stability of update packages and the Rapid Response Content interface will be tested. CrowdStrike will also update the cloud-based update verification tool, as well as improve the error handling mechanism in the content interpreter, which is part of the Falcon sensor.
If you notice an error, select it with your mouse and press CTRL+ENTER.