CrowdStrike has released a post-incident review (PIR) for a faulty update it released that crippled 8.5 million Windows computers last week. The detailed post blamed a bug in testing software for failing to properly validate content updates pushed to millions of machines on Friday. CrowdStrike pledged to more thoroughly test its content updates, improve its error handling, and implement staggered deployments to avoid a repeat of similar disasters.
CrowdStrike’s Falcon software is used by enterprises around the world to help manage malware and security vulnerabilities on millions of Windows computers. On Friday, CrowdStrike released a content configuration update for its software that is designed to “gather telemetry data about possible new threat technologies.” These updates are provided regularly, but this particular settings update caused Windows to crash.
CrowdStrike typically releases setting updates in two different ways. The so-called sensor content can directly update CrowdStrike’s own Falcon sensor, which runs at the Windows core level, and there is also quick response content that can update the way the sensor behaves when detecting malware. Friday’s problem was caused by a 40KB archive of Rapid Response content.
Updates to the actual sensors do not come from the cloud and typically include artificial intelligence and machine learning models that will allow CrowdStrike to improve its detection capabilities over the long term. Some of these features include so-called template types, which are code that enables new detections and is configured through individual quick response content types delivered on Friday.
In the cloud, CrowdStrike manages its own systems that run validation checks on content before it’s published to prevent incidents like Friday’s. CrowdStrike last week released two updates to rapid response content, also known as template instances. “Due to a bug in the content validator, one of the two template instances passed validation despite containing problematic content data,” CrowdStrike said.
While CrowdStrike conducts automated and manual testing of sensor content and template types, it does not appear to have conducted such thorough testing of the rapid response content delivered on Friday. A new template type deployed in March provides “trust in checks performed in the content validator,” so CrowdStrike appears to be assuming that the rollout of responsive content won’t cause problems.
This assumption causes the sensor to load problematic fast response content into its content interpreter and trigger an out-of-memory exception. “The failure to properly handle this unexpected exception resulted in a Windows operating system crash (BSOD),” CrowdStrike explains.
To prevent this from happening again, CrowdStrike is committed to improving its responsive content testing through the use of local developer testing, content update and rollback testing, as well as stress testing, fuzz testing and fault injection. CrowdStrike will also perform stability testing and content interface testing on responsive content.
CrowdStrike has also updated its cloud-based content validator to better check rapid response content postings. CrowdStrike said: “A new check is underway to prevent the deployment of problematic content like this in the future.”
On the driver side, CrowdStrike will “enhance existing error handling in the content interpreter,” which is part of the Falcon sensor. CrowdStrike will also implement staggered deployment of responsive content, ensuring updates are deployed to a larger portion of its installed base gradually rather than being pushed to all systems at once. In recent days, security experts have recommended driver improvements and staggered deployments.