Dive Brief:
- CrowdStrike said the global IT crash on July 19 was caused by an undetected error in a rapid response content configuration update to its Falcon platform, according to a preliminary report issued late Tuesday.
- After releasing an interprocesscommunications template type in late February, a stress test was conducted in March and three additional IPC template instances were released in April, according to the report. However, two additional IPC template instances were released on July 19, and passed validation, even though they contained defective content, CrowdStrike said.
- Moving forward, CrowdStrike will add additional validation testing to rapid response content and develop a staggered deployment strategy, where the company releases new updates in batches. The company will also improve live monitoring and allow customers better control over delivery.
Dive Insight:
The preliminary review into how a defective software update in its Falcon security platform led to a historic crash of Windows computer systems worldwide comes as businesses continue to restore operations.
The defective update to its Falcon sensor caused about 8.5 million Windows systems across the globe to crash, leading to the blue screen of death at major companies, government agencies and other users.
The report findings demonstrate that CrowdStrike needs to employ a dogfooding and canary deployment strategy for future rapid response deployments, according to Allie Mellen, principal analyst at Forrester.
“That means, much like they already do with sensor content updates, deploying rapid response content updates internally at CrowdStrike first, then rolling out deployment in increments,” Mellen said via email..
Dogfooding in software development involves testing a product internally before conducting larger rollouts. Canary deployment – much like in coal mining – involves doing preliminary product rollouts in a more controlled setting before engaging in larger customer shipments.
“What this means is they found a previously unknown bug in their testing infrastructure,” said Beth Linker, senior director of product management at Synopsys Software Integrity Group, via email. “The testing infrastructure is also code and can also have bugs.”
According to the CrowdStrike report, the company issues sensor content, which is shipped with product releases, as well as rapid response content, which are interim updates designed to respond to changes in adversary behavior.
Linker explained that CrowdStrike is trying to respond to dynamic behaviors in real time. The report indicates that CrowdStrike found a previously unknown bug in their testing infrastructure.
“A bug in your testing tools can mean that you miss a problem and find it in the field,” Linker said. “In this case, it was a catastrophic problem, but this kind of thing happens on a much smaller scale all the time.