When a faulty CrowdStrike security update crashed millions of Windows-based systems globally earlier this year, organizations learned a quick and valuable lesson about IT preparedness and resiliency planning.
The July 19 incident grounded thousands of commercial flights, disrupted financial transactions and ultimately cost Fortune 500 companies more than $5 billion in estimated direct losses. The fallout forced many companies to reassess their capacity to respond and recover from unexpected IT outages.
As organizations uncovered weaknesses, they overhauled software engineering practices and invested in people, processes and technology, according to an Adaptavist report. The technology company surveyed 400 software developers across U.S., U.K. and Germany-based enterprises.
“If you were affected, all of your processes got stressed and tested about as much as they could be,” Jon Mort, CTO of The Adaptavist Group, said. “It was so widespread and painful for so many people that it made a theoretical risk very real.”
Nearly 9 in 10 respondents said their organizations were impacted by the outage and more than one-third experienced severe disruptions lasting longer than one day, Adaptavist found.
The repercussions reverberated beyond the walls of IT shops, impacting executives and reinforcing the need for investments in preventative measures, said Mort, who likened the incident to a “call to arms” for the software industry.
“It was so much in the public consciousness,” he said. “You had executives whose flights were delayed or cancelled and that made the business case for IT resilience.”
The airline industry was particularly hard hit, but some carriers recovered faster than others. Delta Air Lines suffered the biggest blow, cancelling more than 5,000 flights over the five-day period it took the carrier to recover.
In comparison, American Airlines and United Airlines were able to quickly mobilize IT teams and keep flight cancellations to around 800 and 1,500, respectively, in the days following the event.
“Within an hour of the outage, we assembled the right operating teams and IT experts to develop and execute a plan to get our systems back online and the aircraft moving again.” American Airlines COO David Seymour said, during a July earnings call.
Rebuilding resilience
The outage storm had a silver lining. In the days and weeks that followed, organizations scrutinized recovery plans and most didn’t like what they saw, according to a Cockroach Labs survey of 1,000 cloud architects and engineering executives conducted by Wakefield Research in August and September.
More than 9 in 10 respondents identified operational weaknesses but nearly half acknowledged they had yet to address the issues.
Adaptavist found a majority of organizations have begun adopting more robust software development processes and cultivating stronger security awareness among staff in response to the July crisis.
One-third of organizations have already changed their software updating procedures, according to the survey. And nearly 9 in 10 respondents said their organization plans to boost investments in cyber and incident response training related to IT outage risks.
Automated software updates are a common vendor practice and customer convenience. CrowdStrike’s Falcon sensor software configuration update took just 78 minutes to do its damage, the company said in a securities filing just three days after the incident.
CrowdStrike’s response, which gave impacted customers the information they needed to mount a proper response to the incident, helped restore customer confidence, according to Mort. “The transparency prevented an erosion of trust that would have occurred otherwise,” he said.
The company retained 97% of its customers, CEO George Kurtz said during a November earnings call.
While an automated update was the Achilles’ heel that brought down enterprise Windows systems in July, it would be a mistake for enterprises to eschew updates, Mort warned.
“That would be the wrong reaction,” Mort said. “You need to be applying security patches and updates — but you need to be doing it safely.”
CIOs can steal a page from the CISO handbook and run regular incident simulations to build resilience across IT systems and processes. They should also make sure they’ve got full recovery plans in place, Mort said.
“If you’ve got backups with no restore, all you’ve got is hope, and hope’s not a strategy,” he said.