Technology executives reassessed their IT operational resilience in the wake of a global wave of costly systems outages caused by a faulty CrowdStrike security update in July. Most were not happy with what they found, according to a survey of 1,000 senior cloud architects and engineering executives conducted by Cockroach Labs and Wakefield Research in August and September.
More than 9 in 10 respondents said they were aware of operational weaknesses within their organization that leave IT systems vulnerable to costly service interruptions. Nearly half acknowledged they hadn’t done enough to improve resilience.
Every company surveyed reported revenue losses from outages in the past year.
“IT outages are pervasive,” Spencer Kimball, CEO of Cockroach Labs, told CIO Dive. “But the CrowdStrike issue was just so blatant and so preventable that people realized they have blind spots when it comes to critical vulnerabilities.”
The CrowdStrike event caught executives by surprise. Although it was live for less than two hours, the update brought down millions of Windows-based systems, grinding operations to a near halt at major airlines and interrupting banking functions globally as technology teams scrambled to respond.
CrowdStrike’s broad reach across continents and industries amplified the disruptive impact of the outage. Images of stranded passengers staring at error messages on airport monitors drove home the cost.
“When you make things really large, whatever can go wrong does go wrong 100% of the time,” Kimball said. “You can't run something at scale and not be prepared to have machines, power systems and networking equipment fail — sometimes it’s a backhoe accidentally cutting into a fiber-optic cable that brings things down.”
Stress tests
IT snafus are endemic and persistent. Companies experience an average of 86 outages annually and more than half reported weekly service disruptions, the report found. The average recovery time was 196 minutes, or more than three hours.
“That’s a lot of lost productivity and a lot of stress on the engineers who have the pagers and have to do the postmortems,” Kimball said.
For a geographically dispersed operation, the challenges are manifold.
United Airlines dispatched teams to hundreds of airport locations to reboot more than 26,000 Windows devices in the days following the CrowdStrike outage, which hit in the early morning hours of Friday, July 19. The effort required staff to drive to sites lacking field support over the weekend, CIO Jason Birnbaum told CIO Dive.
United’s response, which nonetheless led to nearly 1,500 flight cancellations but managed to restore operations within four days, is not uncommon.
Cockroach Labs found more than 9 in 10 companies have to set aside essential work to address unplanned outages. Two-thirds of respondents reported deprioritizing everyday IT maintenance and administrative tasks as a result of disruptions, a practice that can lead to larger problems and mounting costs when future outages hit.
Lack of funding for strategic planning to prevent outages amounts to rolling the dice and puts IT teams in a tenuous position. If they can’t keep systems up and running, jobs can be on the line, Kimball said.
More than one-third of respondents said budget constraints held back preparedness initiatives and 4 in 5 expressed concern that a significant outage or downtime event would jeopardize their jobs.
Financial repercussions
Outage costs vary, based on the scope and severity of an incident and the preparedness of the organization. The companies surveyed by Cockroach Labs reported losses ranging from $10,000 for a limited incident to over $1 million for larger disruptions.
A similar survey of 1,700 technology professionals conducted by New Relic several months prior to the CrowdStrike event found outages can cost up to $1.9 million per hour.
The same incident can have disparate impacts even across the same industry.
Delta Air Lines, one of the hardest hit domestic carriers in the days following the July event, put the CrowdStrike price tag at $500 million. The airline is seeking to recover that amount from CrowdStrike through the courts. CrowdStrike responded with a countersuit last month, pushing the responsibility back onto Delta.
In contrast, United did not report specific losses from the July disruption. Weather and other unexpected events are common enough in the aviation industry that the company bakes operational setbacks into its quarterly guidance, the company’s CFO Mike Leskinen said during an October earnings call.
Scott Kirby, United Airlines CEO, elaborated on the company’s “no excuses” philosophy. “It's easy to have an MBA in a cubicle somewhere come in at 9 a.m. on Monday in an air-conditioned office and calculate how much some event outside of your control cost,” he said during the earnings call. “If you have a no excuses mantra, and you don't allow people to even go calculate those numbers, it forces people to go find innovation.”
Most companies aren’t ready to absorb the impact of a major outage, according to Cockroach Labs. Just one-fifth of its survey respondents said their organization is fully prepared for such events and only one-third have a full response plan.
“The best companies have a long-term view on a constant and really determined evolution of their IT practices and resilience,” Kimball said.