EP17 - How a Tiny Bug Took Down the World

July 25, 2024 | 6 min Read | Originally published at www.linkedin.com

This post is available in audio format

EP17 - How a Tiny Bug Took Down the World

🗓️ Friday 19th of July 2024 - The Wake-Up Call day

Imagine waking up to a world where a single software glitch causes economic losses of $ 10 billion, with $ 5.4 billion blown up by Fortune 500 companies alone. Unfortunately, this is reality, not a nightmare. This exact scenario unfolded with the CrowdStrike outage, impacting businesses globally. This disaster isn’t just a wake-up call for IT departments but a siren for all industrial leaders. The damage is likely underestimated. IT’s influence on our global economy, political stability, and safety is colossal. We’re in the era of Industry 5.0, with many still catching up to Industry 4.0 or even 3.0 paradigms. IT forms the backbone of every industrial and civil sector.

[Unicorns' Ecosystem Component Diagram and related deployment sequence](https://knowhow.distrelec.com/manufacturing/is-your-business-ready-for-industry-5-0/) — Unicorns’ Ecosystem Component Diagram and related deployment sequence

Just consider the catastrophic potential: a bug like this could lead to planes going off course, nuclear reactors malfunctioning, or critical infrastructure collapsing. The way we create software, middleware, and hardware has never been more critical.

The CrowdStrike incident is a stark reminder that robust engineering practices especially in SW Development are non-negotiable anymore!

The Tech Meltdown Nobody Saw Coming

Picture this: it’s a regular Friday, and suddenly, Windows machines globally start dropping like flies. 💥 Cue the dreaded Blue Screen of Death 💀 (BSOD). The culprit? A faulty update in CrowdStrike’s Falcon Sensor that caused an out-of-memory (OOM) error. The system crashes were due to an undetected error in the InterProcessCommunication (IPC) Template Instance, leading to out-of-bounds memory read. This wasn’t just a minor oops—it’s a global meltdown that required manual intervention! Millions of machines needed restarting and fixing manually, like in the old days. 🛠️ CrowdStrike updated us in real-time, but as an engineer, I was disappointed with the lack of technical details. I get they need to protect their IP, but come on, we’re talking about a global disaster here! How did it get to this point? Why wasn’t it noticed? Where’s the guilty code and the automated tests? 🤔

The story and tech details we can gather seem designed to keep people less alarmed than they should be. One of the principles of science and engineering is the replicability of an experiment. Without all the technical details of such a disaster, how can we be sure the remedy was found and that nothing worse will happen in the future? We simply can’t. So, we either choose to sleep between two pillows until the next global disaster, or we start rolling up our sleeves and changing the industry to a more engineered sector.

We need fewer “software developers” and more “software engineers” who know both principles of engineering and craftsmanship. 🚀

The Plot Thickens - My Personal Analysis

Lately, everyone has turned into a CSI detective, pointing fingers at CrowdStrike. I won’t regurgitate the wild theories floating around but offer my perspective with over three decades in software engineering, including high-performance environments like Formula 1 and MotoGP.

From their public GitHub repositories and my Key Behavioral Indicators analysis, it seems that CrowdStrike’s dev team skips crucial engineering steps. Proper Test-Driven Development (TDD) is non-existent. They seem to patch code rather than writing tests first, creating a house of cards. Their tests barely pass coverage checks catching only “trivial” issues. Major engineering practices like conventional commits, semantic releases, mutation testing, and fundamental like a proper testing pyramid are missing.

Acceptance Test-Driven Development (ATDD) and Behavior-Driven Development (BDD)? Nowhere to be found. These practices ensure software meets real-world needs and behaves correctly. Without them, updates are ticking time bombs.

In high-performance environments, these practices are fundamental for reliable and high-performing products. Blaming developers for such disasters is shortsighted. The real issue is systemic—a sick software industry treating development as a cost rather than an essential investment. If this status quo persists, we risk facing catastrophes far worse than rebooting Windows machines or flight cancellations.

CrowdStrike’s code reflects a broader industry problem that needs addressing before we face even more severe consequences.

The Real Villain: Global Lack of SW Engineering Discipline

No Test-Driven Driven development, lipstick DevOps, no refactoring, no clean code principles. Just quick fixes piled on, leading to fragile systems ready to collaps at the first hard hit. This sloppy approach is the real problem out there in the majority of companies—it’s an industry-wide issue. The rush to deliver sacrificing quality, leads to disasters like this one we just faced. This wasn’t just a bad day for CrowdStrike. Airlines, banks, retailers, and even law enforcement were hit hard. The bug fix, meant to be a quick patch, turned into a chain reaction of new bugs and issues. In the interconnected world of Industry 5.0, where IT integrates with human-centric approaches, this poor quality structural issue can lead to economical disasters worse than COVID!

Technology is evolving at hyperbolic speed. Companies, trying to keep up, take shortcuts, blind to the risks they’re inviting. My perspective aligns with many industry leaders, feeling particularly close to me these:

Kathryn Guarini, former IBM CIO and collegue, in her article “A Tech Crisis,” emphasizes the importance of crisis management and robust software practices, one of the critical aspects we worked together in the blue days. She stresses enterprise risk management and chaos engineering for resilient systems.
Gergely Orosz, a prominent tech leader and author of “The Pragmatic Engineer,” in his article “The Biggest Ever Global Outage: Lessons,” attributes the crisis to poor software engineering practices, highlighting the need for thorough testing, continuous integration, and better DevOps practices. Aligning perfectly with my point of view.
Dave Farley, co-author of “Continuous Delivery,” is a stalwart in the software engineering community. His work emphasizes the importance of robust engineering practices, continuous integration, and deployment pipelines. Farley’s insights in his video “Software’s HUGE Impact On The World | Crowdstrike Global IT Outage” underscore the need for a shift towards a more reliable and sustainable approach to software development.

This fiasco is a wake-up call. In a hyper connected world reliant on robust IT, we can’t afford shortcuts. Proper engineering practices aren’t optional. They’re essential.

Leaders, We’ve Got a First-Aid Kit for You

Next time you hear about a tech meltdown, remember: it’s not just bad luck. It’s a sign of deeper issues in how we build and maintain our software. Learn from CrowdStrike’s crash and commit to better practices. Their stock fell 30% in the blink of an eye, even before legal trials began. Can your company survive that? Probably not. Is your risk management policy sustainable or just giving you headaches?

How much (~30%) CRWD stock depreciated after the incident.

Stay sharp, and empower your IT departments to code smart with our BriX Consulting Unicorns Ecosystem. Knowledge is power, and now you know. Your IT departments are unequipped to handle this hyper-connected IT world. Your governance and risk management aren’t ready to navigate such crises. Your products can suddenly become faulty, leading you to shut down.

Don’t be foolish to continue with sloppy software engineering in your company!

Jump onboard the BriX Consulting Unicorns Ecosystem, and together, we’ll evolve your organization toward digital excellence! 🚀

To stay in the loop with all our updates, be sure to subscribe to our newsletter 📩 and podcast channels 🎧:

📰 LinkedIn

🎥 YouTube

📻 Spotify

📻 Apple Podcast

Michele Brissoni

Visionary Digital Evolution Strategist

Rooted in Formula 1 excellence, with over 30 years in IT starting as a child in the 1980s, …