EP18 - The Déjà Vu Disaster. When Will We Learn?

August 1, 2024 | 7 min Read | Originally published at www.linkedin.com

This post is available in audio format

EP18 - The Déjà Vu Disaster. When Will We Learn?

Intro

Welcome back Digital Warriors! ⚔️

Buckle up because today, we’re diving into another round of investigation and deep dive of the catastrophe that shook the cyber world to its core. The CRowdStrike Falcon outage wasn’t just a glitch; it was a colossal failure that left millions scrambling.

And guess what? The same guy at the helm, George Kurtz, was behind a similar meltdown at McAfee in 2010 😱.

Talk about not learning from past mistakes!

Let’s break it down and figure out why the industry keeps falling into the same traps. 🤦🏻

🗓️ 2010 - McAfee: A Cautionary Tale of a Repeated Disaster

Back in 2010, McAfee, under the leadership of then-CTO George Kurtz, faced a massive global outage due to a faulty antivirus update. This update wrongly identified a core Windows file as a virus, causing countless Windows XP machines worldwide to crash and enter a reboot loop. The incident led to widespread disruption across corporate and public sectors, damaging McAfee’s reputation significantly.

Despite this, Intel saw an opportunity (article from 2010 - Forbes, NY Times) amidst the chaos. Just a few months after the incident, in August 2010, Intel acquired McAfee. Before Intel swooped in with a $7.68 billion lifeline, McAfee was reeling from its disastrous 2010 outage, which tanked its reputation and left its stock struggling. Intel’s strategic buy was more than just a business deal—it was a rescue mission that likely saved McAfee from a financial nosedive, while also positioning Intel to embed security deep into their hardware as the world braced for the Internet of Things revolution.

The 2010 disaster should have been a wake-up call. Yet, here we are in 2024, facing another global outage under the same leadership. It’s alarming that the lessons from McAfee’s downfall seem to have gone unlearned.

Product Testing – there was INADEQUATE coverage of Product and Operating System combinations in the test systems used. Specifically, XP SP3 with VSE 8.7 was not included in the test configuration at the time of release.
[22 Apr 2010 - McAfee PostOutage FAQ]

Despite the Herculean effort of 7,000 McAfee employees to fix the 2010 meltdown, the real culprit wasn’t a technical glitch—it was a leadership failure fueled by reckless cost-cutting and the neglect of essential software engineering practices.

This disaster is a stark reminder:

Without disciplined engineering, even a minor update can unleash global chaos. The absence of solid IT foundations isn’t just a slip-up; it’s a ticking time bomb, threatening to trigger even bigger catastrophes in the future.

A Lesson Not Learned

Let’s talk about George Kurtz. How does someone rise from CTO to CEO without mastering the basics of technical oversight? It’s alarming because it reveals a troubling disconnect between CXOs and the gritty technical details of their decisions. Sure, bridging boardroom demands with pragmatic strategies is crucial, but without a deep understanding of their company’s business domain and the technical implications in a rapidly evolving industry, leaders are bound to misstep.

In the case of McAfee and now CrowdStrike, this disconnect is glaring. Kurtz’s rise to the top seems to have bypassed the need for technical mastery—something that’s evidently lacking in both his leadership tenures. Even in CrowdStrike’s public apology, it’s clear how little understanding (or care) there is of the real-world consequences of their decisions:

“We quickly identified the issue and deployed a fix, allowing us to focus diligently on restoring customer systems as our highest priority.“

But fixing a broken OS that crashes due to a poorly managed kernel-level update is not just about diligent focus—it requires far more than a basic rollback. The reality? Sysadmin teams are working around the clock to manually restore systems, with scenes of long queues of laptops waiting for a safe mode fix flooding the web.

CrowdStrike's report on July 29 claims 99% of machines are back online. Yet, that still leaves 85,000 devices in manual recovery—a staggering task that would stack up to a 1.7 km tall pile of laptops — CrowdStrike’s report on July 29 claims 99% of machines are back online. Yet, that still leaves 85,000 devices in manual recovery—a staggering task that would stack up to a 1.7 km tall pile of laptops

When DevOps Meets Deaf Ears

It’s baffling that 14 years after the McAfee debacle—and with all the advancements in DevOps—we’re still witnessing such a blatant example of technical negligence. This isn’t just a failure of technology.

It’s a failure of leadership to understand and value the critical importance of software engineering.

And this problem isn’t limited to crisis management. As AI becomes a standard tool in the workplace, we’re seeing a similar disconnect. A recent study (Forbes) shows that while 96% of C-suite leaders expect AI to boost productivity, 77% of employees report that AI has actually increased their workloads and hampered productivity. This disconnect is yet another example of leadership chasing buzzwords without understanding the real-world implications on their workforce.

This oversight is inexcusable in a world where IT isn’t just a support function but the backbone of global operations. It’s time for leaders to stop following the flow and start mastering the technical details that drive their businesses. The era of blind leadership needs to end—before it leads us into even bigger disasters.

As a CEO, and executive coach, I can understand your pain my dear digital warriors. Mainstream media is missing the mark by focusing on the financial aftermath of the CrowdStrike outage rather than the deeper issues at play. By highlighting the $5.4 billion in losses without questioning the systemic failures in software engineering, they perpetuate a dangerous ignorance. This surface-level reporting shields the public from understanding the critical need for robust IT practices and the real reasons behind these failures, preventing meaningful discourse on how to truly prevent such disasters in the future.

The Market’s Talent Crisis

The industry is grappling with a severe shortage of true software engineers. What we have instead is a market flooded with so-called developers—often underpaid, offshored workers, who are more like keyboard jockeys than engineers, relying heavily on AI tools to patch up the gaps. This dilution of expertise is eroding the foundational understanding of software engineering principles that are critical to building robust IT systems. Without professionals who truly grasp the complexities of these systems, businesses are left dangerously exposed, relying on shallow talent pools that simply don’t have the depth needed to prevent massive failures like the one we’ve just witnessed.

We’ve fallen far from the days of NASA’s Apollo engineers, now settling for underqualified and underpaid developers all in the name of saving a few bucks. This short-sighted, cost-cutting mindset is a ticking time bomb.

I say this because of did you ever notice how tech conferences these days resemble more of an elder care home than a vibrant hub of young innovation? The software engineering industry is sorely missing fresh blood. Where is the new generation of software engineers?

It’s no surprise to be cause in this hard time, CXOs are more focused on slashing budgets rather than nurturing talents. With an average training budget of just $1,207 per employee, the task of upskilling underpaid developers into true software engineers is Herculean, if not impossible. This budget is barely a drop in the ocean compared to the vast skills gap that needs to be bridged. Companies are left to grapple with the monumental challenge of elevating basic coding skills to the level of engineering expertise—trying to do it all on a shoestring.

Call to Action

Leaders, it’s time to get serious. Stop treating IT as a mere cost and start seeing it as the invaluable asset it is. Invest in proper software engineering practices, hire skilled professionals, and foster a culture of continuous improvement. Let’s prevent the next big outage before it happens.

Join us, give it a try at the Unicorns Ecosystem or the other technical excellence programs a lot of my fellow IT gurus have created and let’s evolve your organization toward digital excellence. Together, we can harness the true power of IT and software engineering. Don’t wait for the next disaster—act now!

Stay tuned for the next episode, where we’ll delve deeper into the components of the Unicorns’ Ecosystem following up on how the organizational assessment is working. Until next time, keep evolving! 🌟

To stay in the loop with all our updates, be sure to subscribe to our newsletter 📩 and podcast channels 🎧:

📰 LinkedIn

🎥 YouTube

📻 Spotify

📻 Apple Podcast

Michele Brissoni

Visionary Digital Evolution Strategist

Rooted in Formula 1 excellence, with over 30 years in IT starting as a child in the 1980s, …