On July 19th, 2024, what came to be considered the largest IT malfunction in history brought down approximately 8.5 million Windows devices and systems running critical operations.
The CrowdStrike outage disastrously grounded thousands of flights worldwide, and brought public transit systems in multiple U.S. cities to a halt.
Online healthcare systems in the UK were affected, as well as 911 emergency services in the U.S.
It impacted financial services around the world, and took multiple media and broadcast outlets off the air. And the ongoing impact from the outage is still costing the company, as they posted a USD$92.3 million net loss for their fourth quarter and expect another USD$73 million in costs from “incident-related expenses.”
The incident occurred because of a flawed software configuration update, that rendered millions of computer endpoints inoperable.
Overall, the CrowdStrike outage serves as a critical reminder of the importance of advanced proactive monitoring solutions, rigorous testing systems, duplication and backup networks, and ongoing contingency planning to mitigate widespread disruptions like this.
In this article we’ll discuss in detail the top priorities for every organization to ensure that their IT infrastructure remains failsafe.
Priority 1. Proactive Monitoring & Alerting
Having proactive monitoring, testing and alerting systems in place is the most effective way to help prevent IT outages.
By having the means to identify and address issues before they occur, you can continuously track system performance and detect problems like hardware failures, software configuration errors, network congestion and application anomalies.
Implementation
But it’s not as simple as investing in an off-the-shelf monitoring system and hoping it will meet all your needs.
The first step is to clearly define what you want to achieve and how you will measure it. Every organization has a different IT infrastructure, with different monitoring goals, so it’s important that these goals align with update objectives, system requirements and stakeholder expectations.
For example, your primary objective is to monitor the impact of system updates on your system's functionality. Or it could be that you want to monitor performance, security, usability, or user satisfaction.
You may have clear criteria and thresholds for alerting, as well as the roles and responsibilities for monitoring and reporting.
Once you’ve planned your monitoring strategy, it's critical to prepare your system by configuring your monitoring tools, setting up parameters, and testing the functions – and this should be ongoing, before, during, and after updates.
Compare the resulting data with predefined criteria and thresholds, to identify any deviations or anomalies. This enables you to evaluate the effectiveness and efficiency of every activity and identify any gaps.
Systems change and evolve over time, so it’s vital that your monitoring solutions can adjust to future needs. By tracking and regularly analyzing the trends and patterns of your system's performance, security, usability, and user feedback, you can spot changes or opportunities for improvement.
Cloud-based tools, like IR's testing capabilities, continuously monitor and interact with your UC and contact center solutions, providing real time data that can quickly identify and address problems.
Priority 2: Automated Response Protocols
Automated Response Protocols (ARPs) are another way to significantly help prevent IT outages by using data to quickly initiate corrective actions like restarting services, re-routing traffic, or notifying relevant personnel when problems are detected.
By automating repetitive tasks like system backups, configuration changes, and patch management, you reduce the risk of human error causing outages and unwanted downtime.
The problem is that even though there’s a massive volume of data generated by enterprise organizations, not all of it is actionable, and many data-driven alerts are false alarms. So purpose-built, advanced automation tools, with machine learning capabilities can analyze the right data patterns to predict and prevent potential issues.
How ARPs help prevent IT outages:
- ARPs can monitor system performance and identify potential issues like resource exhaustion, network anomalies, or failing hardware at their earliest stages, enabling early intervention.
- ARPs can automatically trigger pre-defined actions in case of an incident, like restarting services, re-routing traffic, or scaling resources, reducing manual intervention.
- ARPs can schedule preventative maintenance tasks like software updates or system backups to help alleviate the threat of unexpected outages.
Implementation
With well-defined rules in place, you can clearly monitor the conditions that trigger automated responses. This determines the appropriate actions to take and is crucial to avoid unnecessary interventions.
Regularly testing ARP functionality and monitoring system behavior is essential to ensure that it’s working as intended and catching potential issues without raising too many false alarms.
Integrating ARPs with an incident management system will double down on monitoring network activities and triggering alerts. This is particularly crucial with the increase and evolution of cyber threats, and incidents like the CrowdStrike disaster.
Priority 3: Redundancy & Failover Systems
Redundancy is about maintaining backup systems and duplicating critical infrastructure components to ensure continuous operation during an outage. It involves having multiple copies of vital IT components such as servers, network connections, or data.
Failover is the process of automatically switching to a backup system when the primary system fails. This minimizes downtime and users have minimal interruption during an unexpected outage.
It ensures business continuity even during system failures, protecting against revenue loss and reputational damage.
Implementation
To implement redundancy and failover systems, it’s important to first identify critical componentry within your IT infrastructure, then replicate these components across multiple systems. Set up automated failover mechanisms to seamlessly switch to a backup system in case of a major failure.
This includes regular testing and drills, to ensure that everything functions in the case of a real outage.
Data replication strategies should be put in place to maintain synchronized copies of critical data across multiple systems.
Use load balancers to distribute traffic evenly across multiple servers and set up failover clusters where multiple servers work together so that they can automatically switch to backup when the primary server fails.
Ensure that there are backup power generators in place to provide power to critical systems even during a power outage.
Consider having redundant routing protocols like HSRP or VRRP to ensure network connectivity even if a router fails, and multiple network connections to different providers for emergency paths.
Priority 4: Regular Software Updates & Patch Management
Keep systems up to date with the latest patches and security updates to improve overall system stability and prevent vulnerabilities that could lead to failures.
Regular software updates and patch management can prevent IT outages by addressing security vulnerabilities and fixing bugs that hackers can exploit or could cause crashes.
Updates also ensure that your software stays compatible with newer technologies and standards, curtailing compatibility issues that could cause system malfunctions.
Patches usually include performance enhancements that can improve system responsiveness and reduce the risk of slowdowns and a strain on hardware that could potentially lead to downtime.
Implementation
Implementing regular software updates and patch management involves first creating a patch management policy and automating the process.
Prioritize critical patches based on risk level and regularly monitor patch status across all systems.
Additionally, it’s vital to test patches before deployment and to monitor patch status across all systems – and to have a rollback plan in case new patches cause critical issues.
Generate reports detailing patch compliance and document the entire process to ensure consistency.
Priority 5: Disaster Recovery & Business Continuity Planning
Disaster Recovery and Business Continuity Planning is a set of mission critical strategies to help ensure that a business can quickly recover from disruptions.
But having a clear disaster recovery strategy in place is only the first step. Testing of your recovery procedure is essential to ensure its effectiveness.
Many organizations admit that they only perform a minimal amount of testing due to shortage of time and resources. This could severely impact the ability to recover quickly from a disaster like the CrowdStrike incident.
Implementation
The first step is to conduct a thorough risk assessment. Identify critical business functions and take into account the other 4 steps outlined here to prevent outages.
Identify potential threats like natural disasters, cyber attacks, power outages and system failures and analyze the potential impact of each risk.
Conduct a Business Impact Analysis (BIA) to determine which business functions are critical, the maximum allowable downtime for each, and the key dependencies between different business functions and systems.
As mentioned earlier, data backup and replication is critical, as is testing and evaluation through simulations and drills.
Once you’ve identified potential threats, create a dedicated disaster recovery team with clear roles and responsibilities and implement regular training.
Lessons learned from the CrowdStrike calamity
If there’s one thing we’ve learned from the CrowdStrike incident, it’s that even renowned cybersecurity organizations can experience destructive outages.
It highlights the need for stringent monitoring and testing practices, as well as robust incident response plans to fortify your technology stack and create strong operational resilience. By taking on board the priorities explored here businesses can be better prepared for failures and better equipped to recover from them.