Bridgeworks features in the Digitalisation World article to discuss best practices for disaster recovery planning.
October 10, 2024
More importantly than ever, there is a need to consider in what circumstances a disaster recovery plan needs to be put in place, and how to define one. For example, the security firm Crowdstrike’s software update to a configuration file from its Falcon cybersecurity system on 19th July 2024, which caused a global outage of the Microsoft Windows operating system, causing the blue screen of death.
Cyber Security News estimates that it caused a direct financial loss of $5.4bn for Fortune 500 companies, and as it was a worldwide outage, the losses could be much more. There was also an unconnected Microsoft Azure outage. However, other estimates suggest that – much more than that – at least $10bn because it affected airlines, airports, banks, hotels, hospitals, manufacturing, stock markets, broadcasting, petrol stations, retail stores and government services. This also included emergency services and websites.
“The incident on affected approximately 8.5 million Microsoft Windows devices worldwide and caused widespread disruptions across various industries,” the magazine reports. Despite this calamity, UK newspaper, The Guardian, Crowdstrike is doing its best to put the record back in its box by apologising. Adam Meyers, Senior Vice-President for Counter-Adversary Operations at CrowdStrike, testifying before U.S. Congress in September 2024, said: “I am here today because, just over two months ago, we let our customers down. On behalf of everyone at CrowdStrike, I want to apologise.”
Full systems review
Meyers also informed Congress that his company has undertaken a full review of its systems to prevent the same cascade of errors from occurring again in the future. He explained that Crowdstrike takes full responsibility for the outage and said the errors “ultimately resulted in the Falcon sensor attempting to follow a threat-detection configuration, for which there was no corresponding definition of what to do,” writes Blake Montgomery and Johana Bhuiyan in their article for The Guardian. ‘CrowdStrike apologises for global IT outage in congressional testimony.’
David Trossell, CEO and CTO of Bridgeworks, says in contrast to disaster recovery – albeit a disaster in its own making – this was not about recovering a file. The fault occurred in the boot section, and so a software patch to Crowdstrike’s Falcon sensor software had to be handcrafted to bring the system back up. The trouble started when the added defect was not tested properly.
He explains: “Basically they should have tested in fully, right up to the reboot. Whenever you change the software, you have to consider where it goes wrong on the test system. It just looks like they didn’t do it properly. These changes come from a third-party, and therefore it must be about testing. Part of their release mechanism should be a full-system test. It’s about loading the software up, checking everything and make sure it all works – including cyber-security.”
Unpreventable outage
As this incident wasn’t about disaster recovery, in the truest sense of the phrase, he thinks there was no way they could have prevented the outage as the error occurred at the system boot level. This prevented the systems from booting as it existed at the BIOS level. The software update screwed up the system file, and so this episode was about the corruption of a file and not about encrypting a file for back-up. Nevertheless, if you can’t recover a file, you should find a way to allow it to boot.
Quite often, this would require using a USB stick or a boot disk. However, he stresses that using a USB stick is “a bit dodgy from a security perspective.” It could be lost and be misused by a bad actor.
To avoid this situation, Crowdstrike should have carried out multiple tests before installing the software in a live environment. Subsequently, Microsoft is absolved of any responsibility. That comes down to someone at Crowdstrike failing to follow written procedures. Trossell explains: “Microsoft couldn’t have done anything, because if someone screws around with a file, it’s up to that person to fix it.”
Nothing to restore
The only resolution wasn’t restoring data from a back-up, because the only way to resolve such a situation where servers aren’t booting up properly, is to manipulate the boot file. Once the boot file is either replaced or repaired, it’s possible to carry on. “With this one, it was not possible to have a recovery plan as the was nothing to restore as the machine would not boot,” says Trossell.
He adds: “This is such a low, fundamental issue. This is all about the low-level operating systems. Disaster recovery is where a file is backed up somewhere else and the operating system would be running. So, you would have access to the file. Basically, if a file is corrupt you aren’t going to boot. It doesn’t matter whether you have disaster recovery or not – not at that level. If a starter motor won’t turn a car engine over, you can’t start it and there is nothing you can do.”
Disaster recovery’s role
Where disaster recovery comes into play is when there is a ransomware attack or where there are corrupted files, and a machine can be booted to allow backed up data to be restored, or access to that data to be recovered. This is where WAN Acceleration, which mitigates the effects latency and packet loss, can be deployed.
However, it can’t be when a boot file is corrupted. Instead, it’s necessary to, as Crowdstrike did, find a fix such as a software patch. In the case of a cyber-attacks, systems can be restored from Disaster Recovery sites with the help of WAN Acceleration. Still, testing is important even with disaster recovery.
It’s no good having systems that can be backed up and restored if they fail at the point of a disaster occurring. So, even with true disaster recovery, a plan, tests and procedures should be put in place to ensure that everything works as it should do. Appropriate maintenance is crucial to protect what often amounts to critical infrastructure.
As for Crowdstrike, Trossell asks: “If you have got something as critical as that, who didn’t test it, who was in charge?” Failing to ensure that the right factors are place, is in his mind not just a technical failure, bit also a leadership one. While that incident wasn’t within the definition of disaster recovery, as it’s often considered to be, there is similarly a need to adhere to policies and procedures before any release of a software patch.
Failover machine
The only Plan B Crowdstrike could have had in place was an identical machine that is similar to the main one as a failover machine. “Crowdstrike shouldn’t have put it onto a live system; they should have made sure it was okay on a test system first before moving it over to the live system,” he remarks. As for protecting and restoring data that has been impacted by a cyber-attack, he advises having a disaster recovery machine in a data centre or in a separate data centre, which can be booted up to find the issue and to recover from it.
Crowdstrike says it has learnt from its experience. It will no longer roll out its software updates globally to all customers in a single stream. Customers can also select when they receive their updates, or they can choose to hold them off. The latter, though, could make those who elect to hold off, more vulnerable to security breaches, since they won’t have the most updated threat assessment. That is unless they have secured their data with WAN Acceleration.
Assessing DR plans
As for how to know whether disaster recovery plans work, accounting and advisory firm Warren Averett, advises organisations to review and update their disaster recovery plans frequently, identify critical systems and data, define testing objectives, determine the testing approach and scenarios, allocate appropriate resources, document the process, conduct the tests, analyse the results and repeat this process often. As for software, the best way to assess whether it can be released into the wild is also to test it in a formal testing environment.
Without testing, it’s impossible to know what works, and what doesn’t. Whether this is for the implementation of a disaster recovery plan, or as in Crowdstrike’s case, the release of software updates. Disaster recovery testing also ensure that data and applications can be restored to enable operations to continue unabated after a natural disaster, an IT failure of a cyber-attack. This is also best done by using WAN Acceleration to expedite data transfers.