Disaster recovery should be in almost every company’s I.T. budget. It can seem to be a large line item, but the investment winds up paying for itself, especially as cybercriminals continue to target businesses more frequently. According to a recent report by CNBC, the average cyber-attack costs businesses of all sizes $200,000.
At Jay Peak, we really started looking at our disaster recovery plans and capabilities after a “near miss” disaster brought on by a malware attack. Obviously, prevention is the best solution to a situation like this. But after speaking with experts investigating cybercrimes at the FBI, we realized it is nearly inevitable that all companies will succumb to some sort of an attack at some point. How you recover is what dictates whether you remain a viable business or not.
The first thing to understand is that disaster recovery planning is a collaborative effort. The I.T. team cannot do it in a vacuum, nor should it. At the very least, the head of finance and the general manager need to be involved in the early development of a disaster recovery plan. The resort communications person should also be involved early on, to develop a cohesive internal and external message in the event of an actual disaster.
I.T. is responsible for costing and implementing a disaster recovery plan, but the plan itself cannot come together without input from department leaders. Some basic questions I.T. needs to ask of these leaders include:
1. In the event of an unplanned outage, how long before your department is completely paralyzed?
2. Can your department accept the risk of storing offline credit card transactions until the system is back online (the risk being that these transactions are declined)?
3. How much work (labor hours) will it take to catch the systems back up to current for every hour of system downtime?
4. Can your systems work in standalone mode for any length of time?
Yes, I.T. may know the answer to some of these, but additional input helps.
Recovery Time Determination
The two biggest benchmarks you will hear when talking to software, hardware, and cloud vendors about any disaster recovery plan are: recovery point objective (RPO) and recovery time objective (RTO). These two benchmarks probably have more impact on the dollars that will be spent on the final solution than any other.
RPO represents the amount of data that will be lost in the event of a complete loss of systems. So, if a pipe bursts over your server farm at 12:30 a.m. and shorts the entire thing out, how far back do you have to go to get a working system again? This can be measured in seconds (almost no data loss) to as long as a day or more (back to the last successful backup).
RTO represents the time before you can be operational again, also measured in seconds, minutes, hours, or days.
The solution that offers RTOs and RPOs of seconds also costs the most to implement. Conversely, if you’re OK with falling back to the last good backup, and your backups are to tape, your RPO can be as much as 24 hours, and your RTO will be 24-48 hours or more. This is the least expensive scenario.
If seconds count, though, you need redundant hardware either offsite (the cloud counts) or at an alternative site that you can use to get back online. This hardware/software/cloud approach carries many costs, so it’s important to understand what you are paying for. A frank discussion with resort leadership and the profit center managers is the only way to come up with numbers that are reasonable and the business can afford.
In addition, it’s not necessarily required that one solution fits all. Some profit centers may be able to operate just fine without the back-end systems being fully or partially online. Others might be dead in the water without the infrastructure to support them.
Once these discussions conclude, it’s time to look for solutions that can support the required RTOs and RPOs. For our property, the lodging systems probably have the most stringent requirements for both, depending on the time of year and business levels. But our ticketing and retail point-of-sale systems can run in standalone mode happily for a significant period of time. F&B falls in the middle.
The most expensive solutions call for mirrored local hardware that can be brought online in live mode seconds after a disaster strikes the primary hardware. And the costs are basically double++ what your initial hardware costs were.
Why the ++? Well, what if the disaster that strikes isn’t physical, but rather software, such as a crypto attack or some other similar malware attack? If all you have is a live mirror of your local data, it will also be corrupted.
The ++ represents the dollar cost of the software stack required to be able to go back in time to the moment just before the attack happened. Depending how insidious the attacker, this could be almost immediate, or they might wait minutes, hours, or days after penetrating your network to strike, making it that much more difficult to recover. Either way, the ability to go back to the moment of the initial infection is critical to getting systems back online safely.
At Jay Peak, we utilize Zerto and the Microsoft Azure cloud. Zerto allows us to replicate, in real time, our critical data center assets into the Azure cloud in a format that allows us to bring live instances online in the Azure infrastructure. We are able to save up to 30 days of server state in increments of seconds.
As important, we are also able to group assets together so that assets that need to be brought back up in the same time status can be saved together and brought back online as a grouping. Asset groups might be the SQL server and any middleware servers required to run your lodging infrastructure, for instance.
If they are brought back online with different time stamps, there is a high risk of data corruption, which would delay returning to an online state. Since the instances are not being kept online, the only regular monthly cost is the software licensing fee for Zerto and the monthly storage costs of Azure. These costs are significantly less than doubling up hardware on campus.
Identify and Address Needs
When planning for a disaster recovery scenario, several factors must be taken into consideration in addition to RTO and RPO. First, what types of disasters are you planning for? Power failure? Hardware failure? Environmental disaster? Malware/crypto attack? Your existing assets and safeguards will inform where the greatest needs exist.
For example, Jay Peak has redundant paths for utility power onto the campus that can be switched dynamically by our utility, and we also have a data center uninterruptible power supply (UPS), plus a generator backing that up. So, power failure wasn’t a primary concern.
We had also been through a “100-year storm” that flooded much of the campus, but our data center stayed bone dry. So, flooding wasn’t a primary concern.
Instead, our primary concern was a malicious actor or some other localized software failure that caused data corruption. As mentioned earlier, we once survived a minor crypto attack, but even that minor attack cost us significant dollars in lost productivity. Although the long-term damage from this attack was minimal, it revealed that we needed a solution to get us back in business faster.
We were fortunate in that case, and we chose the disaster recovery solution we did because we are comfortable with the flexibility we get with Zerto.
The RPOs with Zerto are measured in seconds, and the RTO is measured in hours for a full-scale disaster. We can do a partial restoration of services in minutes if we face something that is not a complete disaster. Zerto has also given us the ability to do restores at any point in time going back 30 days. This added benefit allows us to restore work that was done since the last traditional backup was completed.
Unique to Your Operation
The solution for your resort will be unique to the workflows of the business and the requirements of each profit center. It will also depend on the size of your I.T. team. If you have a small team, definitely consider the advantages of outsourcing some or all of disaster recovery planning, implementation, and ongoing support to a vendor. It may cost more money in the short term, but you can hold a specialized vendor to a higher standard than you can hold an already strapped I.T. team.
Having the discussions upfront and agreeing to the benchmarks as a leadership team is critical to a successful project that comes in at or under budget, and is completed on time.
Finally, it’s important to remember that a disaster recovery project never truly ends. At minimum, annual testing is required to make sure what you have implemented is going to work the way that you intended it to. This is critical to a successful plan. And, of course, the cost (in man hours, cloud compute, and remediation) cannot be disregarded, but remember that it’s a worthwhile investment.