Disaster Recovery for Hitachi IOportal
Hope for the best, prepare for the worst. Although not as easy as stopping a domino, taking provisions for a complete datacenter outage is certainly worth the effort. As such, Hitachi IOportal is now available with a complete disaster recovery solution that guarantees our customers unfettered access to the IOportal service even in a disaster scenario. This includes access to the IOportal, the capability to create performance graphics for the past 30 days and complete access to the Dashboard, Cockpit and Capacity features. Furthermore, customers’ export files will still be received and processed by the IOportal.
A solution is born
When we set out to develop a disaster recovery solution, it quickly became clear that we first need to clearly define the business cases that should be protected. This exercise resulted in a detailed functionality scope that the solution must fulfil. This first specification helped to coarsely decide on the sizing and capabilities of the disaster recovery solution. Next, we set out to discuss non-functional requirements, especially around operating the system. Such a requirement was for instance the need to keep the code version on both production and disaster recovery systems in sync without manual intervention. Another area of intensive discussion was the need to limit day-to-day operational activities to the production system alone, instead of duplicating the workload and maintain two systems. As the discussions proceeded, a list of tasks formed with all various custom implementations that are required in order to achieve our goals. Finally, a crucial part of the design process was the careful E2E simulation of the various steps and phases that will play out during a disaster event:
- Detecting the disaster
- Deciding whether to switch to the disaster recovery solution or not
- Operations during disaster recovery
- Recovery of primary system
- Resuming regular operations
Going through all theses steps again and again while considering all system aspects, i.e. networking, DNS etc., validated our design and further revealed all the bits and pieces that were missing in terms of functionality.
Disaster Recovery Architecture
At the end we devised following architecture:
- Offsite location for disaster recovery, in the same VPN as our data center
- Fully automatic setup of disaster recovery site via Ansible (infrastructure): this includes installing the OS, bootstrapping and installing dedicated software packages on three VMs per environment (total 6 VMs)
- Automatic application deployment (both front and backend) via Capistrano simultaneously on both sites (see our dedicated blog post on Capistrano
- Batch jobs for periodically copying over the production database and required csv files from production to disaster recovery, such that we achieve our desired Recovery Point Objective (RPO)
- Activation of disaster recovery solution via change of either nameservers or A-records. Users are automatically routed to the secondary site
- Automatic instantaneous duplication of export files between production and disaster recovery sites (both directions)
- Automatic heartbeat monitoring of production system from offsite location and alerting
Is it worth it?
Implementing the disaster recovery solution required significant effort. However, a great deal of it involved implementing automatisation which benefits the system per se, irrespective of the disaster recovery case. With every new environment that we set up and every new rollout we perform, we save time and get standardised and improved quality. This is for sure a worthy investment that keeps paying dividends with time. And finally, it was worth feeling the satisfaction that our customers can enjoy the IOportal cloud service no matter what!