Building Disaster Recovery Solution on AWS for SaaS

Unveil how we built a fully functional pilot light DR environment for protecting the client’s SaaS infrastructure from the downtime.

Our Customer

Pragma IT therapyBOSS is a comprehensive web and mobile SaaS platform for agencies and clinicians that allows its users manage administrative and clinical aspects of home health therapy (Early Intervention, Physical Therapy, Speech Therapy, Skilled Nursing etc). It helps healthcare providers be more efficient and compliant, saves time, cuts costs and streamlines operations of treating patients at home.

The Obstacles They Faced

Client’s main workloads are running on-premises, but in order to meet the US healthcare compliance requirements, reduce restore time, recovery time objective (RTO) and recovery point objective (RPO), minimize the interruption of critical processes and safeguard business operations they needed to build a trustworthy and sustainable disaster recovery (DR) infrastructure.

How We Helped

Romexsoft successfully did professional work on therapyBOSS on-premises environments and software components and built a fully functional pilot light DR environment for protecting the client’s SaaS infrastructure from prolonged downtime and thereby for safeguarding the vital business operations.

The challenge was to find the best possible option in building the DR solution for the SaaS from the perspective of the right balance between the fastest feasible restoration of the platform and the cost-effectiveness of disaster recovery infrastructure itself.

For instance, negative events that could happen with on-premise environments could be a hardware or software failure, a network or power outage, physical damage caused by fire or flooding, human error or some other kind of significant disaster which causes a negative impact on the business continuity.

How the application is built

The TherapyBOSS application is written in Java and has microservices based containerized architecture. Communication between the microservices is implemented through the REST API and event driven approaches. Apache Kafka is used as a distributed event streaming platform. Galera Cluster for MySQL and MongoDB are used as data storage solutions.

How the DR infrastructure is designed

After several workshops with the customer Romexsoft suggested building pilot light DR infrastructure in the US East (Ohio) AWS region far from on-premises data-center. This decision was driven to meet client’s specific RTO, RPO and TCO requirements for their application as well as to enable faster disaster recovery of the critical IT systems from any event that harms the Pragma IT business.

The pilot light disaster recovery approach was delivered by configuring and running the most critical core elements of the customer system in AWS. When the time for recovery comes, AWS infrastructure rapidly provisions a full-scale production environment around the critical one.

Ensuring data relevance and synchronization
To provide constant data relevance for the solution, one of the Galera’s read replicas always runs on AWS EC2 instance and remains synchronized with the main cluster in the data center. Similar approach is designed for the MongoDB cluster. Additional Galera and MongoDB replicas will be provisioned on EC2 instances and synchronized as well.

Data synchronization between on-premises and AWS is accomplished through AWS Site-to-Site VPN. All other AWS services such as applications running in Fargate, AWS MKS, Jenkins server, and Bastion host run in the idle mode. In the moment of disaster, idle part of the AWS infrastructure will be provisioned using the infrastructure as code (IaC) approach with Terraform.

How the DR infrastructure is maintained

We have agreed with the customer to perform disaster recovery exercises for the staging environment on a monthly basis. This activity ensures:

confidence that DR infrastructure always functions properly
integrity of DR environment evolution in accordance with the app’s development
tracking and compliance of determined time range for the restoration of replicas of the on-premises infrastructure