UST Global's Approach to Application Resilience
OVERVIEW OF APPLICATION RESILIENCE
In today’s trends with technology revolution in the digital era, the applications and associated business services are expected to operate at 24x7 with zero tolerance on availability and performance. It took tremendous amount of effort and resources to build such a system in the past, however with availability of building blocks such as containerization, API based resource management, predictive analytics and machine learning makes it possible to build highly available and self-healing applications.
The application non availability is due to several causes:
- Planned outage which constitutes 90% of the time
- Unplanned outage which constitutes about 9%
- Disaster which constitutes about 1% of the failure
Cause of application non-availability
It is fairly easy and efficient to improve resilience while building such systems from grounds up, however in most of the cases, legacy applications were built with a specific purpose and contextual situation at that point in time during development and deployment, and these applications continue to provide value to an enterprise with business critical functions.
However, most of the enterprises are looking at modernization of applications ensuring application resilience as one of the key factors for the transformation.
In this document we will look at various considerations starting from the infrastructure platform, software components and application architecture that would make applications highly resilient from failures or degradation of the application itself and/or from the dependent layers of infrastructure components.
The key dependent layers for an application itself to be resilient in the normal mode of operations are:
- Infrastructure components such as Server, Storage, Network, Power, and Physical environment
- Software components such as Operating System, Middleware components such as Web Server, Application Server, Database Server, and Messaging Server
- Data components such as File system, Shared memory, Object storage, or Block storage
- User connectivity or End User computing environment failures or degradations
APPLICATION RESILIENCY THROUGH INFRASTRUCTURE COMPONENTS
At the infrastructure level, resiliency is mostly achieved through redundant components. In the physical environment, server clusters are configured in either ‘active-active, or active/passive, modes with a heartbeat mechanism to ensure continuity. When the heartbeat fails, the remaining active node or the passive node takes over.
In virtual environments, the resiliency is achieved through continuous provisioning of redundant components with failure detection mechanisms, when a failure occurs a replacement component is provisioned from the pool of existing components and the operational state is established.
In virtual operations such as private or public cloud, multiple approaches used, but each has the key objective of identifying the single points of failure at the infrastructure level and developing a solution with tolerance for those failure scenarios. When multiple failures occur at the infrastructure level, it can be considered a disaster and the established disaster recovery mechanism is triggered for recovery.
Disaster Recovery RTO and RPO
Disaster recovery is implemented with a recovery time objective (RTO) and a recovery point objective (RPO). Not all applications need the same level of RTO and RPO. More stringent RTO and RPO parameters are more expensive to implement hence the RTO and RPO parameters should be tied to the business criticality of the application.
For a lower RTO closer to 0 hours and lower RPO closer to 0 minutes, the infrastructure components should be configured in active-active configuration, using a global load balancer to distribute the traffic based on the response time, location affinity, application affinity, and other factors. This requires duplication of infrastructure components (server, storage, and network) along with the entire applications stack to be deployed in a highly available mode across multiple data centers which are geographically separated with high speed, low latency network connections. The network connection between the data centers should be redundant itself to avoid single point failures. In the event of one regional data center going down or multiple failures occurring in a single data center, the other data center will take up and respond to requests based on timeout parameters or triggered responses to network alerts.
An example is shown below of an active-active deployment of an application environment across two data centers with high speed connectivity for data replication. The two data centers are connected to a global intelligent DNS.
- Note -•• The example below can be implemented in any public cloud today as well as private cloud environments
Active – Active configuration
For a moderate RTO and RPO, wherein there is enough time to recover the application and the associated data, the infrastructure is duplicated similar to active-active configuration except that the secondary site will generally have a scaled down configuration and the primary site will provide the application services as depicted below.
Warm standby configuration
In the event of failure at the primary site, the secondary site will scale up in minutes and provide the full set of application services. The data is replicated between the primary and secondary data copies based on the RPO parameter values.
For non-critical business applications which have a tolerance for less stringent RTO and RPO levels, the secondary site can be implemented in an active-passive configuration where the secondary site will be in shutdown state as illustrated in the below diagram.