Home > Services > Business Continuity Services > DR BCP Solutions
 

DR BCP Solutions

 

Disaster often strikes without any warning. Be it a minor accident or a catastrophic event, these disruptions to your business can cost you heavily.

 
 Disaster Recovery (DR)

Many businesses rely on DR services to prevent either man-made or natural disasters from causing expensive service disruptions. Unfortunately, current DR services come either at a very high cost, or with only weak guarantees related to the amount of data lost or time required to restart operation after a failure.

 Traditional DR

A typical DR service works by replicating application state between two data centers. If the primary data center becomes unavailable, then the backup site can take over and will activate a new copy of the application using the most recently replicated data.

 DR Essentials

The key requirements for an effective DR service are based on business decisions such as monetary cost of system downtime or data loss, while others are directly tied to application performance and correctness. Moreover the essentials of DR can be understood with the below key requirements:

  • Recovery Point Objective (RPO)

    The RPO of a DR system represents the point in time of the most recent backup prior to any failure. The necessary RPO is generally a business decision—for some applications absolutely no data can be lost (RPO=0), requiring continuous synchronous replication to be used; while for others, the acceptable data loss could range from few seconds to hours or even days.
  • Recovery Time Objective (RTO)

    The RTO is an orthogonal business decision that specifies a bound on how long it can take for an application to come back online after a failure occurs. This includes the time to detect the failure, prepare any required servers in the backup site (virtual or physical), initialize the failed application, and perform the network reconfiguration required to re-route requests from the original site to the backup site so the application can be used. Depending on the application type and backup technique, this may involve additional manual steps such as verifying the integrity of state or performing application specific data restore operations, and can require careful scheduling of recovery tasks to be done efficiently. Having a very low RTO can enable business continuity, allowing an application to seamlessly continue operating despite a site wide disaster.
  • Performance

    For a DR service to be useful it must have a minimal impact on the performance of each application being protected under failure-free operation. DR can impact performance either directly such as in the synchronous replication case where an application write will not return until it is committed remotely, or indirectly by simply consuming disk and network bandwidth resources which otherwise the application could use.
  • Consistency

    The DR service must ensure that after a failure occurs the application can be restored to a consistent state. This may require the DR mechanism to be application specific to ensure that all relevant state is properly replicated to the backup site. In other cases, the DR system may assume that the application will keep a consistent copy of its important state on disk, and use a disk replication scheme to create consistent copies at the backup site.
  • Geographic separation

    It is important that the primary and backup sites are geographically separated in order to ensure that a single disaster will not impact both the sites. This geographic separation adds its own challenges since increased distance leads to higher WAN bandwidth costs and will incur greater network latency. Increased round trip latency directly impacts application response time when using synchronous replication. As round trip delays are limited by the speed of light, synchronous replication is feasible only when the backup site is within 10s of kilometers of the primary. Asynchronous techniques can improve performance over longer distances, but can lead to greater data loss during a disaster. Distance can especially be a challenge in cloud based DR services as a business might have only coarse control over where resources will be physically located.




 DR Mechanisms

Disaster Recovery is primarily a form of long distance state replication combined with the ability to start up applications at the backup site after a failure is detected. The amount and type of state that is sent to the backup site can vary depending on the application's needs. State replication can be done at one of these layers:
(i) within an application
(ii) per disk or within a file system, or

(iii) for the full system context.

  • Hot Back-Up Site

    A hot backup site typically provides a set of mirrored stand-by servers that are always available to run the application once a disaster occurs, providing minimal RTO and RPO. Hot standbys typically use synchronous replication to prevent any data loss due to a disaster. This form of backup is most expensive since fully powered servers must be available at all times to run the application, plus extra licensing fees may apply for some applications. It can also have the largest impact on normal application performance since network latency between the two sites increases response time.
 
  • Warm Back-Up Site

    A warm backup site may keep state up-to-date with either synchronous or asynchronous replication schemes depending on the necessary RPO. Standby servers to run the application after failure are available, but are only kept in a 'warm' state where it may take minutes to bring them online. This slows recovery but also reduces cost; the server resources to run the application need to be available at all times, but active costs such as electricity and network bandwidth are lower during normal operation.
  • Cold Back-Up Site

    In a cold backup site, data is often only replicated on a periodic basis leading to an RPO of hours or days. In addition, servers to run the application after failure are not readily available, and there may be a delay of hours or days as hardware is brought out of storage or repurposed from test and development systems, resulting in a higher RTO. It can be difficult to support business continuity with cold backup sites, but they are a very low cost option for applications that do not require strong protection or availability guarantees.


 Failover and Failback

In addition to managing state replication, a DR solution must be able to detect when a disaster has occurred, perform a failover procedure to activate the backup site, as well as run the failback steps necessary to revert control back to the primary data center once the disaster has been dealt with. Detecting when a disaster has occurred is a challenging problem as transient failures or network segmentation can trigger false alarms. In practice, most DR techniques rely on manual detection and failover mechanisms.

In most cases, a disaster will eventually pass and a business will want to revert to the control of its applications back to the original site. To do this, the DR software must support bidirectional state replication so that any new data that was created at the backup site during the disaster can be transferred back to the primary. However, this can be a major challenge: the primary site may have lost an arbitrary amount of data due to the disaster, so the replication software must be able to determine what new and old state must be resynchronized to the original site. In addition, the failback procedure must be scheduled and implemented in order to minimize the level of application downtime.

 
Partner Us
 
 
Brochure Download
Corporate Brochure
 
Related Links