Disaster Recovery

EMC VNX – MirrorView configuration

Last week I have written a short post about introduction to MirrorView. This week I would like to write a little bit more about MirrorView configuration, terminology and usage. Let’s start with terminology

VNX MirrorView Key Terminology

Primary Image – LUN containing production data and the contents of which is replicated to secondary image.
Secondary Image  – LUN containing a mirror of the primary image LUN residing on a different VNX (secondary site)
Image condition – Provides additional information about the status of updates for a secondary image.
State – Remote mirror states and image states.
Consistency Group – Set of mirrors that are managed as a single entity and whose secondary images always remains in a consistent and recoverable state with the primary image and each other.
Consistency Group State – Indicated the current state of the consistency group.
Fracture – Condition in which I/O is not mirrored to the secondary image. Manually initiated by the administrator or by the system when it determines the secondary image is unreachable.
Promote – changes an image’s role from secondary to primary.

 Basic MirrorView Configuration

MirrorView allows for a large amount of topologies and configuration. The primary and secondary images must have the same server-visible capacity (user capacity), because they are allowed to reverse roles for fail-over and fail-back (see Promote definition above). But Primary Image and Secondary Image can reside on different RAID configuration. In order to use MirrorView, the software need to be loaded on both (primary and secondary) VNX arrays. Secondary LUNs are not accessible to hosts during the mirroring. Bi-directional mirroring (VNX array can be both primary and secondary site) is supported, as long as the primary and secondary images within mirror reside on different storage systems.

Consistency Groups

Consistency Groups allow all LUNs that are belonging to a give application to be treated as a single entity and managed as a whole. This helps to ensure that the remote images are consistent. As a result, the remote images are always re-startable copies of the local images. When a mirror is part of a Consistency Groups, most operations on individual members are prohibited (for example fracture, or synchronize can only be executed for the Consistency Group).

Site Level Fan-In

MirrorView supports 4:1 Fan-In ratio. It means that one VNX array can be a destination (secondary site) for 4 (different) primary VNX arrays. It’s a common configuration when remote VNX Array is used for consolidated backups, simplified failover or consolidated remote processing activities. The 4:1 Fan-In ratio is applicable to both MirrorView/S and MirrorView/A

LUN Level Fan-Out

Fan-out mirroring may be used to replicate data from one primary LUN to up-to-two secondary LUNs residing on different arrays. (MirrorView/S 1:2 Fan-out ratio). This configuration enables administrator to synchronously mirror one primary image to two different secondary images. In case of MirrorView/A, one primary image can be mirrored only to single secondary image (MirroView/A 1:1 Fan-out ratio).

 Port configuration

MirrorView ports are automatically assigned when the system is initialized. All MirrorView traffic goes through one dedicated port of each connection type (FC or iSCSI) per Storage Processor. For VNX that have FC and iSCSI systems, one FC port and one iSCSI are available for MirrorView traffic (per SP).

A path must exist between the MirrorView ports of SP-A of the primary and SP-A of a secondary system. Same relationship must exist for Storage Processor B. MirrorView ports may be shared with host I/O, but that might cause performance issues.

 Mirrored Image States

Once an image has been mirrored, the image may be in one of three availability states:

  • Inactive – inactive mirrored states means that the Administrator has stopped mirroring.
  • Active – an active status is considered a normal state, where all I/Os are allowed on the image.
  • Attention – this state indicated that something has happened to the mirrored image and action by an Admin is required.

In terms of the mirrored image consistency and relationship with the source image, MirrorView contains five data states

  • Out-of-Sync – means that a full sync is in order
  • In-Sync – indicate that the primary and secondary contains identical data and the operation is in progress
  • Rolling Back – the act of returning a primary to a predefined point-in-time.
  • Consistent – the mirroring has been stopped and a write intent or fracture log is needed to continue the mirroring
  • Synchronizing – the operation of synchronizing is in progress.

MirrorView Common Operations

Synchronization

Synchronization is a copy operation MirrorView performs during newly-created mirrors or to reestablish existing mirrors after an interruption. Initial synchronization is used to create a baseline copy of the primary image to the secondary. Primary images remain online during the sync process and until the synchronization is complete, the secondary image is unusable.

Promote

A secondary image is promoted to the role of primary when it is necessary to run production applications at the disaster recovery site. A promotion can only occur if the secondary image is in the consistent or synchronized state.

Fracture

A fracture stops MirrorView replication from the primary image to the secondary mirror. Administrative fractures are usually initiated to suspend replication, as opposed to a system fracture which is initiated by the MirrorView software. A system fracture typically means a communication failure between the primary and secondary systems.

With MirrorView/S writes continue to the primary image but are not replicated to the secondary during a fracture. Replication can resume when the user issues a synchronize command.

With MirrorView/A the current updates stop during a fracture and no futher updates will start until a synchronize request is issued.

 

Data protection – NetApp way

When I say data protection I mean the features to back up data and to be able to recover it when needed. Basically you need to back up data for the following reasons:

  • to protect data from accidentally deleted files, application crashers, viruses, data corruption etc.
  • to archive data for future use or for legal purposes
  • to recover from a distaster

NetApp developed many methods of protecting data. To use some of them you need an extra licence, some of them are the standards features of Data ONTAP.

 aggr copy

aggr copy gives up fast block copy of data stored in aggregates. Just a quick remain, all data served by NetApp are located on the aggr. With the aggr copy you can make an exact copy of existing aggregate. It means that all volumes and qtrees that are on the source aggregate will be copied as well.
You can use aggr copy to copy the aggregate within the same filer or to another filer. If the destination is on another filer make sure that rsh authentication is enabled on the source and destination.
The basic example:

filerB> aggr restrict aggr_dest
filerB> aggr copy start filerA:aggr_source filerB:aggr_dest

snapshot copy

NetApp allows you to manually or automatically create and maintain many snapshot copies. Snapshot itself doesn’t copy the data when created, but copies the data that changes between the snapshot and the current state. It means that if you have a snapshot made yesterday at 12:00 you can at any time recover files or even the whole snapshot image to the point of yesterday 12:00.
The basic example:

filerA> snap create volume_01 snapshot_0001

With the snapshot ans SnapRestore (extra license is needed) you can easily recover single file or the whole volume from snapshot.

SnapMirror

With the snapmirror you can replicate the whole volume or the selected qtree to other location (extra license is needed) . You can set SnapMirror in three modes: sync, a-sync and semi-sync. More about SnapMirror you can find in this post.

SnapVault

SnapVault is the backup feature that requires and extra license.  Within the SnapVault you can back up the entire qtree, set up different snapshot schedule on the destination. More about SnapMirror vs SnapVault you can find in this post.

vol copy

With the vol copy you can copy all data from one volume to another, either on the same or different system. Similar to aggr copy, you can initiate a volume copy with the vol copy start command. Teh result is a restricted volume containing the same data as the source volume at the time you initiated the copy opreation.

filerA> vol create vol1 aggr1 50g
filerB> vol create vol1_copy aggr1 50g
filerB> vol restrict vol1_copy
filerB> vol copy start filerA:vol1 filerB:vol1_copy
 […]
filerA> vol status -b 
Volume     Block Size   Vol Size  FS Size 
 ——      ——        ——      ——
 vol1           4096             4346752            4346752
filerB> vol status -b
Volume     Block Size   Vol Size  FS Size 
——      ——        ——      ——
vol1_copy     4096             4346752            4346752 

filerB> vol online vol1_copy

Of course that’s just a simple example.

SyncMirror

Continous mirroring of data to two separate aggregates. This features allows for real-time mirroring of data to matching aggregates physically connected to the same storage system.

RPO and RTO – Understanding the difference

Understanding the RPO and RTO helps you when you have to answer the question: How much downtime are you willing to tolerate? In worst-case-scenario how much data are you willing to loose?
 
What is RPO?

RPO – Recovery Point Objective  – it is the point in time to which systems and data must be recovered after an outage. It defines the amount of data loss that a business can endure.

How to understand that? Simple – if you take a nightly backup of your data your RPO is 24 hours, which means that in the worst case scenario you will loose 24 hours.

There are few general solutions for the RPO:

  • RPO of 24 hours – backups are created at an offsite tape library every night. The corrseponding recovery strategy is to restore data from the set of last backup tapes
  • RPO of 1 hour – shipping database logs to the remote site every hour.
  • RPO in order of minutes – mirroring data asynchronously to the remote site
  • Near zero RPO – mirroring data synchronously to a remote site

What is RTO?
 
RTO – Recovery Time Objective  – it is the time within which systems and applications must be recovered after and outage. It defines the amout of dowintime that a business can endure and survive.

There are few general solutions for the RTO:

  • RTO of 72 hours – restore from tapes available at a cold site
  • RTO of 12 hours – restore from tapes available at a hot site
  • RTO of few hours – Use of data vault at a hot site
  • RTO of a few seconds – cluster production servers with bidirectional mirroring (for example NetApp metro-cluster)

Explaination of the terms:
Data vault  – a repository at a remote site where data can be copied
Hot site – a site where an enterprise’s operations can be moved in the event of a disaster. The site has required hardware, OS, apps, network to perform business operations, and the euqipment is available and running at all times
Cold site – a site where an enteprise’s operations can be moved in the event of disaster, with mininum IT infrastructure and environmental facilities in place, but no activated

 RTO vs RPO

To understand the meaning of those two try to study this example:

When reviewing the disaster recovery plan for two data centers, you find that:

  • The copy of data at remote Site B will lag behind the production data at Site A by 5 minutes
  • It will take 2 hours after an outage at Site A to shift production to Site B. 
  • Three more hours will be needed to power up the servers, bring up the network, and redirect users to Site B.

 

What is the recovery point objective (RPO) of this plan?


What is the recovery time objective (RTO) of this plan?