HP Ser vicegu a r d E xt en d ed Dist a n ce
Clu st er for Lin u x A.01.00 Dep loym en t
Gu id e
Ma n u fa ct u r in g Pa r t Nu m b er : T2808-90006
Ma y 2008 Secon d E d it ion
Evaluating the Need for Disaster Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
What is a Disaster Tolerant Architecture?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Understanding Types of Disaster Tolerant Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Extended Distance Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Cluster Extension (CLX) Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Continental Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Continental Cluster With Cascading Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Comparison of Disaster Tolerant Solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Disaster Tolerant Architecture Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Protecting Nodes through Geographic Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Protecting Data through Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Using Alternative Power Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Creating Highly Available Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Disaster Tolerant Cluster Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Managing a Disaster Tolerant Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Additional Disaster Tolerant Solutions Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
RAID
Types of Data Link for Storage and Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Rules for Separate Network and Data Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Guidelines on DWDM Links for Network and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Understanding Software RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Installing the Extended Distance Cluster Software . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Supported Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Installing XDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Verifying the XDC Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Configuring the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Configuring Multiple Paths to Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Setting the Value of the Link Down Timeout Parameter . . . . . . . . . . . . . . . . . . . . . . 69
Using Persistent Device Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3
Con ten t s
Creating a Multiple Disk Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
To Create and Assemble an MD Device. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Creating Volume Groups and Configuring VG Exclusive Activation on the MD Mirror .
Configuring the Package Control Script and RAID Configuration File . . . . . . . . . . . . 76
Creating and Editing the Package Control Scripts. . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Editing the raid.conf File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Viewing the Status of the MD Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Stopping the MD Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Starting the MD Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Removing and Adding an MD Mirror Component Disk . . . . . . . . . . . . . . . . . . . . . . . 101
Adding a Mirror Component Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4
Con t en t s
5
Con ten t s
6
P r in tin g Histor y
Ta b le 1
E d it ion s a n d Relea ses
Printing Date
Part Number
T2808-90001
Edition
Operating System
Releases (see Note below)
November
2006
Edition 1
•
•
Red Hat 4 U3 or later
Novell SUSE Linux
Enterprise Server 9
SP3 or later
•
Novell SUSE Linux
Enterprise Server 10
or later
May 2008
T2808-90006
Edition 2
•
•
Red Hat 4 U3 or later
Novell SUSE Linux
Enterprise Server 9
SP3 or later
•
Novell SUSE Linux
Enterprise Server 10
or later
The printing date and part number indicate the current edition. The
printing date changes when a new edition is printed. (Minor corrections
and updates which are incorporated at reprint do not cause the date to
change.) The part number changes when extensive technical changes are
incorporated. New editions of this manual will incorporate all material
updated since the previous edition.
NOTE
This document describes a group of separate software products that are
released independently of one another. Not all products described in this
document are necessarily supported on all the same operating system
releases. Consult your product’s Release Notes for information about
supported platforms.
7
HP Printing Division:
Business Critical Computing Business Unit
Hewlett-Packard Co.
19111 Pruneridge Ave.
Cupertino, CA 95014
8
P r efa ce
This guide introduces the concept of Extended Distance Clusters (XDC).
It describes how to configure and manage HP Serviceguard Extended
Distance Clusters for Linux and the associated Software RAID
functionality.
In addition, this guide includes information on a variety of
Hewlett-Packard (HP) high availability cluster technologies that provide
disaster tolerance for your mission-critical applications. Serviceguard
has supported disaster tolerant clusters on HP-UX for several years now
while it is relatively new on Linux. Features of those disaster tolerant
HP-UX systems may be used as an example through this document.
In ten d ed Au d ien ce
It is assumed that you are familiar with the following topics
•
•
The chapters of this guide include:
•
•
•
•
Cluster,” on page 13.
Chapter 2, “Building an Extended Distance Cluster Using
Serviceguard and Software RAID,” on page 51
Chapter 3, “Configuring your Environment for Software RAID,” on
page 61
Chapter 4, “Disaster Scenarios and Their Handling,” on page 85
9
10
P r efa ce
Related
The following documents contain additional useful information:
Publications
•
Clusters for High Availability: a Primer of HP Solutions, Second
Edition. Hewlett-Packard Professional Books: Prentice Hall PTR,
2001 (ISBN 0-13-089355-2)
•
Designing Disaster Tolerant HA Clusters Using Metrocluster and
Continentalclusters (B7660-900xx)
•
•
HP StorageWorks Cluster Extension EVA user guide
HP StorageWorks Cluster Extension XP for HP Serviceguard for
Linux
•
•
HP Serviceguard for Linux Version A.11.16 Release Notes
Managing HP Serviceguard for Linux
Use the following URL to access HP’s High Availability web page:
• http://www.hp.com/go/ha
ProblemReporting If you have problems with HP software or hardware products, please
contact your HP support representative.
11
12
Disaster Tolerance and Recovery in a Serviceguard Cluster
1
Disa st er Toler a n ce a n d
R ecover y in a Ser vicegu a r d
Clu ster
cluster technologies that provide disaster tolerance for your
This chapter covers the following topics:
•
•
•
•
•
•
“Understanding Types of Disaster Tolerant Clusters” on page 18
“Disaster Tolerant Architecture Guidelines” on page 37
“Managing a Disaster Tolerant Environment” on page 48
“Additional Disaster Tolerant Solutions Information” on page 50
Chapter 1
13
Disaster Tolerance and Recovery in a Serviceguard Cluster
Evaluating the Need for Disaster Tolerance
E va lu a t in g t h e Need for Disa st er Toler a n ce
Disa st er toler a n ce is the ability to restore applications and data within
a reasonable period of time after a disaster. Most people think of fire,
flood, and earthquake as disasters, but a d isa ster can be any event that
unexpectedly interrupts service or corrupts data in an entire data center:
the backhoe that digs too deep and severs a network connection, or an act
of sabotage.
Disaster tolerant architectures protect against unplanned down time due
to disasters by geographically distributing the nodes in a cluster so that
a disaster at one site does not disable the entire cluster. To evaluate your
need for a disaster tolerant solution, you need to weigh:
•
Risk of disaster. Areas prone to tornadoes, floods, or earthquakes
may require a disaster recovery solution. Some industries need to
consider risks other than natural disasters or accidents, such as
terrorist activity or sabotage.
The type of disaster to which your business is prone, whether it is
due to geographical location or the nature of the business, will
determine the type of disaster recovery you choose. For example, if
you live in a region prone to big earthquakes, you are not likely to
put your alternate or backup nodes in the same city as your primary
nodes, because that sort of disaster affects a large area.
The frequency of the disaster also plays an important role in
determining whether to invest in a rapid disaster recovery solution.
For example, you would be more likely to protect from hurricanes
that occur seasonally, rather than protecting from a dormant
volcano.
•
Vulnerability of the business. How long can your business afford to be
down? Some parts of a business may be able to endure a 1 or 2 day
recovery time, while others need to recover in a matter of minutes.
Some parts of a business only need local protection from single
outages, such a node failure. Other parts of a business may need both
local protection and protection in case of site failure.
It is important to consider the role applications play in your
business. For example, you may target the assembly line production
servers as most in need of quick recovery. But if the most likely
disaster in your area is an earthquake, it would render the assembly
14
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Evaluating the Need for Disaster Tolerance
line inoperable as well as the computers. In this case disaster
recovery would be moot, and local failover is probably the more
appropriate level of protection.
On the other hand, you may have an order processing center that is
prone to floods in the winter. The business loses thousands of dollars
a minute while the order processing servers are down. A disaster
tolerant architecture is appropriate protection in this situation.
Deciding to implement a disaster recovery solution really depends on the
balance between risk of disaster, and the vulnerability of your business if
a disaster occurs. The following pages give a high-level view of a variety
of disaster tolerant solutions and sketch the general guidelines that you
must follow in developing a disaster tolerant computing environment.
Chapter 1
15
Disaster Tolerance and Recovery in a Serviceguard Cluster
What is a Disaster Tolerant Architecture?
Wh a t is a Disa st er Toler a n t Ar ch it ectu r e?
In a Serviceguard cluster configuration, high availability is achieved by
using redundant hardware to eliminate single points of failure. This
protects the cluster against hardware faults, such as the node failure in
Figure 1-1.
Figu r e 1-1
High Ava ila b ilit y Ar ch it ectu r e.
node 1 fails
node 1
pkg A
node 2
pkg B
pkg A disks
pkg A mirrors
pkg B disks
X
pkg B mirrors
Client Connections
pkg A fails
over to node 2
node 1
node 2
pkg B
pkg A
pkg A disks
pkg A mirrors
pkg B disks
X
pkg B mirrors
Client Connections
This architecture, which is typically implemented on one site in a single
data center, is sometimes called a loca l clu ster . For some installations,
the level of protection given by a local cluster is insufficient. Consider the
order processing center where power outages are common during harsh
weather. Or consider the systems running the stock market, where
multiple system failures, for any reason, have a significant financial
16
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
What is a Disaster Tolerant Architecture?
impact. For these types of installations, and many more like them, it is
important to guard not only against single points of failure, but against
m u lt ip le p oin ts of fa ilu r e (MP OF ), or against single massive failures
that cause many components to fail, such as the failure of a data center,
of an entire site, or of a small area. A d a ta cen t er , in the context of
disaster recovery, is a physically proximate collection of nodes and disks,
usually all in one room.
massive failures requires a different type of cluster architecture called a
d isa st er toler a n t a r ch it ectu r e. This architecture provides you with
the ability to fail over automatically to another part of the cluster or
manually to a different cluster after certain disasters. Specifically, the
disaster tolerant cluster provides appropriate failover in the case where
a disaster causes an entire data center to fail, as in Figure 1-2.
Figu r e 1-2
Disa st er Toler a n t Ar ch itect u r e
Chapter 1
17
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Un d er sta n d in g Typ es of Disa ster Toler a n t
Clu ster s
To protect against multiple points of failure, cluster components must be
geographically dispersed: nodes can be put in different rooms, on
disaster from which you need protection, and on the technology used to
this guide:
•
•
•
Extended Distance Clusters
Cluster Extension (CLX) Cluster
Continental Cluster
These types differ from a simple local cluster in many ways. Extended
distance clusters and metropolitan clusters often require right-of-way
from local governments or utilities to lay network and data replication
cables or connect to DWDMs. This can complicate the design and
implementation. They also require a different kind of control mechanism
for ensuring that data integrity issues do not arise, such as a quorum
server. Typically, extended distance and metropolitan clusters use an
arbitrator site containing a computer running a “quorum” application.
Continental clusters span great distances and operate by replicating
data between two completely separate local clusters.
NOTE
Continental clusters are not supported with HP Serviceguard for Linux.
They are described here to show the range of solutions that exist.
Exten d ed Dista n ce Clu ster s
An exten d ed d ista n ce clu ster (also known as ext en d ed ca m p u s
clu st er ) is a normal Serviceguard cluster that has alternate nodes
located in different data centers separated by some distance, with a third
location supporting the quorum service. Extended distance clusters are
connected using a high speed cable that guarantees network access
between the nodes as long as all guidelines for disaster tolerant
18
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
architecture are followed. Extended distance clusters were formerly
known as ca m p u s clu ster s, but that term is not always appropriate
because the supported distances have increased beyond the typical size
of a single corporate campus. The maximum distance between nodes in
an extended distance cluster is set by the limits of the data replication
technology and networking limits. An extended distance cluster is shown
in Figure 1-3.
NOTE
There are no rules or recommendations on how far the third location
must be from the two main data centers. The third location can be as
close as the room next door with its own power source or can be as far as
in a site across town. The distance among all three locations dictates the
level of disaster tolerance an extended distance cluster can provide.
In an extended distance cluster, for data replication, the Multiple Disk
(MD) driver is used. Using the MD kernel driver, you can configure RAID
1 (mirroring) in your cluster. In a dual data center setup, to configure
RAID 1, one LUN from a storage device in data center 1 is coupled with a
LUN from a storage device in data center 2. As a result, the data that is
written to this MD device is simultaneously written to both devices. A
package that is running on one node in one data center has access data
from both storage devices.
The two recommended configurations for the extended distance cluster
are both described below.
Chapter 1
19
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Figu r e 1-3
E xten d ed Dista n ce Clu st er
In the above configuration the network and FC links between the data
centers are combined and sent over common DWDM links. Two DWDM
links provide redundancy. When one of them fails, the other may still be
active and may keep the two data centers connected. Using the DWDM
link, clusters can now be extended to greater distances which were not
possible earlier due to limits imposed by the Fibre Channel link for
storage and Ethernet for networks. Storage in both data centers is
connected to both the nodes via two FC switches in order to provide
multiple paths. This configuration supports a distance up to 100 kms
between datacenter1 and datacenter2.
20
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Figu r e 1-4
Tw o Da t a Cen ter Setu p
Figure 1-4 shows a configuration that is supported with separate
network and FC links between the data centers. In this configuration,
the FC links and the Ethernet networks are not carried over DWDM
links. But each of these links is duplicated between the two data centers,
for redundancy. The disadvantage of having the network and the FC
links separate is that if there is a link failure between sites, the ability to
exchange heartbeats and the ability to write mirrored data will not be
lost at the same time. This configuration is supported to a distance of 10
kms between data centers.
All the nodes in the extended distance cluster must be configured with
QLogic driver’s multipath feature in order to provide redundancy in
connectivity to the storage devices. Mirroring for the storage is
configured such that each half of the mirror (disk set) will be physically
present at one datacenter each. Further, from each of the nodes there are
multiple paths to both of these mirror halves.
Chapter 1
21
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Also note that the networking in the configuration shown is the
minimum. Added network connections for additional heartbeats are
recommended.
Ben efit s of E xten d ed Dista n ce Clu st er
•
This configuration implements a single Serviceguard cluster across
two data centers, and uses either Multiple Device (MD) driver for
data replication.
•
•
•
You may choose any mix of Fibre Channel-based storage supported
by Serviceguard, that also supports the QLogic multipath feature.
This configuration may be the easiest to understand, as it is similar
in many ways to a standard Serviceguard cluster.
Application failover is minimized. All disks are available to all nodes,
so that if a primary disk fails but the node stays up and the replica is
available, there is no failover (that is, the application continues to
run on the same node while accessing the replica).
•
•
Data copies are peers, so there is no issue with reconfiguring a
replica to function as a primary disk after failover.
Writes are synchronous, so data remains current between the
primary disk and its replica, unless the link or disk is down.
22
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Clu ster Exten sion (CLX) Clu ster
A Linux CLX cluster is similar to an HP-UX m et r op olit a n clu st er and
is a cluster that has alternate nodes located in different parts of a city or
in nearby cities. Putting nodes further apart increases the likelihood that
alternate nodes will be available for failover in the event of a disaster.
The architectural requirements are the same as for an extended distance
cluster, with the additional constraint of a third location for arbitrator
node(s) or quorum server. And as with an extended distance cluster, the
distance separating the nodes in a metropolitan cluster is limited by the
data replication and network technology available.
In addition, there is no hard requirement on how far the third location
has to be from the two main data centers. The third location can be as
close as the room next door with its own power source or can be as far as
in a site across town. The distance between all three locations dictates
the level of disaster tolerance a metropolitan cluster can provide.
On Linux, the metropolitan cluster is implemented using CLX.
•
•
CLX for XP
CLX for EVA
For HP-UX, Metropolitan cluster architecture is implemented through
the following HP products:
•
•
•
Metrocluster with Continuous Access XP
Metrocluster with Continuous Access EVA
Metrocluster with EMC SRDF
The above HP-UX products are described in detail in Chapters 3, 4, and
5 of the Designing Disaster Tolerant HA Clusters Using Metrocluster and
Continentalclusters user’s guide. The Linux products are described in
detail in Getting Started with MC/ ServiceGuard for Linux guide. While
there are some differences between the HP-UX and the Linux versions,
the concepts are similar enough that only Cluster Extension (CLX) will
be described here.
On-line versions of the above document and other HA documentation are
available at http://docs.hp.com-> High Availability.
On-line versions of the Cluster Extension documentation is available at
http://h71028.www7.hp.com/enterprise/cache/120851-0-0-225-12
1.html-> HP StorageWorks Cluster Extension EVA or XP.
Chapter 1
23
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Figure 1-5 shows a CLX for a Linux Serviceguard cluster architecture.
Figu r e 1-5
CLX for Lin u x Ser vicegu a r d Clu ster
A key difference between extended distance clusters and CLX clusters is
the data replication technology used. The extended distance cluster uses
Fibre Channel and Linux MD software mirroring for data replication.
CLX clusters provide extremely robust hardware-based data replication
available with specific disk arrays based on the capabilities of the HP
StorageWorks Disk Array XP series, or the HP StorageWorks EVA disk
arrays.
24
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Ben efit s of CLX
•
CLX offers a more resilient solution than Extended Distance Cluster,
as it provides complete integration between Serviceguard’s
application package and the data replication subsystem. The storage
subsystem is queried to determine the state of the data on the
arrays.
CLX knows that application package data is replicated between two
data centers. It takes advantage of this knowledge to evaluate the
status of the local and remote copies of the data, including whether
the local site holds the primary copy or the secondary copy of data,
whether the local data is consistent or not and whether the local data
is current or not. Depending on the result of this evaluation, CLX
decides if it is safe to start the application package, whether a
resynchronization of data is needed before the package can start, or
whether manual intervention is required to determine the state of
the data before the application package is started.
CLX allows for customization of the startup behavior for application
packages depending on your requirements, such as data currency or
application availability. This means that by default, CLX will always
prioritize data consistency and data currency over application
availability. If, however, you choose to prioritize availability over
currency, you can configure CLX to start up even when the state of
the data cannot be determined to be fully current (but the data is
consistent).
•
•
CLX XP supports synchronous and asynchronous replication modes,
allowing you to prioritize performance over data currency between
the data centers.
Because data replication and resynchronization are performed by the
storage subsystem, CLX may provide significantly better
performance than Extended Distance Cluster during recovery.
Unlike Extended Distance Cluster, CLX does not require any
additional CPU time for data replication, which minimizes the
impact on the host.
•
•
There is little or no lag time writing to the replica, so the data
remains current.
Data can be copied in both directions, so that if the primary site fails
and the replica takes over, data can be copied back to the primary
site when it comes back up.
Chapter 1
25
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
•
Disk resynchronization is independent of CPU failure (that is, if the
hosts at the primary site fail but the disk remains up, the disk knows
it does not have to be resynchronized).
Differ en ces Betw een E xt en d ed Dist a n ce Clu st er a n d CLX
The major differences between an Extended Distance Cluster and a CLX
cluster are:
•
The methods used to replicate data between the storage devices in
the two data centers. The two basic methods available for replicating
data between the data centers for Linux clusters are either
host-based or storage array-based. Extended Distance Cluster
always uses host-based replication (MD mirroring on Linux). Any
(mix of) Serviceguard supported Fibre Channel storage can be
implemented in an Extended Distance Cluster. CLX always uses
array-based replication/mirroring, and requires storage from the
same vendor in both data centers (that is, a pair of XPs with
Continuous Access, or a pair of EVAs with Continuous Access).
•
Data centers in an Extended Distance Cluster can span up to 100km,
whereas the distance between data centers in a Metrocluster is
defined by the shortest of the following distances:
— Maximum distance that guarantees a network latency of no more
than 200ms
— Maximum distance supported by the data replication link
— Maximum supported distance for DWDM as stated by the
provider
•
In an Extended Distance Cluster, there is no built-in mechanism for
determining the state of the data being replicated. When an
application fails over from one data center to another, the package is
allowed to start up if the volume group(s) can be activated. A CLX
implementation provides a higher degree of data integrity; that is,
the application is only allowed to start up based on the state of the
data and the disk arrays.
It is possible for data to be updated on the disk system local to a
server running a package without remote data being updated. This
happens if the data link between sites is lost, usually as a precursor
to a site going down. If that occurs and the site with the latest data
then goes down, that data is lost. The period of time from the link
lost to the site going down is called the "recovery point". An
26
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
"objective" can be set for the recovery point such that if data is
updated for a period less than the objective, automated failover can
occur and a package will start. If the time is longer than the
objective, then the package will not start. In a Linux environment,
this is a user configurable parameter: RPO_TARGET.
•
Extended Distance Cluster disk reads may outperform CLX in
normal operations. On the other hand, CLX data resynchronization
and recovery performance are better than Extended Distance
Cluster.
Con t in en ta l Clu ster
A con t in en ta l clu st er provides an alternative disaster tolerant
solution in which distinct clusters can be separated by large distances,
with wide area networking used between them. Continental cluster
architecture is implemented using the Continentalclusters product,
described fully in Chapter 2 of the Designing Disaster Tolerant HA
Clusters Using Metrocluster and Continentalclusters user’s guide. This
product is available only on HP-UX and not on Linux. The design is
implemented with two distinct Serviceguard clusters that can be located
configuration. In this architecture, each cluster maintains its own
quorum, so an arbitrator data center is not used for a continental cluster.
A continental cluster can use any WAN connection through a TCP/IP
protocol; however, due to data replication needs, high speed connections
such as T1 or T3/E3 leased lines or switched lines may be required. See
Figure 1-6.
NOTE
A continental cluster can also be built using multiple clusters that
communicate over shorter distances using a conventional LAN.
Chapter 1
27
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Figu r e 1-6
Con tin en ta l Clu st er
node 1b
node 2b
pkg A_R
pkg B_R
High Availability Network
New York Cluster:
Data Center B
Los Angeles Cluster:
WAN
Data Center A
node 2a
node 1a
Data Replication
and/or Mirroring
pkg B
pkg A
Continentalclusters provides the flexibility to work with any data
replication mechanism. It provides pre-integrated solutions that use HP
StorageWorks Continuous Access XP, HP StorageWorks Continuous
Access EVA, or EMC Symmetrix Remote Data Facility for data
replication.
The points to consider when configuring a continental cluster over a
WAN are:
•
•
Inter-cluster connections are TCP/IP based.
The physical connection is one or more leased lines managed by a
common carrier. Common carriers cannot guarantee the same
reliability that a dedicated physical cable can. The distance can
introduce a time lag for data replication, which creates an issue with
data currency. This could increase the cost by requiring higher speed
WAN connections to improve data replication performance and
reduce latency.
•
Operational issues, such as working with different personnel trained
on different processes, and conducting failover rehearsals, are made
more difficult the further apart the nodes are in the cluster.
28
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Ben efit s of Con t in en ta lclu ster s
•
You can virtually build data centers anywhere and still have the data
centers provide disaster tolerance for each other. Since
Continentalclusters uses two clusters, theoretically there is no limit
to the distance between the two clusters. The distance between the
clusters is dictated by the required rate of data replication to the
remote site, level of data currency, and the quality of networking
links between the two data centers.
•
In addition, inter-cluster communication can be implemented with
either a WAN or LAN topology. LAN support is advantageous when
you have data centers in close proximity to each other, but do not
want the data centers configured into a single cluster. One example
may be when you already have two Serviceguard clusters close to
each other and, for business reasons, you cannot merge these two
clusters into a single cluster. If you are concerned with one of the
centers becoming unavailable, Continentalclusters can be added to
provide disaster tolerance. Furthermore, Continentalclusters can be
implemented with an existing Serviceguard cluster architecture
while keeping both clusters running, and provide flexibility by
supporting disaster recovery failover between two clusters that are
on the same subnet or on different subnets.
•
•
You can integrate Continentalclusters with any storage component of
choice that is supported by Serviceguard. Continentalclusters
provides a structure to work with any type of data replication
mechanism. A set of guidelines for integrating other data replication
schemes with Continentalclusters is included in the Designing
Disaster Tolerant HA Clusters Using Metrocluster and
Continentalclusters user’s guide.
Besides selecting your own storage and data replication solution, you
can also take advantage of the following HP pre-integrated solutions:
— Storage subsystems implemented by CLX are also pre-integrated
with Continentalclusters. Continentalclusters uses the same
data replication integration module that CLX implements to
check for data status of the application package before package
start up.
— If Oracle DBMS is used and logical data replication is the
preferred method, depending on the version, either Oracle 8i
Standby or Oracle 9i Data Guard with log shipping is used to
Chapter 1
29
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
replicate the data between two data centers. HP provides a
supported integration toolkit for Oracle 8i Standby DB in the
Enterprise Cluster Management Toolkit (ECMT).
•
RAC is supported by Continentalclusters by integrating it with
SGeRAC. In this configuration, multiple nodes in a single cluster can
simultaneously access the database (that is, nodes in one data center
can access the database). If the site fails, the RAC instances can be
recovered at the second site.
•
•
Continentalclusters supports a maximum of 4 clusters with up to 16
nodes per cluster (for a maximum of 64 nodes) supporting up to 3
primary clusters and one recovery cluster.
Failover for Continentalclusters is semi-automatic. If a data center
fails, the administrator is advised, and is required to take action to
bring the application up on the surviving cluster.
Con tin en ta l Clu ster With Ca sca d in g Fa ilover
A continental cluster with ca sca d in g fa ilover uses three main data
centers distributed between a metropolitan cluster, which serves as a
primary cluster, and a standard cluster, which serves as a recovery
cluster.
Cascading failover means that applications are configured to fail over
from one data center to another in the primary cluster and then to a
third (recovery) cluster if the entire primary cluster fails. Data
replication also follows the cascading model. Data is replicated from the
primary disk array to the secondary disk array in the Metrocluster, then
replicated to the third disk array in the Serviceguard recovery cluster.
For more information on Cascading Failover configuration, maintenance,
Cluster” white paper on the high availability documentation web site at
http://docs.hp.com-> High Availability -> Continentalcluster.
Com p a r ison of Disa ster Toler a n t Solu tion s
Table 1-1 summarizes and compares the disaster tolerant solutions that
are currently available:
30
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Ta b le 1-1
Com p a r ison of Disa st er Toler a n t Clu st er Solu t ion s
Exten d ed Dista n ce
Clu st er
Con t in en t a lclu st er s
(H P -UX on ly)
Att r ibu tes
CLX
Key Benefit
Excellent in “normal”
operations, and partial
failure. Since all hosts
have access to both
disks, in a failure
Two significant benefits: Increased data
protection by
•
Provides maximum
data protection.
State of the data is
determined before
application is
supporting unlimited
distance between data
centers (protects
where the node is
running and the
against such disasters
as those caused by
earthquakes or violent
attacks, where an
entire area can be
disrupted).
started.
application is up, but
the disk becomes
unavailable, no failover
occurs. The node will
access the remote disk
to continue processing.
If necessary, data
resynchronization is
performed before
application is
brought up.
•
Better performance
than Extended
Distance Cluster for
resynchronization,
as replication is done
by storage
subsystem (no
impact to host).
Chapter 1
31
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Ta b le 1-1
Com p a r ison of Disa st er Toler a n t Clu st er Solu t ion s (Con t in u ed )
Exten d ed Dista n ce
Clu st er
Con t in en t a lclu st er s
(H P -UX on ly)
Att r ibu tes
CLX
Key
Limitation
No ability to check the
state of the data before
starting up the
application. If the
volume group (vg) can
be activated, the
Specialized storage
required. Currently, XP
with Continuous Access,
and EVA with
Continuous Access are
supported.
No automatic failover
between clusters.
application will be
started. If mirrors are
split or multiple paths
to storage are down, as
long as the vg can be
activated, the
application will be
started.
Data resynchronization
does not have a big
impact on system
performance. However,
the performance varies
depending on the
number of times data
resynchronization
occurs. In the case of
MD, data
resynchronization is
done one disk at a time,
using about 10% of the
available CPU time
and taking longer to
resynchronize multiple
LUNs. The amount of
CPU time used is a
configurable MD
parameter.
32
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Ta b le 1-1
Com p a r ison of Disa st er Toler a n t Clu st er Solu t ion s (Con t in u ed )
Exten d ed Dista n ce
Clu st er
Con t in en t a lclu st er s
(H P -UX on ly)
Att r ibu tes
CLX
Maximum
Distance
100 Kilometers
Shortest of the distances
between:
No distance
restrictions.
•
Cluster network
latency (not to
exceed 200 ms).
•
•
Data Replication
Max Distance.
DWDM provider
max distance.
Data
Replication
mechanism
Host-based, through
MD. Replication can
affect performance
(writes are
Array-based, through
Continuous Access XP or
Continuous Access EVA.
You have a choice of
either selecting their
own SG-supported
storage and data
replication
mechanism, or
implementing one of
HP’s pre-integrated
solutions (including
Continuous Access XP,
Continuous Access
EVA, and EMC SRDF
for array-based, or
Oracle 8i Standby for
host based.) Also, you
may choose Oracle 9i
Data Guard as a
Replication and
synchronous).
resynchronization
performed by the storage
subsystem, so the host
does not experience a
performance hit.
Resynchronization can
impact performance.
(Complete
resynchronization is
required in many
scenarios that have
multiple failures.)
Incremental
resynchronizations are
done, based on bitmap,
minimizing the need for
full re-syncs.
host-based solution.
Contributed (that is,
unsupported)
integration templates
for Oracle 9i.
Chapter 1
33
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Ta b le 1-1
Com p a r ison of Disa st er Toler a n t Clu st er Solu t ion s (Con t in u ed )
Exten d ed Dista n ce
Clu st er
Con t in en t a lclu st er s
(H P -UX on ly)
Att r ibu tes
CLX
Application
Failover
type
Automatic (no manual
intervention required).
Automatic (no manual
intervention required).
Semi-automatic (user
must “push the
button” to initiate
recovery).
Access Mode Active/Standby
Active/Standby
Active/Standby
for a
package
Client
Transparen
cy
Client detects the lost
connection. You must
reconnect once the
application is recovered application is recovered
at second site. at second site.
Client detects the lost
connection. You must
reconnect once the
You must reconnect
once the application is
recovered at second
site.
Maximum
Cluster Size
Allowed
2 nodes for this release. 2 to 16 nodes
1 to 16 nodes in each
cluster supporting up
to 3 primary clusters
and one recovery
cluster. (maximum
total of 4 clusters-64
nodes)
Storage
Identical storage is not
required (replication is
host-based with MD
mirroring).
Identical Storage is
required.
Identical storage is
required if
storage-based
mirroring is used.
Identical storage is
not required for other
data replication
implementations.
34
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Ta b le 1-1
Com p a r ison of Disa st er Toler a n t Clu st er Solu t ion s (Con t in u ed )
Exten d ed Dista n ce
Clu st er
Con t in en t a lclu st er s
(H P -UX on ly)
Att r ibu tes
CLX
Dark Fiber
Data
Replication
Link
Dark Fiber
WAN
LAN
Continuous Access over
IP
Dark Fiber
(pre-integrated
solution)
Continuous Access over
ATM
Continuous Access
over IP
(pre-integrated
solution)
Continuous Access
over ATM
(pre-integrated
solution)
Cluster
Network
Single or multiple IP
subnet
Single or multiple IP
subnet
Two configurations:
Single IP subnet for
both clusters (LAN
connection between
clusters)
Two IP subnets – one
per cluster (WAN
connection between
clusters)
Chapter 1
35
Disaster Tolerance and Recovery in a Serviceguard Cluster
Understanding Types of Disaster Tolerant Clusters
Ta b le 1-1
Com p a r ison of Disa st er Toler a n t Clu st er Solu t ion s (Con t in u ed )
Exten d ed Dista n ce
Clu st er
Con t in en t a lclu st er s
(H P -UX on ly)
Att r ibu tes
CLX
DTS
SGLX + XDC
SGLX +
CLX XP or CLX EVA
SG +
Software/
Licenses
Required
Continentalclusters +
(Metrocluster
Continuous Access XP
or
Metrocluster
Continuous Access
EVA
or
Metrocluster EMC
SRDF
or
Enterprise Cluster
Master Toolkit)
or
Customer-selected
data replication
subsystem
CC with RAC: SG +
SGeRAC +
Continentalclusters
36
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Disaster Tolerant Architecture Guidelines
Disa st er Toler a n t Ar ch it ect u r e Gu id elin es
Disaster tolerant architectures represent a shift away from the massive
central data centers and towards more distributed data processing
facilities. While each architecture will be different to suit specific
availability needs, there are a few basic guidelines for designing a
disaster tolerant architecture so that it protects against the loss of an
entire data center:
•
•
•
•
Protecting nodes through geographic dispersion
Protecting data through replication
Using alternative power sources
Creating highly available networks
These guidelines are in addition to the standard high-availability
guidelines of redundant components such as multiple paths to storage,
P r otect in g Nod es th r ou gh Geogr a p h ic Disp er sion
Redundant nodes in a disaster tolerant architecture must be
geographically dispersed. If they are in the same data center, it is not a
disaster tolerant architecture. Figure 1-2 on page 17 shows a cluster
architecture with nodes in two data centers: A and B. If all nodes in data
center A fail, applications can fail over to the nodes in data center B and
continue to provide clients with service.
Depending on the type of disaster you are protecting against and on the
available technology, the nodes can be as close as another room in the
same building, or as far away as another city. The minimum
recommended dispersion is a single building with redundant nodes in
different data centers using different power sources. Specific
architectures based on geographic dispersion are discussed in the
following chapter.
Chapter 1
37
Disaster Tolerance and Recovery in a Serviceguard Cluster
Disaster Tolerant Architecture Guidelines
P r otect in g Da ta th r ou gh Rep lica tion
The most significant losses during a disaster are the loss of access to
data, and the loss of data itself. You protect against this loss through
data replication, that is, creating extra copies of the data. Data
replication should:
•
Ensure d a ta con sist en cy by replicating data in a logical order so
that it is immediately usable or recoverable. Inconsistent data is
unusable and is not recoverable for processing. Consistent data may
or may not be current.
•
•
•
Ensure d a ta cu r r en cy by replicating data quickly so that a replica
of the data can be recovered to include all committed disk writes that
were applied to the local disks.
Ensure d a t a r ecover a b ilit y so that there is some action that can be
taken to make the data consistent, such as applying logs or rolling a
database.
Minimize d a t a loss by configuring data replication to address
consistency, currency, and recoverability.
Different data replication methods have different advantages with
regards to data consistency and currency. Your choice of which data
replication methods to use will depend on what type of disaster tolerant
architecture you require.
Off-lin e Da ta R ep lica tion
Off-line data replication is the method most commonly used today. It
involves two or more data centers that store their data on tape and either
send it to each other (through an express service, if need dictates) or
store it off-line in a vault. If a disaster occurs at one site, the off-line copy
of data is used to synchronize data and a remote site functions in place of
the failed site.
Because data is replicated using physical off-line backup, data
consistency is fairly high, barring human error or an untested corrupt
backup. However, data currency is compromised by the time delay in
sending the tape backup to a remote site.
Off-line data replication is fine for many applications for which recovery
time is not an issue critical to the business. Although data might be
replicated weekly or even daily, recovery could take from a day to a week
38
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Disaster Tolerant Architecture Guidelines
depending on the volume of data. Some applications, depending on the
role they play in the business, may need to have a faster recovery time,
within hours or even minutes.
On -lin e Da t a R ep lica tion
On-line data replication is a method of copying data from one site to
another across a link. It is used when very short recovery time, from
minutes to hours, is required. To be able to recover use of a system in a
short time, the data at the alternate site must be replicated in real time
on all disks.
Data can be replicated either synchronously or asynchronously.
Syn ch r on ou s r ep lica t ion requires one disk write to be completed and
replicated before another disk write can begin. This method improves the
chances of keeping data consistent and current during replication.
However, it greatly reduces replication capacity and performance, as well
as system response time. Asyn ch r on ou s r ep lica tion does not require
the primary site to wait for one disk write to be replicated before
beginning another. This can be an issue with data currency, depending
on the volume of transactions. An application that has a very large
volume of transactions can get hours or days behind in replication using
asynchronous replication. If the application fails over to the remote site,
it would start up with data that is not current.
Currently the two ways of replicating data on-line are physical data
replication and logical data replication. Either of these can be configured
to use synchronous or asynchronous writes.
P h ysica l Da t a R ep lica tion
Each physical write to disk is replicated on another disk at another site.
dependent. This allows each node to run different applications under
normal circumstances. Then, if a disaster occurs, an alternate node can
take ownership of applications and data, provided the replicated data is
current and consistent.
As shown in Figure 1-7, physical replication can be done in software or
hardware.
Chapter 1
39
Disaster Tolerance and Recovery in a Serviceguard Cluster
Disaster Tolerant Architecture Guidelines
Figu r e 1-7
P h ysica l Da t a R ep lica tion
MD Software RAID is an example of physical replication done in the
software; a disk I/O is written to each array connected to the node,
requiring the node to make multiple disk I/Os. Continuous Access XP on
the HP StorageWorks Disk Array XP series is an example of physical
replication in hardware; a single disk I/O is replicated across the
Continuous Access link to a second XP disk array.
Ad va n t a ges of p h ysica l r ep lica tion in h a r d w a r e are:
•
There is little or no lag time writing to the replica. This means that
the data remains very current.
•
•
Replication consumes no additional CPU.
The hardware deals with resynchronization if the link or disk fails.
And resynchronization is independent of CPU failure; if the CPU
fails and the disk remains up, the disk knows it does not have to be
resynchronized.
•
Data can be copied in both directions, so that if the primary fails and
the replica takes over, data can be copied back to the primary when it
comes back up.
Disa d va n ta ges of p h ysica l r ep lica t ion in h a r d w a r e are:
40
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Disaster Tolerant Architecture Guidelines
•
The logical order of data writes is not always maintained in
synchronous replication. When a replication link goes down and
transactions continue at the primary site, writes to the primary disk
are queued in a bit-map. When the link is restored, if there has been
more than one write to the primary disk, then there is no way to
determine the original order of transactions until the
resynchronization has completed successfully. This increases the risk
of data inconsistency.
Also, because the replicated data is a write operation to a physical
disk block, database corruption and human errors, such as the
accidental removal of a database table, are replicated at the remote
site.
NOTE
Configuring the disk so that it does not allow a subsequent disk write
until the current disk write is copied to the replica (synchronous
writes) can limit this risk as long as the link remains up.
Synchronous writes impact the capacity and performance of the data
replication technology.
•
•
Redundant disk hardware and cabling are required. This, at a
minimum, doubles data storage costs, because the technology is in
the disk itself and requires specialized hardware.
For architectures using dedicated cables, the distance between the
sites is limited by the cable interconnect technology. Different
technologies support different distances and provide different “data
through” performance.
•
For architectures using common carriers, the costs can vary
dramatically, and the connection can be less reliable, depending on
the Service Level Agreement.
Ad va n t a ges of p h ysica l r ep lica tion in soft w a r e are:
•
•
•
There is little or no time lag between the initial and replicated disk
I/O, so data remains very current.
The solution is independent of disk technology, so you can use any
supported disk technology.
Data copies are peers, so there is no issue with reconfiguring a
replica to function as a primary disk after failover.
Chapter 1
41
Disaster Tolerance and Recovery in a Serviceguard Cluster
Disaster Tolerant Architecture Guidelines
•
•
Because there are multiple read devices, that is, the node has access
to both copies of data, there may be improvements in read
performance.
Writes are synchronous unless the link or disk is down.
Disa d va n ta ges of p h ysica l r ep lica t ion in softw a r e are:
•
As with physical replication in the hardware, the logical order of data
writes is not maintained. When the link is restored, if there has been
more than one write to the primary disk, there is no way to
determine the original order of transactions until the
resynchronization has completed successfully.
NOTE
Configuring the software so that a write to disk must be replicated
on the remote disk before a subsequent write is allowed can limit the
risk of data inconsistency while the link is up.
•
•
Additional hardware is required for the cluster.
Distance between sites is limited by the physical disk link
capabilities.
•
Performance is affected by many factors: CPU overhead for
mirroring, double I/Os, degraded write performance, and CPU time
for resynchronization. In addition, CPU failure may cause a
resynchronization even if it is not needed, further affecting system
performance.
Logica l Da t a Rep lica t ion
Logical data replication is a method of replicating data by repeating the
sequence of transactions at the remote site. Logical replication often
must be done at both the file system level, and the database level in
order to replicate all of the data associated with an application. Most
database vendors have one or more database replication products. An
example is the Oracle Standby Database.
Logical replication can be configured to use synchronous or
asynchronous writes. Transaction processing monitors (TPMs) can also
perform logical replication.
42
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Disaster Tolerant Architecture Guidelines
Figu r e 1-8
Logica l Da t a Rep lica t ion
node 1
node 1a
Logical Replication
in Software.
Network
No direct access to
both copies of data.
Ad va n t a ges of u sin g logica l r ep lica tion are:
•
•
•
The distance between nodes is limited only by the networking
technology.
There is no additional hardware needed to do logical replication,
unless you choose to boost CPU power and network bandwidth.
Logical replication can be implemented to reduce risk of duplicating
human error. For example, if a database administrator erroneously
removes a table from the database, a physical replication method will
duplicate that error at the remote site as a raw write to disk. A
logical replication method can be implemented to delay applying the
data at a remote site, so such errors would not be replicated at the
remote site. This also means that administrative tasks, such as
adding or removing database tables, has to be repeated at each site.
•
With database replication you can roll transactions forward or
backward to achieve the level of currency desired on the replica,
although this functionality is not available with file system
replication.
Disa d va n ta ges of logica l r ep lica tion are:
•
It uses significant CPU overhead because transactions are often
replicated more than once and logged to ensure data consistency, and
all but the most simple database transactions take significant CPU.
It also uses network bandwidth, whereas most physical replication
methods use a separate data replication link. As a result, there may
be a significant lag in replicating transactions at the remote site,
which affects data currency.
Chapter 1
43
Disaster Tolerance and Recovery in a Serviceguard Cluster
Disaster Tolerant Architecture Guidelines
•
•
•
If the primary database fails and is corrupt, which results in the
replica taking over, then the process for restoring the primary
database so that it can be used as the replica is complex. This often
involves recreating the database and doing a database dump from
the replica.
Applications often have to be modified to work in an environment
that uses a logical replication database. Logic errors in applications
or in the RDBMS code itself that cause database corruption will be
replicated to remote sites. This is also an issue with physical
replication.
Most logical replication methods do not support personality
swapping, which is the ability after a failure to allow the secondary
site to become the primary and the original primary to become the
new secondary site. This capability can provide increased up time.
Id ea l Da ta R ep lica t ion
The ideal disaster tolerant architecture, if budgets allow, is the following
combination:
•
•
For performance and data currency—physical data replication.
For data consistency—either a second physical data replication as a
point-in-time snapshot or logical data replication, which would only
be used in the cases where the primary physical replica was corrupt.
Usin g Alter n a tive P ow er Sou r ces
In a high-availability cluster, redundancy is applied to cluster
components, such as multiple paths to storage, redundant network cards,
level of protection is required for these redundancies.
Each data center that houses part of a disaster tolerant cluster should be
supplied with power from a different circuit. In addition to a standard
UPS (uninterrupted power supply), each node in a disaster tolerant
cluster should be on a separate power circuit; see Figure 1-9.
44
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Disaster Tolerant Architecture Guidelines
Figu r e 1-9
Alt er n a tive Pow er Sou r ces
Power Circuit 1
Power Circuit 3
Power Circuit 4
node 1
node 3
Power Circuit 2
node 2
node 4
Data Center A
Data Center B
Housing remote nodes in another building often implies they are
powered by a different circuit, so it is especially important to make sure
all nodes are powered from a different source if the disaster tolerant
cluster is located in two data centers in the same building. Some disaster
tolerant designs go as far as making sure that their redundant power
source is supplied by a different power substation on the grid. This adds
protection against large-scale power failures, such as brown-outs,
sabotage, or electrical storms.
Cr ea tin g High ly Ava ila ble Netw or k in g
Standard high-availability guidelines require redundant networks.
Redundant networks may be highly available, but they are not disaster
tolerant if a single accident can interrupt both network connections. For
example, if you use the same trench to lay cables for both networks, you
do not have a disaster tolerant architecture because a single accident,
such as a backhoe digging in the wrong place, can sever both cables at
once, making automated failover during a disaster impossible.
In a disaster tolerant architecture, the reliability of the network is
paramount. To reduce the likelihood of a single accident causing both
networks to fail, redundant network cables should be installed so that
they use physically different routes for each network. How you route
cables will depend on the networking technology you use. Specific
guidelines for some network technologies are listed here.
Chapter 1
45
Disaster Tolerance and Recovery in a Serviceguard Cluster
Disaster Tolerant Architecture Guidelines
Disa st er Toler a n t Loca l Ar ea Netw or k in g
Ethernet networks can also be used to connect nodes in a disaster
tolerant architecture within the following guidelines:
•
Each node is connected to redundant switches and bridges using two
Ethernet host adapters. Bridges, repeaters, or other components that
convert from copper to fibre cable may be used to span longer
distances.
Disa st er Toler a n t Wid e Ar ea Netw or k in g
Disaster tolerant networking for continental clusters is directly tied to
the data replication method. In addition to the redundant lines
connecting the remote nodes, you also need to consider what bandwidth
you need to support the data replication method you have chosen. A
continental cluster that handles a high number of transactions per
minute will not only require a highly available network, but also one
with a large amount of bandwidth.
This is a brief discussion of things to consider when choosing the network
configuration for your continental cluster. Details on WAN choices and
configurations can be found in Continental Cluster documentation
available from: http://docs.hp.com -> High Availability.
•
Bandwidth affects the rate of data replication, and therefore the
currency of the data should there be the need to switch control to
another site. The greater the number of transactions you process, the
more bandwidth you will need. The following connection types offer
differing amounts of bandwidth:
— T1 and T3: low end
— ISDN and DSL: medium bandwidth
— ATM: high end
•
•
Reliability affects whether or not data replication happens, and
therefore the consistency of the data should you need to fail over to
the recovery cluster. Redundant leased lines should be used, and
should be from two different common carriers, if possible.
Cost influences both bandwidth and reliability. Higher bandwidth
and dual leased lines cost more. It is best to address data consistency
issues first by installing redundant lines, then weigh the price of
data currency and select the line speed accordingly.
46
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Disaster Tolerant Architecture Guidelines
Disa ster Toler a n t Clu ster Lim ita tion s
Disaster tolerant clusters have limitations, some of which can be
mitigated by good planning. Some examples of MPOF that may not be
covered by disaster tolerant configurations:
•
Failure of all networks among all data centers — This can be
mitigated by using a different route for all network cables.
•
Loss of power in more than one data center — This can be mitigated
by making sure data centers are on different power circuits, and
redundant power supplies are on different circuits. If power outages
are frequent in your area, and down time is expensive, you may want
to invest in a backup generator.
•
•
Loss of all copies of the on-line data — This can be mitigated by
replicating data off-line (frequent backups). It can also be mitigated
by taking snapshots of consistent data and storing it on-line;
Business Copy XP and EMC Symmetrix BCV (Business Consistency
Volumes) provide this functionality and the additional benefit of
quick recovery should anything happen to both copies of on-line data.
A r ollin g d isa st er is a disaster that occurs before the cluster is able
to recover from failure that is not normally considered a “disaster”.
An example is a data replication link that fails, then, as it is being
restored and data is being resynchronized, a disaster causes an
entire data center to fail. The effects of rolling disasters can be
mitigated by ensuring that a copy of the data is stored either off-line
or on a separate disk that can quickly be mounted. The trade-off is a
lack of currency for the data in the off-line copy.
Chapter 1
47
Disaster Tolerance and Recovery in a Serviceguard Cluster
Managing a Disaster Tolerant Environment
Ma n a gin g a Disa st er Toler a n t E n vir on m en t
In addition to the changes in hardware and software to create a disaster
tolerant architecture, there are also changes in the way you manage the
environment. Configuration of a disaster tolerant architecture needs to
be carefully planned, implemented and maintained. There are additional
resources needed, and additional decisions to make concerning the
maintenance of a disaster tolerant architecture:
•
Manage it in-house, or hire a service?
Hiring a service can remove the burden of maintaining the capital
equipment needed to recover from a disaster. Most disaster recovery
services provide their own off-site equipment, which reduces
maintenance costs. Often the disaster recovery site and equipment
are shared by many companies, further reducing cost.
Managing disaster recovery in-house gives complete control over the
type of redundant equipment used and the methods used to recover
from disaster, giving you complete control over all means of recovery.
•
Implement automated or manual recovery?
Manual recovery costs less to implement and gives more flexibility in
making decisions while recovering from a disaster. Evaluating the
data and making decisions can add to recovery time, but it is justified
in some situations, for example if applications compete for resources
following a disaster and one of them has to be halted.
Automated recovery reduces the amount of time and in most cases
eliminates human intervention needed to recover from a disaster.
You may want to automate recovery for any number of reasons:
— Automated recovery is usually faster.
— Staff may not be available for manual recovery, as is the case
with “lights-out” data centers.
— Reduction in human intervention is also a reduction in human
error. Disasters don’t happen often, so lack of practice and the
stressfulness of the situation may increase the potential for
human error.
— Automated recovery procedures and processes can be
transparent to the clients.
48
Chapter 1
Disaster Tolerance and Recovery in a Serviceguard Cluster
Managing a Disaster Tolerant Environment
Even if recovery is automated, you may choose to, or need to recover
from some types of disasters with manual recovery. A r ollin g
d isa st er , which is a disaster that happens before the cluster has
recovered from a previous disaster, is an example of when you may
want to manually switch over. If the data link failed, as it was
coming up and resynchronizing data, and the data center failed, you
would want human intervention to make judgment calls on which
site had the most current and consistent data before failing over.
•
Who manages the nodes in the cluster and how are they trained?
Putting a disaster tolerant architecture in place without planning for
the people aspects is a waste of money. Training and documentation
are more complex because the cluster is in multiple data centers.
Each data center often has its own operations staff with their own
processes and ways of working. These operations people will now be
required to communicate with each other and coordinate
maintenance and failover rehearsals, as well as working together to
recover from an actual disaster. If the remote nodes are placed in a
“lights-out” data center, the operations staff may want to put
additional processes or monitoring software in place to maintain the
nodes in the remote location.
Rehearsals of failover scenarios are important to keep prepared. A
written plan should outline rehearsal of what to do in cases of
disaster with a minimum recommended rehearsal schedule of once
every 6 months, ideally once every 3 months.
•
How is the cluster maintained?
Planned downtime and maintenance, such as backups or upgrades,
must be more carefully thought out because they may leave the
cluster vulnerable to another failure. For example, nodes need to be
brought down for maintenance in pairs: one node at each site, so that
quorum calculations do not prevent automated recovery if a disaster
occurs during planned maintenance.
Rapid detection of failures and rapid repair of hardware is essential
so that the cluster is not vulnerable to additional failures.
Testing is more complex and requires personnel in each of the data
centers. Site failure testing should be added to the current cluster
testing plans.
Chapter 1
49
Disaster Tolerance and Recovery in a Serviceguard Cluster
Additional Disaster Tolerant Solutions Information
Ad d it ion a l Disa st er Toler a n t Solu t ion s
In for m a t ion
On-line versions of HA documentation are available at
http://docs.hp.com-> High Availability -> Serviceguard for Linux.
For information on CLX for EVA and XP, see the following document
available at
http://h71028.www7.hp.com/enterprise/cache/120851-0-0-225-12
1.html-> HP StorageWorks Cluster Extension for EVA or XP.
•
•
HP StorageWorks Cluster Extension EVA user guide
HP StorageWorks Cluster Extension XP for HP Serviceguard for
Linux
50
Chapter 1
Building an Extended Distance Cluster Using Serviceguard and Software RAID
2
Bu ild in g a n E xt en d ed Dist a n ce
Clu ster Usin g Ser vicegu a r d a n d
Soft w a r e R AID
Simple Serviceguard clusters are usually configured in a single data
center, often in a single room, to provide protection against failures in
CPUs, interface cards, and software. Extended Serviceguard clusters are
specialized cluster configurations, which allow a single cluster to extend
across two separate data centers for increased disaster tolerance.
Depending on the type of links employed, distances of up to 100 kms
Software RAID) and Fibre Channel. Both two data center and three data
•
•
“Two Data Center and Quorum Service Location Architectures” on
page 53
•
•
“Rules for Separate Network and Data Links” on page 57
“Guidelines on DWDM Links for Network and Data” on page 58
Chapter 2
51
Building an Extended Distance Cluster Using Serviceguard and Software RAID
Types of Data Link for Storage and Networking
Typ es of Da t a Lin k for St or a ge a n d
Net w or k in g
Fibre Channel technology lets you increase the distance between the
components in an Serviceguard cluster, thus making it possible to design
a disaster tolerant architecture. The following table shows some of the
distances possible with a few of the available technologies, including
some of the Fiber Optic alternatives.
Ta b le 2-1
Lin k Tech n ologies a n d Dista n ces
Ma xim u m Dista n ce
Su p p or t ed
Typ e of Lin k
Gigabit Ethernet Twisted Pair
Short Wave Fiber
50 meters
500 meters
Long Wave Fiber
10 kilometers
100 kilometers
Dense Wave Division Multiplexing
(DWDM)
The development of DWDM technology allows designers to use dark fiber
(high speed communication lines provided by common carriers) to extend
the distances that were formerly subject to limits imposed by Fibre
Channel for storage and Ethernet for network links.
NOTE
Increased distance often means increased cost and reduced speed of
connection. Not all combinations of links are supported in all cluster
types. For a current list of supported configurations and supported
distances, see the HP Configuration Guide, available through your HP
representative.
52
Chapter 2
Building an Extended Distance Cluster Using Serviceguard and Software RAID
Two Data Center and Quorum Service Location Architectures
Tw o Da ta Cen ter a n d Qu or u m Ser vice
Loca t ion Ar ch itect u r es
A two data center and Quorum Service location, which is at a third
location, have the following configuration requirements:
NOTE
There is no hard requirement on how far the Quorum Service location
has to be from the two main data centers. It can be as close as the room
next door with its own power source or can be as far as in another site
across town. The distance between all three locations dictates that level
of disaster tolerance a cluster can provide.
•
In these solutions, there must be an equal number of nodes in each
primary data center, and the third location (known as the arbitrator
data center) contains the Quorum Server. LockLUN is not supported
in a Disaster Tolerant configuration. In this release, only one node in
each data center is supported.
•
The Quorum Server is used as a tie-breaker to maintain cluster
quorum when all communication between the two primary data
centers is lost. The arbitrator data center must be located separately
from the primary data centers. For more information about quorum
server, see the Managing Serviceguard user’s guide and the
Serviceguard Quorum Server Release Notes.
•
•
A minimum of two heartbeat paths must be configured for all cluster
nodes. The preferred solution is two separate heartbeat subnets
configured in the cluster, each going over a separately routed
network path to the other data center. Alternatively, there can be a
single dedicated heartbeat subnet with a bonded pair configured for
it. Each would go over a separately routed physical network path to
the other data centers.
There can be separate networking and Fibre Channel links between
the data centers, or both networking and Fibre Channel can go over
DWDM links between the data centers.
Chapter 2
53
Building an Extended Distance Cluster Using Serviceguard and Software RAID
Two Data Center and Quorum Service Location Architectures
•
•
•
Fibre Channel Direct Fabric Attach (DFA) is recommended over
Fibre Channel Arbitrated loop configurations, due to the superior
performance of DFA, especially as the distance increases. Therefore
Fibre Channel switches are recommended over Fibre Channel hubs.
Any combination of the following Fibre Channel capable disk arrays
may be used: HP StorageWorks 1000 and 1500 series Modular
Storage Arrays, HP StorageWorks Enterprise Virtual Arrays, or HP
StorageWorks Disk Array XP.
For disaster tolerance, application data must be mirrored between
the primary data centers. You must ensure that the mirror copies
reside in different data centers, as the software cannot determine the
locations.
NOTE
When a failure results in the mirror copies losing synchronization,
MD will perform a full resynchronization when both halves of the
mirror are available.
•
No routing is allowed for the networks between data centers. Routing
is allowed to the third data center if a Quorum Server is used in that
data center.
The following is a list of recommended arbitration methods for Extended
Distance Cluster solutions in order of preference:
•
•
Quorum Server running in a Serviceguard cluster
Quorum Server
For more information on Quorum Server, see the Serviceguard Quorum
Server Release Notes for Linux.
Figure 2-1 is an example of a two data center and third location
configuration using DWDM, with a quorum server node on the third site.
54
Chapter 2
Building an Extended Distance Cluster Using Serviceguard and Software RAID
Two Data Center and Quorum Service Location Architectures
Figu r e 2-1
Tw o Da t a Cen ter s a n d Th ir d Loca t ion w it h DWDM a n d Qu or u m
Ser ver
Figure 2-1 is an example of a two data center and third location
configuration using DWDM, with a quorum server node on the third site.
The DWDM boxes connected between the two Primary Data Centers are
configured with redundant dark fiber links and the standby fibre feature
has been enabled.
Chapter 2
55
Building an Extended Distance Cluster Using Serviceguard and Software RAID
Two Data Center and Quorum Service Location Architectures
There are no requirements for the distance between the Quorum Server
Data center and the Primary Data Centers, however it is necessary to
ensure that the Quorum Server can be contacted within a reasonable
amount of time (should be within the NODE_TIMEOUTperiod). LockLUN
arbitration is not allowed in this configuration.
56
Chapter 2
Building an Extended Distance Cluster Using Serviceguard and Software RAID
Rules for Separate Network and Data Links
R u les for Sep a r a te Netw or k a n d Da ta Lin k s
•
There must be less than 200 milliseconds of latency in the network
between the data centers.
•
•
No routing is allowed for the networks between the data centers.
Routing is allowed to the third data center if a Quorum Server is
used in that data center.
•
The maximum distance between the data centers for this type of
configuration is currently limited by the maximum distance
supported for the networking type or Fibre Channel link type being
used, whichever is shorter.
•
There can be a maximum of 500 meters between the Fibre Channel
switches in the two data centers if Short-wave ports are used. This
distance can be increased to 10 kilometers by using a Long-wave
Fibre Channel port on the switches. If DWDM links are used, the
maximum distance between the data centers is 100 kilometers. For
more information on link technologies, see Table 2-1 on page 52.
•
There must be at least two alternately routed networking links
between each primary data center to prevent the “backhoe problem”.
The “backhoe problem” can occur when all cables are routed through
a single trench and a tractor on a construction job severs all cables
and disables all communications between the data centers. It is
allowable to have only a single network link routed from each
primary data center to the third location, however in order to survive
the loss of the network link between a primary data center and the
arbitrator data center, the network routing should be configured so
that a primary data center can also reach the arbitrator via a route
which passes through the other primary data center.
•
•
There must be at least two alternately routed Fibre Channel Data
Replication links between each data center.
See the HP Configuration Guide (available through your HP
representative) for a list of supported Fibre Channel hardware.
Chapter 2
57
Building an Extended Distance Cluster Using Serviceguard and Software RAID
Guidelines on DWDM Links for Network and Data
Gu id elin es on DWDM Lin k s for Net w or k a n d
Da t a
•
There must be less than 200 milliseconds of latency in the network
between the data centers.
•
•
No routing is allowed for the networks between the data centers.
Routing is allowed to the third data center if a Quorum Server is
used in that data center.
•
•
The maximum distance supported between the data centers for
DWDM configurations is 100 kilometers.
Both the networking and Fibre Channel Data Replication can go
through the same DWDM box - separate DWDM boxes are not
required.
•
Since DWDM converters are typically designed to be fault tolerant, it
is acceptable to use only one DWDM box (in each data center) for the
links between each data center. However, for the highest availability,
it is recommended to use two separate DWDM boxes (in each data
center) for the links between each data center. If using a single
DWDM box for the links between each data center the redundant
standby fibre link feature of the DWDM box must be configured. If
the DWDM box supports multiple active DWDM links, that feature
can be used instead of the redundant standby feature.
•
At least two dark fiber optic links are required between each
Primary data center, each fibre link routed differently to prevent the
“backhoe problem.” It is allowable to have only a single fibre link
routed from each Primary data center to the third location, however
in order to survive the loss of a link between a Primary data center
and the third data center, the network routing should be configured
so that a Primary data center can also reach the Arbitrator via a
route passing through the other Primary data center.
•
The network switches in the configuration must support DLPI (link
level) packets. The network switch can be 100BaseT (TX or FX),
1000BaseT (TX or FX) or FDDI. The connection between the network
switch and the DWDM box must be fiber optic.
58
Chapter 2
Building an Extended Distance Cluster Using Serviceguard and Software RAID
Guidelines on DWDM Links for Network and Data
•
Fibre Channel switches must be used in a DWDM configuration;
Fibre Channel hubs are not supported. Direct Fabric Attach mode
must be used for the ports connected to the DWDM link.
See the HP Configuration Guide, available through your HP
representative, for more information on supported devices.
Chapter 2
59
Building an Extended Distance Cluster Using Serviceguard and Software RAID
Guidelines on DWDM Links for Network and Data
60
Chapter 2
Configuring your Environment for Software RAID
3
Con figu r in g you r E n vir on m en t
for Softw a r e R AID
extended distance cluster. This chapter discusses the procedures you
cluster. Following are the topics discussed in this chapter:
•
•
•
“Understanding Software RAID” on page 62
“Installing the Extended Distance Cluster Software” on page 63
“Configuring the Environment” on page 66
Chapter 3
61
Configuring your Environment for Software RAID
Understanding Software RAID
Un d er st a n d in g Soft w a r e R AID
Redundant Array of Independent Disks (RAID) is a mechanism that
provides storage fault tolerance and, occasionally, better performance.
Software RAID is designed on the concept of RAID 1. RAID 1 uses
mirroring where data is written to two disks at the same time.
The Serviceguard XDC product uses the Multiple Device (MD) driver
and its associated tool mdadm to implement Software RAID. With
Software RAID, two disks (or disk sets) are configured so that the same
data is written on both disks as one "write transaction". So if data from
one disk set is lost, or if one disk set is rendered unavailable, the data is
always available from the second disk set. As a result, high availability of
data is guaranteed. In an extended distance cluster, the two disk sets are
in two physically separated locations, so if one location becomes
unavailable, the other location still has the data.
For more information on Linux Software RAID, see The Software-RAID
HOWTO manual available at:
http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html
To enable Software RAID in your extended distance cluster, you need to
complete the following:
1. Install the extended distance cluster software.
2. Copy the files into package directories.
3. Configure packages that will use Software RAID.
The subsequent sections include information on installing Extended
Distance Cluster software, and configuring your environment for
Software RAID.
62
Chapter 3
Configuring your Environment for Software RAID
Installing the Extended Distance Cluster Software
In sta llin g th e E xt en d ed Dist a n ce Clu st er
Soft w a r e
This section discusses the supported operating systems, prerequisites
and the procedures for installing the Extended Distance Cluster
software.
Su p p or ted Op er a tin g System s
The Extended Distance Cluster software supports the following
operating systems:
•
•
•
Red Hat 4 U3 or later
Novell SUSE Linux Enterprise Server 9 SP3 or later
Novell SUSE Linux Enterprise Server 10 or later
P r er equ isites
Following are the prerequisites for installing Extended Distance Cluster
software (XDC):
•
•
HP Serviceguard for Linux A 11.16.07 or higher
Network Time Protocol (NTP) - all nodes in the cluster to point to the
same NTP server.
•
QLogic Driver - The version number of this driver depends on the
version of the QLogic cards in your environment. Download the
appropriate version of the driver from the following location:
http://www.hp.com -> Software and Driver Downloads
Select the Download drivers and software (and firmware) option.
Enter the HBA name and click >>. If more than one result is
displayed, download the appropriate driver for your operating
system.
In sta llin g XDC
You can install XDC from the product CD. You must install the XDC
software on all nodes in a cluster to enable Software RAID.
Chapter 3
63
Configuring your Environment for Software RAID
Installing the Extended Distance Cluster Software
Complete the following procedure to install XDC:
1. Insert the product CD into the drive and mount the CD.
2. Open the command line interface.
3. If you are installing XDC on Red Hat 4, run the following command:
# rpm -Uvh xdc-A.01.00-0.rhel4.noarch.rpm
4. If you are installing XDC on Novell SUSE Linux Enterprise Server 9,
run the following command:
# rpm -Uvh xdc-A.01.00-0.sles9.noarch.rpm
5. If you are installing XDC on Novell SUSE Linux Enterprise Server
10, run the following command:
# rpm -Uvh xdc-A.01.00-0.sles10.noarch.rpm
This command initializes the XDC software installation. After you
install XDC, you need to copy the raid.conf.templateinto each
package directory that you need to enable Software RAID.
6. Run the following command to copy the raid.conf.templatefile as
raid.conffile into each package directory:
# cp $SGROOT/xdc/raid.conf.template \
$SGCONF/<pkgdir>/raid.conf
The file is copied into the package directory.
NOTE
Installing the Extended Distance Cluster software does not enable
Software RAID for every package in your environment. You need to
manually enable Software RAID for a package by copying the files into
the package directories. Also, if you need to enable Software RAID for
more than one package in your environment, you need to copy the files
and templates into each of those package directories. You must edit these
template files later.
Ver ifyin g th e XDC In st a lla tion
After you install XDC, run the following command to ensure that the
software is installed:
#rpm -qa| grep xdc
64
Chapter 3
Configuring your Environment for Software RAID
Installing the Extended Distance Cluster Software
In the output, the product name, xdc -A.01.00-0will be listed. The
presence of this file verifies that the installation is successful.
Chapter 3
65
Configuring your Environment for Software RAID
Configuring the Environment
Con figu r in g th e E n vir on m en t
After setting up the hardware as described in the Extended Distance
Cluster Architecture section and installing the Extended Distance
Cluster software, complete the following steps to enable Software RAID
for each package. Subsequent sections describe each of these processes in
detail.
1. Configure multipath for storage
In the Extended Distance Cluster setup described in figures 1 and 2,
a node has multiple paths to storage. With this setup each LUN
exposed from a storage array shows up as two devices on every node.
There are two device entries in the /devdirectory for the same LUN
where each device entry will pertain to a single path to that LUN.
only one device entry will appear in their place. This happens for
devices from both the storage systems. Creating these multiple links
to the storage device ensures that each node is not dependent only on
one link to write data to that storage device. For more information on
configuring multipath, see “Configuring Multiple Paths to Storage”
on page 69.
2. Configure persistent device names for storage devices
Once the multipath has been configured, you need to create
or reoriented. Since the MD mirror device starts with the names of
the component devices, a change in the device name prevents the MD
mirror from starting. To avoid this problem, HP requires that you
make the device names persistent. For more information on
configuring persistent device names, see “Using Persistent Device
Names” on page 71.
3. Create the MD mirror device
To enable Software RAID in your environment, you need to first
create the mirror setup. This implies that you specify two disks to
create a Multiple Device (MD). When configuring disks in RAID 1
level, use a disk or LUN from each datacenter as one mirror half. Be
sure to create disk sets of the same size as they need to store data
66
Chapter 3
Configuring your Environment for Software RAID
Configuring the Environment
that are of identical sizes. Differences in disk set size results in a
mirror being created of a size equal to the smaller of the two disks.
Be sure to create the mirror using the persistent device names of the
component devices. For more information on creating and managing
a mirrored device, see “Creating a Multiple Disk Device” on page 72.
4. Create volume groups and logical volumes on the MD mirror device
Exclusive activation feature. This protects against a volume group
which is already active on one node to be activated again
(accidentally or on purpose) on any other node in the cluster. For
more information on creating volume groups and configuring
exclusive activation, see “Creating Volume Groups and Configuring
VG Exclusive Activation on the MD Mirror” on page 74.
5. Configure the package control script and the Extended Distance
cluster configuration script
In order to let Serviceguard know of the existence of the mirror
created in the previous step and hence make use of it, it must be
configured as part of a package. This MD device must be specified in
package directory and edit it to specify the RAID configuration
parameters for this package. Using the details mentioned in this file
Serviceguard will start, stop and monitor this MD mirror for the
package. For details on how to configure the package control script
and raid.conf see “Configuring the Package Control Script and RAID
Configuration File” on page 76.
IMPORTANT
Every time you edit the raid.conffile, you must copy this edited file
to all nodes in the cluster.
6. Start the package
Starting a package configured for Software RAID is the same as
starting any other package in Serviceguard for Linux.
You also need to keep in mind a few guidelines before you enable
Software RAID for a particular package. Following are some of these
guidelines you need to follow:
Chapter 3
67
Configuring your Environment for Software RAID
Configuring the Environment
•
•
Ensure that the Quorum Server link is close to the Ethernet links in
your setup. In cases of failures of all Ethernet and Fibre channel
links, the nodes can easily access the Quorum Server for arbitration.
The Quorum Server is configured in a third location only for
arbitration. In scenarios where the link between two nodes is lost,
each node considers the other node to be dead. As a result, both
nodes will try to access the Quorum Server. The Quorum Server, as
an arbitrator, acknowledges the node that reaches it first and allows
the package to start on that node.
•
You also need to configure Network Time Protocol (NTP) in your
environment. This protocol resolves the time differences that could
occur in a network. For example, nodes in different time zones.
68
Chapter 3
Configuring your Environment for Software RAID
Configuring Multiple Paths to Storage
Con figu r in g Mu ltip le P a th s to Stor a ge
HP requires that you configure multiple paths to the storage device
using the QLogic HBA driver as it has inbuilt multipath capabilities. Use
the install script with the “-f” option to enable multipath failover mode.
For more information on installing the QLogic HBA driver, see the HP
StorageWorks Using the QLogic HBA driver for single-path or multipath
failover mode on Linux systems application notes. This document is
available at the following location:
http://h20000.www2.hp.com/bc/docs/support/SupportManual/c001
69487/c00169487.pdf
NOTE
You need to register with Hewlett-Packard to access this site.
Sett in g t h e Va lu e of t h e Lin k Dow n Tim eou t
Pa r a m eter
After you install the QLogic HBA driver, you must set the Link Down
Timeoutparameter of the QLogic cards to a duration equal to the cluster
reformation time. When using the default values of heartbeat interval
and node timeout intervals of Serviceguard for Linux with a Quorum
server, this parameter must be set to 40 seconds. Setting this parameter
to 40 seconds, which is the recommended value, prevents further writes
to the active half of the mirror disk set when the other fails. If this
failure were to also bring down the node a few moments later, then the
chance of losing these writes are eliminated.
This parameter prevents any data being written to a disk when a failure
occurs. The value of this parameter must be set such that the disks are
inaccessible for a time period which is greater than the cluster
reformation time. This parameter is important in scenarios where an
entire site is in the process of going down. By blocking further writes to
the MD device, the two disks of the MD device remain current and
synchronized. As a result, when the package fails over, it starts with a
disk that has current data. You must set a value for this parameter for
all QLogic cards.
Chapter 3
69
Configuring your Environment for Software RAID
Configuring Multiple Paths to Storage
The QLogic cards are configured to hold up any disk access and
essentially hang for a time period which is greater than the cluster
reformation time when access to a disk is lost. This is achieved by
altering the Link Down Timeout value for each port of the card. Setting a
value for the Link Down Timeout parameter for a QLogic card ensures
that the MD device hangs when access to a mirror is lost. For
configurations with multipath, the MD device hangs when one path to a
storage system is lost. However, the MD device resumes activity when
the specified hang period expires. This ensures that no data is lost.
This parameter is required to address a scenario where an entire
datacenter fails but all its components do not fail at the same time but
undergo a rolling failure. In this case, if the access to one disk is lost, the
MD layer hangs and data is no longer written to it. Within the hang
period, the node goes down and a cluster reformation takes place. When
the package fails over to another node, it starts with a disk that has
current data.
The value to be set for Link Down Timeoutparameter depends on the
heartbeat interval and the node timeout values configured for a
particular cluster. Use the SANSurfer CLI tool to set the value for this
parameter. For more information on how to set this parameter, see
http://download.qlogic.com/manual/32338/SN0054614-00B.pdf
Table 3-1 lists the heartbeat intervals and the node timeout values for a
particular cluster.
Ta b le 3-1
Clu ster R efor m a t ion Tim e a n d Tim eou t Va lu es
Hea r t b ea t
In t er va ls
Clu ster
R efor m a tion Tim e
Lin k Dow n
Tim eou t Va lu e
1 second
38 seconds
56 seconds
140 seconds
250 seconds
40 seconds
58 seconds
142 seconds
255 seconds
2 seconds
5 seconds
10 seconds
NOTE
70
The values in this table are approximate values. The actual time varies
from system to system, depending on the system load.
Chapter 3
Configuring your Environment for Software RAID
Using Persistent Device Names
Usin g P er sist en t Device Na m es
When there is a disk related failure and subsequent reboot, there is a
possibility that the devices are renamed. Linux names disks in the order
they are found. The device that was /dev/sdfmay be renamed to
/dev/sdeif any “lower” device is failed or removed. As a result, you
cannot activate the MD device with the original name.
HP requires that the device names be persistent to avoid reorientation
after a failure and reboot. For more information on creating persistent
device names, see the Using udev to Simplify HP Serviceguard for Linux
Configuration white paper that is available at the following location:
http://docs.hp.com
When creating persistent device names, ensure that the same udev rules
file exists in all the nodes. This is necessary for the symlinks to appear
and point to the correct device. Use these persistent device names
wherever there is a need to specify the devices for extended cluster
configuration or during recovery process after a failure. A persistent
device created based on the instructions in the document mentioned
earlier will have a device name that starts with /dev/hpdev/.
NOTE
The name of the MD device must be unique across all packages in the
cluster. Also, the names of each of their component udev devices must
also be unique across all nodes in the cluster.
Chapter 3
71
Configuring your Environment for Software RAID
Creating a Multiple Disk Device
Cr ea t in g a Mu lt ip le Disk Device
As mentioned earlier, the first step for enabling Software RAID in your
environment is to create the Multiple Disk (MD) device using two
underlying component disks. This MD device is a virtual device which
ensures that any data written to it is written to both component disks. As
a result, the data is identical on both disks that make up the MD device.
This section describes how to create an MD device. This is the only step
that you must complete before you enable Software RAID for a package.
The other RAID operations are needed only during maintenance or
during recovery process after a failure has occurred.
NOTE
For all the steps in the subsequent sections, all the persistent device
names, and not the actual device names, must be used for the two
component disks of the MD mirror.
To Cr ea te a n d Assem ble a n MD Device
This example shows how to create the MD device /dev/md0, you must
create it from a LUN of storage device 1 (/dev/hpdev/sde1) and another
LUN from storage 2 (/dev/hpdev/sdf1).
Run the following command to create an MD device:
# mdadm --create --verbose /dev/md0 --level=1 \
--raid-devices=2 /dev/hpdev/sde1 /dev/hpdev/sdf1
This command creates the MD device.
Once the new RAID device, /dev/md0, is created on one of the cluster
nodes, you must assemble it on the nodes where the package must run.
You create an MD device only once and you can manage other functions
using the XDC scripts.
To assemble the MD device, complete the following procedure:
1. Stop the MD device on the node where you created it, by running the
following command:
# mdadm -S /dev/md0
72
Chapter 3
Configuring your Environment for Software RAID
Creating a Multiple Disk Device
2. Assemble the MD device on the other node by running the following
command:
# mdadm -A -R /dev/md0 /dev/hpdev/sde1 /dev/hpdev/sdf1
3. Stop the MD device on the other node by running the following
command:
# mdadm -S /dev/md0
You must stop the MD device soon after you assemble it on the
second node.
4. If you want to create volume groups, restart the MD device on the
first node by running the following command:
# mdadm -A -R /dev/md0 /dev/hpdev/sde1 /dev/hpdev/sdf1
5. After you have created the volume groups, stop the MD device by
running the following command:
# mdadm -S /dev/md0
IMPORTANT
You would need to repeat this procedure to create all MD devices that are
used in a package.
When data is written to this device, the MD driver writes to both the
underlying disks. In case of read requests, the MD reads from one device
or the other based on its algorithms. After creating this device you treat
it like any other LUN that is going to have shared data in a Serviceguard
environment and then create a logical volume and a file system on it.
Chapter 3
73
Configuring your Environment for Software RAID
Creating Volume Groups and Configuring VG Exclusive Activation on the MD Mirror
Cr ea t in g Volu m e Gr ou p s a n d Con figu r in g VG
Exclu sive Activa tion on th e MD Mir r or
Once you create the MD mirror device, you need to create volume groups
and logical volumes on it.
NOTE
XDC A.01.00 does not support configuring multiple raid1 devices as
physical volumes in a single volume group.
For example, if you create a volume group vg01, it can have only one MD
raid1 device /dev/md0as its physical volume.
To configure multiple raid1 devices as physical volumes in a single
volume group, you must install the XDC A.01.02 patch. To install this
patch, you must first upgrade to HP Serviceguard A.11.18 and the latest
version of XDC. After upgrading, install the A.01.02 patch that is specific
to the operating system in your environment.
XDC A.01.02 contains the following patches for Red Hat and SuSE Linux
operating systems:
•
•
•
SGLX_00133 for Red Hat Enterprise Linux 4
SGLX_00134 for Red Hat Enterprise Linux 5
SGLX_00135 for SUSE Linux Enterprise Server 10
You can contact the HP support personnel to obtain these patches.
When you create a logical volume on an MD device, the actual physical
devices that form the MD raid1 mirror must be filtered out to avoid
receiving messages from LVM about duplicate PV entries.
For example, let us assume that /dev/sdeand /dev/sdfare two
physical disks that form the md device /dev/md0. The persistent device
names for /dev/sdeand /dev/sdfare /dev/hpdev/md0_mirror0and
/dev/hpdev/md0_mirror1respectively. When you create a logical
volume, duplicate entries are detected for the two physical disks that
form the mirror device. As a result, the logical volume is not created and
an error message is displayed. Following is a sample of the error message
that is displayed:
74
Chapter 3
Configuring your Environment for Software RAID
Creating Volume Groups and Configuring VG Exclusive Activation on the MD Mirror
Found duplicate PV 9w3TIxKZ6lFRqWUmQm9tlV5nsdUkTi4i: using
/dev/sde not /dev/sdf
With this error, you cannot create a new volume group on /dev/md0. As a
result, you must create a filter for LVM. To create a filter, add the
following line in the /etc/lvm/lvm.conffile:
filter = [ "r|/dev/cdrom|","r|/dev/hpdev/md0_mirror0|",
"r|/dev/hpdev/md0_mirror1|" ]
where /dev/hpdev/md0_mirror0and /dev/hpdev/md0_mirror1are the
persistent device names of the devices /dev/sdeand /dev/sdf
respectively.
NOTE
When adding the filter, ensure that you use the persistent names of all
the devices used in the mirrors.
This prevents these mirror devices from being scanned or used for logical
volumes. You have to reload LVM with /etc/init.d/lvm
force-reload.
Once you add the filter to the /etc/lvm/lvm.conffile, create the logical
volume infrastructure on the MD mirror device as you would on a single
disk. For more information on creating volume groups and logical
volumes, see the latest edition of the Managing HP Serviceguard 11.16
for Linux at http://docs.hp.com/en/ha.html#Serviceguard for
Linux
Chapter 3
75
Configuring your Environment for Software RAID
Configuring the Package Control Script and RAID Configuration File
Con figu r in g th e P a ck a ge Con t r ol Scr ip t a n d
R AID Con figu r a t ion F ile
This section describes the package control scripts and configuration files
that you need to create and edit to enable Software RAID in your
Serviceguard environment.
Earlier versions of Serviceguard supported MD as a multipathing
software. As a result, the package control script includes certain
configuration parameters that are specific to MD. Do not use these
parameters to configure XDC in your environment. Following are the
parameters in the configuration file that you must not edit:
# MD (RAID) CONFIGURATION FILE
# Specify the configuration file that will be used to define
# the md raid devices for this package.
# NOTE: The multipath mechanisms that are supported for shared storage
# depend on the storage subsystem and the HBA driver in the
# configuration. Follow the documentation for those devices when setting
# up multipath. The MD driver was used with earlier versions of
# Serviceguard and may still be used by some storage system/HBA
# combinations. For that reason there are references to MD in the template
# files, worksheets, and other areas. Only use MD if your storage systems
# specifically calls out its use for multipath.
# If some other multipath mechanism is used (e.g. one built
# into an HBA driver), then references to MD, RAIDTAB, RAIDSTART, etc.
# should be commented out. If the references are in the comments, they
# can be ignored. References to MD devices, such as /dev/md0, should be
# replaced with the appropriate multipath device name.
# For example:
# RAIDTAB="/usr/local/cmcluster/conf/raidtab.sg"
#RAIDTAB=""
# MD (RAID) COMMANDS
76
Chapter 3
Configuring your Environment for Software RAID
Configuring the Package Control Script and RAID Configuration File
# Specify the method of activation and deactivation for md.
# Leave the default (RAIDSTART="raidstart", "RAIDSTOP="raidstop") if you want
# md to be started and stopped with default methods.
RAIDSTART="raidstart -c ${RAIDTAB}"
RAIDSTOP="raidstop -c ${RAIDTAB}"
Cr ea tin g a n d Ed itin g th e Pa ck a ge Con tr ol Scr ip t s
After you install the XDC software, you need to create a package control
script and add references to the XDC software to enable Software RAID.
After you create the package control script you need to complete the
following tasks:
•
•
Edit the value of the DATA_REPvariable
Edit the value of the XDC_CONFIG_FILEto point to the location where
the raid.conffile is placed
•
Configure the RAID monitoring service
To Cr ea t e a P a ck a ge Con tr ol Scr ip t
The procedure to create a package control script for XDC software is
identical to the procedure that you follow to create other package control
scripts.
To create a package control script, run the following command:
# cmmakepkg –s <package file name>.sh
For example: # cmmakepkg –s oracle_pkg.sh
An empty template file for this package is created. You will need to edit
this package control script, in order to enable Software RAID in your
environment.
To Ed it th e DATA_R EP Va r ia ble
The DATA_REPvariable defines the nature of data replication that is
used. To enable Software RAID, set the value of this variable to MD. You
must set this value for every package that you need to enable Software
RAID. When you set this parameter to xdcmd, it enables remote data
replication through Software RAID.
For example: DATA_REP=”xdcmd”
Chapter 3
77
Configuring your Environment for Software RAID
Configuring the Package Control Script and RAID Configuration File
To Ed it th e XDC_CONF IG F ILE p a r a m eter
In addition to modifying the DATA_REPvariable, you must also set
XDC_CONFIG_FILEto specify the raid.conffile for this package. This file
resides in the package directory.
For example: XDC_CONFIG_FILE="$SGCONF/oracle_pkg/raid.conf”
To Con figu r e t h e R AID Mon itor in g Ser vice
After you have edited the variables in the XDC configuration file
(XDC_CONFIG_FILE), you must set up RAID monitoring as a service
within Serviceguard. Following is an example of how the file content
must look:
SERVICE_NAME[0]="RAID_monitor"
SERVICE_CMD[0]="$SGSBIN/raid_monitor '${XDC_CONFIG_FILE}'"
SERVICE_RESTART[0]=""
Ensure that this service is also configured in the package configuration
file as shown below:
SERVICE_NAME raid_monitor
SERVICE_FAIL_FAST_ENABLED YES
SERVICE_HALT_TIMEOUT 300
After editing the package control script, you must edit the raid.conf
file to enable Software RAID.
Ed itin g th e r a id .con f F ile
environment of the package. You must place a copy of this file in the
package directory of every package that you have enabled Software
RAID. The parameters in this file are:
•
RPO_TARGET
Given a set of storage that is mirrored remotely, as in Figure 1-4, the
RPO_TARGET(Recovery Point Objective Target) is the maximum time
allowed between the expiration of the Link_Down_Timeout(t1 in
Figure 3-1, after the failure of the data links to the remote storage)
and the package starting up on the remote node (t4 on Figure 3-1). If
78
Chapter 3
Configuring your Environment for Software RAID
Configuring the Package Control Script and RAID Configuration File
more time elapses than what is specified for RPO_TARGET, the
package is prevented from starting on the remote node (assuming
that the node still has access only to its own half of the mirror).
By default, RPO_TARGETis set to 0. Leave it at 0to ensure the
package does not start on an adoptive node with a mirror half that is
not current. This ensures the highest degree of data currency.
If RPO_TARGETis not set to 0, the value of RAID_MONITOR_INTERVAL
should be less than the value of RPO_TARGET.
(RAID_MONITOR_INTERVALshould also be less than the value of the
Link_Down_Timeoutparameter so that disk access failure can be
recognized early enough in certain failure scenarios.)
IMPORTANT
A very low value of RAID_MONITOR_INTERVAL(less than 5 seconds)
has some impact on system performance because of the high
frequency of polling.
You can also set RPO_TARGETto the special value -1or to any positive
integer. Setting RPO_TARGETto -1causes the RAID system to ignore
any time-window checks on the disk set. This allows the package to
start with a mirror half that is not current.
Setting the RPO_TARGETto any positive integer, means that the
package will start with a mirror half that is not current by any
number of seconds less than that value. For example, an RPO_TARGET
of 45 means that the package will start only if the mirror is up to
date, or out of date by less than 45 seconds.
Because some timers are affected by polling, the value of this
parameter can vary by approximately 2 seconds.
This also requires that the minimum value of this parameter is 2
seconds if a small value is necessary. Change the value of
RPO_TARGET, if necessary, after considering the cases discussed
below.
Ca ses to Con sid er w h en Sett in g R P O_TAR GET
RPO_TARGETallows for certain failure conditions when data is not
synchronized between the two sites.
Chapter 3
79
Configuring your Environment for Software RAID
Configuring the Package Control Script and RAID Configuration File
For example, let us assume that the data storage links in Figure 1-4
fail before the heartbeat links fail. In this case, after the time
specified by Link_Down_Timeouthas elapsed, a package in
Datacenter1 (DC1) will continue updating the local storage, but not
the mirrored data in datacenter2 (DC2). While the communication
links must be designed to prevent this situation as far as possible,
this scenario could occur and may last for a while before one of the
sites fails.
NOTE
For more information on how access to disks is disabled in certain
failure scenarios, see “Setting the Value of the Link Down Timeout
Parameter” on page 69.
Let us consider a few failure scenarios and the impact of different
RPO_TARGETvalues. The discussion below is based on the timeline
and events shown in Figure 3-1.
Figu r e 3-1
R P O Ta r get Defin ition s
To ensure that no data is lost when a package fails over to DC2 and
starts with only DC2's local storage, t2 must occur between t0 and t1.
80
Chapter 3
Configuring your Environment for Software RAID
Configuring the Package Control Script and RAID Configuration File
Now consider an XDC configuration such as that shown in Figure 1-3
(DWDM links between data centers). If DC1 fails such that links A
and B both fail simultaneously, and DC1's connection to the Quorum
Server fails at the same time, Serviceguard ensures that DC2
survives and the package fails over and runs with DC2 local storage.
But if DC1's links A and B fail, and later DC1's link to the Quorum
Server fails, then both sets of nodes (DC1 and DC2) will try to obtain
the cluster lock from the Quorum Server. If the Quorum server
chooses DC1 (which is about to experience complete site failure),
then the entire cluster will go down.
But if the Quorum Server chooses DC2 instead, then the application
running on DC1 will not be able to write to the remote storage but
will continue to write to its local (DC1) storage until site failure
occurs (at t3). If the network is set up in such a way that the
application cannot communicate with its clients under these
circumstances, the clients will not receive any acknowledgement of
these writes. HP recommends you configure the network such that
when links between the sites fail, the communication links to the
application clients also go down.
If the network is configured to prevent the application from
communicating with its clients under these circumstances, the
clients will not receive any acknowledgement of these writes and
after the failover will re-transmit them, and the writes will be
committed and acknowledged at DC2. This is the desired outcome;
HP recommends you configure the network such that when links
between the sites fail, the communication links to the application
clients are also shut down.
In the case of an XDC configuration such as that shown in Figure
1-4, there is an additional variable in the possible failure scenarios.
Instead of a DWDM link, in this configuration there are two separate
LAN and FC links which can experience failure independent of each
other. If the network links between the sites fail within a very short
period (on the order of 1 second) after t1 (after the storage links had
failed), the XDC software on DC1 will not have time to inform the
XDC on DC2 of the failure. So DC2 assumes that there were no
updates after t1, but there may have been.
When this scenario occurs, disk writes continue on DC1 until t3. In
this case, the effective value of the RPO_TARGETparameter is greater
than the expected value of 0.
Chapter 3
81
Configuring your Environment for Software RAID
Configuring the Package Control Script and RAID Configuration File
Again, if the network is set up in such a way that when the links
between the sites fail, the communication links to the application
clients are also shut down, then the unintended writes are not
acknowledged and have no long term effect.
IMPORTANT
The value you set for RPO_TARGETmust be more than the value you
set for the RAID_MONITOR_INTERVALparameter. By default, the
RAID_MONITOR_INTERVALparameter is set to 30 seconds.
For example: RPO_TARGET=60 seconds
•
MULTIPLE_DEVICES AND COMPONENT_DEVICES
Parameter RAID_DEVICE [ ]specifies the MD devices that are used
by a package. You must begin with RAID_DEVICE[0], and increment
the list in sequence. Component device parameters DEVICE_0[ ]and
DEVICE_1[ ]specify the component devices for the MD device of the
same index.
For example, if a package uses multiple devices such as
— md0 consisting of devices /dev/hpdev/sdeand /dev/hpdev/sdf
and
— md1 consisting of devices /dev/hpdev/sdg1and
/dev/hpdev/mylink-sdh1
use
# md0
RAID_DEVICE[0]=/dev/md0;
DEVICE_0[0]="/dev/hpdev/sde";
DEVICE_1[0]="/dev/hpdev/sdf"
#md1
RAID_DEVICE[1]=/dev/md1;
DEVICE_0[1]="/dev/hpdev/sdg1";
DEVICE_1[1]="/dev/hpdev/sdh1"
The MD RAID device names and the component device names must
be unique across the packages in the entire cluster.
82
Chapter 3
Configuring your Environment for Software RAID
Configuring the Package Control Script and RAID Configuration File
•
RAID_MONITOR_INTERVAL
This parameter defines the time interval, in seconds, the raid
monitor script waits between each check to verify accessibility of
both component devices of all mirror devices used by this package.
By default, this parameter is set to 30 seconds.
IMPORTANT
After you edit the parameters, ensure that you copy the package control
script and the edited raid.conffile to all nodes in the cluster. All the
nodes in the cluster must have the identical copy of the files.
After you have installed the XDC software, and completed the
configuration procedures, your environment is equipped to handle any
failure scenarios. The subsequent chapter discusses how certain disaster
scenarios are handled by the XDC software.
Chapter 3
83
Configuring your Environment for Software RAID
Configuring the Package Control Script and RAID Configuration File
84
Chapter 3
Disaster Scenarios and Their Handling
4
Disa st er Scen a r ios a n d Th eir
H a n d lin g
The previous chapters provided information on deploying Software RAID
in your environment. In this chapter, you will find information on how
Software RAID addresses various disaster scenarios. All the disaster
scenarios described in this section have the following three categories:
•
Disa st er Scen a r io
Describes the type of disaster and provides details regarding the
cause and the sequence of failures leading to the disasters in the case
of multiple failures.
•
•
Wh a t h a p p en s w h en t h is d isa ster occu r s
Describes how the Extended Distance Cluster software handles this
disaster.
R ecover y P r ocess
After the disaster strikes and necessary actions are taken by the
software to handle it, you need to ensure that your environment
recovers from the disaster. This section describes all the steps that
an administrator needs to take to repair the failures and restore the
cluster to its original state. Also, all the commands listed in this
column must be executed on a single line.
Chapter 4
85
Disaster Scenarios and Their Handling
The following table lists all the disaster scenarios that are handled by
the Extended Distance Cluster software. All the scenarios assume that
the setup is the same as the one described in “Extended Distance
Clusters” on page 18 of this document.
Ta b le 4-1
Disa st er Scen a r ios a n d Th eir H a n d lin g
Wh a t Ha p p en s Wh en
Th is Disa ster Occu r s
Disa ster Scen a r io
Recover y P r ocess
A package (P1) is running on a
node (Node 1). Node 1
experiences a failure.
The package (P1) fails over
to another node (Node 2).
As the network and both the
mirrored disk sets are accessible
on Node 2, and were also
accessible when Node 1 failed, you
only need to restore Node 1. Then
you must enable the package to
run on Node 1 after it is repaired
by running the following
This node (Node 2) is
configured to take over the
package when it fails on
Node 1.
command:
# cmmodpkg -e P1 -n N1
86
Chapter 4
Disaster Scenarios and Their Handling
Ta b le 4-1
Disa st er Scen a r ios a n d Th eir H a n d lin g (Con t in u ed )
Wh a t Ha p p en s Wh en
Th is Disa ster Occu r s
Disa ster Scen a r io
Recover y P r ocess
A package (P1) is running on a
node (Node 1). The package uses
a mirror (md0) that consists of
two storage components - S1
(local to Node 1 -
The package (P1) continues
to run on Node 1 with the
mirror that consists of only
S2.
Once you restore power to S1, or
restore the FC links to S1, the
corresponding mirror half of S1
(/dev/hpdev/mylink-sde) is
accessible from Node 1. To make
the restored mirrored half part of
the MD array, complete the
following procedure:
/dev/hpdev/mylink-sde) and
S2 (local to Node 2).
Access to S1 is lost from both
nodes, either due to power
failure to S1 or loss of FC links
to S1.
1. Run the following command to
remove the mirrored half from
the array:
# mdadm --remove /dev/md0
/dev/hpdev/mylink-sde
2. Run the following command to
add the mirrored half to the
array:
# mdadm --add /dev/md0
/dev/hpdev/mylink-sde
The re-mirroring process is
initiated. When it is complete, the
extended distance cluster detects
the added mirror half and accepts
S1 as part of md0.
Chapter 4
87
Disaster Scenarios and Their Handling
Ta b le 4-1
Disa st er Scen a r ios a n d Th eir H a n d lin g (Con t in u ed )
Wh a t Ha p p en s Wh en
Th is Disa ster Occu r s
Disa ster Scen a r io
Recover y P r ocess
A package (P1) is running on a
node (Node 1). The package uses
a mirror (md0) that consists of
two storage components - S1
(local to Node 1 -
The package (P1) fails over
to Node 2 and starts
running with the mirror of
md0 that consists of only
the storage local to node 2
(S2).
Complete the following procedure
to initiate a recovery:
1. Restore data center 1, Node 1
and storage 1. Once Node 1 is
restored, it rejoins the cluster.
Once S1 is restored, it
/dev/hpdev/mylink-sde) and
S2 (local to Node 2)
becomes accessible from Node
2.
Data center 1 that consists of
Node 1 and P1 experiences a
failure.
When the package failed over
and started on Node 2, S1 was
not a part of md0. As a result,
you need to add S1 into md0.
Run the following command to
add S1 to md0:
NOTE: In this example, failures
in a data center are
instantaneous. For example -
power failure.
# mdadm – -add /dev/md0
/dev/hpdev/mylink-sde
The re-mirroring process is
initiated. When it is complete,
the extended distance cluster
detects the added mirror half
and accepts S1 as part of md0.
2. Enable P1 to run on Node 1 by
running the following
command:
# cmmodpkg -e P1 -n N1
88
Chapter 4
Disaster Scenarios and Their Handling
Ta b le 4-1
Disa st er Scen a r ios a n d Th eir H a n d lin g (Con t in u ed )
Wh a t Ha p p en s Wh en
Th is Disa ster Occu r s
Disa ster Scen a r io
Recover y P r ocess
This is a multiple failure
scenario where the failures
occur in a particular sequence in
the configuration that
corresponds to figure 2 where
Ethernet and FC links do not go
over DWDM.
The package (P1) continues
to run on N1 after the first
failure, with md0 consisting
of only S1.
For the first failure scenario,
complete the following procedure
to initiate a recovery:
1. Restore the links in both
directions between the data
centers. As a result, S2
After the second failure, the
package (P1) fails over to
N2 and starts with S1.
Since S2 is also accessible,
the extended distance
(/dev/hpdev/mylink-sdf) is
accessible from N1 and S1 is
accessible from N2.
The package (P1) is running on
a node (N1). P1 uses a mirror
md0 consisting of S1 (local to
node N1, say /dev/hpdev/
mylink-sde) and S2 (local to
node N2).
cluster adds S2 and starts
re-mirroring of S2.
2. Run the following commands
to remove and add S2 to md0
on N1:
# mdadm --remove /dev/md0
/dev/hpdev/mylink-sdf
The first failure occurs with all
FC links between the two data
centers failing, causing N1 to
lose access to S2 and N2 to lose
access to S1.
# mdadm --add /dev/md0
/dev/hpdev/mylink-sdf
The re-mirroring process is
After recovery for the first
failure has been initiated, the
second failure occurs when
re-mirroring is in progress and
N1 goes down.
initiated. The re-mirroring process
starts from the beginning on N2
after the second failure. When it
completes, the extended distance
cluster detects S2 and accepts it as
part of md0again.
Chapter 4
89
Disaster Scenarios and Their Handling
Ta b le 4-1
Disa st er Scen a r ios a n d Th eir H a n d lin g (Con t in u ed )
Wh a t Ha p p en s Wh en
Th is Disa ster Occu r s
Disa ster Scen a r io
Recover y P r ocess
This is a multiple failure
scenario where the failures
occur in a particular sequence in
the configuration that
corresponds to figure 2 where
Ethernet and FC links do not go
over DWDM.
The package (P1) continues
to run on Node 1 after the
first failure, with the MD0
that consists of only S1.
In this scenario, no attempts are
made to repair the first failure
until the second failure occurs.
Typically the second failure occurs
before the first failure is repaired.
After the second failure, the
package P1 fails over to N2
and starts with S2. Data
that was written to S1 after
the FC link failure is now
lost because the
1. To recover from the first
failure, restore the FC links
between the data centers. As a
result, S1 is accessible from
N2.
The RPO_TARGETfor the package
P1 is set to IGNORE.
The package is running on Node
1. P1 uses a mirror md0
RPO_TARGETwas set to
IGNORE.
2. Run the following command to
add S1 to md0 on N2:
consisting of S1 (local to node
N1, - /dev/hpdev/mylink-sde)
and S2 (local to node N2). The
first failure occurs when all FC
links between the two data
centers fail, causing Node 1 to
lose access to S2 and Node 2 to
lose access to S1.
# mdadm --add /dev/md0
/dev/hpdev/mylink-sde
This command initiates the
re-mirroring process. When it
is complete, the extended
distance cluster detects S1
and accepts it as md0.
After sometime a second failure
occurs. Node 1 fails (because of
power failure)
For the second failure, restore N1.
Once it is restored, it joins the
cluster and can access S1 and S2.
1. Run the following command to
enable P1 to run on N1
# cmmodpkg -e P1 -n N1
90
Chapter 4
Disaster Scenarios and Their Handling
Ta b le 4-1
Disa st er Scen a r ios a n d Th eir H a n d lin g (Con t in u ed )
Wh a t Ha p p en s Wh en
Th is Disa ster Occu r s
Disa ster Scen a r io
Recover y P r ocess
This failure is the same as the
previous failure except that the
package (P1) is configured with
RPO_TARGETset to 60seconds.
Package P1 continues to run
on N1 after the first failure
with md0 consisting of only
S1
In this scenario, no attempts are
made to repair the first failure
until the second failure occurs.
Typically, the second failure occurs
before the first failure is repaired.
In this case, initially the
After the second failure,
package P1 fails over to N2
and starts with S2. This
happens because the disk
S2 is non-current by less
than 60 seconds. This time
limit is set by the
RPO_TARGETparameter.
Disk S2 has data that is
older than the other mirror
half S1.
package (P1) is running on N 1.
P1 uses a mirror md0 consisting
of S1 (local to node N1 -
/dev/hpdev/mylink-sde) and
S2 (local to node N2). The first
failure occurs when all FC links
between the two data centers
fail, causing N1 to lose access to
S2 and N2 to lose access to S1.
1. To recover from the first
failure, restore the FC links
between the data centers. As a
result, S1
(/dev/hpdev/mylink-sde) is
accessible from N2.
2. Run the following command to
add S1 to md0 on N2:
# mdadm --add /dev/md0
/dev/hpdev/mylink-sde.
After the package resumes
activity and runs for 20 seconds,
a second failure occurs causing
N 1 to fail, perhaps due to power
failure.
However, all data that was
written to S1 after the FC
link failure is lost
This command initiates the
re-mirroring process. When it is
complete, the extended distance
cluster detects S1 and accepts it as
md0 again.
For the second failure, restore N1.
Once it is restored, it joins the
cluster and can access S1 and S2.
1. Run the following command to
enable P1 to run on N1
# cmmodpkg -e P1 -n N1
Chapter 4
91
Disaster Scenarios and Their Handling
Ta b le 4-1
Disa st er Scen a r ios a n d Th eir H a n d lin g (Con t in u ed )
Wh a t Ha p p en s Wh en
Th is Disa ster Occu r s
Disa ster Scen a r io
Recover y P r ocess
In this case, the package (P1)
runs with RPO_TARGETset to 60
seconds.
The package (P1) continues
to run on N1 with md0
consisting of only S1 after
the first failure
In this scenario, no attempts are
made to repair the first failure
until the second failure occurs.
Complete the following procedure
to initiate a recovery:
Package P1 is running on node
N1. P1 uses a mirror md0
consisting of S1 (local to node
N1, for example
/dev/hpdev/mylink-sde) and
S2 (local to node N2). The first
failure occurs when all FC links
between two data centers fail,
causing N1 to lose access to S2
and N2 to lose access to S1.
After the second failure, the
package does not start up
on N2 because when it tries
to start with only S2 on N2,
it detects that S2 is
non-current for a time
period which is greater than
the value of RPO_TARGET.
1. To recover from the first
failure, restore the FC links
between the data centers. As a
result, S1 is accessible from
N2.
2. After the FC links are
restored, and S1 is accessible
from N2, run the following
command to restart the
package on N2.
After the package resumes
activity and runs for 90 seconds,
a second failure occurs causing
node N1 to fail.
# cmrunpkg <package_name>
When the package starts up on
N2, it automatically adds S1 back
into the array and starts
re-mirroring from S1 to S2. When
re-mirroring is complete, the
extended distance cluster detects
and accepts S1 as part of md0
again.
For the second failure, restore N1.
Once it is restored, it joins the
cluster and can access S1 and S2.
1. Run the following command to
enable P1 to run on N1:
# cmmodpkg -e P1 -n N1
92
Chapter 4
Disaster Scenarios and Their Handling
Ta b le 4-1
Disa st er Scen a r ios a n d Th eir H a n d lin g (Con t in u ed )
Wh a t Ha p p en s Wh en
Th is Disa ster Occu r s
Disa ster Scen a r io
Recover y P r ocess
This scenario is an extension of
the previous failure scenario. In
the previous scenario, when the
package fails over to N2, it does
not start as the value of
If the FC links are not
restored on N2, you can only
start the package forcefully.
You can forcefully start a
package only if it is
determined that the
associated data loss is
acceptable.
Complete the following procedure
to initiate a recovery:
1. Reconnect the FC links
between the data centers. As a
result, S1
RPO_TARGETwould have been
exceeded.
(/dev/hpdev/mylink-sde)
becomes accessible from N2
To forcefully start the package
P1 on N2 when the FC links are
not restored on N2, check the
package log file on N2 and
execute the commands that
appear in it.
2. Run the following command to
add S1 to md0 on N2
After you execute the force
start commands, package
P1 starts on N2 and runs
with md0 consisting of only
S2
# mdadm --add /dev/md0
/dev/hpdev/mylink-sde
(/dev/hpdev/mylink-sdf).
This command initiates the
re-mirroring process from S2 to
S1. When re-mirroring is complete,
the extended distance cluster
detects S1 and accepts it as part of
md0.
Chapter 4
93
Disaster Scenarios and Their Handling
Ta b le 4-1
Disa st er Scen a r ios a n d Th eir H a n d lin g (Con t in u ed )
Wh a t Ha p p en s Wh en
Th is Disa ster Occu r s
Disa ster Scen a r io
Recover y P r ocess
In this case, the package (P1)
runs with RPO-TARGETset to 60
seconds.
When the first failure
Complete the following steps to
initiate a recovery:
occurs, the package (P1)
continues to run on N1 with
md0 consisting of only S1.
1. Restore the FC links between
the data centers. As a result,
S2 (/dev/hpdev/mylink-sdf)
becomes available to N1 and
S1 (/dev/hpdev/mylink-sde)
becomes accessible from N2.
In this case, initially the
package (P1) is running on node
N1. P1 uses a mirror md0
consisting of S1 (local to node
N1, for example
When the second failure
occurs, the package fails
over to N2 and starts with
S2.
/dev/hpdev/mylink-sde) and
S2 (local to node N2). The first
failure occurs when all FC links
between the two data centers
fail, causing N1 to lose access to
S2 and N2 to lose access to S1.
When N2 fails, the package
does not start on node N1
because a package is
allowed to start only once
with a single disk. You must
repair this failure and both
disks must be synchronized
and be a part of the MD
array before another failure
of same pattern occurs.
2. To start the package P1 on N1,
check the package log file in
the package directory and run
the commands which will
appear to force a package
start.
Immediately afterwards, a
second failure occurs where
node (N1) goes down because of
a power failure.
When the package starts up on
N1, it automatically adds S2 back
into the array and the
re-mirroring process is started.
When re-mirroring is complete,
the extended distance cluster
detects and accepts S1 as part of
md0.
After N1 is repaired and
brought back into the cluster,
package switching of P1 to N1 is
enabled.
In this failure scenario, only
S1 is available to P1 on N1,
as the FC links between the
data centers are not
repaired. As P1 started once
with S2 on N2, it cannot
start on N1 until both disks
are available.
IMPORTANT: While it is not a
good idea to enable package
switching of P1 to N1, it is
described here to show recovery
from an operator error.
The FC links between the data
centers are not repaired and N2
becomes inaccessible because of
a power failure.
94
Chapter 4
Disaster Scenarios and Their Handling
Ta b le 4-1
Disa st er Scen a r ios a n d Th eir H a n d lin g (Con t in u ed )
Wh a t Ha p p en s Wh en
Th is Disa ster Occu r s
Disa ster Scen a r io
Recover y P r ocess
In this case, initially the
package (P1) is running on node
N1. P1 uses a mirror md0
consisting of S1 (local to node
N1, for example
/dev/hpdev/mylink-sde) and
S2 (local to node N2). The first
failure occurs with all Ethernet
links between the two data
centers failing.
With this failure, the
Complete the following steps to
initiate a recovery:
heartbeat exchange is lost
between N1 and N2. This
results in both nodes trying
to get to the Quorum server.
1. You need to only restore the
Ethernet links between the
data centers so that N1 and
N2 can exchange heartbeats
If N1 accesses the Quorum
server first, the package
continues to run on N1 with
S1 and S2 while N2 is
rebooted. If N2 accesses the
Quorum server, the package
fails over to N2 and starts
running with both S1 and
S2 and N1 is rebooted.
2. After restoring the links, you
must add the node that was
rebooted as part of the cluster.
Run the cmrunnodecommand
to add the node to the cluster.
NOTE: If this failure is a
precursor to a site failure, and if
the Quorum Service arbitration
selects the site that is likely to
have a failure, it is possible that
the entire cluster will go down.
In this case, initially the
package (P1) is running on node
N1. P1 uses a mirror md0
consisting of S1 (local to node
N1, say
/dev/hpdev/mylink-sde) and
S2 (local to node N2). The first
failure occurs when the
Ethernet links from N1 to the
Ethernet switch in datacenter1
fails.
With this failure, the
heartbeat exchange
between N1 and N2 is lost.
Complete the following procedure
to initiate a recovery:
1. Restore the Ethernet links
from N1 to the switch in data
center 1.
N2 accesses the Quorum
server, as it is the only node
which has access to the
Quorum server. The
package fails over to N2 and
starts running with both S1
and S2 while N1 gets
rebooted.
2. After restoring the links, you
must add the node that was
rebooted as part of the cluster.
Run the cmrunnodecommand
to add the node to the cluster.
Chapter 4
95
Disaster Scenarios and Their Handling
96
Chapter 4
Managing an MD Device
A
Ma n a gin g a n MD Device
device. For the latest information on how to manage and MD device, see
•
•
•
•
“Viewing the Status of the MD Device” on page 98
“Stopping the MD Device” on page 99
“Starting the MD Device” on page 100
“Removing and Adding an MD Mirror Component Disk” on page 101
Appendix A
97
Managing an MD Device
Viewing the Status of the MD Device
View in g t h e St a t u s of t h e MD Device
After creating an MD device, you can view its status. By doing so, you
can remain informed of whether the device is clean, up and running, or if
there are any errors.
To view the status of the MD device, run the following command on any
node:
cat /proc/mdstat
Immediately after the MD devices are created and during some recovery
processes, the devices undergo a re-mirroring process. You can view the
progress of this process in the /proc/mdstatfile. Following is the output
you will see:
[root@dlhct1 ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sde[2] sdf[0]
9766784 blocks [2/1] [U_]
[=>...................] recovery = 8.9% (871232/9766784) finish=2.7min
speed=54452K/sec
unused devices: <none>
NOTE
A status report obtained using the cat /proc/mdstatcommand shows
the MD device name and the actual device names of the two MD
component devices. It does not show the persistent device names.
After you create an MD device, you can view the status of the device, stop
and start the device, add and remove a mirror component from the MD
device.
98
Appendix A
Managing an MD Device
Stopping the MD Device
Stop p in g th e MD Device
After you create an MD device, it begins to run. You need to stop the
device and add the configuration into the raid.conffile. To stop the MD
device, run the following command:
# mdadm -S <md_device name>
When you stop this device, all resources that were previously occupied by
this device are released. Also, the entry of this device is removed from
the /proc/mdstatfile.
E xa m p le A-1
St op p in g th e MD Device /dev/md0
To stop the MD device /dev/md0, run the following command:
[root@dlhct1 dev]# mdadm -S /dev/md0
Once you stop the device, the entry is removed from the /proc/mdstat
file. Following is an example of what the file contents will look like:
[root@dlhct1 dev]# cat /proc/mdstat
Personalities : [raid1]
unused devices: <none>
NOTE
This command and the other commands described subsequently are
listed as they may be used during cluster development and during some
recovery operations.
Appendix A
99
Managing an MD Device
Starting the MD Device
Sta r t in g t h e MD Device
After you create an MD device, you would need to stop and start the MD
device to ensure that it is active. You would not need to start the MD
device in any other scenario as this is handled by the XDC software.
To start the MD device, run the following command:
# mdadm -A -R <md_device_name>
<md_mirror_component_persistent_name_0>
<md_mirror_component_persistent_name_1>
E xa m p le A-2
St a r tin g t h e MD Device /dev/md0
To start the MD device /dev/md0, run the following command:
# mdadm -A -R /dev/md0 /dev/hpdev/sde /dev/hpdev/sdf1
Following is an example of what the /proc/mdstatfile contents will look
like once the MD is started:
[root@dlhct1 dev]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sde[1] sdf[0]
9766784 blocks [2/2] [UU]
unused devices: <none>
100
Appendix A
Managing an MD Device
Removing and Adding an MD Mirror Component Disk
R em ovin g a n d Ad d in g a n MD Mir r or
Com p on en t Disk
There are certain failure scenarios, where you would need to manually
remove the mirror component of an MD device and add it again later. For
example, if links between two data centers fail, you would need to
remove and add the disks that were marked as failed disks.
When a disk within an MD device fails, the /proc/mdstatfile of the MD
array displays a message. For example:
[root@dlhct1 dev]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sde[2](F) sdf[0]
9766784 blocks [2/1] [U_]
unused devices: <none>
In the message, the (F)indicates which disk has failed. In this example,
the sde[2]disk has failed.
In such a scenario, you must remove the failed disk from the MD array.
You need to determine the persistent name of the failed disk before you
remove it from the MD array. For this example, run the following
command to determine the persistent name of the disk:
# udevinfo -q symlink -n sdc1
Following is a sample output:
hpdev/mylink-sdc \
disk/by-id/scsi-3600805f3000b9510a6d7f8a6cdb70054-part1 \
disk/by -path/pci-0000:06:01.0-scsi-0:0:1:30-part1
Run the following command to remove a failed component device from
the MD array:
# mdadm - -remove <md device name>
<md_mirror_component_persistent_name>
In this example:
# mdadm --remove /dev/md0 /dev/hpdev/mylink-sdc1
This command removes the failed mirrored disk from the array.
Appendix A
101
Managing an MD Device
Removing and Adding an MD Mirror Component Disk
E xa m p le A-3
R em ovin g a fa iled MD com p on en t d isk fr om /dev/md0a r r a y
To remove a failed MD component disk from /dev/md0, run the following
command:
# mdadm –-remove /dev/md0 /dev/hpdev/sde
Following is an example of the status message that is displayed when a
failed component is removed from the MD array:
[root@dlhct1 dev]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdf[0]
9766784 blocks [2/1] [U_]
unused devices: <none>
Ad d in g a Mir r or Com p on en t Device
As mentioned earlier, in certain failure scenarios, you need to remove a
failed mirror disk component, repair it and then add it back into an MD
array. Run the following command to add a mirror component back into
the MD array:
# mdadm - -add <md device name>
<md_mirror_component_persistent_name>
E xa m p le A-4
Ad d in g a n ew d isk a s a n MD com p on en t t o /dev/md0a r r a y
To add a new disk to the /dev/md0 array, run the following command:
# mdadm - -add /dev/md0 /dev/hpdev/sde
Following is an example of the status message displayed in the
/proc/mdstatfile once the disk is added:
Personalities : [raid1]
md0 : active raid1 sde[2] sdf[0]
9766784 blocks [2/1] [U_]
[=>...................] recovery = 8.9% (871232/9766784) finish=2.7min
speed=54452K/sec
unused devices: <none>
102
Appendix A
A
C
cluster
extended distance, 22
FibreChannel, 52
metropolitan, 23
wide area, 27
cluster maintenance, 49
configuring, 46
F
disaster tolerant Ethernet networks, 46
consistency of data, 38
continental cluster, 27
currency of data, 38
D
data center, 17
data replication, 38
FibreChannel, 52
ideal, 44
MD Device
logical, 42
MD mirror device
off-line, 38
definition, 23
MULTIPLE_DEVICES AND
COMPONENT_DEVICES, 82
online, 39
physical, 39
DATA_REP Variable
edit, 77
disaster recovery
automating, 48
services, 48
disaster tolerance
evaluating need, 14
disaster tolerant
architecture, 17
cluster types, 18
Continentalclusters WAN, 46
definition, 16
operations staff
FibreChannel cluster, 52
guidelines for architecture, 37
limitations, 47
P
managing, 48
package control script
configure, 67
Persistent Device Names
Using, 71
metropolitan cluster rules, 23
staffing and training, 49
103
In d ex
physical data replication, 39
power sources
redundant, 44
Q
QLogic cards, 70
R
RAID Monitoring Service
Configure, 78
raid.conf file
Edit, 78
redundant power sources, 44
replicating data, 38
off-line, 38
online, 39
rolling disaster, 49
rolling disasters, 47
RPO_TARGET, 78
S
Serviceguard, 16
Software RAID
guidelines, 67
understanding, 62
synchronous data replication, 39
V
Volume Groups
Creating, 74
W
WAN configuration, 46
wide area cluster, 27
X
XDC_CONFIG FILE, 78
104
|