Home > Resources > HACMP
Resources Collection > Shell Script Snippets
This page discusses IBM's SSA technology from a high availability perspective.
Please refer to SSA Basics for a more introductory
discussion of SSA. This page is a long way from being a complete description
of the topic (please refer to the various IBM SSA manuals and redbooks
for details).
The page is too short. Why don't you help fix that problem?
This page is part of the Matilda Team's HACMP Resources
Collection. The home page of the collection is located here.
IMPORTANT: read the disclaimer BEFORE
you use any information provided in this collection.
Configuration Rules
When planning an SSA configuration, it is important to make
sure that you follow all of the SSA
configuration rules. Failure to follow one or more of the rules can
result in a configuration which:
- fails completely (i.e. just doesn't work at all)
- fails intermittently
- doesn't provide the level of availability that you are trying to achieve
Be VERY careful!
Example Configurations
A very minimal highly available configuration
Although taken somewhat to extremes, the following configuration represents
a minimal yet highly available configuration of two single-disk drive
SSA loops in a single 7133:

This configuration has:
- two SSA loops:
- The first SSA loop is shown in red and contains both
SSA adapters in each node (via the "A" pair of connectors
for the sake of cleanliness) and disk drive modules 1 through
4. Disk drive module 1 is an actual SSA disk drive and
disk drive modules 2 through 4 are blanks.
- The second SSA loop is shown in blue and contains both
SSA adapters in each node (via the "B" pair of connectors)
and disk drive modules 9 through 12. Disk drive module
9 is an actual SSA disk drive and disk drive modules 10
through 12 are blanks.
- all bypass cards configured in forced-inline mode.
- a single shared volume group containing mirrored logical volumes
with each logical partition having one physical partition on
the red SSA loop and the other physical partition on the blue
SSA loop.
- doesn't use disk drive modules 5 through 8 and 13 through 16
in any way (i.e. they are available for use by other applications).
A few points to consider:
- This configuration can be easily expanded to two loops with
four drives per loop by replacing the blanks in disk drive slots
2 through 4 and 10 through 12 with actual SSA disk drives.
- the loss of either node won't cause the other node to lose
access to any SSA disk drives as each node has direct access
to one end of each quad containing actual SSA disk drives.
- the loss of any single SSA controller in a node won't prevent
the node from accessing all of the SSA disk drives since the
node's other SSA controller has either direct or indirect (via
the other node's SSA adapters) access to all of the disks.
- since every node has both direct and indirect access to every
quad of disks, the failure or removal of any single disk drive
module or SSA cable won't prevent either node from accessing
all of the remaining disks.
- The decision to use quads 1-4 and 9-12 is somewhat arbitrary.
They were chosen for this example because it is relatively easy
to grow this configuration to two loops of five to eight drives
per loop. Other choices (eg. using quads 1-4 and 5-8) result
in configurations which are somewhat more difficult to grow to
larger configurations.
With the exception of the 7133 itself (see discussion below on this
topic), this configuration has no single points of failure from the
shared disk hardware perspective.
The next step upwards
Here's a highly available SSA configuration with a pair of
eight disk SSA loops in a single 7133 shared between two nodes:

This example illustrates the result of increasing the previous
example beyond 4 SSA disk drives per loop. This configuration has:
- two SSA loops:
- The first SSA loop is shown in red and contains both
SSA adapters in each node (via the "A" pair of connectors)
and disk drive modules 1 through 8. actual disks).
- The second SSA loop is shown in blue and contains both
SSA adapters in each node (via the "B" pair of connectors)
and disk drive modules 9 through 16.
- All disk drive modules in both loops are actual SSA disk
drives.
- bypass cards 4/5 and 12/13 are configured in bypass mode and
bypass cards 8/9 and 1/16 are configured in forced-inline mode.
- a single shared volume group containing mirrored logical volumes
with each logical partition having one physical partition on
the red SSA loop and the other physical partition on the blue
SSA loop.
A few points to consider:
- the loss of either node won't cause the other node to lose
access to any disks as each node has direct access to one end
of each quad of disks.
- the loss of any single SSA controller in a node won't prevent
the node from accessing all of the disks since the node's other
SSA controller has either direct or indirect (via the other node's
SSA adapters) access to all of the quads of disks.
- since every node has both direct and indirect access to every
quad of disks, the failure or removal of any single disk drive
module or SSA cable won't prevent either node from accessing
all of the remaining disks.
- slightly smaller loops are possible within this general framework.
For example, a pair of six loops could be achieved by replacing
disk modules 4, 5, 12 and 13 with blanks.
Care must be taken to ensure that four or more blanks never appear contiguously
in a single loop. For example, one might be tempted to turn this configuration
into a pair of four disk loops with lots of room for expansion by replacing
disk drive modules 3, 4, 5, 6 and 11, 12, 13 and 14 with blanks.
This is a legal configuration if all of the hardware is operating. Unfortunately,
the "no four or more blanks in a row rule" is violated if the left node
fails since disk drive module 4 will become directly connected to disk
drive module 5 resulting in four blanks in a row (3 through 6). Similarily,
if the right node fails then the result is four blanks in a row in 11 through
14. The result of this particular configuration error isn't catastrophic
since four or more blanks in a row are equivalent to a break in the loop
and all disk drives remain accessible.
A better approach would be to replace the odd slots for disk drive modules
and the even slots for blanks since such an approach can't result in four
blanks in a row.
What about the solitary 7133 in the above examples?
One problem with both of the previous configurations is that
the 7133 itself is a single point of failure. While it is true that
an entire 7133 unit rarely fails, it is possible (burst water pipes
come to mind). If the availability requirements of the application
are sufficiently strict to make such a failure unacceptable then dual
7133s will have to be configured. In the above configurations, this
is as easy as splitting the two loops across two 7133s. For example,
the following configuration is logically equivalent to the last example
while eliminating the solitary 7133 single point of failure:

What about bigger clusters?
Clusters with more than two nodes pose new challenges.
For example, here's a configuration with four nodes sharing access
to a sixteen disk drive module SSA loop in a single 7133:

The correct use of the 7133 bypass mode feature becomes very important
in a cluster of this size. Each pair of SSA connectors must be configured
in bypass mode. This ensures that all of disk drive modules in the
loop are visible to running nodes if two or more of the nodes are
down. For example, if beta and gamma are down, the 8/9 and 1/16 SSA
connector pairs will switch bypass state which allows alpha to continue
to access disk drive modules 9 through 16 and delta to continue to
access disk drive modules 1 through 8. On the other hand, if the
SSA connector pairs are configured in forced inline mode then disk
drive modules 1 through 8 become invisible to delta and disk drive
modules 9 through 16 become invisible to alpha if beta and gamma
go down.
In fact, bypass mode is important for an even more fundamental reason
- when powering up the cluster from a completely powered down state,
bypass mode ensures that the cluster nodes that come up first see
all of the 7133 disk drives during device configuration. For example,
if bypass mode is not configured on any of the 7133 SSA connector
pairs and alpha performs device configuration during bootup before
beta and gamma are powered on then alpha won't configure disk drive
modules 9 through 16. This will result in missing pdisk and missing
hdisk errors when alpha tries to online the shared volume groups
that use disk drive modules 9 through 16.
The bottom line is very simple - SSA loops involving more than three
nodes always require the correct configuration of bypass mode.
One final point is probably worth making. Note that each pair of
7133 SSA connectors is connected to exactly one node's SSA adapters.
This is essential to the proper functioning of bypass mode since
the pair of 7133 SSA connectors only switches to bypass state if both SSA
connectors detect a loss of power. For example, if 7133 SSA connector
4 is connected to alpha and 7133 SSA connector 5 is connected to
beta then the 4/5 pair of SSA connectors won't switch to bypass mode
until both alpha and beta have been lost. Hence the loss of gamma
and either but not both of alpha and beta would cause some disks
to become lost to delta.
The configuration as shown above isn't complete. A second 7133 with
equivalent loops but using the B connectors on each adapter card
would be required to eliminate the first 7133 as a single point of
failure.
IMPORTANT: If you lack the appropriate skills, experience and/or
competency, are unwilling to take responsibility for your actions,
or if you don't like these disclaimers then
don't use this information.
|