Part III: Designing and Planning a Shared Root Cluster
General setup and design recommendations for a Shared Root Cluster
DESIGNING AND PLANNING A SHARED ROOT CLUSTER
Avoiding Single Point of Failures (SPoFs)
Because a Shared Root Cluster is used for mission-critical applications, the recommended design includes high availabillity measures on different layers:
On top there are usually more than two cluster nodes to allow application failover and server loadbalancing - although it is also possible to construct single node Diskless Shared Root Clusters for various scenarios (e. g. a single node staging cluster).
As the nodes of a Diskless Shared Root Cluster only make use of local harddisks for volatile information (e.g. /tmp or swap), the next layer connects the servers with their shared storage system. The disk space is usually provided by a SAN and only in few cases by a NFS server. Therefor you will need a fibre channel or dedicated iSCSI ethernet network between SAN and cluster nodes. For the majority of clusters it is mandatory that the storage network must be fully redundant. That means that the cluster nodes usually are equipped with dual fibre channel HBAs (Host Bus Adapters) and every node is connected to the storage array via multiple paths provided by a redundant switched fabric. If one link is down, there is hope that the other path is still working.
Then the redundancy continues in the storage array. As the shared root partition is a crucial part of the Diskless Shared Root Cluster, the data is usually stored on a fully redundant RAID 1 (mirroring) volume. Also it is important, that the speed of the disc array is sufficient for the number of cluster nodes and clients to serve. You may influence the performance with the number of disks in the diskgroup that forms the RAID array. However the storage controllers and the hard drive technology are limiting factors.
More redundancy is needed for the network that connects the cluster nodes with each other and the network that is used by the clients to access the cluster ressources. The network bandwith should be as high as possible and you should use different carriers. A minimum of two bonded network interface cards (NICs)is needed for redundant connections. In such scenarios it is mandatory to use Virtual LANs (VLANs) together with Quality of Service (QoS) to separate the network packages and allow traffic shaping so that the Inter Cluster Communication (ICC) is delivered with high priority. Ideally you would use seperate NICs for ICC, public connection of the clients and your management connections to the cluster node.
Now you need to provide some protection against power failure. That means that your servers should include redundant power supplies and your racks should be connected with at least 2 power sources. Ideally one power source is supported by both batteries and diesel power generators.
This setup allows you to achieve scalable performance and high availability and enables you to create disaster tolerant infrastructures. All you need to do is to put your cluster equippment in different fire compartments and replicate your data to a similar structure in another data center.
Recommended Hardware
Quite a big advantage of the Diskless Shared Root Cluster is that you may use off-the-shelf GNU/Linux compatible hardware from any vendor you prefer. However it is recommended to use reliable components that have been proven and tested in enterprise environments.
You may also choose between different architectures for example Intel Itanium, Intel Xeon or AMD Opteron architecture. The 64-bit AMD Opteron microprocessor is particularly suitable for database applications and for Tru64 UNIX migrations because of the similarity to the former Alpha platform.
The Diskless Shared Root Cluster is highly scalable. Therefor blade technology is a good fit when you plan to stack in one rack as much processing power as possible.
For example the Hewlett Packard C-class server blades (e.g. BL465c half size blades - up to 128 CPUs/42U rack) are highly recommended and require much less cabling as the rack mount alternatives.
However HP DL385 and DL585 are perfect if you need multi core cluster nodes with lots of RAM for your enterprise applications.
If you plan to setup a versatile platform for xen virtualisation, you may have a look at models with CPUs that support Pacifica and Vanderpool.
As storage array you may use Infortrend EonStore or HP MSA 1500 as entry system. If you need more performance and have the dime to spare you should have a look at the HP Enterprise Virtual Array series. The controllers of this enterprise product also include advanced features (e.g. volume clones) and they score with an easier to use web-based interface in contrast to telnet based interfaces of entry level storage systems.
Failover Domains
In a cluster you have the option to define failover domains. A failover domain is a subset of your cluster nodes that are configured to run a specific service in case that there is a system failure. That means you may specify the nodes were your service is allowed to run.
A failover domain may be configured with the following options:
- Unrestricted - The specified nodes are preferred, but the service assigned to this domain may run on all available cluster nodes.
- Restricted - Allows you to specify particular nodes to run a specific service. It the nodes are unavailable then the service cannot be started.
- Unordered - The service will start on any node within the failover domain without any preference or priority order.
- Ordered - This option allows you to choose which nodes should run a service in a preferred manner.
Failover domains are usually unrestricted and unordered.
Tip: To implement the concept of a preferred member, create an unrestricted failover domain comprised of only one cluster member. By doing this, a service runs on the preferred member; in the event of a failure, the service fails over to any of the other members.
Power down after failure
Most companies run their critical services on a huge cluster grids. In such huge scenarious the administrators usually get only informed of critical events so that they are not alarmed falsely. However the staff may miss if a cluster node is fenced and rejoins the cluster a few minutes later. In single cases this is acceptable but what if there is a malfunction in the hardware and the same node gets fenced frequently? Such hardware must be identified at all costs and it may be appropriate that the failed cluster node gets powered down. The monitoring software or a Grayhead will take notice of this and will report that the node is down. Then the administrator may decide to put the node online again or to further inspect it if the node is affected regularily. Since only servers in good shape form the cluster, the quality of the cluster increases. However this method may be only used in bigger clusters. In smaller ones usually the cluster services should be protected by bringing all nodes online again as fast as possible.
Heartbeat Network
It is usually a wise idea to use a dedicated heartbeat interface for all inter cluster configurarion. Most servers will have heavy traffic on the public network interface and therefor cluster communication should be on a seperate NIC.
In a Shared Root Cluster the basic network configuration is done within the com_info section of the /etc/cluster/cluster.conf. The IP address that is chosen for the cluster node should be placed in the /etc/hosts file together with the proper node name. From this time on the inter cluster communication is done over the specified NIC.
You should also consider the usage of bonding interfaces so that the heartbeat network may be handled by different network interfaces and seperate network switches. In case of a network issue the alternative network path will guarantee seemless failover and continued cluster services.