Personal tools
You are here: Home Documentation Administrator's Handbook Part X: Troubleshooting Guide
Document Actions

Part X: Troubleshooting Guide

Troubleshooting guide for emergency situations. Don't panic!

TROUBLESHOOTING GUIDE

System Recovery Process

Mark Hlawatschek

<hlawatschek (at) atix.de>

Reiner Rottmann

<rottmann (at) atix.de>

2007-04-18
┌─────────────────────────────────────────────────────────────────────────────┐
│Revision History                                                             │
├────────────────────────────────────────┬─────────────────────────────┬──────┤
│Revision 0.9.2                          │2007-06-21                   │RR    │
├────────────────────────────────────────┴─────────────────────────────┴──────┤
│revised document                                                             │
├────────────────────────────────────────┬─────────────────────────────┬──────┤
│Revision 0.9.1                          │2007-05-28                   │RR    │
├────────────────────────────────────────┴─────────────────────────────┴──────┤
│document rewritten                                                           │
├────────────────────────────────────────┬─────────────────────────────┬──────┤
│Revision 0.0.2                          │2006-04-19                   │MH    │
├────────────────────────────────────────┴─────────────────────────────┴──────┤
│revised document                                                             │
├────────────────────────────────────────┬─────────────────────────────┬──────┤
│Revision 0.0.1                          │2006-04-18                   │MH    │
├────────────────────────────────────────┴─────────────────────────────┴──────┤
│first draft version                                                          │
└─────────────────────────────────────────────────────────────────────────────┘

Abstract

The system recovery process guide contains the following information:

  • Task list for manual system recovery
  • Definition of all needed debugging information for problem management

Scope

Dealing with cluster problems is a demanding task even for an experienced administrator. If there is a tough problem to solve it is difficult to find the right course of action.

For that case we distilled our years of experience in troubleshooting shared-root cluster infrastructures into one document that assists in each phase of the recovery process.

This guide not only describes an easy way for incident classification. There is also an in-depth incident Description with accurate recovery procedures, data assessments and preventive measures.

However the presented troubleshooting principles and techniques are specific for ATIX com.oonics shared-root clusters.

Process Overview

The system recovery process can be split up into several sub processes.

Each cluster handling process begins with the creation of a cluster problem report. A troubleshooting ticket is opened and cluster health checks are performed to identify the actual cause of the incident. A checklist ensures that all tests are executed in the right order and only if they make sense. See the section called “Incident Classification” for more information.

With each check passed, the number of possible incidents decreases. In the end the results of these checks will show the predefined failure situation that is most probable. Then the appropriate recovery process can be selected and executed. If necessary the whole process can be iterated with a loop. With each progress of the cluster handling process the ticket will be updated. At the end the whole incident will be reported. See the section called “Reporting Processes” for a detailed reporting Description.

The whole process is designed with two major goals in mind: expandability and automation. That means that new check items can be introduced by demand at any time and in future the whole process can be implemented in a computer program.

Incident Classification

This sections describes the incident classification process to determine the exact nature and actual cause of your problem.

The starting point for your analysis is the the section called “Incident Classification Checklist”.

This checklist includes easy tests to eliminate possible failure situations. After all checks have been done the results will point to a single predefined failure situation with the highest propability. Then a reference to the incidents section in this document will guide you to the particular recovery procedure.

Incident Classification Checklist

  • Check power state of each cluster node. See the section called “Check Node Power State” If nodes are not up it is most likely that they got fenced and remained powered down. See the section called “Node got fenced”
  • Check public network interfaces of each cluster node. See the section called “Check Public Network Interfaces” If this check fails follow the instructions for a network error. See the section called “Network Interface Card Error”
  • Check if SSH port is open on each cluster node. See the section called “Check SSH Port” In case that the SSH port is not reachable a kernel panic is very likely. See the section called “Kernel Panic”
  • Check SSH login with local user account . See the section called “Check SSH Login with Local User” If a SSH login fails there is usually a operating system error involved. See the section called “SSH Error”
  • Check SSH login with LDAP user account . See the section called “Check SSH Login with LDAP User” If a LDAP user cannot login there is usually a LDAP error. See the section called “LDAP Error”
  • The following checks should only be performed in case that SSH login failed:

Check console login with local user account.

See the section called “Check Console Login with Local User”

Further checks should only be performed in case that console login failed:

  • Check fence acknowledgement server. See the section called “Check Fence Acknowledgement Server” In the unlikely event that there is no fence acknowledgement server the cluster is frozen because of an yet unknown cause. See the section called “Unknown Cluster Freeze”
  • Check if fence manual occured See the section called “Check Fence Manual” If manual fencing occured it has to be recovered by hand. See the section called “Fence Manual Occured”
  • Check if fencing errors occured See the section called “Check for Fencing Errors” If fencing errors are found the cluster is usually frozen. See the section called “Fencing does not resolve”
  • Check if storage errors occured See the section called “Check for Storage Errors” If storage errors are found they have to be treated with great care. See the section called “Storage Error”

The following checks are performed to check the overall cluster status:

  • Check private NICs and bonding configuration. See the section called “Check Private Network Interfaces And Bonding” If this check fails follow the instructions for a network error. See the section called “Network Interface Card Error”
  • Check if clustat can be executed See the section called “Check HA Manager Status” If clustat does not return the HA-Manager hangs. See the section called “HA-Manager Hangs”
  • Check if cluster daemons are running See the section called “Check Cluster Daemons” If the HA-Manager is not running it has to be restarted. See the section called “HA-Manager Stopped”
  • Check if clustered services are running See the section called “Check Clustered Services” If clustered services are down they have to be restarted. See the section called “Clustered Services Stopped”

A sample checklist could be as follows:

Table 1. Incident Classification Checklist

┌─────────────────────────────────────────────────┬───────┬───────┬───┬───────┐
│                    Checkitem                    │node-1 │node-2 │...│node-n │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
│5.1 Check node power state                       │       │       │   │       │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
│5.2 Check public NICs                            │       │       │   │       │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
│5.4 Check SSH port                               │       │       │   │       │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
│5.5 Check SSH login with local user              │       │       │   │       │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
│5.6 Check SSH login with LDAP user               │       │       │   │       │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
|   5.7 Check console login with local user       │       │       │   │       │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
│      5.8 Check fence acknowledgement server     │       │       │   │       │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
│      5.9 Check if fence manual occured          │       │       │   │       │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
│      5.11 Check if fencing errors occured       │       │       │   │       │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
│      5.10 Check if storage errors occured       │       │       │   │       │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
│5.3 Check private NICs and Bonding               │       │       │   │       │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
│5.14 Check HA-Manager status                     │       │       │   │       │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
│5.15 Check cluster daemons                       │       │       │   │       │
├─────────────────────────────────────────────────┼───────┼───────┼───┼───────┤
│5.16 Check clustered services                    │       │       │   │       │
└─────────────────────────────────────────────────┴───────┴───────┴───┴───────┘

Incident Classification Flowchart

After the checklist has been completed, you can use the incident classification flowchart to determine the exact cause for the issue.

Figure 1. Incident Classification Flowchart

http://intranet-server.gallien.atix:8080/atix_intranet/it/dokumentation/open-sharedroot/administrator-s-handbook-revised/images/incident-classification-process.png/image_view_fullscreen

If you look at the flowchart you will see that all the basic checks are listed in form of a stairway from top left to bottom right.

If a check is successful you will be guided to the next column. If a check fails you will asked for the results of the checks that would be otherwise skipped.

After sufficient checks have been evaluated you will get to a predefined failure situation.

Then this failure situation is covered in the section called “List of incidents”. There you will find precise guidelines how to recover the incident.

In case that there is an unknown incident, more checks need to be defined that further characterise the issue.

The complete flowchart to this incident classification is illustrated in

Figure 2. Overview of the incident classification process

Overview of the incident classification process

http://intranet-server.gallien.atix:8080/atix_intranet/it/dokumentation/open-sharedroot/administrator-s-handbook-revised/images/system-recovery-process.png/image_view_fullscreen

List of incidents

All the incidents have different Impact on the cluster. While some may only affect single nodes, others may influence the cluster minority, majority or in worst case the whole cluster. The system recovery process must accommodate to this and the recovery procedure may differ a bit for each case.

Hardware Issues

SAN has no power

Created: 07/05/15 Last Review: 07/05/15 Revision: 1.0

Synopsis

The SAN components have no power. You have to perfom a power check.

Description

Data centers are equipped with redundant electrical power supply, battery and generator backup. So it is very unlikely that the whole data center has no power. In the few cases that single devices fail because of no power you have to check within your responsibility. Since the SAN is one of the most important parts of your cluster, it is very well protected against power failures. Usually you are able to fix any issues while the system is up and stable running with power from the backup systems.

Impact

DESASTEROUS: If the SAN has no power the cluster nodes cannot access the shared root device. The kernel will panic and the whole cluster collapses instantly. The nodes are not able to reboot from the SAN and the cluster will stay down.

Data Pre-Recovery Assessment

Not covered in this guide. See SAN documentation.

Correction

Not covered in this guide. See SAN documentation.

Data Post-Recovery Assessment

Not covered in this guide. See SAN documentation.

Prevention

Not covered in this guide. See SAN documentation.

Reporting

Not covered in this guide. See SAN documentation.

Applies to

Cluster infrastructure

Keywords

SAN, no power, power failure

Node has no power

Created: 07/05/15 Last Review: 07/05/15 Revision: 1.0

Synopsis

A node has no power. You have to perfom a power check.

Description

Servers are equipped with redundant electrical power supply, battery and generator backup. So it is very unlikely that a power failure occurs. In the few cases that single server fails because of no power you have to check within your responsibility.

Impact

MARGINAL-CRITICAL: If a single cluster node has no power the cluster usually can compensate. If more nodes failed the Impact can be critical.

Data Pre-Recovery Assessment

Not covered in this guide. See hardware documentation.

Correction

Not covered in this guide. See hardware documentation.

Data Post-Recovery Assessment

Not covered in this guide. See hardware documentation.

Prevention

Not covered in this guide. See hardware documentation.

Reporting

Not covered in this guide. See hardware documentation.

Applies to

Cluster infrastructure

Keywords

Server, powered off

Network Interface Card Error

Created: 07/05/25 Last Review: 07/05/25 Revision: 1.0

Synopsis

A network interface card or bonding interface cannot be reached and needs to be repaired.

Description

A network interface card or bonding interface cannot be reached and needs to be repaired.

Impact

MARGINAL-CRITICAL: If bonding is used, a single NIC could fail without interrupting the connection. The Impact would be marginal. If the public IP is not reachable then it depends if the clustered services are loadbalanced. In such environments the public IP address is constantly monitored. If a failure is detected the node will be removed from the loadbalancing and the cluster should be able to compensate. However if the applications are not loadbalanced or the error affects the cluster intercommunication network the failure could be critical.

Data Pre-Recovery Assessment

Not covered in this document. Ask hardware supplier.

Please note that MAC addresses need to be changed if network interface cards are replaced. This can be done by editing MAC settings in /etc/cluster/ cluster.conf, increasing the version number and updating the initial ramdisk with com-ec /etc/comoonics/enterprisecopy/updateinitrd.xml.

Correction

Not covered in this document. Ask hardware supplier.

Data Post-Recovery Assessment

Not covered in this document. Ask hardware supplier.

Prevention

To prevent this issue the infrastructure should be redundant.

Reporting

Not covered in this document. Ask hardware supplier.

Applies to

Network infrastructure

Keywords

NIC, network, interface, ICMP error, ping error, bonding error

SSH Error

Created: 07/05/25 Last Review: 07/05/25 Revision: 1.0

Synopsis

A SSH error caused that SSH logins fail or hang.

Description

Most common when a user login fails it is a incorrect combination of username and password. Usually messagea indicate what is going wrong with the login. But there are also a few errors that cause the ssh login to fail. It may happen that a cluster node fails after the comoonics boot process because of an external error. It is also possible that the cluster node pauses the boot process for example if a cluster administrator needs to acknowledge the state of an inquorate cluster. Moreover it is possible that the SSH service is just misconfigured or that there are other operating system errors.

Impact

MINOR-DESASTROUS: Depending on the actual error. Failed or even worse hanging SSH logins usually are bad signs for the cluster. Usually the inaccessible nodes have to be restarted.

Data Pre-Recovery Assessment

To document the issue gather data according the section called “Pre-Recovery Data Assessment 002 (emergency)”.

Correction

To correct this issue the affected cluster node should reboot. If the whole cluster is affected the cluster has to be shut down completely. This incident needs to be reported to the support personnel of the operating system to determine the exact cause of this error.

Data Post-Recovery Assessment

To document the issue gather data according the section called “Post-Recovery Data Assessment 001”.

Prevention

To prevent this issue update all packages to the latest stable releases if possible. Use quality assurance processes to verify the stability of the software.

Reporting

See the section called “Reporting Process 001”

Applies to

Operating System

Keywords

Operating system error, misconfiguration, SSH login fails

Kernel Panic

Created: 07/05/25 Last Review: 07/05/25 Revision: 1.0

Synopsis

A critical error caused the kernel to panic.

Description

A kernel panic message is displayed by an operating system upon detecting an internal system error from which it cannot recover. A panic usually occurs as a result of a hardware failure or a bug in the operating system. Debug information is usually dumped to disk and should be analyzed. Although the provided information is not always useful, sometimes valueable information can be derived that should be used for troubleshooting by system developer or tech support personnel. Depending on the number of nodes affected, the reboot of single nodes may suffice to restore the cluster.

Impact

DESASTROUS: Single cluster nodes should be fenced automatically. If a kernel panic occurs it usually affects all nodes. Then the cluster has to be shut down completely and started again to recover.

Data Pre-Recovery Assessment

To document the issue gather data according the section called “Pre-Recovery Data Assessment 002 (emergency)”.

Correction

To correct this issue the affected cluster node should reboot. If the whole cluster is affected the cluster has to be shut down completely.

Data Post-Recovery Assessment

To document the issue gather data according the section called “Post-Recovery Data Assessment 001”.

Prevention

To prevent this issue update all packages to the latest stable releases if possible. Use quality assurance processes to verify the stability of the software.

Reporting

See the section called “Reporting Process 001”

Applies to

Operating System

Keywords

Kernel panic, kernel dump, critical bug, cluster freeze, fencing, SSH port not reachable

LDAP Error

Created: 07/05/27 Last Review: 07/05/27 Revision: 1.0

Synopsis

A LDAP error caused logins to fail.

Description

If a login with local users is possible and only LDAP users are not authorized to login there is either a misconfiguration in the LDAP system or the LDAP server is not reachable.

Impact

MINOR-CRITICAL: Depending on the applications that are operated on the cluster, the Impact varies. If only LDAP users are locked out the cluster is still running. The Problem can be fixed without rebooting the node.

Data Pre-Recovery Assessment

To document the issue gather data according the section called “Pre-Recovery Data Assessment 001”.

Correction

To correct this issue the LDAP configruation needs to be repaired.

Data Post-Recovery Assessment

To document the issue gather data according the section called “Post-Recovery Data Assessment 001”.

Prevention

To prevent this issue double check the LDAP configuration and availabillity of the LDAP servers.

Reporting

See the section called “Reporting Process 001”

Applies to

Operating System

Keywords

Operating system error, LDAP misconfiguration, SSH login fails

Node got fenced

Created: 07/05/25 Last Review: 07/05/25 Revision: 1.0

Synopsis

A cluster node was fenced and powered down. The node should be powered up and rejoin the cluster.

Description

After successful fencing, the cluster node remains powered off to quickly identify a failed server. If the incident is correctly diagnosed and reported, the node should boot up and rejoin the cluster. There are various reasons for a node to be fenced out of the cluster. The main reasons are system failures or hardware reasons.

Impact

VARIABLE: If a single cluster node was fenced the cluster usually can compensate. If more nodes are fenced the Impact can be critical.

Data Pre-Recovery Assessment

No data assessment needed.

Correction

To correct this issue the node must be powered up again. It should rejoin the cluster. Check for any errors during the boot process.

Data Post-Recovery Assessment

To document the issue gather data according the section called “Post-Recovery Data Assessment 001”.

Prevention

With more cluster nodes joined together the Impact will decrease. The actual cause for the fencing should be eliminated.

Reporting

See the section called “Reporting Process 001”

Applies to

Cluster infrastructure

Keywords

Node, server, fenced, fencing

Fencing does not resolve

Created: 07/05/15 Last Review: 07/05/15 Revision: 1.0

Synopsis

The fencing agent command (e.g. fence_ilo) is not returning. The fencing process cannot be completed. The cluster will be in a stalled state.

Description

If a node in the cluster fails, the cluster's internal recovery process is going to remove the failed node from the cluster. This procedure is called "fencing". In order to do this, the fencing daemon (fenced) executes a fencing agent. If that fails the fencing process is not resolved. The cluster will be in a stalled state. I. e. all file systems mounted on the failed node will be freezed. The cluster is not accessible for this time frame.

Impact

DESASTEROUS: If fencing cannot resolve the cluster freezes. The nodes need to be restarted.

Data Pre-Recovery Assessment

The incident should be documented with the section called “Pre-Recovery Data Assessment 002 (emergency)”.

Correction

Note

This error should be reported and escalated to ensure proper recovery!

If it doesn't help you have to restart the affected cluster. You should identify why the fencing failed. Therefore you have to collect all log files within and report them.

Data Post-Recovery Assessment

To document the issue gather data according the section called “Post-Recovery Data Assessment 001”.

Prevention

To prevent these issues you should monitor the functionality of the fencing devices.

Reporting

See the section called “Reporting Process 001”

Applies to

Cluster infrastructure

Keywords

fencing, cluster freeze

Unknown Cluster Freeze

Created: 07/05/27 Last Review: 07/05/27 Revision: 1.0

Synopsis

An error caused the cluster to freeze up.

Description

This is a rather uncommon error constellation. The service to form a cluster infrastructure are affected by an unknown bug.

Impact

DESASTROUS: The whole cluster is frozen. Recovery is not possible without complete shutdown of the system and reforming the cluster.

Data Pre-Recovery Assessment

The incident should be documented with the section called “Pre-Recovery Data Assessment 002 (emergency)”.

Correction

To correct this issue the affected cluster nodes should be fenced and rebooted. If the whole cluster is affected the cluster has to be shut down completely.

Data Post-Recovery Assessment

To document the issue gather data according the section called “Post-Recovery Data Assessment 001”.

Prevention

This is a very rare and unlikely error. No Prevention method is known.

Reporting

See the section called “Reporting Process 001”

Applies to

Cluster Infrastructure

Keywords

Unknown cluster freeze, rare critical condition

Storage Error

Created: 07/05/27 Last Review: 07/05/27 Revision: 1.0

Synopsis

A storage error caused the cluster to freeze up.

Description

This is a rare error. The storage system is not accessible and the cluster freezes. The only recovery option will be a complete restart of the cluster. But that should be done after analysing the storage system error. Therefor the issue should be escalated to ATIX.

Impact

DESASTEROUS: If the SAN is not accessible or there are critical errors the cluster nodes cannot access the shared root partition. The kernel may panic and the whole cluster collapses instantly. The nodes are not able to reboot from the SAN and the cluster will stay down.

Data Pre-Recovery Assessment

The incident should be documented with the section called “Pre-Recovery Data Assessment 002 (emergency)”. Depending on the storage error this maybe impossible.

Correction

Caution

This cluster state is very critical. It is advised to escalate this issue immediately!

This step has to be escalated to ATIX and the storage administrator! To correct this issue the cluster should be shut down completely. After the storage system and the data path to the storage device has been repaired, the cluster can be rebuild again.

Data Post-Recovery Assessment

To document the issue gather data according the section called “Post-Recovery Data Assessment 001”.

Prevention

This is a very rare and unlikely error. No Prevention method is known.

Reporting

See the section called “Reporting Process 001”

Applies to

Cluster Infrastructure

Keywords

Cluster freeze, storage system error, rare critical condition

Fence Manual Occured

Created: 07/05/27 Last Review: 07/05/27 Revision: 1.0

Synopsis

An error caused the cluster to fence a node manually.

Description

Usually there are multiple fencing devices configurated. As a last resort if all the other fencing methods fail, the manual fencing process will be started. I. e. sombody has to switch off the failed node and tell the running fencing agent that it did so. At the time when the fencing process fails, or the manual fencing process has been started but not completed, the cluster is in a stalled state.

Impact

CRITICAL: If a manual fencing occurs the cluster freezes. Usually the cluster is recoverable after a manual fencing procedure.

Data Pre-Recovery Assessment

The incident should be documented with the section called “Pre-Recovery Data Assessment 002 (emergency)”.

Correction

This node that should have been fenced has to be shut down. Only after this has been verified the administrator has to log into the fence acknowledgment server of the cluster node that initiated the fencing process. Only after the affected cluster nodes were shut down the fence manual may be acknowledged:

  1. Shutdown the manual fenced node.

  2. Login to the fence acknowledgement server

    bash$ telnet node01 12242
    
  3. Execute ackmanual to acknowledge manual fencing.

Example:

telnet node01 12242
Trying 192.123.123.123...
Connected to node01 (192.123.123.123).
Escape character is '^]'.
Username: root
Password: password
(Cmd) ackmanual

Data Post-Recovery Assessment

To document the issue gather data according the section called “Post-Recovery Data Assessment 001”.

Prevention

Add additional fencing devices.

Reporting

See the section called “Reporting Process 001”

Applies to

Cluster Infrastructure

Keywords

Cluster freeze, fence manual, fence acknowledgement

HA-Manager Hangs

Created: 07/05/27 Last Review: 07/05/27 Revision: 1.0

Synopsis

The Availability (HA) Manager hangs

Description

The Availability (HA) Manager monitors the system services. This process-monitoring system can be affected by an unknown bug that causes the HA-Manager to hang. In that case it has to be restarted.

Impact

CRITICAL: The HA-Manager is a critical component and must run at any time.

Data Pre-Recovery Assessment

The incident should be documented with the section called “Pre-Recovery Data Assessment 001”.

Correction

All Correction steps need to be coordinated with the application owners!

  1. Log into the affected node.

  2. Kill the hanging process and restart the HA-Manager on all nodes.

    bash# killall -9 clurgmgrd
    bash# service rgmanager start
    
  3. Test if rgmanager runs correctly with clustat.

Data Post-Recovery Assessment

To document the issue gather data according the section called “Post-Recovery Data Assessment 001”.

Prevention

Use latest stable release of Red Hat's cluster suite.

Reporting

See the section called “Reporting Process 001”

Applies to

Cluster Infrastructure

Keywords

HA-Manager, rgmanager hangs

HA-Manager Stopped Created: 07/05/27 Last Review: 07/05/27 Revision: 1.0

Synopsis

The HA-Manager is stopped.

Description

The Availability (HA) Manager monitors the system services. This process-monitoring system can be affected by an unknown bug that causes the HA-Manager to stop. In that case it has to be restarted.

Impact

CRITICAL: The HA-Manager is a critical component and must run at any time.

Data Pre-Recovery Assessment

The incident should be documented with the section called “Pre-Recovery Data Assessment 001”.

Correction

All Correction steps need to be coordinated with the application owners!

  1. Log into the affected node.

  2. Restart the HA-Manager.

    bash# service rgmanager start
    
  3. Test if rgmanager runs correctly with clustat.

Data Post-Recovery Assessment

To document the issue gather data according the section called “Post-Recovery Data Assessment 001”.

Prevention

Use latest stable release of Red Hat's cluster suite.

Reporting

See the section called “Reporting Process 001”

Applies to

Cluster Infrastructure

Keywords

HA-Manager, rgmanager stopped

Clustered Services Stopped

Created: 07/05/27 Last Review: 07/05/27 Revision: 1.0

Synopsis

Some or all of the clustered services are stopped.

Description

Usually the Availability (HA) Manager monitors the clustered services. The HA Manager can reconstruct the state of the monitored services but in few cases this may fail. Then the services need to be restarted.

Impact

CRITICAL: The clustered services should run at any time.

Data Pre-Recovery Assessment

The incident should be documented with the section called “Pre-Recovery Data Assessment 001”.

Correction

All Correction steps need to be coordinated with the application owners!

  1. Log into the affected node.

  2. Kill hanging process and restart the affected services.

    bash$ clusvcadm -d [servicename]
    bash$ clusvcadm -e [servicename] -m [nodename]
    
  3. Test if the services run correctly with clustat.

Data Post-Recovery Assessment

To document the issue gather data according the section called “Post-Recovery Data Assessment 001”.

Prevention

Use latest stable release the clustered services.

Reporting

See the section called “Reporting Process 001”

Applies to

Cluster Infrastructure

Keywords

Clustered services stopped

Checklists

This section lists all sub-checklists that are used in this document.

Check Node Power State

The power state of the cluster nodes can be checked with the server's Integrated Lights Out (ILO) module:

  1. Make a SSH connection to the ILO interface

    bash$ ssh user@ilo.ip.addr.ess
    
  2. Check power state with power command.

  3. Read power state from output.

Possible Results:

"1": "power: server power is currently: On"

"0": "power: server power is currently: Off"

Example:

bash$ ssh atix@realserver10.ilo.cc.atix

hpiLO-> power

power: server power is currently: On

Result would be: "1", the server is powered up.

Check Public Network Interfaces

The public network interfaces can be checked with the ping command.

  1. Ping network interface
    bash$ ping nic.ip.addr.ess
    
  2. Read percentage of packet loss.

Possible Results:

"1": "0% packet loss" "0": "100% packet loss"

Examples:

bash$ ping node01.pub

PING node01.pub (10.19.41.1) 56(84) bytes of data.
64 bytes from node01.pub (10.19.41.1): icmp_seq=0 ttl=64 time=1.96 ms
64 bytes from node01.pub (10.19.41.1): icmp_seq=1 ttl=64 time=0.098 ms
64 bytes from node01.pub (10.19.41.1): icmp_seq=2 ttl=64 time=0.118 ms
64 bytes from node01.pub (10.19.41.1): icmp_seq=3 ttl=64 time=0.101 ms
64 bytes from node01.pub (10.19.41.1): icmp_seq=4 ttl=64 time=0.106 ms

--- node01.pub ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4001ms
rtt min/avg/max/mdev = 0.098/0.477/1.965/0.744 ms, pipe 2

Result would be: "1", network interface is reachable.

bash$ ping node01.pub

PING node01.pub (10.19.41.1) 56(84) bytes of data.
64 bytes from node01.pub (10.19.41.1): icmp_seq=1 Destination Host Unreachable
64 bytes from node01.pub (10.19.41.1): icmp_seq=1 Destination Host Unreachable
64 bytes from node01.pub (10.19.41.1): icmp_seq=2 Destination Host Unreachable
64 bytes from node01.pub (10.19.41.1): icmp_seq=3 Destination Host Unreachable
64 bytes from node01.pub (10.19.41.1): icmp_seq=4 Destination Host Unreachable

--- node01.pub ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4001ms
rtt min/avg/max/mdev = 0.098/0.477/1.965/0.744 ms, pipe 2

Result would be: "0", network interface is not reachable.

Check Private Network Interfaces And Bonding

Private network and bonding interfaces can be checked with ip command after a successful login.

  1. Show Ping network interface

    bash$ ip addr
    
  2. Read status.

Possible Results:

"1": All interfaces up.

"0": Some or all interfaces down.

Examples

bash$ ip addr

1: lo: [LOOPBACK,UP] mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
4: eth0: [BROADCAST,MULTICAST,SLAVE,UP] mtu 1500 qdisc pfifo_fast master
bond0 qlen 1000 link/ether 00:18:fe:87:81:b2 brd ff:ff:ff:ff:ff:ff
inet6 fe80::218:feff:fe87:81b2/64 scope link
valid_lft forever preferred_lft forever
5: eth1: [BROADCAST,MULTICAST,SLAVE,UP] mtu 1500 qdisc pfifo_fast master
bond0 qlen 1000 link/ether 00:18:fe:87:81:b2 brd ff:ff:ff:ff:ff:ff
inet6 fe80::218:feff:fe87:81b2/64 scope link
valid_lft forever preferred_lft forever
6: bond0: [BROADCAST,MULTICAST,MASTER,UP] mtu 1500 qdisc noqueue
link/ether 00:18:fe:87:81:b2 brd ff:ff:ff:ff:ff:ff
inet6 fe80::218:feff:fe87:81b2/64 scope link
valid_lft forever preferred_lft forever

Result would be: "1", network interfaces are up.

bash$ ip addr

1: lo: [LOOPBACK,UP] mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
4: eth0: [BROADCAST,MULTICAST,SLAVE] mtu 1500 qdisc pfifo_fast master
bond0 qlen 1000 link/ether 00:18:fe:87:81:b2 brd ff:ff:ff:ff:ff:ff
inet6 fe80::218:feff:fe87:81b2/64 scope link
valid_lft forever preferred_lft forever
5: eth1: [BROADCAST,MULTICAST,SLAVE,UP] mtu 1500 qdisc pfifo_fast master
bond0 qlen 1000 link/ether 00:18:fe:87:81:b2 brd ff:ff:ff:ff:ff:ff
inet6 fe80::218:feff:fe87:81b2/64 scope link
valid_lft forever preferred_lft forever
6: bond0: [BROADCAST,MULTICAST,MASTER,UP] mtu 1500 qdisc noqueue
link/ether 00:18:fe:87:81:b2 brd ff:ff:ff:ff:ff:ff
inet6 fe80::218:feff:fe87:81b2/64 scope link
valid_lft forever preferred_lft forever

Result would be: "0", eth0 network interface is not up.

Check SSH Port

The SSH port of a node can be checked with the telnet command.

  1. Telnet SSH port of cluster node

    bash$ telnet node.ip.addr.ess 22
    
  2. Read SSH version string to identify if SSH is running.

Possible Results:

"1": "Connected (...) SSH-1.99-OpenSSH_3.9p1"

"0": "Trying (...) Unable to connect to remote host: Connection refused"

Examples:

bash$ telnet node01 22

Trying 192.123.123.123...
Connected to node01 (192.123.123.123).
Escape character is '^]'.
SH-1.99-OpenSSH_3.9p1

Connection closed by foreign host.

Result would be: "1", SSH port is reachable.

bash$ telnet node01 22

Trying 192.123.123.123...
telnet: connect to address 192.123.123.123: Connection refused
telnet: Unable to connect to remote host: Connection refused

Result would be: "0", SSH port is notreachable.

Check SSH Login with Local User

The SSH login with a local user can be checked with the ssh command.

  1. Make a SSH connection to the cluster node with a local user

    bash$ ssh luser@node.ip.addr.ess
    
  2. Check for your login prompt.

Possible Results:

"1": "Last login: Thu May 24 00:33:44 2007 from 192.123.123.123 bash$

"0": Login hangs.

Examples:

bash$ ssh luser@node01

luser@node01's password:
Last login: Thu May 24 00:33:44 2007 from 192.123.123.123
bash$

Result would be: "1", SSH login with local user is possible.

bash$ ssh luser@node01

hangs...

Result would be: "0", SSH login with local user is not possible.

Check SSH Login with LDAP User

The SSH login with a LDAP user can be checked with the ssh command.

  1. Make a SSH connection to the cluster node with a LDAP user

    bash$ ssh ldapuser@node.ip.addr.ess
    
  2. Check for your login prompt.

Possible Results:

"1": "Last login: Thu May 24 00:33:44 2007 from 192.123.123.123 bash$

"0": Login hangs or "Permission denied, please try again."

Examples:

bash$ ssh ldapuser@node01

ldapuser@node01's password:
Last login: Thu May 24 00:33:44 2007 from 192.123.123.123
bash$

Result would be: "1", SSH login with LDAP user is possible.

bash$ ssh ldapuser@node01

Permission denied, please try again.

Result would be: "0", SSH login with LDAP user is not possible.

Check Console Login with Local User

The console login with a local user can be checked by loggin in at the console login shell.

  1. Make a connection to the ILO interface of the cluster node or use console to login with a local user account.
  2. Check for your login prompt.

Possible Results:

"1": "Last login: Thu May 24 00:33:44 2007 from 192.123.123.123 bash$

"0": Login hangs or no login prompt

Examples:

login: luser

luser@node01's password:
Last login: Thu May 24 00:33:44 2007 from 192.123.123.123
bash$

Result would be: "1", console login with local user is possible.

login: luser

hangs...

Result would be: "0", console login with local user is not possible.

Check Fence Acknowledgement Server

The fence acknowledgement server can be checked by login with telnet.

  1. Telnet fenceack_server port of cluster node

    bash$ telnet node.ip.addr.ess 12242
    
  2. Check for fence acknowledgement server prompt.

Possible Results:

"1": "(cmd)"

"0": "Connection refused"

Examples:

bash$ telnet node.ip.addr.ess 12242

Trying 192.123.123.123...
Connected to node01 (192.123.123.123).
Escape character is '^]'.
Username: root
Password: password
(Cmd)

Result would be: "1", fence acknowledgement server is available.

bash$ telnet node01.pub 22

telnet node01.pub 12242
Trying 192.123.123.123...
telnet: connect to address 192.123.123.123: Connection refused
telnet: Unable to connect to remote host: Connection refused

Result would be: "0", fence acknowledgement server is not available.

Check Fence Manual

With a login to the fence acknowledgement server it can be checked wether fence manual occured.

  1. Telnet fenceack_server port of cluster node

    bash$ telnet node.ip.addr.ess 12242
    
  2. Check for fence manual message after the login. (In older versions you may

    need to execute a command like help to see the message.

Possible Results:

"1": "Fence manual is in progress. Please make sure the fenced node is
powercylcled and execute ackmanual here."

"0": No message after login to the fence acknowledgement server.

Examples:

bash$ telnet node01 12242

Trying 192.123.123.123...
Connected to node01 (192.123.123.123).
Escape character is '^]'.
Username: root
Password: password
Fence manual is in progress. Please make sure the fenced node is powercylcled
and execute ackmanual here.
(Cmd)

Result would be: "1", fence manual is in progress.

bash$ telnet node01 12242

Trying 192.123.123.123...
Connected to node01 (192.123.123.123).
Escape character is '^]'.
Username: root
Password: password
(Cmd)

Result would be: "0", no manual fencing is to be resolved.

Check for Storage Errors

Storage errors can be checked by regular expressions applied to logfiles.

  1. Login to logging server.
  2. Grep for SCSI Errors in /var/log/messages.

Possible Results:

"1": "SCSI error"

"0": No errors found.

Examples:

bash# grep "SCSI error" /var/log/messages

/var/log/messages.15.gz:May 10 14:23:37 node627a kernel: SCSI error :
[1 0 0 26] return code = 0x20000
/var/log/messages.15.gz:May 10 14:23:37 node627a kernel: SCSI error :
[1 0 0 26] return code = 0x20000
/var/log/messages.15.gz:May 10 14:23:37 node627a kernel: SCSI error :
[1 0 1 28] return code = 0x20000
/var/log/messages.15.gz:May 10 14:23:37 node627a kernel: SCSI error :
[1 0 1 28] return code = 0x20000
/var/log/messages.15.gz:May 10 14:29:14 node627a kernel: SCSI error :
[1 0 0 21] return code = 0x20000
/var/log/messages.15.gz:May 10 14:29:14 node627a kernel: SCSI error :
[1 0 0 2] return code = 0x20000
/var/log/messages.15.gz:May 10 14:29:14 node627a kernel: SCSI error :
[1 0 0 2] return code = 0x20000
/var/log/messages.15.gz:May 10 14:29:14 node627a kernel: SCSI error :
[1 0 0 2] return code = 0x20000
/var/log/messages.15.gz:May 10 14:29:14 node627a kernel: SCSI error :
[1 0 0 2] return code = 0x20000
/var/log/messages.15.gz:May 10 14:29:14 node627a kernel: SCSI error :
[1 0 0 2] return code = 0x20000
/var/log/messages.15.gz:May 10 14:29:14 node627a kernel: SCSI error :
[1 0 0 2] return code = 0x20000
/var/log/messages.15.gz:May 10 14:29:14 node627a kernel: SCSI error :
[1 0 0 2] return code = 0x20000
/var/log/messages.15.gz:May 10 14:29:14 node627a kernel: SCSI error :
[1 0 0 2] return code = 0x20000
/var/log/messages.15.gz:May 10 14:29:14 node627a kernel: SCSI error :
[1 0 0 2] return code = 0x20000
/var/log/messages.15.gz:May 10 14:29:14 node627a kernel: SCSI error :
[1 0 0 2] return code = 0x20000
/var/log/messages.15.gz:May 10 14:29:14 node627a kernel: SCSI error :
[1 0 0 2] return code = 0x20000
/var/log/messages.15.gz:May 10 14:29:14 node627a kernel: SCSI error :
[1 0 0 2] return code = 0x20000

Result would be: "1", SCSI errors found.

bash$ grep "SCSI error" /var/log/messages

nothing found

Result would be: "0", no SCSI errors found.

Check for Fencing Errors

Fencing errors can be checked by regular expressions applied to logfiles.

  1. Login to logging server.
  2. Grep for fenced in /var/log/messages and watchout for errors.

Possible Results:

"1": Fencing error found, no manual fencing occured.

"0": No fencing errors found e.g. manual fencing occured.

Examples:

bash# grep "fenced" /var/log/messages

May 23 14:05:57 node202b fenced[21743]: fencing node "node202a-ics0"
May 23 14:06:00 node202b fenced[21743]: agent "/opt/atix/comoonics-
fencing/fence_ilo" reports: Traceback (most recent call last):
   File "/opt/atix/comoonics-fencing/fence_ilo", l
ine 343, in main     ilo=FenceIlo(Config)   File "/opt/atix/comoonics-
fencing/fence_ilo", line 110, in __init__     self.connect()
   File "/opt/atix/comoonics-fencing/fenc
May 23 14:06:00 node202b fenced[21743]: agent "/opt/atix/comoonics-
fencing/fence_ilo" reports: e_ilo", line 128, in connect
     self.socket.connect((self.address, self.port))
     File "[string]", line 1, in connect error: (111, 'Connection
      refused') read login=power read
 passwd=somepassword read action=hardoff Python socket client.
  Connecting to: 1
May 23 14:06:00 node202b fenced[21743]: agent "/opt/atix/comoonics-
fencing/fence_ilo" reports: 0.226.10.49:443

Result would be: "1", node is fenced but an error occured and manual fencing messages cannot be found.

bash# grep "fenced" /var/log/messages

May 23 14:05:57 node202b fenced[21743]: fencing node "node202a-ics0"
May 23 14:06:00 node202b fenced[21743]: agent "/opt/atix/comoonics-
fencing/fence_ilo" reports: Traceback (most recent call last):
File "/opt/atix/comoonics-fencing/fence_ilo", l
ine 343, in main     ilo=FenceIlo(Config)   File "/opt/atix/comoonics-
fencing/fence_ilo", line 110, in __init__     self.connect()
File "/opt/atix/comoonics-fencing/fenc
May 23 14:06:00 node202b fenced[21743]: agent "/opt/atix/comoonics-
fencing/fence_ilo" reports: e_ilo", line 128, in connect
self.socket.connect((self.address, self.port))
File "[string]", line 1, in connect error: (111, 'Connection
refused') read login=power read
passwd=somepassword read action=hardoff Python socket client.
Connecting to: 1
May 23 14:06:00 node202b fenced[21743]: agent "/opt/atix/comoonics-
fencing/fence_ilo" reports: 0.226.10.49:443
May 23 14:06:00 node202b fence_manual: Node node202a-ics0 needs to
be reset before recovery can procede.  Waiting for node202a-ics0
to rejoin the cluster or for manual acknowledge ment that it has
been reset (i.e. fence_ack_manual -n node202a-ics0)

Result would be: "0", node is fenced and agent reports errors but fence_manual occured.

Check Cluster Status

The cluster status can be checked with cman_tool.

  1. Login on a cluster node

  2. Display the local view of the cluster status

    bash$ cman_tool status
    
  3. Check if config version number is correct.

    This check will be implemented in com-ec soon.

    If you choose to compare it manually you have to mount the boot volume. Doublecheck that you are solely accessing the boot volume! The boot volume is not formated with a cluster filesystem!

    # mount /boot
    # mkdir -p /tmp/initrd-tmp
    # cd /tmp/initrd-tmp
    # gunzip -c /boot/initrd_sr-2.6.9-34.0.1.ELsmp.img | cpio -ivd
    # diff /tmp/initrd-tmp/etc/cluster/cluster.conf /etc/cluster/cluster.conf
    

    They should be equal and you can safely read the version information from / etc/cluster/cluster.conf. If they differ you should compare the version and changes in both files. The cluster.conf that is included in the initrd will be active after reboot!

  4. Check if cluster status is correct.

Possible Results:

"1": Cluster status is correct.

"0": Cluster status is incorrect.

Examples:

bash$ cman_tool status

Protocol version: 5.0.1
Config version: 12
Cluster name: cluster01
Cluster ID: 28712
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 6
Expected_votes: 6
Total_votes: 6
Quorum: 4
Active subsystems: 20
Node name: node01
Node ID: 1
Node addresses: 192.123.123.123

Result would be: "1", the node is a valid cluster member and the version of the cluster.conf is 12 as it should be in this example.

bash$ cman_tool status

Protocol version: 5.0.1
Config version: 11
Cluster name: cluster01
Cluster ID: 28712
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 4
Expected_votes: 6
Total_votes: 4
Quorum: 4
Active subsystems: 20
Node name: node01
Node ID: 1
Node addresses: 192.123.123.123

Result would be: "0", the expected votes are not met and the cluster version is not correct if 12 is the newest serial of cluster.conf.

Check Nodes Status

The node status can be checked with cman_tool.

  1. Login on a cluster node

  2. Display the local view of the cluster nodes

    bash$ cman_tool nodes
    
  3. Check if all cluster nodes are visible.

Possible Results:

"1": All cluster nodes are there.

"0": Some cluster nodes are missing.

Examples:

bash$ cman_tool nodes

Node  Votes Exp Sts  Name
1    1    6   M   node01
2    1    6   M   node02
3    1    1   M   node03
4    1    6   M   node04
5    1    1   M   node05
6    1    1   M   node06

Result would be: "1", if there are 6 nodes in the cluster.

bash$ cman_tool nodes

Node  Votes Exp Sts  Name
1    1    6   M   node01
2    1    6   X   node02
3    1    1   M   node03
4    1    6   X   node04
5    1    1   X   node05
6    1    1   M   node06

Result would be: "0", if there are 6 nodes in the cluster.

Check HA Manager Status

The status of the HA Manager can be checked with clustat.

The clustat command displays the status of the cluster. It shows membership information, quorum view, and the state of all configured user services.

  1. Login on a cluster node

  2. Evaluate the status of the cluster's HA-Manager

    bash$ clustat
    
  3. Check if the tool can be executed correctly.

Possible Results:

"1": clustat can be executed correctly. "0": The clustat tool hangs after execution.

Examples:

bash$ clustat

Member Status: Quorate

Member Name                              Status
------ ----                              ------
node01                                   Online, Local, rgmanager
node02                                   Online, rgmanager
node03                                   Online, rgmanager
node04                                   Online, rgmanager
node05                                   Online, rgmanager
node06                                   Online, rgmanager

Service Name         Owner (Last)                   State
------- ----         ----- ------                   -----
cms                  node01                         started
live                 node04                         started
multi-pub            node06                         started
multi-int            node05                         started

Result would be: "1", the tool executes correctly. The displayed status is not important at this step.

bash$ clustat status

Hangs.

Result would be: "0", the tool hangs.

Check Cluster Daemons

Check if Cluster Daemons are running

The clustat command displays the status of the cluster. It shows membership information, quorum view, and the state of all configured user services.

  1. Login on a cluster node

  2. Evaluate the status of the cluster daemons

    bash$ clustat
    
  3. Check if the rgmanager runs on the cluster nodes and if services are listed.

Possible Results:

"1": rgmanager runs on all nodes and services are listed.

"0": rgmanager runs not on all nodes and services are not listed.

Examples:

bash$ clustat

Member Status: Quorate

Member Name                              Status
------ ----                              ------
node01                                   Online, Local, rgmanager
node02                                   Online, rgmanager
node03                                   Online, rgmanager
node04                                   Online, rgmanager
node05                                   Online, rgmanager
node06                                   Online, rgmanager

Service Name         Owner (Last)                   State
------- ----         ----- ------                   -----
cms                  node01                         started
live                 node04                         started
multi-pub            node06                         started
multi-int            node05                         started

Result would be: "1", the tool executes correctly. The displayed status is not important at this step.

bash$ clustat

Member Status: Quorate

Resource Group Manager not running; no service information available.

Member Name                              Status
------ ----                              ------
node01                                   Online, Local
node02                                   Online
node03                                   Online
node04                                   Online
node05                                   Online
node06                                   Online

Service Name         Owner (Last)                   State
------- ----         ----- ------                   -----

Result would be: "0", rgmanager is not running message and services are not displayed.

Check Clustered Services

Check if Clustered Services are running

The clustat command displays the status of the cluster. It shows membership information, quorum view, and the state of all configured user services.

  1. Login on a cluster node

  2. Evaluate the status of the clustered services

    bash$ clustat
    
  3. Check if the listed services match the services that should run on the clustered node.

Possible Results:

"1": All services are running. "0": Some or all services are missing.

Examples:

bash$ clustat

Member Status: Quorate

Member Name                              Status
------ ----                              ------
node01                                   Online, Local, rgmanager
node02                                   Online, rgmanager
node03                                   Online, rgmanager
node04                                   Online, rgmanager
node05                                   Online, rgmanager
node06                                   Online, rgmanager

Service Name         Owner (Last)                   State
------- ----         ----- ------                   -----
cms                  node01                         started
live                 node04                         started
multi-pub            node06                         started
multi-int            node05                         started

Result would be: "1", all 4 services are started on the nodes.

bash$ clustat

Member Name                              Status
------ ----                              ------
node01                                   Online, Local
node02                                   Online, rgmanager
node03                                   Online, rgmanager
node04                                   Online, rgmanager
node05                                   Online, rgmanager
node06                                   Online, rgmanager

Service Name         Owner (Last)                   State
------- ----         ----- ------                   -----
cms                  node01                         started
live                 none (node04)                  stopped
multi-pub            node06                         started
multi-int            none (node05)                  stopped

Result would be: "0", some services are stopped.

Power Checklist

Checking the power means checking both the unit and the unit's source of power.

  1. Make sure all power cords are fimly in place.

    Power cords can fall out of ports or sockets, so check to make sure they are firmly plugged in

    If the device experiencing problems has a removeable power cord, make sure that the cord is plugged firmly into both the device and the electrical outlet. Also check the power cord to make sure that it is not damaged. Sometimes power cords can get severed or partially cut.

  2. Make sure that all equipment power lights are on.

    A device may be firmly plugged in, yet not have power.

    Most devices found in data centers habe power indicator lights on the front of the unit. Make sure that each piece of quipment that is plugged in has its power indicator light on. If the light is not on then either the electrical outlet is not delivering any power, the power cord is damaged, or the unit cannot receive power and is therefore in need of replace.

  3. Test the equipment and electrical outlet.

    Unit Test:Plug the unit in another electrical outlet that you have confirmed is working by conducting an electrical outlet test (see below). If the unit works, then the original outlet should be serviced. If, however, the unit still doesn't work after being plugged into the new outlet, then report the problem to the hardware vendor because the unit may need service.

    Electrical Outlet Test: Plug another test device that works into the electrical outlet in question. If it works, then the electrical outlet is working. If it fails then the electrical outlet may be in need of repair and appropriate actions should be taken.

  4. Power Cord Test: If the power cord is removeable, you can try replacing it with another removeable power cord of the same type and brand. If the unit works with a new power cord, your original power cord is damaged and should be replaced.

Data Assessments

To be able to analyze the incidents and to solve problems, data of the affected system needs to be assembled. Some of the data must be collected before the system recover process can be started. Data not dependent on the actual system state can be collected after a system recovery.

Pre-Recovery Data Assessments

This section describes the procedures how to collect data for further analysis before the cluster has been recovered.

Pre-Recovery Data Assessment 001

This section defines a general pre-recovery data assessment process.

Red Hat Enterprise Linux has a built in tool sysreport to gather specific system information for troubleshooting.

This utility will go through and collect some detailed information about the hardware and setup of your Red Hat Linux system. This information then can be used to diagnose cluster problems.

Note

No changes will be made to your system during this process.

To use this tool you need to process following steps:

  1. Choose the cluster node you want to collect data from.

  2. Log into a shell that has priviledges to access the configuration files.

  3. Execute sysreport and be patient. This process may take a while

  4. Transmit the resulting file together with a detailed incident Description

    to ATIX. See the section called “Reporting Processes”

Example:

bash# sysreport

This utility will go through and collect some detailed information
about the hardware and setup of your Red Hat Linux system.
This information will be used to diagnose problems with your system
and will be considered confidential information.  Red Hat will use
this information for diagnostic purposes ONLY.

Please wait while we collect information about your system.

This process may take a while to complete....
No changes will be made to your system during this process.

NOTE: You can safely ignore a failed message. This only means a file
we were checking for did not exist.

If your system hangs while gathering rpm information, please abort
the script with CTRL-C and run it again after adding -norpm to the
sysreport command line

Press ENTER to continue, or CTRL-C to quit.


Getting system configuration information.

Determining Red Hat Linux version:                         [  OK  ]
Determinding your current hostname:                        [  OK  ]
Getting the date:                                          [  OK  ]
Checking your systems current uptime and load average:     [  OK  ]

Checking current process tree:                             [  OK  ]
Collecting information about ld.so.conf:                   [  OK  ]

Collecting information about system authentication (pam):  [  OK  ]

Checking module information fglrx:                         [  OK  ]

Getting disk and filesystem information.

Collecting information from /etc/fstab:                    [  OK  ]

Collecting global devices list (lshal):                    [  OK  ]

collecting information about commonly used network services

Collecting information about system services (xinetd.conf) [  OK  ]

Getting information about CUPS (/etc/cups/snmp.conf)       [  OK  ]

Gathering information from system logs

Collecting information from dmesg:                         [  OK  ]

Collecting log files from Apache                           [  OK  ]

Getting information about RHN


Gathering information on SELinux setup

Collecting log files from RHN                              [  OK  ]

Please enter your case number (if you have one):

Please send /root/hostname.domain.2007052553430.tar.bz2 to your support
representative.

Pre-Recovery Data Assessment 002 (emergency)

This section defines a general pre-recovery data assessment process.

The fence acknowledgement server has the abillitiy to spawn an emergency shell. There comhf-sysreport can be used to gather specific system information for troubleshooting.

This utility will go through and collect some detailed information about the hardware and setup of your Red Hat Linux system. This information then can be used to diagnose cluster problems.

Note

No changes will be made to your system during this process.

To use this tool you need to process following steps:

  1. Choose the cluster node you want to collect data from.

  2. Telnet to the acknowledgement server

    bash$ telnet node01 12242
    
  3. Execute shell to spawn an emergency shell

  4. Execute comhf-sysreport and be patient. This process may take a while

  5. Transmit the resulting file together with a detailed incident Description

    to ATIX. See the section called “Reporting Processes”

Example:

bash# comhf-sysreport -v

(Cmd) shell
bash# comhf-sysreport -v
preparing chroot environment
- mounting /proc
starting com-sysinfo
Gathering procfs cluster information (/proc/cluster):
copy /proc/cluster
Gathering procfs cluster information (/proc/cluster):[ OK ]
Gathering dlm_locks (/proc/cluster/dlm_locks)
Gathering dlm_locks (/proc/cluster/dlm_locks)
Getting locks for lt_sharedroot
Getting locks for clvmd
Getting locks for Magma
Gathering slabinfo (/proc/slabinfo)
copy /proc/slabinfo
Gathering slabinfo (/proc/slabinfo)[ OK ]
/bin/dmesg
Gathering dmesg output[ OK ]
creating tar.gz file
/
cleaning up
- umounting /proc
DONE.
INFO: The sysreport can be found at /tmp/fence_tool//var/com-sysinfo/2007-05-25-075154/node202a_sysreport_2007-05-25.tgz

Post-Recovery Data Assessments

This section describes the procedure how to collect data for further analysis after the cluster has been recovered.

Post-Recovery Data Assessment 001

This section defines a general post-recovery data assessment process.

Red Hat Enterprise Linux has a built in tool sysreport to gather specific system information for troubleshooting.

This utility will go through and collect some detailed information about the hardware and setup of your Red Hat Linux system. This information then can be used to diagnose cluster problems.

Note

No changes will be made to your system during this process.

To use this tool you need to process following steps:

  1. Choose the cluster node you want to collect data from.
  2. Log into a shell that has priviledges to access the configuration files.
  3. Execute sysreport and be patient. This process may take a while
  4. Also get the kernel dumps at /var/crash/.
  5. Transmit the resulting files together with a detailed incident Description to ATIX. See the section called “Reporting Processes”

Example:

bash# sysreport

This utility will go through and collect some detailed information
about the hardware and setup of your Red Hat Linux system.
This information will be used to diagnose problems with your system
and will be considered confidential information.  Red Hat will use
this information for diagnostic purposes ONLY.

Please wait while we collect information about your system.

This process may take a while to complete....
No changes will be made to your system during this process.

NOTE: You can safely ignore a failed message. This only means a file
we were checking for did not exist.

If your system hangs while gathering rpm information, please abort
the script with CTRL-C and run it again after adding -norpm to the
sysreport command line

Press ENTER to continue, or CTRL-C to quit.


Getting system configuration information.

Determining Red Hat Linux version:                         [  OK  ]
Determinding your current hostname:                        [  OK  ]
Getting the date:                                          [  OK  ]
Checking your systems current uptime and load average:     [  OK  ]

Checking current process tree:                             [  OK  ]
Collecting information about ld.so.conf:                   [  OK  ]

Collecting information about system authentication (pam):  [  OK  ]

Checking module information fglrx:                         [  OK  ]

Getting disk and filesystem information.

Collecting information from /etc/fstab:                    [  OK  ]

Collecting global devices list (lshal):                    [  OK  ]

collecting information about commonly used network services

Collecting information about system services (xinetd.conf) [  OK  ]

Getting information about CUPS (/etc/cups/snmp.conf)       [  OK  ]

Gathering information from system logs

Collecting information from dmesg:                         [  OK  ]

Collecting log files from Apache                           [  OK  ]

Getting information about RHN


Gathering information on SELinux setup

Collecting log files from RHN                              [  OK  ]

Please enter your case number (if you have one):

Please send /root/hostname.domain.2007052553430.tar.bz2 to your support
representative.

Reporting Processes

System Message: WARNING/2 (<string>, line 2329)

Title underline too short.

**Reporting** Processes
-------------------

This section contains Reporting processes for contacting external support.

Reporting Process 001

System Message: WARNING/2 (<string>, line 2334)

Title underline too short.

**Reporting** Process 001
---------------------

To resolve cluster incidents quickly, technical issues need to be documented prior contacting technical support.

Besides the information that is aquired with the data assessment process you need the details for your support entitlement when you discuss the issue with the support team.

The following steps illustrate the requirements to report an incident:

  1. Define the Problem

    If you are able to define the problem and symptoms before contacting ATIX or other support technicians will speed up the recovery process.

    The support personell want to make sure that the problem solving process is acurate. Therefor the incident report needs to be detailed.

  2. Gather Background Information

    To solve a issue efficiently, the support team needs to have all relevant information to understand and reproduce the failure. What steps led to the failure? Can it be recreated? Are there any error messages? You should be able to answer such questions.

  3. Gather Diagnostic Information

    You should follow the data assessment procedure as far as possible. The more information provided the better.

  4. Determine Severity Level

    Please let us know your rating of the **Impact**.

  5. Contact support

    Depending on the type of error, its severity and your service level agreement, contact the support personnel per phone or web.

    For example the ATIX ticket system is reachable unter this url: http:// troubleticket.atix.de.

Cluster Maintenance

This section describes how to perform a cluster maintenance.

Cluster Shutdown

Sometimes a cluster needs a complete shutdown to perform various maintenance tasks. For example if you have upgraded the kernel. But a cluster should not be shut down in haste. It is important for system administrators to know how the cluster is being used and actively involve all users in scheduling downtime. This procedure defines a clean way to shut down the cluster nodes. But there are times where it is impossible to follow all steps. In such cases you should follow as close as possible. If you have any doubts shutting down the cluster please ask colleagues that are also responsible for a second opinion.

  1. Stop all services on the cluster

    Services can be stopped with clusvcadm -d $service where $service is the name of the service. All services are listed in /etc/init.d/.

  2. Optional: Unmount unnecessary filesystems

    The shutdown command will unmount all filesystems automatically but you may want to control the order. You can unmount filesystems with umount $mountpoint where $mountpoint is the directory the volume is currently mounted. You can review which filesystems are mounted with mount. All mounted filesystems are also written to /etc/mtab.

  3. Increase the number of expected votes

    The data integrity in a cluster is ensured by a quorum mechanism. A quorum depends on several factors, such as expected votes over a period of time and quorum disk votes. If quorum is lost the cluster will stop all activity. So if you would power down node after node you will eventually freeze the cluster. To temporarily bypass this safety mechanism you have to adjust the number of votes of the node you wish to power down last.

    This can be done with cman_tool vodes N+1.

    N+1 means you should use at least one vote more than the total number of required votes.

  4. Optional: Turn Resource Group Manager off

    If you don't want the Resource Group Manager to start the cluster services automatically, you may want to disable it temporarily. This would prohibit that you have to relocate your cluster services later on which would induce a failover. Because in most cases the Resource Group Manager is configured to restart a cluster service as soon as enough cluster nodes join. However the node where the service should run on may be still booting.

    To disable rgmanager you can use the command chkconfig rgmanager off.

    Make sure that you enable it after the cluster is up again! You can do this by typing chkconfig rgmanager on. Also you need to start the service manually with clusvcadm -e $servicename -m $clusternode. You can review the names of the clusternodes with clustat.

  5. Shutdown all but one node

    You can either login on the nodes separately, use a distributed shell or the following script to perform the shutdown

    # for nodes in "node1" "node2" "node-n"
    
    do
    
    ssh $nodes 'shutdown -h +5 "Cluster will be shutdown because of..."
    
    loop
    

    This command will use ssh to connect to each node that is listed and will execute a complete shutdown with a five minutes delay and a message why th cluster needs to be shut down. ssh must be configured for RSA or DSA authentication.

    Within the delay you are able to cancel the shutdown with shutdown -c.

  6. Shutdown the last node

    After all the other nodes are shut down you may execute the last step for the remaining node.

Start Cluster

This section describes how a cluster can be started.

Boot the cluster nodes

You should not boot all nodes at once so that the power supply is not overloaded. However it is usually no problem to boot more than one node at a time.

You should pay attention if there are any critical errors during the boot process.

After sshd has been started you should be able to login from remote. But you should check if you reach the login prompt to make sure that the boot process has been completed.

Usually the cluster is configured to automatically resume its activity. But sometimes you need to manually enable the clustered services or you need to relocate them. First start the rgmanager with service rgmanager start if it is not started by default. Then you can manage the clustered services with clusvcadm -e $servicename -m $clusternode. You can review the names of the clusternodes with clustat.

Perform a full cluster health check as described in the section called “Incident Classification Checklist”.


Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: