Description on the open sharedroot Bootprocess
One of the crucial parts of diskless clusters is the bootsequence or the boot process. Therefore this document describes all steps involed in booting a server until the normal initprocess is executed.
One of the crucial parts of diskless clusters is the bootsequence or the boot process. Therefore this document describes all steps involed in booting a server until the normal initprocess is executed. This document starts going through the relevant startsequence of the BIOS (see the section called “The BIOS boot sequence and SANbased diskless servers”) giving over to the bootloader GRUB (see the section called “The GRUB boot loader and open sharedroot cluster bootimage”) which itself gives control to the “Linux Kernel” and the “Initial RAMDisk” (see the section called “The Linux Kernel and InitRD and the open sharedroot cluster bootimage”).
Table of Contents
- The BIOS boot sequence and SANbased diskless servers
- The GRUB boot loader and open sharedroot cluster bootimage
- The Linux Kernel and InitRD and the open sharedroot cluster bootimage
- The Comoonics open sharedroot initscripts
The system BIOS, located in an EPROM chip on the motherboard, is what starts the computer running when you turn it on. The following are the steps that a typical boot sequence involves when configuring diskless clients based on shared storage via SCSI. Of course this might vary by the manufacturer of your hardware, BIOS, etc., and especially by what peripherals you have in the PC. Here is what generally happens when you turn on the power button of your intelbased server.
Shortly after the power button switched on the power supply the processor starts up. When the processor first starts up, it is suffering from amnesia; there is nothing at all in the memory to execute. Of course processor makers know this will happen, so they pre-program the processor to always look at the same place in the system BIOS ROM for the start of the BIOS boot program. This is normally location “FFFF0h”, right at the end of the system memory. They put it there so that the size of the ROM can be changed without creating compatibility problems. Since there are only 16 bytes left from there to the end of conventional memory, this location just contains a "jump" instruction telling the processor where to go to find the real BIOS startup program.
The BIOS performs the power-on self test (POST). If there are any fatal errors, the boot process stops.
After initializing the graphics card the BIOS looks for other devices' ROMs to see if any of them have BIOSes. Normally, the IDE/ATA hard disk BIOS will be found at C8000h and executed. If any other device BIOSes are found, they are executed as well.
Off-chipset controllers, like SCSI cards, PCI IDE cards or their on-board counterparts are searched afterwards. Within the controller, the HDs are searhced in an order detemined by the controller's own BIOS. This order may be altered through this BIOS's setup utility. For example, BIOSs will let you set the boot order to "SCSI, C, A", which actually means that it will start its search on the first SCSI-Controller found (confusingly, this drive becomes disk 1, and then C:). Other BIOSs have clearer terminologies for pretty much the same thing. Afterwards, the BIOS enumerates all the HDDs, starting with the chosen boot drive, and continuing with the rest of them, in the order described above. This mean that to the BIOS, the HDDs become "disk 1", "disk 2"... (the actual numbers are 0x80, 0x81...).
If the BIOS finds what it is looking for, it starts the process of booting the operating system, using the information in the boot sector. The BIOS basically reads the first sector - the so called “Master Boot Record” the MBR see [WPMBR]- from the disk and executes the “code area” in that sector. At this point, the code in the boot sector - the “boot loader” takes over from the BIOS.
GNU GRUB (“GRand Unified Bootloader”) is a program that installs a boot loader to the MBR. It allows you to place specific instructions in the MBR that loads a GRUB menu or command environment, permitting you to start the operating system of your choice, pass special instructions to kernels when they boot, or discover system parameters (such as available RAM) before booting.
The process of loading GRUB, and then the operating system, involves several stages:
- 1. Loading the primary boot loader, commonly called Stage 1.
The primary boot loader must exist in the very small space allocated for the MBR, which is less than 512 bytes (Address “0x0000-0x018A”). Therefore, the only thing the primary boot loader accomplishes is loading the secondary boot loader, due to the fact that there is not enough space in the MBR for anything else.
- 2. Loading the secondary boot loader, commonly called Stage 2.
The secondary boot loader actually brings up the advanced functionality that allows you to load a specific operating system. With GRUB, this is the code that allows you to display a menu or type commands.
- Loading the operating system, such as the Linux kernel, on a specified partition.
Once GRUB has received the correct instructions for the operating system to start, either from its command line or configuration file, it finds the necessary boot file and hands off control of the machine to that operating system.
Some filesystems, as well as filesystem configurations, may require a Stage 1.5 file that essentially bridges the gap between the primary and secondary boot loaders.
For example, if your Stage 2 boot loader file is on a partition using a filesystem that the Stage 1 boot loader cannot access, it is possible to direct the Stage 1 boot loader to load additional instructions from the Stage 1.5 file that allows it to read the Stage 2 boot loader file. For more information, consult the GRUB info pages.
All the needed dependencies for booting are to be found on the so called boot partition. GRUB reads from this partition information like the kernel. That means GRUB has to know the filesystem of the bootpartition and therefore it most often is EXT2/3 and all GRUB relevant files reside in the directory “grub” called the “GRUB's root filesystem”. Other filesystems like “iso9660, xfs, fat ...” are supported. Also files like the different stages can be found there. These files are binary copied to the specified regions on the disk via the grub. The configuration found in the directory “grub” are “grub.conf” and “devices.lst”
Next, the kernel command is executed with the location of the kernel file as an option. Once the Linux kernel boots, it sets up the root file system that Linux users are familiar with. The original GRUB root file system and its mounts are forgotten; they only existed to boot the kernel file.
Boot parameters to grub can either be given via the grub configuration file grub/grub.conf or edited over the grub userinterface shown at boottime. For a description of the user interface and bootparameters take a look at http://www.gnu.org/software/grub/manual/html_node/index.html.
The “kernel” boot parameter of GRUB allows to specify a string directly given to
boot kernel. Under Linux this string can be found in the file
/proc/cmdline. Most of
the parameters accepted by kernelmodules and the kernel itself are specified in the kernelsource. The file
Documentation/bootparms.txt gives a list of most parameters.
As the initial ramdisk of the comoonics open sharedroot bootimage is a quite complex task there are some bootparameters influencing the booting. These parameters are listed here.
Switches on the debug mode. Every command and its output is echoed on console and the syslogserver. Default is unset.
Switches on the step mode. Predefined breakpoints are hit and waited for userinput. Either the process can be quit and the user ends up in a shell (“q”) or step mode is switched off with “c” or a new bash is forked with “b” and can be left with “exit” and the bootprocess will proceed or any other key will step over this breakpoint.
These options are fully supported only if the “comoonics-bootimage” version greater the “1.0-75” is used. Default is unset.
The “scsifailover” option switches between scsi failover and multipathing within the driver (default setting)or the “devicemapper” (“mapper”).
If the devicemapper or driver is requested to be used for multipathing also the driver needs to support it.
Additional options used for mounting the root filesystem. Default is (defaults).
specifies the source where the root filesystem comes from. Until now only the default - “scsi” is supported. “iscsi”, “gnbd” and “nfs” will be added.
overwrites the rootdevices where to be mounted from. This can be either a valid device or in the future some kind of URL like i.e. “iscsi://sourceserver/export”. Default is unset.
Disables the manual acknowledgement of a cluster booted without having quorum.
When the kernel is executed by the bootloader it fills the memory with its image and executes itself.
Since the Linux Kernel is a monolitic kernel but also supports loading of modules into the kernel, modern
kernels are mostly kept small and everything else like drivers for speciall features are loaded as modules.
Normally the Unix kernel's responibility is to mount the root filesystem and to execute the
“init” process. But if modules are needed to mount the root filesystem a special concept is
brought in that of the so called “initial ramdisk”. If a initial ramdisk is given, the kernel
loads that image into memory and uncompresses it if need be. The initial ramdisk can either be a loop
filesystem or a cpio image. Most often used in modern kernels is the cpio
image. So it will be silently proposed although nearly everything can also be applied for loop filesystems
initial ramdisks. As already said the kernel unpacks and mounts the initrd. Next it searches for the file
/linuxrc and forks it normaly with pidnumber 1. If the initrd exits - that is often the case -
the kernel tries to mount the given filesystem (Parameter “root”) and executes
/sbin/init on this filesystem (again as pid 1). If the initrd does not exit - which is
case in a open sharedroot cluster - the initrd itself takes control over mounting the root filesystem and executing
/sbin/init as Pid 1. But both ways end up in the same result.
As the kernel and implicitly the “initrd” have the responsibility to prepare or mount the rootfilesystem and execute the “init” process in “open sharedroot” clusters this task is quite complex and will be described in more detail in the next section.
A short overview of the sequence of steps done in the “comoonics bootimage” is shown in
Figure1, “The open sharedroot initrd process”. Everything starts when the kernel executes the file
First all parameters given by GRUB to the kernel are read from
/proc/cmdline and stored in their counterpart variables. The variable effects used global
com-debug are setup.
Next the hardwaredetection takes part. This needs to happen because any node can have different hardware in terms of network interfaces and scsi controllers.
After having detected the relevant hardware parts the modules their modules are loaded.
Now all storage devices including the “device mapper” and LVM are setup correctly.
Also the network interfaces need to be up and running, so first the modules are loaded.
At this stage all requirements to build up the cluster are met and the cluster configuration is read and all devices are setup according to it. This means the cluster relevant network interfaces are setup and the network is tested for functionality.
If need be - in case of the “rootsource” coming from the network - storage devices are imported from the network and setup correctly.
Next all cluster and cluster filesystem relevant services and modules are loaded and started. In case of “GFS” and “Redhat Clustersuite” this means the “Cluster Configruation Daemon” has to be started first. Then this node can join or build a new cluster and join or build a new “fence domain”. This is an example for “GFS” and “Redhat Cluster suite” being used but any other cluster implementation follows the same steps in terms of this cluster.
At this point the cluster is setup correctly and the node can mount the root filesystem and prepare some clusterdependent settings like “cluster dependent symbolic links” and the like.
Again there might be some clusterdependent settings or steps which wild be done at this stage. In case of “GFS” and “Redhat Clustersuite” we first need to build a “tmp filesystem” change root environment where the CCSD and “Fenced” will run
With GFS there are some services which cannot run on GFS as root filesystem. Because of this they need their own root which is not part of the cluster filesystem.
. Both CCSD and “Fenced” or more general all cluster relevant services will now be restarted in the newly build change root.
Last but not least some configuration files will be updated and the init process will be executed as the running “/linuxrc” process, so that init will get the Pid 1. At this point the Linux init process will take over.
In order to run an “open sharedroot” cluster properly some initprocesses will need to be
started. The first and most important depency is propably a new changeroot for the CCSD,
“fenced” and optionally the “fenceackserver”. This changeroot can be build on any
directory specified in
/etc/sysconfig/cluster in the variable
FENCED_CHROOT. If this is a directory on a lokal filesystem - which is highly recommendet -
all files will be copied there if not under this directory will be a loop filesytem mountet (these services
must not run on GFS). The changeroot is builded by the initscript
This script basically starts the CCSD in the changeroot specified in
/etc/init.d/ccsd is disabled.
This script starts the “fenced” in the changeroot specified in
/etc/init.d/fenced is disabled.
The “fenceacksv” is server program that runs in the changeroot and is a maintenance tool for cases when the rootfilesystem is freezed but remote access is needed to the still living parts of the system.
Therefor either the services running in that changeroot can be restartet or manual fencing being in progress can be manually acknowledged.
That service runs on port “12242” either with ssl or not and is configured via the
/etc/cluster/cluster.conf. For more information see the apropriate