Personal tools
You are here: Home Members marc blog Blog on GFS
Document Actions

Blog on GFS

Up one level

Thought on GFS

GFS Performance Tests
In order to reproduce a cluster freeze that seemed to only occure on clusters with huge amount of memory. We implemented a stress test that works as follows. Background: see below

WORDS OF WARNING

These are tests filling the filesystem up with huge amounts of lock i/o and files. So don't do this on any filesystem that:

  1. Should not run full
  2. Is in use by production!!!

Background:

What is done is there is one tool (create-tree.py) that creates a treestructure on a given directory. This script can be started on one or more nodes in different directories.

After having run this for some time (30min-1h) we also start a huge amount of finds on all other nodes not doing the creation. See below for details.

This scenario creates loads of locks with gfs. With RHEL4 U4 and RHEL4 U5 without lock purging we see loads of locks (some millions) and also reproduce an unstable cluster.

When using RHEL4 U5 (latest version, of July 2007) and switching on the lock purging we'll see locks being constant (in between 50k and 100k).

Prerequesits:

  1. create-tree.py see createtree .
  2. Cluster with multiple nodes at least two

Example:

The filesystem under /scratch has to be tested on the given cluster with four nodes. The directory is /scratch/perftest/subdir1 for node1 creating the tree and /scratch/perftest/subdir2 for another node. On the two nodes being left the finds are started.

Be sure to start those tests on a connection that cannot be terminated or on a system console or fork it into background.

Time: T0

shell(node*): gfs_tool settune /scratch/ glock_purge 50

Time: T1

shell(node1)>  python ~grimmmar/create-tree.py /scratch/perftest/subdir3 5 10000 10 1
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0/child0
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0/child0/child0
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0/child0/child0/child0
 DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0/child0/child0/child0/child0
 DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0/child0/child0/child0/child1
 DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0/child0/child0/child0/child2
 DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0/child0/child0/child0/child3
 DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0/child0/child0/child0/child4
 DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0/child0/child0/child0/child5
 DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0/child0/child0/child0/child6
 DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0/child0/child0/child0/child7
 DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0/child0/child0/child0/child8
 DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir3/child0/child0/child0/child0/child9
shell(node2)> python ~grimmmar/create-tree.py /scratch/perftest/subdir4 5 10000 10 1
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir4/child0
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir4/child0/child0
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir4/child0/child0/child0
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir4/child0/child0/child0/child0
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir4/child0/child0/child0/child0/child0
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir4/child0/child0/child0/child0/child1
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir4/child0/child0/child0/child0/child2
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir4/child0/child0/child0/child0/child3
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir4/child0/child0/child0/child0/child4
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir4/child0/child0/child0/child0/child5
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir4/child0/child0/child0/child0/child6
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir4/child0/child0/child0/child0/child7
DEBUG:comoonics_benchmarking.create-tree:child: /scratch/perftest/subdir4/child0/child0/child0/child0/child8

Time T1+60min:

shell(node3)> for i in $(seq 0 100); do (find /scratch/perftest -printf "%p %k %i %c %a\n" 2>&1 > /dev/null &); done 

Time T1+60min:

shell(node4)> for i in $(seq 0 100); do (find /scratch/perftest -printf "%p %k %i %c %a\n" 2>&1 > /dev/null &); done

Analysis

One way to check what the filesystem does is with the command gfs-tool counters /scratch and look out for locks and locks held. Those values should not grow too much. If so tune the filesystem with gfs_tool settune glock_purge.

Glock Trimming Patch
Description of GLock Trimming Patch introduced in RHEL4 U5

wcheng@redhat.com Jan. 25, 2007

Installation and Run Script

Install: Test RPMs are available based on request. For a quick test to see whether this patch solve your issue is:

shell> umount /mnt/your_gfs_partition
shell> rmmod gfs
shell> insmod /this_new_ko/gfs.ko
shell> mount /mnt/your_gfs_partition

Tunable setup: There are two tunables to play around with:

  1. glock_purge

    After the gfs.ko is loaded and filesystem mounted, issue:

    shell> gfs_tool settune <mount_point> glock_purge <percentage>
    (e.g. "gfs_tool settune /mnt/gfs1 glock_purge 50")
    

    This will tell GFS to trim roughly 50% of unused glocks every 5 seconds. The default is 0 percent (no trimming). The operation can be dynamically turned off by explicitly set the percentage to 0.

  2. demote_secs

    This tunable is already in RHEL4 gfs.

    shell> gfs_tool settune <mount_point> demote_secs <seconds>
    (e.g. "gfs_tool settune /mnt/gfs1 demote_secs 200")
    

    This will demote gfs write locks into less restricted states and subsequently flush the cache data into disk. Shorter demote second(s) can be used to avoid gfs accumulating too much cached data that results with burst mode flushing activities or prolong another nodes' lock access. It is default to 300 seconds. This command can be issued dynamically but has to be done after mount time.

The following are some glory details if you care to read.

The Original Base Kernel Patch

Other than relying on VM flush daemons and/or application specific APIs or commands, GFS also flushes its data into storage during glock state transitions - that is, whenever an inode glock is moved from an exclusive state (write) into a less restricted state (e.g. shared state), the memory cached write data is synced into the disk based on a set of criteria. As the disk write operation is generally expensive, there are few policies implemented to retain the glocks in its current state as much as possible.

As reported via bugzilla 214239 (and several others), we've found GFS needs to fine-tune it current retain policy to meet the latency sensitive application requirement. Two particular issues we've found via the profiling data (collected from several customers' run time environment) are:

  • Glocks stay in exclusive state for too long that end up with burst mode flushing activities (and other memory/io issues) that could subsequently push file access time out of bound for latency sensitive applications.
  • System could easily spend half of it CPU cycles in lock hash search calls due to large amount of glocks accumulation (ref: "214239":https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=214239#c1).

We have been passing few VM tuning tips, together with a shorter (tunable) demote_secs, to customers and find they do relieve problem#1 symptoms greatly. Note that the "demote_secs" is the time interval used by the existing glock scan daemon to move locks into less restricted states if unused. This implies on an idle system, all locks will be moved into "unlocked" state eventually. Unfortunately, "unlocked" does not imply these glocks will be removed from the system. Actually they'll stay there forever until:

  1. the inode (file) is explicitly deleted (on-disk deletion), or
  2. VM issues prune_icache() call due to memory pressure, or
  3. Umount command kicks in, or
  4. Lock manager issues an LM_CB_DROPLOCKS callback.

When problem#2 first popped up in RHEL3 time frame, we naturally went thru the above 4 routes to look for solution. I forgot under what conditions Lock Manager could issue the DROPLOCK callback. However, in reality, (3) and (4) share "one" very same exported VFS layer call to do its core job - that is invalidate_inodes(). This vfs call walks thru 4 (global vfs) inodes lists to find the entry that belongs to this particular filesystem. For each entry found, it is removed. The operation, interestingly, overlaps with (2) (VM prune_icache call). The difference is that prune_icache() scans only one list (inode_unused) and selectively purges inode, instead of all of them.

As the on-memory inodes are purged, the GFS logic embeded in the inode deallocation code will remove the corresponding glocks accordingly. It is then the glock could disappear.

So here came the original base kernel patch. As a latency issue, we didn't want to disturb the painstaking efforts of retaining the glocks done by GFS's original author(s). We ended up with exporting the modified prune_icache() that allowed it to function like invalidate_inodes() logic if asked. It walked thru inode_unused list to find the matching mount point. It purges a fixed percentage of inodes from that list if the entry belongs to the subject mount point. In short, we created a new call that had the logic needed for glock trimmming purpose without massive cut-and-pasting the code segment from the existing prune_icache base kernel call.

GFS-only Patch

GFS already has a glock scan daemon waking up on a tunable interval to do glock demote work. It scans the glock hash table to examine the entry one by one. If the reference count and several criteria meet the requirement, it demotes the lock into a less restricted state. For removable glocks, they are transferred into a reclaim list and another daemon (reclaimd) will eventually purge them from the system. One of the criteria to identify a removable glock is by its zero inode reference count. Unfortunately, as long as glock is tied to the vfs inode, the reference count never goes down unless the vfs inode is purged (and it never does unless the vm thinks it is under memory pressure).

For lock trimming purpose, it took several tries to get the gfs-only patch works. The following is the logic that seems to work at this moment:

Each vfs inode is tied to a pair of glocks - iopen glock (LM_TYPE_IOPEN) and inode glock (LM_TYPE_INODE). The inode glock normally has frequent state transitions, depending how and when the file is accessed (read, write, delete, etc) but the iopen glock is mostly on SHARED state during its life cycle until either:

  1. The GFS inode is removed (gfs_inode_destroy), or
  2. Some logic (that doesn't exist before this patch) kicks off gfs_iopen_go_callback() to explicitly change its state (presumely by Lock Manager).

Since these two glocks have been the major contributors to the glock accumulation issues, they are our targeted glocks to get trimmed. Without disturbing the existing GFS code, we piggy-back the logic into gfs_scand daemon that wakes up every 5-second interval to scan the glock hash table. If an iopen glock is found, we follow the pointer to obtain the inode glock state. If it is in unlocked state, we demotes the iopen glock (from shared into unlocked). This triggers gfs_try_toss_vnode() logic to prune the associated dentries and subsequently delete the vfs inode. It then follows the very same purging logic as base kernel approach. If inode glock is found first (I haven't implemented this yet), we check it lock state. If unlocked, we follow the pointer to find its iopen lock, then subsequently demote it. It will then trigger gfs_try_toss_vnode() logic that generates the same sequence of clean-up events as described above.

Few to-do items

  1. Current CVS check-in only looks for iopen glock. We should add inode-glock as described above to shorten the search process.
  2. Have another version of the patch that trims the lock if it is in idle (unlocked) state longer than a tunable timeout value. The CVS check-in is based on a tunable percentage count. The trimming action stops when either the max count reached or we reach the end of the table.
  3. Now glocks are trimmed (and gfs lock dump shows the correct result) - I'm not sure how DLM side makes these locks disappears from ts hash table (?).
GLock Description
What are the Glocks aquired by GFS and what informations do they provide?

Motiviation:

What are the Glocks aquired by GFS and what informations do they provide?

From gfs/incore.h, base gfs-kernel/src:

Glock Structure

One for each inter-node lock held by this node.

A glock is a local representation/abstraction of an inter-node lock. Inter-node locks are managed by a "lock module" (LM) which plugs in to the lock harness / glock interface (see gfs-kernel/harness). Different lock modules support different lock protocols (e.g. GULM, GDLM, no_lock). A glock may have one or more holders within a node. See gfs_holder above. Glocks are managed within a hash table hosted by the in-core superblock. After all holders have released a glock, it will stay in the hash table cache for a time (depending on lock type), during which the inter-node lock will not be released unless another node needs the lock (lock manager requests this via callback to GFS through LM on this node). This provides better performance in case this node needs the glock again soon.

See comments for meta_go_demote_ok, glops.c.

Each glock has an associated vector of lock-type-specific glops functions which are called at important times during the life of a glock, and which define the type of lock (e.g. dinode, rgrp, non-disk, etc). See gfs_glock_operations above. A glock, at inter-node scope, is identified by the following dimensions:

  1. lock number (usually a block # for on-disk protected entities, or a fixed assigned number for non-disk locks, e.g. MOUNT).
  2. lock type (actually, the type of entity protected by the lock).
  3. lock namespace, to support multiple GFS filesystems simultaneously. Namespace (usually cluster:filesystem) is specified when mounting.

See man page for gfs_mount.

Glocks require support of Lock Value Blocks (LVBs) by the inter-node lock manager. LVBs are small (32-byte) chunks of data associated with a given lock, that can be quickly shared between cluster nodes. Used for certain purposes such as sharing an rgroup's block usage statistics without requiring the overhead of:

  • sync-to-disk by one node, then a
  • read from disk by another node.

Type definition

struct gfs_glock {
      struct list_head gl_list;    /* Link to hb_list in one of superblock's
                                    * sd_gl_hash glock hash table buckets */
      unsigned long gl_flags;      /* GLF_... see above */
      struct lm_lockname gl_name;  /* Lock number and lock type */
      atomic_t gl_count;           /* Usage count */
      spinlock_t gl_spin;          /* Protects some members of this struct */
      /* Lock state reflects inter-node manager's lock state */
      unsigned int gl_state;       /* LM_ST_... see harness/lm_interface.h */
      /* Lists of gfs_holders */
      struct list_head gl_holders;  /* all current holders of the glock */
      struct list_head gl_waiters1; /* HIF_MUTEX */
      struct list_head gl_waiters2; /* HIF_DEMOTE, HIF_GREEDY */
      struct list_head gl_waiters3; /* HIF_PROMOTE */
      struct gfs_glock_operations *gl_ops; /* function vector, defines type */
      /* State to remember for async lock requests */
      struct gfs_holder *gl_req_gh; /* Holder for request being serviced */
      gfs_glop_bh_t gl_req_bh;  /* The bottom half to execute */
      lm_lock_t *gl_lock;       /* Lock module's private lock data */
      char *gl_lvb;             /* Lock Value Block */
      atomic_t gl_lvb_count;    /* LVB recursive usage (hold/unhold) count */
      uint64_t gl_vn;           /* Incremented when protected data changes */
      unsigned long gl_stamp;   /* Glock cache retention timer */
      void *gl_object;          /* The protected entity (e.g. a dinode) */
      /* Incore transaction stuff */
      /* Log elements map us to a particular set of log operations functions,
         and to a particular transaction */
      struct gfs_log_element gl_new_le;     /* New, incomplete transaction */
      struct gfs_log_element gl_incore_le;  /* Complete (committed) trans */ 
      struct gfs_gl_hash_bucket *gl_bucket; /* Our bucket in sd_gl_hash */
      struct list_head gl_reclaim;          /* Link to sd_reclaim_list */
      struct gfs_sbd *gl_sbd;               /* Superblock (FS instance) */
      struct inode *gl_aspace;              /* The buffers protected by this lock */
      struct list_head gl_ail_bufs;         /* AIL buffers protected by us */
};

gl_flags (harness/lm_interface.h):

  • GLF_PLUG(0): Dummy
  • GLF_LOCK(1): Exclusive (local) access to glock structure
  • GLF_STICKY(2): Don't release this inter-node lock unless another node explicitly asks
  • GLF_PREFETCH(3): This lock has been (speculatively) prefetched, demote if not used soon
  • GLF_SYNC(4): Sync lock's protected data as soon as there are no more holders
  • GLF_DIRTY(5): There is dirty data for this lock, sync before releasing inter-node
  • GLF_SKIP_WAITERS2(6): Make run_queue() ignore gl_waiters2 (demote/greedy) holders
  • GLF_GREEDY(7): This lock is ignoring callbacks (requests from other nodes) for now

gl_state (harness/lm_interface.h):

  • LM_TYPE_RESERVED(0x00)
  • LM_TYPE_NONDISK(0x01): Non-disk cluster-wide, e.g. TRANS
  • LM_TYPE_INODE(0x02): Inode, e.g. files
  • LM_TYPE_RGRP(0x03): Resource Group (block allocation)
  • LM_TYPE_META(0x04): Metadata, e.g. superblock, journals
  • LM_TYPE_IOPEN(0x05): nodocu??
  • LM_TYPE_FLOCK(0x06): Linux file lock
  • LM_TYPE_PLOCK(0x07): POSIX file lock
  • LM_TYPE_QUOTA(0x08): User or group block usage quota

gfs_holder (gfs/incore.h):

Glock holder structure

One for each holder of a glock. These coordinate the use, within this node, of an acquired inter-node glock. Once a node has acquired a glock, it may be shared within that node by several processes, or even by several recursive requests from the same process. Each is a separate "holder". Different holders may co-exist having requested different lock states, as long as the node holds the lock in a state that is compatible. A hold requestor may select, via flags, the rules by which sharing within the node is granted:

  • LM_FLAG_ANY: Grant if glock state is any other than UNLOCKED.
  • GL_EXACT: Grant only if glock state is exactly the requested state.
  • GL_LOCAL_EXCL: Grant only one holder at a time within this node.

With no flags, a hold will be granted to a SHARED request even if the node holds the glock in EXCLUSIVE mode. See relaxed_state_ok(). When a process needs to manipulate a lock, it requests it via one of these holder structures. If the request cannot be satisfied immediately, the holder structure gets queued on one of these lists in gfs_glock:

  1. waiters1, for gaining exclusive access to the (local) glock structure.
  2. waiters2, for demoting a lock (unlocking a glock, or changing its state to be less restrictive) or relenquishing "greedy" status.
  3. waiters3, for promoting (locking a new glock, or changing a glock state to be more restrictive).

When holding a lock, gfs_holder struct stays on glock's holder list. See gfs-kernel/src/harness/lm_interface.h for gh_state (LM_ST...) and gh_flags (LM_FLAG...) fields. Also see glock.h for gh_flags field (GL...) flags.

Structure:

struct gfs_holder {
         struct list_head gh_list;      /* Link to one of glock's holder lists */
         struct gfs_glock *gh_gl;       /* Glock that we're holding */
         struct task_struct *gh_owner;  /* Linux process that is the holder */
         /* request to change lock state */
         unsigned int gh_state;         /* LM_ST_... requested lock state */
         int gh_flags;                  /* GL_... or LM_FLAG_... req modifiers */
         int gh_error;                  /* GLR_... CANCELLED/TRYFAILED/-errno */
         unsigned long gh_iflags;       /* HIF_... holder state, see above */
         struct completion gh_wait;     /* Wait for completion of ... */
};

Action requests

  • HIF_MUTEX(0): Exclusive (local) access to glock struct
  • HIF_PROMOTE(1): Change lock to more restrictive state
  • HIF_DEMOTE(2): Change lock to less restrictive state
  • HIF_GREEDY(3): Wait for the glock to be unlocked

States:

  • HIF_ALLOCED(4): Holder structure is or was in use
  • HIF_DEALLOC(5): Toss holder struct as soon as queued request is satisfied
  • HIF_HOLDER(6): We have been granted a hold on the lock
  • HIF_FIRST(7): We are first holder to get the lock
  • HIF_RECURSE(8): >1 hold requests on same glock by same process
  • HIF_ABORTED(9): Aborted before being submitted
gfs mountoptions
list all mountoptions interpreted by gfs itself.

Mountoptions of GFS (RHEL4/U5) from gfs-kernel/ioctl.c:

  • version
  • lockproto: args->ar_lockproto
  • locktable: args->ar_locktable
  • hostdata: args->ar_hostdata
  • ignore_local_fs: args->ar_ignore_local_fs
  • localcaching: args->ar_localcaching
  • localflocks: args->ar_localflocks
  • oopses_ok: args->ar_oopses_ok
  • upgrade: args->ar_upgrade
  • num_glockd: args->ar_num_glockd
  • posix_acls: args->ar_posix_acls
  • suiddir: args->ar_suiddir
gfs_scand
What does gfs_scand and why does it cause so heavy load?

does anybody now what exactly is the task of gfs_scand. We see it with very much CPU time loads of times (eg. system is up for 40h and gfs_scand has 4h CPU-Time).

This is a complicated subject. So please bear with me and see whether the following description helps:

Gfs_scand scans GFS locks (glock) hash table to find:

  1. if glock can be downgraded into less restricted state (say from shared state to unlock state) (and dirty data flushing is embedded in the glock transition code).
  2. if glock is idle and in unlock state for too long, it will be reclaimed.

Whenever GFS needs a lock, it creates a glock and subsequently asks lock manager for a corresponding lock. In DLM case, there is one-to-one correspondence between glock and dlm lock.

Now if gfs_scand has used too much CPU time, it may mean the system has accumulated too many locks as described in: readme.gfs_glock_trimming.R4

Unfortunately the lock trimming patch added into RHEL 4.5 is too mild (i.e. not aggressive enough, see Red Hat bugzilla 245776). We'll try to correct the issue as soon as next errata is available. In short, if the daemon has hogged too much CPU time without any sign of slowing down whenever it wakes up, you can try to make it run less often by:

    shell> gfs_tool settune <mount_point> scand_secs <x> 
          // the default x is 5 seconds

The side effect of longer scand_secs is that if you have large amount of file write and/or delete activities, the dirty data will stay in the buffer cache for longer time and lock count will up considerably.

And can you track down which scand is responsible for what filesystem?

BTW: I'm talking about RHEL4U4.

Answer(Wendy Chang):

Relevant Bugzilla:

Red Hat Advisories:

GFS Daemons and gfs_scand
What GFS daemons are running and what does gfs_scand

gfs_scand

Look for cached glocks and inodes to toss from memory see gfs_glockd()

gfs_scand_internal - Look for glocks and inodes to toss from memory

sdp: the filesystem

Invokes scan_glock() for each glock in each cache bucket. Steps of reclaiming a glock:

  • scan_glock() places eligible glocks on filesystem's reclaim list.
  • gfs_reclaim_glock() processes list members, attaches demotion requests to wait queues of glocks still locked at inter-node scope.
  • Demote to UNLOCKED state (if not already unlocked).
  • gfs_reclaim_lock() cleans up glock structure.

gfs_glockd

Reclaim unused glock structures sdp: Pointer to GFS superblock

One or more of these daemons run, reclaiming glocks on sd_reclaim_list. sd_glockd_num says how many daemons are running now. Number of daemons can be set by user, with num_glockd mount option. See gfs_scand()

gfs_recoverd

Recover dead machine's journals

sdp: Pointer to GFS superblock

gfs_logd

Update log tail as Active Items get flushed to in-place blocks

sdp: Pointer to GFS superblock

Also, periodically check to make sure that we're using the most recent journal index.

gfs_quotad

Write cached quota changes into the quota file

sdp: Pointer to GFS superblock

gfs_inoded

Deallocate unlinked inodes

sdp: Pointer to GFS superblock

Blog on GFS
« July 2014 »
Su Mo Tu We Th Fr Sa
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31
Blog on GFS:
More...
Categories:
GFS (0)
 

Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: