Glock Trimming Patch
email@example.com Jan. 25, 2007
Install: Test RPMs are available based on request. For a quick test to see whether this patch solve your issue is:
shell> umount /mnt/your_gfs_partition shell> rmmod gfs shell> insmod /this_new_ko/gfs.ko shell> mount /mnt/your_gfs_partition
Tunable setup: There are two tunables to play around with:
After the gfs.ko is loaded and filesystem mounted, issue:
shell> gfs_tool settune <mount_point> glock_purge <percentage> (e.g. "gfs_tool settune /mnt/gfs1 glock_purge 50")
This will tell GFS to trim roughly 50% of unused glocks every 5 seconds. The default is 0 percent (no trimming). The operation can be dynamically turned off by explicitly set the percentage to 0.
This tunable is already in RHEL4 gfs.
shell> gfs_tool settune <mount_point> demote_secs <seconds> (e.g. "gfs_tool settune /mnt/gfs1 demote_secs 200")
This will demote gfs write locks into less restricted states and subsequently flush the cache data into disk. Shorter demote second(s) can be used to avoid gfs accumulating too much cached data that results with burst mode flushing activities or prolong another nodes' lock access. It is default to 300 seconds. This command can be issued dynamically but has to be done after mount time.
The following are some glory details if you care to read.
Other than relying on VM flush daemons and/or application specific APIs or commands, GFS also flushes its data into storage during glock state transitions - that is, whenever an inode glock is moved from an exclusive state (write) into a less restricted state (e.g. shared state), the memory cached write data is synced into the disk based on a set of criteria. As the disk write operation is generally expensive, there are few policies implemented to retain the glocks in its current state as much as possible.
As reported via bugzilla 214239 (and several others), we've found GFS needs to fine-tune it current retain policy to meet the latency sensitive application requirement. Two particular issues we've found via the profiling data (collected from several customers' run time environment) are:
- Glocks stay in exclusive state for too long that end up with burst mode flushing activities (and other memory/io issues) that could subsequently push file access time out of bound for latency sensitive applications.
- System could easily spend half of it CPU cycles in lock hash search calls due to large amount of glocks accumulation (ref: "214239":https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=214239#c1).
We have been passing few VM tuning tips, together with a shorter (tunable) demote_secs, to customers and find they do relieve problem#1 symptoms greatly. Note that the "demote_secs" is the time interval used by the existing glock scan daemon to move locks into less restricted states if unused. This implies on an idle system, all locks will be moved into "unlocked" state eventually. Unfortunately, "unlocked" does not imply these glocks will be removed from the system. Actually they'll stay there forever until:
- the inode (file) is explicitly deleted (on-disk deletion), or
- VM issues prune_icache() call due to memory pressure, or
- Umount command kicks in, or
- Lock manager issues an LM_CB_DROPLOCKS callback.
When problem#2 first popped up in RHEL3 time frame, we naturally went thru the above 4 routes to look for solution. I forgot under what conditions Lock Manager could issue the DROPLOCK callback. However, in reality, (3) and (4) share "one" very same exported VFS layer call to do its core job - that is invalidate_inodes(). This vfs call walks thru 4 (global vfs) inodes lists to find the entry that belongs to this particular filesystem. For each entry found, it is removed. The operation, interestingly, overlaps with (2) (VM prune_icache call). The difference is that prune_icache() scans only one list (inode_unused) and selectively purges inode, instead of all of them.
As the on-memory inodes are purged, the GFS logic embeded in the inode deallocation code will remove the corresponding glocks accordingly. It is then the glock could disappear.
So here came the original base kernel patch. As a latency issue, we didn't want to disturb the painstaking efforts of retaining the glocks done by GFS's original author(s). We ended up with exporting the modified prune_icache() that allowed it to function like invalidate_inodes() logic if asked. It walked thru inode_unused list to find the matching mount point. It purges a fixed percentage of inodes from that list if the entry belongs to the subject mount point. In short, we created a new call that had the logic needed for glock trimmming purpose without massive cut-and-pasting the code segment from the existing prune_icache base kernel call.
GFS already has a glock scan daemon waking up on a tunable interval to do glock demote work. It scans the glock hash table to examine the entry one by one. If the reference count and several criteria meet the requirement, it demotes the lock into a less restricted state. For removable glocks, they are transferred into a reclaim list and another daemon (reclaimd) will eventually purge them from the system. One of the criteria to identify a removable glock is by its zero inode reference count. Unfortunately, as long as glock is tied to the vfs inode, the reference count never goes down unless the vfs inode is purged (and it never does unless the vm thinks it is under memory pressure).
For lock trimming purpose, it took several tries to get the gfs-only patch works. The following is the logic that seems to work at this moment:
Each vfs inode is tied to a pair of glocks - iopen glock (LM_TYPE_IOPEN) and inode glock (LM_TYPE_INODE). The inode glock normally has frequent state transitions, depending how and when the file is accessed (read, write, delete, etc) but the iopen glock is mostly on SHARED state during its life cycle until either:
- The GFS inode is removed (gfs_inode_destroy), or
- Some logic (that doesn't exist before this patch) kicks off gfs_iopen_go_callback() to explicitly change its state (presumely by Lock Manager).
Since these two glocks have been the major contributors to the glock accumulation issues, they are our targeted glocks to get trimmed. Without disturbing the existing GFS code, we piggy-back the logic into gfs_scand daemon that wakes up every 5-second interval to scan the glock hash table. If an iopen glock is found, we follow the pointer to obtain the inode glock state. If it is in unlocked state, we demotes the iopen glock (from shared into unlocked). This triggers gfs_try_toss_vnode() logic to prune the associated dentries and subsequently delete the vfs inode. It then follows the very same purging logic as base kernel approach. If inode glock is found first (I haven't implemented this yet), we check it lock state. If unlocked, we follow the pointer to find its iopen lock, then subsequently demote it. It will then trigger gfs_try_toss_vnode() logic that generates the same sequence of clean-up events as described above.
- Current CVS check-in only looks for iopen glock. We should add inode-glock as described above to shorten the search process.
- Have another version of the patch that trims the lock if it is in idle (unlocked) state longer than a tunable timeout value. The CVS check-in is based on a tunable percentage count. The trimming action stops when either the max count reached or we reach the end of the table.
- Now glocks are trimmed (and gfs lock dump shows the correct result) - I'm not sure how DLM side makes these locks disappears from ts hash table (?).