[XEN] Xen and clockdrifts or why does the clock stop after live migration
This document describes the problems with stopping of the DomU clock after live migration. It is only valid for XEN virtualization.
The clock of the DomU stops after live migration
On a XEN Cluster with live migration you cannot keep all nodes exactly in time sync. You should have them as synchronous as possible (use ntpd or the like) but nevertheless on node will alway be more in the future then another. This problem seems to lead to DomUs stopping the clock when a DomU is migrated from a Dom0 which is in the future compared to the destination Dom0. At this state the time stops. This is a known bug and filed here:
A workaround would be to make the DomU time independent from Dom0 time. And use ntp on the DomU also to keep time in sync. For this you only need to add the line xen.independent_wallclock=1 to /etc/sysctl.conf and then issue a sysctl -p on all Dom0s and DomUs.
Problems resulting from this:
- Time stops
- Top shows wrong or no load values
- Applcations might crash
- Ping stops
- Etc.
About XEN Clockdrifts
Consider a CentOS-5.1 Xen server (2.6.18-53.1.4.el5xen) hosting two domains running CentOS-5.1 (2.6.18-53.1.4.el5). One domain has a fairly accurate clock, the other domain has a clock that gains ungodly amounts of time, roughly one minute every two or three minutes. For a fix, one suggestion is to run this command in DomU:
echo 1 > /proc/sys/xen/independent_wallclock
This didn't change anything. As an experiment, I wrote a script to call ntpd -q, sleep 60, and repeat indefinitely. Here are a couple of snippets of output:
goodclock# ksh ./xenclockdrift
ntpd: time slew +0.001211s
ntpd: time slew +0.001200s
ntpd: time slew +0.001855s
ntpd: time slew +0.001532s
ntpd: time slew +0.001603s
ntpd: time slew +0.001320s
ntpd: time slew +0.001931s
badclock# ksh ./xenclockdrift
ntpd: time slew -0.000193s
ntpd: time set -57.356377s
ntpd: time slew +0.002352s
ntpd: time slew +0.003018s
ntpd: time set -57.417488s
ntpd: time slew +0.012089s
ntpd: time slew -0.000985s
These domains are fully virtualized and set up identically, except "badclock" is allocated two processors versus one processor for "goodclock". DomU's clock is running normally.
Anyone know what's going or know how to fix it?
This is a known issue that has come up on this list a lot.
For C5.1 see the first known issue
Please note that the clock rate issue in that description applies to non-xen kernels. xen kernels are set to 250Hz by default.
>>>With this option you can reduce the clock rate from the default of 1000HZ to 100HZ which is desirable in a virtual machine.
If it does not apply to xen then this should be made more clear.
I think it is a good idea to add a note about xen kernels. But this is noted in the upstream Release Notes where the tick divider option is mentioned:
>>>Note that the virtualized kernel does not support multiple timer rates on guests. dom0 uses a fixed timing rate set across all guests; this reduces the load that multiple tick rates could cause.
See also here (scroll down to tick_divider in the Feature Updates section)
A note by Johnny Hughes in Comment 6644:
>>>As a side note ... if the clock GAINS (runs to fast) time you should be able to fix it with this:
VMWare KB#1591 (by setting the correct host.cpukHz) and vmware tools should adjust a clock that is too slow.
Also see this blog entry concerning host.cpukHz: Blog: vmware-guest-clock-runs-fast "
Taking the advice from a few people and web pages, I made this change to grub.conf:
kernel /xen.gz-2.6.18-53.1.4.el5 divider=10 clock=pit
The clock is better but still not fixed. It jumps forward less often.
Despite the subject, the contents of this thread have drifted to vmware. No, you cannot use the divider= option for the xen kernel. This is for the standard kernel only.