Does anybody really know what time it is?

Well, if you’re running VMWare ESX, your virtual machines might not.  (If you’re just looking for a quick answer on why your clock drags ass, remove your virtual floppy drive.)

Allow me to elaborate.  I am not slagging on VMWare.  They make some incredible stuff.  But they are trying to simulate a system that contains thirty year old kludges.  Some of those kludges cause very wierd things to happen.  One of those things involves time.

VMWare has an excellent paper on timekeeping in virtual machines.  The specific problem that I ran into is a side-effect of what they refer to as “tick-based timekeeping”

Dig this: Running the Universe database in a Windows 2008 virtual machine, we noticed that scheduled tasks were being missed.  We here is really me, but we sounds better since this was for one of my clients.  Anyhow, the task scheduler is something I wrote.  It runs through its internal task list at the top of the minute, and runs whatever is appropriate for the minute.  Works a lot like good old cron.  Specifically, it uses the Universe Basic SLEEP command to sleep for 10 seconds, and checks if its at the top of the minute.  The reasons for this weirdness are something you understand if you work with Universe.

So, it’s missing jobs.  No clue why.  The clock seems to drag a bit, and then catches up. I wrote a test program that would SLEEP for thirty seconds, and then print the time.  And here’s what I wound up with:  Sometimes SLEEPing for thirty seconds would take a minute and a half!  Essentially, it would occassionally SLEEP through its alarm call.

So now you’re wondering “hey, how is that possible?”.  Remember that white paper I linked that you didn’t read?  It’s all about the ticks.  I don’t know how, precisely, the Universe SLEEP command works, but I’ll bet real money it counts ticks.  And when the tick counter goes wacky, so does SLEEP.

In the process of googling, I found all sorts of ways that time could drag, but none of them were my problem.  So I did what I usually do when I can’t find an answer.  I go easter egging.

In my lab, I created a new VM with only one CPU, just enough RAM and disk, and a single NIC.  I installed the OS and Universe.  Then I ran my little test program.  And SLEEP took precisely the right number of seconds every time.  The only differences between this VM and the first one were the number of CPUs and a floppy drive.  So I shut down my other Universe-hosting VM, and removed the floppy (more on why that became the obvious choice in a moment).  Restarted it, and SLEEP worked perfectly.

So now, you’re scratching your head.  How can a floppy drive, a VIRTUAL one, no less, cause clock problems?  We go back to the white paper.  VMWare detects if something needs ticks, and provides ticks to the virtual machine.  But they acknowledge that the ticks don’t necessarily come in a continuous stream, and some things might not handle getting a whole stream of ticks at once when ESX tries to catch up.  This is why the clock bounces around so much.  It also explains why Universe freaks out.

In the final analysis, VMWare sees the floppy drive, and decides that the VM needs ticks.  Take away the floppy drive, and it uses one of the more accurate tickless timers.  This goes all the way back to Windows 98, when the guys at Microsoft figured out how to make a floppy drive know a disk had been inserted or changed without actually reading the disc.  But the floppy drive generates interrupts, and the routines for dealing with it use, you guessed it, ticks.

So, the lesson learned is this – simplicity good.  If you don’t need it, then don’t attach it to your VM.  And with the hardware set that VMWare provides, you probably don’t ever need a floppy drive to install your favorite operating system.