Virtualization: WTF

For reasons that don’t need exploring at this juncture, I decided to start reading through a bunch of papers on virtualization, and I thought I’d force myself to actually do it by publicly committing to blogging about them.

First on deck is Disco: Running Commodity Operating Systems on Scalable Multiprocessors, a paper from 1997 that itself “brings back an idea popular in the 1970s” – run a small virtualization layer between hardware and multiple virtual machines (referred to in the paper as a virtual machine monitor; “hypervisor” in more modern parlance). Disco was aimed at allowing software to take advantage of new hardware innovations without requiring huge changes in the operating system. I can speculate on a few reasons this paper’s first in the list:

  • if you have a systems background, most of it is intelligible with some brow-furrowing
  • it goes into a useful level of detail on the actual work of intercepting, rewriting, and optimizing host operating systems’ access to hardware resources
  • the authors went on to found VMware, a massively successful virtualization company

I read the paper intending to summarize it for this blog, but I got completely distracted by the paper’s motivation, which I found both interesting and unexpected.

What Problem Is Disco Solving?

The motivation for this paper is an attempt to use a cache-coherent, non-uniform memory access (cc-NUMA) system with multiple processors efficiently without requiring drastic operating system modifications. To describe cc-NUMA with a bit less jargon: a system can have multiple processors, each processor with a bank of memory that it can access more quickly than a bank of memory near another processor, and the system will make sure that this doesn’t result in invalid caches for any processor. It took me a pass and a half through this paper (and a quick glance at the SGI Altix user’s guide) before I realized why this architecture in particular was so problematic that it merited a revival of the virtual machine monitor. I doubt I understand the whole scope of the problem, but here’s an attempted summary of what I did gather:

  • accessing remote memory carries a uniformly higher cost than accessing local memory. In other words, processor 1 trying to access memory from a bank near processor 128 takes longer than processor 1 accessing memory from its own memory bank. The ratio of time-to-access-remote-memory to time-to-access-local-memory is called the NUMA ratio; for thought, the NUMA ratio on the SGI Altix varies from 1.9 to 3.5, neither of which are insignificant amounts. (Altix numbers come from Linux Scalability for Large NUMA Systems, 2003.)
  • naively allocating memory in big chunks by physical address number results in memory being concentrated in banks near certain processors. If we allocate a contiguous 512 kilobytes of memory because we intend to use it for kernel data structures, it’ll all be in one memory bank near one processor.
  • there are some blocks of memory that all processors will need to reference frequently, such as kernel code for system calls. (Processor memory caches can mitigate this, but cache memory is precious, and cache validation is no joke on multiprocessor systems.)

All these combined mean that unlucky processors might take a lot of performance hits fairly frequently. Even in the case where each processor can cache frequently-referenced data in remote memory, consider the case where processor 1 writes to a block of memory that processors 2-128 all have cached locally. Processors 2 through 128 all need to be informed (via the cache coherency mechanism of the hardware) that their caches for this block are invalid, and that they need to do a fetch before reading this data again. If the block in question is, say, a lock for a frequently-contested resource, this can happen quite frequently with poor implications for performance.

OK, How Does This Solve It?

Rather than rewrite thousands of lines of IRIX (and Linux and Solaris and AIX and Windows NT and HP-UX and BSD and…) operating system source code, the authors insert a virtualization layer between the non-NUMA-aware operating system and the cc-NUMA machine. (Well, a simulator, because the actual machine motivating this paper didn’t exist when they were writing it.) The virtualization layer (or virtual machine monitor) intermediates operating system access to processors, memory, and I/O – a tall order that the authors managed to do under 100Kb of resident memory.

The monitor deals with cc-NUMA by taking control of the physical layout of memory away from the resident virtual machines. Instead of operating systems mapping virtual memory addresses to physical addresses, operating systems running under a virtual machine monitor map virtual addresses to a falsified physical address space, maintained by the monitor, which the monitor intercepts requests to and rewrites to the real machine address of the memory.

Disco has access to information the OS commits on whether memory is writeable, in addition to information on access patterns from running virtual machines. It can use the combination of these two pieces of information to intelligently map memory in one-to-many or many-to-one configurations. For example, a page of memory that contains read-only kernel code, which may be needed by all processors, can be replicated across multiple memory banks; each processor, when requesting a read to that physical address of memory, can get the machine address closest to itself containing a copy of that data. By keeping track of which virtual processors request which pages and adjusting mappings or moving pages accordingly, Disco can provide local memory accesses much more consistently than the native, non-NUMA-aware OS.

It Does Some Other Cool Stuff Too

Disco also does some similarly clever things with virtualized I/O and disk access. Imagine having one hypervisor running eight virtual machines, all of them copies of the same Linux installation. On receiving a read request from disk for the kernel executable from processor 1, the hypervisor can read the data from disk, put it into memory, and then service subsequent read requests from that location in memory – even if the request comes from processor 4 or processor 120.

Shameless Pandering To My Established Audience

I would be remiss not to mention that the authors also show some impressive performance results from a library OS. The commodity OS used for most testing in the paper, IRIX, performed extremely badly under certain conditions because of choices made by its virtual memory manager. SPLASHOS, a library OS, ran three times faster under the test workload by completely abdicating its responsibility for page faults to the hypervisor.

Oh Yeah, Also It Worked

People don’t recommend 17-year-old papers that prove the null hypothesis. Disco got very good performance results over the commodity OS, IRIX, running directly on the hardware. According to lazy web searches, VMware was indeed built on Disco, and they seem to be doing okay.