Intel reveals next steps in virtualisation
Vanderpool two and three en route
By Charlie Demerjian in San Francisco: segunda-feira 22 agosto 2005, 21:05
THIS IS THE LATEST article in a series about hardware virtualisation. The first set is on Vanderpool, Intel's version of the concept. If you are unfamiliar with the concept, please read Vanderpool Parts 1, 2, 3 and 4, and AMD's Pacifica parts 1, 2 and 3.
Before you can get Intel chips with VT in them, Intel is touting VT2 and VT3. Don't think of this as a reason not to buy chips with vanilla VT1 in them, you will have a long wait if you do, think of this as more of a statement of future direction. VT enabled chips will be on the market in a few months, but the war of words has already begun.
Intel announced VT over a year ago and released the specs about 6 months ago. Not to be outdone, AMD announced Pacifica, which should be out early next year. Now Intel is talking about the next generation shortly before the first one is out, which means AMD will follow with Pacifica2 before Pacifica the elder hits the market. Before you conclude that this is all a marketing game, just remember, there is a substantial amount of good tech here, and VT2 will do a lot of good for VMM developers.
Think of this as where Intel will go once the above listed stuff is purchasable silicon. This time, it is not just limited to the CPU, it pulls in the Northbridge, memory controllers, buses and peripherals. It is a lot more of an attempt to virtualise the system rather than only the processor.
The first and probably the most important addition is memory virtualisation - not memory controller virtualisation. While this may seem like a rather odd distinction, it actually makes a lot of sense, and with the Intel implementation there may never be a need to virtualise the memory controller. This is called Extended Page Tables - which is EPT in Intel parlance.
Finding the right address is a time consuming and recursive process which can be three or four levels deep with a lot to keep track of. As with AMD's Nested Page Tables, you could add another level of recursion, or as Intel does, add an offset to the PTW, the Page Table Walker. What this in effect does is figure out the offset to where the VM thinks a page is located and adds this to the PTW calculations.
The PTW hardware is designed to figure out what an address should be after a TLB miss. To add virtualisation you need to know the offset between the real memory address and what the guest thinks it is.
When the calculations are done by the PTW, the result is then passed to the memory controller in a 'pre-virtualised' manner, it is already correct. The memory controller does not need to figure out anything, nor does it need to be aware that the OS calling for data is virtualised. In effect, you are adding the intelligence before the memory controller rather than on it.
One nice thing that EPT does is to make any memory controller able to support hardware virtualisation with little or no changes. When CPUs that support it come out, motherboard support should be there as well, there's no need to design a new northbridge for virtualisation.
EPT will mean no more dropping in and out of the VM every time there is a certain class of memory accesses, and a whole lot fewer interrupts to trap. Since this is one of the largest costs in virtualisation it should speed things up dramatically. On the surface, it appears to be quite a different method of achieving the same goal that Pacifica gets to by virtualising the memory controller. It will be interesting to see which method provides the lowest overhead, but both will be vastly better than the current software method. EPT will catch Intel up to Pacifica.
Once you've caught up, the next thing to do is move ahead, and that's what DMA Remapping does. If you recall, Pacifica can block DMA at the HT to ccHT border, providing a yes/no ability for the VM to see a particular piece of hardware. This is not exactly a hugely granular solution, but it does the intended job fairly well.
DMA Remapping remaps the DMA request to the correct guest OS, so in cases where Pacifica might deny something, DMA Remapping points it to the correct spot. This is the first step to virtualising peripherals, but more on that later.
DMA Remapping in software is hugely expensive, it can bring a fast machine to it's knees if done improperly. Remapping in hardware is orders of magnitude faster, but still has enough overhead so it isn't something you would want to do on every interrupt. That is where Intel adds the IOTLB. Like the TTLB from Pacifica, or the plain old TLB in non-virtualised chips, it caches the DMA remappings so they only have to be calculated once.
EPT and DMA Remapping work in tandem - they virtualise two of the biggest holes that VT left open. With these two things running in VT2/3, or whatever marketing name is thought up by then, there is very little left to do on the CPU to provide a completely virtualised environment.
The catch here is 'on the CPU.' The next phase of VT is to pull the platform and peripherals into the picture. This involves just about every vendor out there, and Intel has to ride them while cracking the whip. They all have to dance to the same standard or you end up with an ugly mess instead of a virtualised machine.
To do this, you start with the PCI-SIG because all of these things that you want dancing to the same beat are all plugged into PCIe. Work here is well under way. There are virtualisation working groups for PCIe 2.0 that are in the early stages of bitter argument. It may be a while, but with any luck, PCIe 2.0 will come out with some form of virtualisation support built in.
This will allow the peripheral vendors to make individual devices virtualisable, or at least be VM aware enough to dodge the uglier bullets. With a full set of virtualisable hardware on a PCIe 2.0 bus, and a VTx CPU running it all, you have pretty much a completely virtualised system. That is the goal, and it looks like the roadmap is being made known.
A cute trick that this, once fully implemented, will allow is for each VM to run it's own driver set in guest OS space. No more massive jumping back and forth to the VMM and trapping every call under the sun. For the user, you can play a lot of tricks, and also run games and other demanding apps in a VM. For devs, it could allow driver debugging on a whole new level. Have five revisions of a driver, and want to test them all on the same box? Not a problem. Things like this make devs smile.
Another little trick they've added is a preemption timer. This doesn't really virtualise anything specifically, but it allows for different ways to pop in and out of the VMM. It is a timer that says run VM 1 for X milliseconds, then drop out and run VM 2 for Y. Preemption timers have some very interesting implications for the embedded world and other similar applications, but for desktops they are not all that useful.
It can help a lot when you need to switch tasks now, or you must allocate a certain amount of CPU power to a task. For telecom and networking applications, it makes virtualisation a useful tool and possibly a must have feature. On the other end of the spectrum, it can help for media applications like media PCs and Tivo-type devices. For the business world, it doesn't buy you all that much.
So, with all the tech, what is the nutshell story, and more importantly, when? VT is launching in a few months, definitely in 2005 for desktops. For servers and mobile, Intel is only saying 'after 2005' but you can narrow that down with a little guesswork. I would look for the memory virtualisation on the Merom cores along with DMA remapping. If Intel follows it's normal way of doing things, both techs will probably not make the first spin of the cores, but will follow in the next revision.
The platform level work is a bigger open question. The PCIe 2.0 virtualisation should proceed in the same orderly cat-herding fashion that any standard setting body goes through. It is needed, and I think everyone agrees that it should be done, but how, when, and most importantly who's methodology will be a contentious issue. It will be worked out, and you will see virtualised PCIe eventually.
Then comes a task that makes the previous cat-herding look easy. Imagine cat herding with a firehose and firecrackers. That is notably easier than getting all the peripheral makers to play along. This part will also come in time, starting with the enterprise level hardware, and moving down the food chain to the more reputable peripheral makers, and then eventually to everyone. Think hardware compatibility lists, and lots of them.
Once VT2 and VT3 are out there won't be much left to do. The entire computer will be virtualizable with very little overhead, and the dreaded software faking of any part of the system should be banished to memories of the bad old days. That is the point of all of this, and now we have a rough roadmap on how to get there. µ