Raw notes about virtualization, firecracker and crosvm

Global concepts

Addressable space

Depends on the processor capabilities, which can be around 36~40 bits for recent Intel processors. If taking 39 bits as an example, this means the total addressable space will be 1 TiB of available addresses. The guest RAM is part of this addressable space, same as devices, PCI holes, ...

BAR (Base Address Register)

A base address register is part of the PCI configuration space of each PCI device, and it declares an extra memory region related to the device that can be found at this specific base address.

I/O BAR (deprecated): Supposed to be MMIO, hence causing VM exits the guest access this memory.

Memory BAR (standard): Supposed to be backed by memory mapped region from the host, and declared to KVM, as they should not create any VM exit.

Firecracker

It maps, via mmap(), a certain amount of memory into the hypervisor virtual memory space, in order to represent guest RAM. This guest RAM can be split into 2 distincts regions if the user asks for more RAM than the base address of the hole defined by Firecracker between 3.25 GiB and 4 GiB. This hole is dedicated to device configuration through MMIO, which means that it dynamically increments the available base address every time a new device is added. The guest will access those device configurations through the memory mapped IO region. Only guest RAM is passed by Firecracker to KVM through ioctl(KVM_SET_USER_MEMORY_REGION), since this is the kind of memory that we don't want to trap on. By informing KVM, this updates the EPT tables in the MMU, hence this won't cause a VM exit when the guest will try to access one address part of the RAM address range.

PIO (Port IO) or MMIO (Memory Mapped IO) is typically used for communication of the devices configuration, and has to be handled differently. When the guest wants to read/write to an MMIO address, this will cause a VM exit with the flag KVM_EXIT_MMIO which needs to be handled by the hypervisor. The point being to let the hypervisor do some sanity checks about the address being accessed and accordingly return the value if that's an address in an expected range. In case of a virtio device, this is likely to reach the virtio device backend that corresponds to this address, and the appropriate configuration information will be returned as the value at this address. Same thing happens with PIO, except that the flag returned with the VM exit will be KVM_EXIT_IO.

At the time a device is created by the hypervisor, the address range it occupies is inserted to the corresponding bus (io_bus or mmio_bus). By inserting the range, the hypervisor ensures that whenever a VM exit on a specific memory read/write will happen, it will have the ability to detect if this is an expected memory access or not.

Virtqueues are allocated by the guest drivers into the guest RAM, and they are a powerful way to read/write consequent amount of data between host and guest, without going through a bunch of VM exits.

Not in Firecracker but worth mentioning

Any additional memory needed by a device, a 3D buffer for a GPU device, will need to be mapped into the hypervisor virtual address space to reserve the right amount of memory. Once the mapping is done, this memory needs to be declared to KVM the same way guest RAM had been previously declared. This will allow the guest to access directly this shared memory region without causing further VM exit, which is the expected behavior when huge amount of data need to be shared between host and guest. Just a quick note that in this case, the virtqueues allocated in RAM are not a good fit. Because the virtqueues point to the guest RAM, this means we would need to extend the guest RAM in order to allow huge regions to be allocated in guest memory. One more reason, the virtqueues still need to notify the host through ioevent mechanism that needs to be handled by KVM. This is not extremely expensive, but still involves more steps than simply accessing a fully shared memory region directly handled through EPT/MMU only.

Crosvm

It is more advanced but also more complex than Firecracker since the memory model accept more use cases such as sharing memory with the guest. This is due to the amount of supported devices required by the Android use case.

The hypervisor has the ability to create some memory mapping for extra device memory (virtio-gpu and virtio-wayland). Once the mapping is done, it uses add_device_memory() to notify KVM about this region. Note that the GPU shared region is added by default from run_config().

virtio-wayland (can be found in devices/src/virtio/wl.rs) can call into VmRequest::RegisterMemory to trigger the function register_memory(). This is a generic function responsible for doing the memory mapping backed by a file descriptor and notifying KVM about the new region through KVM_SET_USER_MEMORY_REGION. The last interesting part in this function is the glue/component that takes care of finding the available addresses for all those device memory regions. This component is called the SystemAllocator, and is responsible for avoiding any overlapping memory regions.

To summarize, the mmap() happens, then the hypervisor needs to find an address range in guest physical memory, and once the SystemAllocator returns the first available one, the hypervisor can tell KVM about it.

The way PCI is handled regarding BARs is first by assuming the PCI hole is going to be the entire guest physical address space [0x0 - 0xffffffffff], because no specific indication about holes is given through ACPI since there is no ACPI emulated at all. Then, it assumes the kernel will not try to reprogram the BARs, which let the hypervisor fully rely onto the SystemAllocator to decide about the entire VM memory topology.

Except the virtio-gpu device, which allocates 4GiB of shared memory upfront and then declares a 8GiB BAR through its PCI configuration, there is no other virtio device using BARs to provide the guest the address of a shared memory region. Interestingly, the virtio-wayland device behaves slightly differently. It does not create any shared memory before the guest has started running, but later, it receives some requests (WlOp::NewAlloc) through the virtqueues, because the guest needs to allocate more shared regions. Upon receiving this request, the whole shared memory creation happens (mmap() + find guest address range + inform KVM), and the virtio-wayland device returns the guest pfn (page frame number) corresponding to the base of the new memory region, along with the size. So here, this is interesting to see that virtqueues are used as a simple control plane to trigger the creation of more shared memory regions. Because those operations are dynamic, using the BARs to declare a fixed memory region would not work here.

sboeuf/virt_notes.md

Global concepts

Firecracker

Crosvm