Kexec Handover and Live Update
It’s NOT: live patching, live migration.
Updates the kernel or hypervisor with minimal disruption for underlying
workloads.
Most commonly used for hypervisors.
Can also be used by other workloads to reduce kernel patching downtime.
Multiple cloud providers working together to upstream it.
The system is in normal state.
The system software starts the live update process.
Serializes state keeping VMs active but with limited capabilities.
Pauses VMs and does final serialization.
Loads and next kernel and hands over the serialized data.
Next kernel deserializes the data.
Resumes VM, returning normal operation.
In “serialization” part, mention the role of system software and kernel.
Note the similarities to live migration.
VM metadata
VM memory
Passthrough devices
IOMMU mappings
Complex feature, not easy to do it in one go.
Upstreaming as a set of building blocks instead.
Creates a mechanism for kernel-to-kernel communication.
Provides mechanism to mark memory as preserved.
Makes sure preserved memory does not get used by the next kernel.
Passes this information over kexec.
Explain that it is not possible to preserve user memory using this.
But it can be used for non-liveupdate cases as well, like reserve_mem for
example.
4.3 KHO: Memory preservation
int kho_preserve_folio (struct folio * folio );
int kho_unpreserve_folio (struct folio * folio );
struct folio * kho_restore_folio (phys_addr_t phys );
4.4 KHO: Memory preservation
Before the system is ready for kexec, KHO must be notified so it can prepare.
On this notification, serializes preserved memory to bitmaps.
Mention that the finalization hook is going away.
Pre-reserved scratch area for early boot.
Passing KHO metadata: setup data on x86, chosen node in FDT on arm64.
struct kho_data {
__u64 fdt_addr ;
__u64 fdt_size ;
__u64 scratch_addr ;
__u64 scratch_size ;
} __attribute__((packed ));
chosen {
linux,kho-fdt = <...>;
linux,kho-scratch = <...>;
};
Mention that kexec image and all early boot allocations go in scratch.
Mention that chosen node gets set at kexec load time.
On early boot, only allocate from scratch.
enum memblock_flags choose_memblock_flags (void )
{
if (kho_scratch_only )
return MEMBLOCK_KHO_SCRATCH ;
[...]
}
After early boot, mark preserved pages as reserved and turn off scratch-only
mode
Reserved pages don’t get released to buddy allocator.
4.8 Live Update Orchestrator (LUO)
LUO provides a way for userspace to control the live update process.
Allows marking which resources to preserve.
Provides a state machine to co-ordinate all the components.
API is exposed through a set of IOCTLs.
Can’t preserve everything since too much state.
Mention that this is the next layer since it lets userspace actually do stuff.
Maybe mention that /dev/liveupdate can only be opened once and that luod
must control it?
\textcolor{blue}{Normal}: No live update in progress.
\textcolor{blue}{Prepared}: Kernel is prepared to do a live update. Devices and resources
operate in limited capacity.
\textcolor{blue}{Frozen}: The final reboot event has been sent. Last chance for the kernel to
serialize.
\textcolor{blue}{Updated}: System has rebooted into next kernel and can start deserializing
devices and resources.
\textcolor{blue}{Normal}: The system is back to normal functionality.
struct liveupdate_ioctl_set_event {
__u32 size ;
__u32 event ;
};
LIVEUPDATE_PREPARE: Normal -> Prepared
LIVEUPDATE_FREEZE: Prepared -> Frozen
LIVEUPDATE_FINISH: Updated -> Normal
LIVEUPDATE_CANCEL: Prepared -> Normal
Explain all the states.
FREEZE: Sent from reboot(2).
4.11 LUO: File Descriptors
Userspace can pass in supported file descriptors to LUO to mark them for
preservation.
Not any arbitrary FD, only FDs for supported file types.
struct liveupdate_ioctl_fd_preserve {
__u32 size ;
__s32 fd ;
__aligned_u64 token ;
};
Give some examples of FDs in Linux: memfd, sockets, VFIO, IOMMUFD, KVM, etc.
Mention some properties that can change with restore FDs, taking memfd as
example.
Mention that the token can be used to identify the FD after reboot.
For things that can’t be described by a FD.
Examples: PCI, NVME, ftrace, etc.
Mention that not much work done on this so use cases and usage model still
unclear.
4.13 Memory File Descriptor (memfd)
memfd attaches a file descriptor to anonymous memory.
State preserved: memory contents, size and position.
After preserve, cannot add or remove pages from the memfd.
Limitations: no sparseness, no swap.
Mention that memfd is the first user of LUO.
Mention that pages are pinned and holes are filled.
4.14 memfd: preservation format
/ {
pos = <0x...>;
size = <0x...>;
folios = [array of memfd_luo_preserved_folio]
};
struct memfd_luo_preserved_folio {
u64 foliodesc ;
u64 index ;
};
Foliodesc: bottom 12 bits for flags, rest for PFN.
4.15 VFIO, PCI, IOMMU, etc…
KHO is in mainline. See kernel/kexec_handover.c and
include/linux/kexec_handover.h.
LUO v4 sent out few days ago. \color{blue}\underline{Patch posting} . It is
starting to stabilize and is on path to upstream soon.
memfd support will get merged with the LUO patches.
RFCs for PCI, VFIO, IOMMU out.
Supporting more subsystems: huge pages, VFIO, IOMMU, PCI, etc.
Implementing luod.
Improving performance for reboots.
Defining a mechanism for kernels to negotiate versions to enable rollback and
roll forward to a wider set of kernels.
Testing and validation.