bhyve ring buffer

Current implementation

The current vmm trace ring buffer implementation is in vmm_sol_dev.c. There is one global ring buffer with a maximum size of 1 MB. Each message in the ring buffer is 280 bytes, allowing 3744 messages in the buffer.

> ::sizeof vmm_trace_dmsg_t |=D
                280
> ::print -t vmm_trace_dmsg_t
vmm_trace_dmsg_t {
    timespec_t timestamp {
        time_t tv_sec
        long tv_nsec
    }
    char [256] buf
    struct vmm_trace_dmsg *next
}

Questionanble implementation decisions include:

There is one global ring buffer and messages are not tagged with the vmm instance that logged.
Each node in the buffer is dynamically allocated until the 1 MB limit is reached, leading to rather complicated allocation logic.
The ring buffer memory is not freed when vmm unloads.

Currently there are no callers of vmm_trace_log(), which means what is present is dead code.

Requirements

The tracing facility needs to fulfill the following requirements:

Each call must be relatively light-weight.
Each message needs to be attributable to a particular instantiation of a virtual machine.
Each message needs a time stamp
Each message needs a severity
Messages must survive the death of VM and the non-global zone that contains the VM.
Messages must be easily accessible through mdb
Messages must be retrievable via vmm_ioctl() or similar.

The plan

There will be a single global ring buffer that is allocated and initialized as the first virtual machine starts. The number of entries in the ring buffer will be tunable up until the first VM is started.

Kernel interfaces

Trace messages inserted into a global ring buffer with:

void vmm_trace_log(struct vm *vm, vmm_trace_flags_t severity,
    const char *fmt, ...);

The number of trace messages kept is tunable until the first VM is started.

int vmm_trace_log_count = 5000;		/* Arbitrary, about 1.5 MB */

The following symbols will be helpful for debugging:

vmm_trace_node_t *vmm_debug_rbuf is the ring buffer
int vmm_debug_rbuf_entries number of elements in vmm_debug_rbuf
list_t vmm_trace_heads keeps track of per-VM vmm_trace_head_ts. This helps you find trace messages for VMs that no longer exist but still have messages in the ring buffer.

User space interfaces

Messages can be read by user space with:

int vm_get_next_log(struct vmctx *ctx, vmm_trace_flags_t severities,
    vmm_trace_dmsg_t *msg);

Important declarations shared between user space and kernel space are:

typedef enum {
	/* logging levels */
	VTF_DEBUG = 0x01,
	VTF_ERROR = 0x02,
} vmm_trace_flags_t;

typedef struct vmm_trace_dmsg {
	timespec_t vtd_timestamp;
	char vtd_buf[256];		/* Always nul terminated */
	vmm_trace_flags_t vtd_flags;
} vmm_trace_dmsg_t;

Kernel internals

Each ring buffer entry will be:

typedef struct vmm_trace_node {
	vmm_trace_dmesg_t vtn_msg;
	vmm_trace_head_t *vtd_vm_head;	/* Which VM was this from? */
	list_node_t vtd_vm_linkage;	/* List head in vtd_vm_head */
} vmm_trace_node_t;

Each struct vm will reference a newly allocated vmm_trace_head_t. vmm_trace_log() will update vtd_vm_head to refrence the per-VM vmm_trace_head_t and insert this vmm_trace_dmsg_t at the tail of a list in vmm_trace_head_t.

typedef struct vmm_trace_head {
	vmm_t *vth_vm;			/* If NULL, candidate for destruction */
	char vth_name[32];		/* VM name */
	zoneid_t vth_zoneid;
	zonedid_t vth_zonedid;
	list_head_t vth_msgs;
	kmutex_t vth_lock;
	kmutex_t vth_cv;		/* To signal vm_get_next_log() */
	vmm_trace_dmsg_t *vth_nextmsg;	/* To help vm_get_next_log() */

	/* XXX included for discussion. Not sure how useful they are. */
	struct timeval vth_create_time;
	struct timeval vth_destroy_time;
} vmm_trace_head_t;

The vtd_flags field will be used to track logging levels and perhaps other things. vmm_trace_log()'s severity argument will be masked to only accept the logging severity bits.

Each vmm_trace_head_t is a member of the vmm_trace_heads linked list. As the corresponding bhyve instance is being destroyed, the vmm_trace_head_t becomes a candidate for destruction. It will only be destroyed when it no longer references any messages.

There will be a global lock that is held during the following operations:

During allocation of the ring buffer.
As vmm_vtrace_log() selects the next vmm_trace_dmsg_t to use. It is not held as the vmm_trace_dmsg_t is updated.
As nodes are added to or removed from vmm_trace_heads.

The vth_lock in a vth_trace_head_t is held during the following operations:

While the vm_get_next_log() ioctl is reading vth_nextmsg.
While the vm_get_next_log() ioctl is copying a message.
While a vmm_trace_node_t is being removed from or added to its vth_msgs list (e.g. via vmm_th_remove_node() and vmm_th_add_node())

The vth_cv is signaled by vmm_th_add_node() when it changes vth_nextmsg from NULL. It will also be signaled as the VM is destroyed so that the ioctl may return an error.

As vmm_th_remove_node() is removing a node, it will check to see if the vmm_trace_head_t should be destroyed. It should be destroyed if vth_vm is NULL and the vth_msgs list is empty.

zhyve changes

zhyve will gain a logging thread that will loop on calls to vm_get_next_log(), writing the received messages to stderr. Per OS-6831, these messages will find their way into the platform log at <zonepath>/logs/platform.log.

mgerdts/plan.md