KEP-2857: Runtime Assisted Mounting of CSI Volumes

Release Signoff Checklist
Summary
Terminology
Motivation
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Alternative 1 - Skipping Kubelet involvement during mount deferral through metadata files
- Alternative 2 - Integrate CRUST Interface APIs into CRI and Shim APIs
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

Summary

Certain container runtime handlers (e.g. Hypervisor based runtimes like Kata) may prefer to manage the file system mount process associated with persistent volumes consumed by containers in a pod. Deferring the file system mount to a container runtime handler - in coordination with a CSI plugin - is desirable in scenarios where strong isolation between pods is critical but using Raw Block mode and changing existing workloads to perform the filesystem mount is burdensome. This KEP proposes a set of enhancements to enable coordination around mounting and management of the file system on persistent volumes between a CSI plugin and a container runtime handler.

Terminology

The following terms are used through out the KEP. This section clarifies what they refer to avoid verbosity.

container runtime handler: the "low level" container runtime that a CRI runtime like containerd or CRIO invokes. Example: runc, runsc (gVisor), kata. A container runtime handler typically would not directly interact with the Kubelet using CRI. Instead, they consume the OCI spec generated by a CRI runtime to create a pod sandbox and launch containers in that sandbox.
mounting a file system: invocation of the mount system call with a file system type and a set of options.
post mount configuration: in the scope of Kubernetes, this involves actions like: application of fsGroup GID ownership on files based on fsGroupChangePolicy, selinux relabelling and safely surfacing a subpath within the volume to bind mount into a container.
management of file system: in the scope of current CSI spec, this involves: [1] retrieving filesystem stats and conditions associated with a mounted volume and [2] online expansion of the filesystem associate with a mounted volume. Later, as CSI evolves, this may involve other actions too (e.g. quiescing the file system for snapshots).
CRUST Interface: a new API/interface detailed in this KEP that allows Kubelet to invoke storage management APIs on a container runtime handler.

Motivation

This KEP is inspired by design proposals in the Kata community to avoid mounting the file system of a PV in the host while bringing up a pod (in a guest sandbox) that mounts the PV. The existing approach (without any changes to Kubelet and CSI) has the following drawbacks:

A CSI plugin has to directly invoke Kata specific interfaces and populate Kata specific configuration files. Thus, a CSI plugin has to be tightly integrated with a specific container runtime handler.
The overall mechanism does not support application of fsGroup, subpaths and specified selinux labels (in the pod spec) after the file system is mounted by the container runtime handler. Supporting these require requesting pod details during CSI node publish, performing a pod spec lookup from the CSI plugin and passing the relevant fields to the container runtime handler.

A set of enhancements to CSI along with a new API between Kubelet and container runtime handlers (like Kata) overcomes the above drawbacks. These enable a CSI plugin and a container runtime handler to coordinate mounting, application of post mount configuration and management operations on the file system on persistent volumes in a generic fashion (without requiring runtime handler specific logic in CSI plugins or look-up of pod specs in CSI plugins).

Goals

Enable a CSI plugin and a container runtime handler to coordinate publishing of persistent storage backed by a block device to pods by deferring the file system mount and application of post mount configuration to the container runtime.
Enable a CSI plugin and a container runtime to coordinate management operations on the file system (e.g. collection of file system stats and expanding the file system while mounted)
Defer the file system mount and any post mount configuration specified in the pod spec only if the container runtime handler is capable of handling the specified file system and applying all specified post mount configurations.
Allow a CSI plugin, kubelet and CRI runtime to fall back to regular CSI node publish and application of post mount configuration when the container runtime handler cannot handle the file system or post mount configuration.
Avoid changes to CRI, OCI and CRI container runtimes like CRIO/containerd.

Non-Goals

Enable a CSI node plugin to be launched within a special runtime class (e.g. Kata). It is expected that CSI node plugin pods launch using a “default” runtime (like runc) and is not restricted from performing privileged operations on the host OS.
Falling back to CSI plugin based node publish flow if runtime assisted paths for mounting and managing file system fails.

Existing Solutions and Gaps

A pod using a microvm runtime (like Kata) can mount PVs backed by block devices as a raw block device to ensure the file system mount is managed within the pod sandbox environment. However this approach has the following shortcomings:
- Every workload pod that mounts PVs backed by block devices needs to mount the file system (on the raw block PV). This requires modifications to the workload container logic. Therefore this is not a general purpose solution.
- File system management operations like online expansion of the file system and reporting of FS stats cannot be performed at an infrastructural level if the file system mounts are managed by workload pods.
A pod using a microvm runtime (like Kata) may use a filesystem (e.g. virtiofs) to project the file-system on-disk (and mounted on the host) to the guest environment (as detailed in the diagram below). While this solves several use-cases, it is not desired due to the following factors:
- Native file-system features: The workload may wish to use specific features and system calls supported by an on-disk file system (e.g. ext4/xfs) mounted on the host. However, plumbing some of these features/system calls/file controls through the intermediate “projection” file system (e.g. virtio-fs) and it’s dependent framework (e.g. FUSE) may require extra work and therefore not supported until implemented and tested. Some examples of this incompatibility are: open_by_handle_at, xattr, F_SETLKW and fscrypt
- Safety and isolation: A compromised or malicious workload may try to exploit vulnerabilities in the file system of persistent volumes. Due to the inherently complex nature of file system modules running in kernel mode, they present a larger surface of attack relative to simpler/lower level modules in the disk/block stack. Therefore, it is safer to mount the file system within an isolated sandbox environment (like guest VM) rather than in the host environment as the blast radius of a file system kernel mode panic/DoS will be isolated to the sandbox rather than affecting the entire host. A recent example of a vulnerability in virtio-fs: https://nvd.nist.gov/vuln/detail/CVE-2020-10717
- Performance: As pointed out here [slide 6], in a microvm environment, a block interface (such as virtio-blk) provides a faster path to sharing data between a guest and host relative to a file-system interface.

Proposal

Coordination between a CSI Plugin and a container runtime handler around mounting and management of the file system on persistent volumes can be accomplished in various ways. This section provides an overview of the primary enhancements detailed in this KEP. Various alternatives are listed further below in the Alternatives section. Details of the new APIs outlined below are specified in the Design Details section down below.

Container RUntime STorage (CRUST) Interface and support in Container Runtime Handlers

A new API: Container RUntime STorage (CRUST) Interface is proposed that roughly mirrors CSI node plugin APIs. The CRUST Interface will be used by the Kubelet to invoke file system mount and management operations on a container runtime handler that implements the interface. The range of operations will be similar to those invoked by Kubelet on CSI node plugins but the set of parameters will differ. Since CSI is implemented exclusively by storage plugins while the CRUST Interface is expected to be implemented by container runtime handlers, the APIs are kept independent in spite of similarities. Generally, the CRUST Interface is expected to "shadow" the CSI node plugin service APIs. The main operations that will be initially supported are: querying capabilities, mounting a file system and applying post mount configurations, querying file system stats and online expansion of the file system.

The CRUST Interface is not a part of CRI as the overall structure and target scenarios of CRI is different from those of the storage management focussed CRUST Interface. The CRUST Interface is generally expected to be implemented by a container runtime handler (like Kata) rather than a CRI capable runtime. If a CRI runtime does not depend on a lower level container runtime handler and wishes to support runtime assisted mounting and management of file systems on PVs, it may implement the CRUST interface.

Container runtime handlers (like Kata) need to implement CRUST to deliver performance and security benefits associated with runtime assisted mounts of file systems on PVs. Specific enhancements to apply post mount configurations will be necessary if the container runtime handler wishes to support them. These include: checking fsGroup ownership based on a policy and applying fsGroup ownership when needed, surfacing subpaths from a mounted volume and relabelling the file system with specified selinux labels.

API enhancements in Storage Class and Runtime Class to opt-in to the feature

A new optional field in StorageClass is proposed to specify whether kubelet will indicate to the CSI plugin that it can defer file system mount and management operations to the container runtime. If this field is not explicitly enabled, runtime assisted mounting will not take place on PVs associated with the storage class even if the CSI plugin is capable of deferring to a container runtime handler. This field should not be enabled for a storage class associated with a CSI plugin that is not capable of deferring to a container runtime handler.

A new optional field in RuntimeClass is proposed to specify a unix domain socket path on the host surfaced by a container runtime handler over which Kubelet may invoke CRUST Interface calls. If this field is empty or not specified, runtime assisted mounting will not take place for pods (mounting PVs) associated with the RuntimeClass.

Enhancements to CSI Node APIs and CSI Node Plugins

A CSI node plugin capable of deferring file system mounts should not perform any mount operations during NodeStageVolume and instead perform it during NodePublishVolume based on whether to defer mounts to a container runtime handler. Such plugins also need to implement support for CSI API enhancements below.

A new capability and set of new fields are proposed in CSI Node API Request and Response messages to support deferral of mounting and management operations:

A new CSI node plugin capability will be introduced to allow a CSI node plugin to indicate to the cluster orchestrator (Kubelet) that it supports deferring mount and management operations to a container runtime handler.
New fields are proposed in CSI NodePublishVolumeRequest and NodePublishVolumeResponse. The new fields in NodePublishVolumeRequest are expected to be populated by the container orchestrator (Kubelet) and inspected by a CSI node plugin to determine whether it can defer file system mount to a container runtime handler. If a CSI node plugin wishes to defer file system mount to a container runtime handler, it is expected to populate new fields in CSI NodePublishVolumeResponse to indicate the file system type and mount options to use for mounting the file system.
New fields are proposed in CSI NodeGetVolumeStatsRequest and NodeExpandVolumeRequest for the Kubelet to indicate to a CSI plugin that the pod (that a PV is published to) is associated with a container runtime handler that supports file system management operations.
New fields are proposed in CSI NodeGetVolumeStatsResponse and NodeExpandVolumeResponse for a CSI plugin to indicate to the Kubelet that it should pass the request to the container runtime handler by specifying a source field corresponding to the volume. In the context of this KEP, this should always be the host path for a block device.

Besides implementing support for the above, a CSI node plugin capable of deferring file system mounts should not perform any mount operations during NodeStageVolume and instead perform it during NodePublishVolume based on whether to defer mounts to a container runtime handler.

Enhancements to Kubelet

Kubelet will need enhancements to:

Inspect capabilities of container runtime handlers (through CRUST) and CSI node plugins (through CSI Node APIs) as well as new fields in RuntimeClass and StorageClass with respect to runtime assisted file system mount and management operations and perform actions based on those capabilities.
Invoke file system mount and management operations on container runtime handlers through the CRUST Interface when a CSI plugin indicates deferral of file system mount and management operations.

User Stories (Optional)

In all the stories below, we consider a pod specifying a microvm runtimeclass, a fsGroup, a subpath (in container mounts) and a PVC. The pod will get scheduled on a node and Kubelet on that node will prepare the volume mounts. The PVC is already bound to a persistent volume backed by a block device and managed by a CSI plugin. The first scenario details a situation (and interactions) where all conditions are satisfied for the CSI plugin to defer file system mount and management operations to the container runtime handler. Next, a set of scenarios describe absence of a certain condition for deferral of file system mount to container runtime handler from kicking in which will result in falling back to CSI plugin managed file system mount. Finally, a scenario describes lack of support for file system management operations in the container runtime handler that will result in those operations failing.

Story 1: Defer file system mount and management operations to a container runtime handler fully capable of handling mounting, post mount configuration and file system management operations

The CSI plugin will support deferral of file system mount and management operations to a container runtime handler. The storage class for the PV will have deferral of file system mounting and management operations enabled. The runtime handler in the runtime class will support the CRUST Interface and surface a UDS path over which to invoke the CRUST APIs. The UDS path will be specified in the runtimeclass. Following are the sequence of steps that will take place in the context of runtime assisted file system mounting and management operations:

Kubelet will analyze the runtimeclass specified by the pod, query the capabilities of the container runtime handler around supported file systems (through CRUST RuntimeGetSupportedFileSystems) and management operations (through CRUST RuntimeGetCapabilities) and retrieve the set of file systems and post mount configurations supported by the container runtime handler.
Kubelet will analyze the pod spec and determine the container runtime handler can support application of all post mount configurations specified in the pod spec (checking and application of fsGroup and surfacing of Subpath).
Kubelet will analyze the storage class of the PV bound to the PVC and note the enablement of deferral of file system mounting and management operations. Based on this, Kubelet will query the capabilities of the associated CSI node plugin and note support for deferral of file system mounting and management operations by the CSI node plugin and get a positive response.
Kubelet will invoke CSI NodeStageVolume on the CSI node plugin.
The CSI plugin will stage the volume (ensuring it is formatted) but not mount the filesystem (associated with the PV) in the node host OS environment since it may need to defer filesystem mounts to the container runtime.
Kubelet will inovoke CSI NodePublishVolume on the CSI node plugin. Through the new fields in the request, Kubelet will indicate to the CSI plugin that it can defer mounting of the file system to the container runtime handler along with the set of file systems supported by the container runtime handler (retrieved earlier).
The CSI plugin will note that the container runtime handler supports the file system the volume is formatted with. It will pass the block device path (rather than a file system mount point) on the host along with file-system type and mount options to the Kubelet in response to NodePublishVolume.
Kubelet will invoke CRI RunSandbox on the CRI runtime.
The CRI runtime and the container runtime handler together will create the pod sandbox.
Kubelet will invoke CRUST RuntimePublishVolumeRequest on the container runtime handler. The parameters will include: the block device path in the host, the target publish path in the host, file system type (to mount), mount options, fsGroup GID and fsGroup change policy.
The container runtime handler will attach the block device to the sandbox environment and mount the specified file system along with the mount options on the virtual block device (that corresponds to the specified host device) in the sandbox environment. Next, the container runtime handler will check and apply the fsGroup ownership based on the specified fsGroupChangePolicy. Finally, the mapping between the target publish path in the host, block device path on the host and the virtual block device path in the sandbox environment will be saved by the container runtime handler for handling container mounts and file system management operations later.
Kubelet will invoke CRI CreateContainer on the CRI runtime and pass the paths to mount into the container without any security probes or validation of the subPath (as no filesystem is mounted on the host).
The CRI runtime will work with the container runtime handler to create the container. The container runtime handler will extract the subpath by matching the prefix of the source of the OCI mount spec to the target publish path of the volume saved off when handling RuntimePublishVolumeRequest. Next, the subpath will be probed and checked against symlink based escapes from the mount source (similar to the logic in Kubelet today). Finally, the evaluated bind mounts will be prepared and the container started.

Handling FileSystem Stats and Conditions while the pod runs:

Kubelet fsResourceAnalyzer will examine the runtimeclass specified by the configured pod, query the capabilities of the container runtime handler around file system management operations through CRUST RuntimeGetCapabilities and determines the pod's container runtime handler supports querying filesystem stats and condition.
Kubelet will invoke CSI NodeGetVolumeStats on the CSI node plugin. Through the new fields in the request, Kubelet will indicate to the CSI plugin that it can defer stats and conditions for the file system to the container runtime handler.
The CSI plugin will pass the block device path on the host (corresponding to the volume) to the Kubelet in response.
Kubelet will invoke CRUST RuntimeGetVolumeStats on the container runtime handler with the block device path on the host as a parameter to identify the target.
The container runtime handler will map the host block device paths to the corresponding virtual device in the sandbox environment, query the file system stats and condition and populate the response.
Kubelet fsResourceAnalyzer will parse and store the file system stats and conditions from the container runtime handler (instead of CSI plugin) and publish events related to volume health if the condition is abnormal.

Handling FileSystem Expansion while the pod runs:

Kubelet will handle online file system expansion in a way identical to retrieval of file system stats described above. Handling file system expansion within the pod sandbox safely requires that the volume is mounted by a single pod on the node (using PVC access mode ReadWriteOncePod as described later).

When the pod terminates:

Kubelet will invoke CRI StopPodSandbox on the CRI runtime which will work with the container runtime handler to bring down the sandbox in preparation for removal.
The container runtime handler will dismount filesystems it has mounted in the sandbox environment and detach the block device from the sandbox environment.
Kubelet will invoke CSI NodeUnpublishVolume and NodeUnstageVolume on the CSI plugin.
The node CSI plugin will perform any clean up of state but skip dismounting the file system on the PV.

Story 2: Fallback to CSI plugin based mounting of file system due to lack of support for post mount configuration operations in container runtime handler

When Kubelet will query the capabilities of the container runtime handler around file system mount and management operations through CRUST RuntimeGetCapabilities, the container runtime handler will report it does not support checking and application of fsGroup ownership as well as subPath handling. Since the pod spec specifies a fsGroup and subPath for container mounts, Kubelet will fall back to regular CSI plugin managed file system mounts without any container runtime handler involvement.

Story 3: Fallback to CSI plugin based mounting of file system due to absence of runtimeclass field specifying UDS for CRUST interface

When Kubelet will analyze the runtimeclass specified by the pod, it will find that no UDS is specified to invoke CRUST Interface on the container runtime handler. Therefore, Kubelet will fall back to regular CSI plugin managed file system mounts without any container runtime handler involvement.

Story 4: Fallback to CSI plugin based mounting of file system due to lack of support for deferral of file system mount in CSI node plugin

When Kubelet will analyze the storage class of the PV bound to the PVC, it will find deferral of file system mounting and management operations is not explicitly enabled. Therefore, Kubelet will fall back to regular CSI plugin managed file system mounts without any container runtime handler involvement.

Story 5: Fallback to CSI plugin based mounting of file system due to lack of support for a specific file system in container runtime handler

When Kubelet will pass the list of file systems that the container runtime handler supports mounting (retrieved through CRUST RuntimeGetCapabilities) as part of CSI NodePublishVolume, the CSI plugin will determine that the file system the volume is formatted with is not supported by the container runtime. Therefore, in response to NodePublishVolume, the CSI plugin will mount the file system (in regular fashion) and not specify any runtime_mount_info. As a result, Kubelet will fall back to regular CSI plugin managed file system mounts without any container runtime handler involvement.

Story 6: Failure to expand file system due to lack of support for file system resizing in container runtime handler

The CSI plugin will support deferral of file system mount and management operations to a container runtime handler. The storage class for the PV will have deferral of file system mounting and management operations enabled. The runtime handler in the runtime class will support the CRUST Interface and surface a UDS path over which to invoke the CRUST APIs. The UDS path will be specified in the runtimeclass. Mount and post mount operations will be deferred by the CSI plugin to the container runtime handler as described earlier and this will succeed resulting in successful pod startup.

While the pod runs, the PV is expanded.

Kubelet through CRUST RuntimeGetCapabilities will determine that the container runtime handler does not support file system expansion.
Kubelet will invoke CSI NodeExpandVolume indicating that the container runtime handler does not support file system expansion.
the CSI plugin will determine that the file system mount was deferred to the container runtime handler and respond with a failure indicating a failed precondition.
Kubelet will note the failed precondition and report an overall failure for the operation and not retry it.

Notes/Constraints/Caveats (Optional)

Workloads specifying runtimes capable of handling filesystem mounts should use PVCs with ReadWriteOncePod access mode (rather than ReadWriteOnce) if they are expected to write to the PV. If two pods use the same PVC with ReadWriteOnce and get scheduled on the same node, there is a risk of data corruption because regular block filesystems like XFS or ext4 file system do not support parallel mounts. This does not apply if the filesystem to be mounted can support parallel mounts or the mounts are read-only.

A cluster admin can configure a webhook/OPA policy to restrict the set of access modes that can be specified on PVCs that are referred by pods associated with a microvm runtime capable of mounting and managing the filesystem on PVs.

A pod typically maps to the isolated sandbox environment in the context of microvm runtimes. Individual containers in the pod within the sandbox are not expected to be isolated with the same guarantees that exist across pods. To align with these isolation goals, restrictions around the ability of multiple containers within a pod mounting the same PV should not be necessary and considered beyond the scope of this KEP.

A container runtime handler may declare support for a post mount configuration if application of the configuration is a no-op in the sandbox environment. For example, Kata guest kernels do not enforce selinux; therefore application of selinux labels specified in the pod spec is not necessary. So the Kata runtime may declare support for selinux relabelling and perform runtime assisted mounting of file systems on PVs referred by pods that specify explicit selinux labels or if the host environment has selinux enabled in enforcing mode.

Risks and Mitigations

Design Details

Enhancements to existing APIs

As summarized in the Proposal section above, coordination of mounting and management of the filesystem between a CSI plugin and a container runtime handler requires enhancements to multiple APIs and components of a Kubernetes cluster. This section delves into the details of each enhancement or addition.

Enhancements in RuntimeClass

A new field is necessary in RuntimeClass to specify a domain socket path (surfaced by the runtime handler) which Kubelet can use to invoke the CRUST APIs on the runtime handler. If the field is not specified, Kubelet will consider the container runtime handler unable to support runtime assisted mount and management of file systems.

type RuntimeClass struct {
	metav1.TypeMeta `json:",inline"`
  ...
  // FSMgmtSocket specifies an absolute path on the host to a UNIX socket
  // surfaced by the Handler over which FileSystem Management APIs can be
  // invoked. Absence of this implies Handler will not support CRUST Interface
  // +optional
  FSMgmtSocket *string `json:"fsMgmtSocket,omitempty" protobuf:"bytes,5,opt,name=fsMgmtSocket"`
}

Enhancements in StorageClass

A new field is necessary in StorageClass to specify whether Kubelet will attempt to initiate runtime assisted mount with the CSI plugin associated with the storage class. This allows disabling runtime assisted mount and management of PVs associated with the StorageClass even if the CSI plugin is capable of supporting runtime assisted mount and management of volumes.

type RuntimeClass struct {
	metav1.TypeMeta `json:",inline"`
  ...
  // AllowRuntimeAssistedMount specifies whether the storage class allows
  // a CSI plugin to defer file system mount and management to a container
  // runtime handler
  // +optional
  AllowRuntimeAssistedMount *bool `json:"allowRuntimeAssistedMount,omitempty" protobuf:"bytes,9,opt,name=allowRuntimeAssistedMount"`
}

Enhancements in CSI API Requests

New fields are necessary in the CSI node API requests: NodePublishVolumeRequest, NodeGetVolumeStatsRequest and NodeExpandVolumeRequest for the Kubelet (or another Container Orchestrator from the orchestrator agnostic CSI perspective) to indicate to a CSI plugin that the pod (that a PV is to be published to) is associated with a container runtime handler that supports mounting and management of the filesystem associated with the PV. Based on these fields, a CSI plugin can decide whether to defer handling of the operation to the container runtime (and populate the appropriate fields in the corresponding CSI API responses if it decides to defer).

Enhancements for NodePublishVolumeRequest: new field runtime_supported_filesystems to indicate the list of file systems the container runtime can support mounting in the sandbox environment.

message NodePublishVolumeRequest {
  // The ID of the volume to publish. This field is REQUIRED.
  string volume_id = 1;
  ...
  // Indicates file systems supported by the Container Runtime
  // Hander associated with the containers.
  // This field is OPTIONAL.
  repeated string runtime_supported_filesystems = 6 [(alpha_field) = true];
}

Enhancements for NodeGetVolumeStatsRequest: new field runtime_supported_stats

message NodeGetVolumeStatsRequest {
  // The ID of the volume. This field is REQUIRED.
  string volume_id = 1;
  ...
  // Indicates Container Runtime supports reporting stats and
  // condition of the file system on the volume
  // This field is OPTIONAL.
  bool runtime_supported_stats = 4 [(alpha_field) = true];

Enhancements for NodeExpandVolumeRequest: new field runtime_supported_expansion

message NodeExpandVolumeRequest {
  // The ID of the volume to publish. This field is REQUIRED.
  string volume_id = 1;
  ...
  // Indicates Container Runtime supports expanding the file system
  // on the volume
  // This field is OPTIONAL.
  bool runtime_supports_expand = 7 [(alpha_field) = true];
}

Enhancements in CSI API Responses

Corresponding to the above enhancements in CSI API Requests, new fields are necessary in the CSI node API responses: NodePublishVolumeResponse, NodeGetVolumeStatsResponse and NodeVolumeExpandResponse. The CSI plugin needs to indicate to the Kubelet (or another Container Orchestrator from the orchestrator agnostic CSI perspective) that the runtime (associated with the pod mounting the PV) should be involved in processing of the API along with relevant parameters to identify the volume and process the API call.

Enhancements for NodePublishVolumeResponse: new optional field runtime_mount_info specifying details of how and what to mount on a block device or network share.

message FileSystemMountInfo {
  // Source device or network share (e.g. /dev/sdX, srv:/export) whose mount
  // is deferred to a container runtime that is capable of executing the mount.
  // This field is REQUIRED
  string source = 1 [(alpha_field) = true];
  // Type of the filesystem to mount (e.g. xfs, ext4, nfs, ntfs) on the specified
  // source.
  // This field is REQUIRED
  string type = 2 [(alpha_field) = true];
  // Mount options supported by the filesystem to be used for the specified source.
  // This field is OPTIONAL.
  map<string, string> options = 3 [(alpha_field) = true];
}

message NodePublishVolumeResponse {
  // Specifies details of how to mount a file system on a source device or
  // network share when SP defers file system mounts to a container runtime.
  // A SP MUST populate this if runtime_supports_mount was set in
  // NodePublishVolumeRequest and SP is capable of deferring filesystem mount to
  // the container runtime.
  // This field is OPTIONAL
  FileSystemMountInfo runtime_mount_info [(alpha_field) = true];
}

In case of a block device backed PV, fields in NodePublishVolumeResponse would contain:

source set to the name of the block device that the runtime should mount. Example: /dev/sdf
options set to filesystem specific options to be passed to mount. Example: nobarrier
type set to the filesystem to use for mounting. Example: xfs or ext4 for Linux, ntfs for Windows

In case of a NFS backed PV, fields in NodePublishVolumeResponse would contain:

source set to the name of the block device that the runtime should mount. Example: srv1.net:/exported/path
options set to the NFS client options that need to be passed to mount.nfs
type set to NFS

Enhancements for NodeGetVolumeStatsResponse: new optional field source which, if populated, will be passed by Kubelet (or another Container Orchestrator) to the container runtime to retrieve stats associated with the file system and it's condition.

message NodeGetVolumeStatsResponse {
  ...
  // Source device or network share (e.g. /dev/sdX, srv:/export) whose mount
  // is deferred to a container runtime that is capable of reporting stats for
  // filesystems mounted by it.
  // This field is OPTIONAL.
  // It SHOULD be populated by a SP if runtime_supports_stats was set in
  // NodeGetVolumeStatsRequest and SP is capable of deferring filesystem mount
  // and stats requests to the container runtime.
  // SP MUST NOT populate `usage` and `volume_conditions` fields if source is
  // specified indicating deferral of stats to the container runtime.
  string source = 3; [(alpha_field) = true];
}

Enhancements for NodeExpandVolumeResponse: new optional field source which, if populated, along with the existing field capacity_bytes, will be passed by Kubelet (or another Container Orchestrator) to the container runtime to expand the file system.

message NodeExpandVolumeResponse {
  ...
  // Source device (e.g. /dev/sdX) whose mount is deferred to a container runtime
  // that is capable of expanding the filesystems mounted by it.
  // This field is OPTIONAL.
  // It SHOULD be populated by a SP if runtime_supports_expand was set in
  // NodeExpandVolumeRequest and SP is capable of deferring filesystem mount
  // and expand requests to the container runtime.
  // SP MUST populate `capacity_bytes` field with the desired capacity if source
  // is specified indicating deferral of expansion to the container runtime.
  string source = 2; [(alpha_field) = true];
}

CRUST Interface: Container RUntime STorage Interface

Implmenting the CRUST API and surfacing the unix domain socket for invocation of CRUST API is expected to be the responsibility of the OCI runtime handler with no involvement expected from the CRI runtime.

service RuntimeAssistedStorageManagement {
  rpc RuntimeGetCapabilities(RuntimeGetCapabilitiesRequest)
    returns (RuntimeGetCapabilitiesResponse) {}
  rpc RuntimeGetSupportedFileSystems(RuntimeGetSupportedFileSystemsRequest)
    returns (RuntimeGetSupportedFileSystemsResponse) {}
  rpc RuntimePublishVolume (RuntimePublishVolumeRequest)
    returns (RuntimePublishVolumeResponse) {}
  rpc RuntimeGetVolumeStats (RuntimeGetVolumeStatsRequest)
    returns (RuntimeGetVolumeStatsResponse) {}
  rpc RuntimeExpandVolume (RuntimeExpandVolumeRequest)
    returns (RuntimeExpandVolumeResponse) {}
}

message RuntimeGetCapabilitiesRequest {
  // Intentionally empty.
}

message RuntimeGetCapabilitiesResponse {
  // All the storage management capabilities that the runtime handler supports.
  // This field is OPTIONAL.
  repeated RuntimeCapability capabilities = 1;
}

// Specifies a capability of the runtime handler around storage management
// Similar to https://github.com/container-storage-interface/spec/blob/5b0d4540158a260cb3347ef1c87ede8600afb9bf/csi.proto#L1493
message RuntimeCapability {
  message RPC {
    enum Type {
      UNKNOWN = 0;
      // Performs a full recursive check of FsGroup GID ownership of files
      // and applies FsGroup GID ownership to all files
      FS_GROUP_CHANGE_POLICY_ALWAYS = 2;
      // Performs a quick check of FsGroup GID ownership of root of the file
      // system and skips recursive FsGroup GID ownership application if a
      // match is found
      FS_GROUP_CHANGE_POLICY_ROOT_MISMATCH = 3;
      // Supports surfacing subpaths to containers from a mounted file system
      SUBPATH = 4;
      // Supports handling of file system with selinux labels and recursive
      // relabelling of files if necessary
      SELINUX_RELABEL = 5;
      // Supports handling of file system with selinux labels and passing the
      // selinux label as a mount option
      SELINUX_RELABEL_ON_MOUNT = 6;
      // Supports retrieval of volume stats from the file system
      VOLUME_STATS = 7;
      // Supports online resizing (expansion) of the file system
      VOLUME_RESIZE = 8;
    }
    Type type = 1;
  }
  oneof type {
    // RPC that the runtime supports.
    RPC rpc = 1;
  }
}

message RuntimeGetSupportedFileSystemsRequest {
  // Intentionally empty.
}

message RuntimeGetSupportedFileSystemsResponse {
  // A list of file systems that the runtime handler supports mounting
  // A CSI plugin will usually match the entries in the list to the
  // file system a volume is formatted with (potentially obtained using blkid
  // as in https://github.com/kubernetes/utils/blob/2afb4311ab10fb869c5d8c260b5e92b4bbde7f80/mount/mount_linux.go#L403 and
  // implemented in https://github.com/util-linux/util-linux/tree/441f9b9303d015f1777aec7168807d58feacca31/libblkid/src/superblocks)
  // Examples: ext3, ext4, xfs
  // A CSI plugin may also support a special/custom file system. If the runtime
  // handler wishes to support mounting that, the cluster administrator wil
  // need to ensure the container runtime environment has the appropriate
  // software and configuration for the special file system and reports support
  // for it through this API.
  repeated string file_systems = 1;
}

message RuntimePublishVolumeRequest {
  // An ID assigned by the container runtime to the sandbox where the volume
  // needs to be published
  string sandbox_id = 1;

  // An ID to uniquely identify the volume in the host environment. In case of
  // a block device, this will be a path to the block device in the host
  // (e.g. /dev/sdx)
  string host_volume_id = 1;

  // The path on the host where the volume would have been mounted (if runtime
  // assisted mount was not enabled). This path should be used to map either
  // [1] source field in mounts structure in OCI config spec OR
  // [2] host_path fields in Mounts in ContainerConfig in CRI CreateContainer
  // to host_volume_id whose file system is mounted by the container runtime
  string host_target_path = 2;

  // The type of file system to be mounted on the volume
  string file_system = 2;

  // The mount options to use when mounting the file system
  repeated string mount_options = 3;

  // The supplemental gid that should own the file system so it can be shared
  // between different containers potentially running with different uids.
  int32 fsgroup_gid = 4;

  // The policy to use when checking and applying fsgroup ownership of the
  // mounted file system
  string fsgroup_policy = 5;
}

message RuntimePublishVolumeResponse {
  // Intentionally empty.
}

message RuntimeGetVolumeStatsRequest {
  // An ID assigned by the container runtime to the sandbox where the volume has
  // been published/mounted
  string sandbox_id = 1;

  // An ID to uniquely identify the volume in the host environment. In case of
  // a block device, this will be a path to the block device in the host
  // (e.g. /dev/sdx)
  string host_volume_id = 1;
}

message RuntimeGetVolumeStatsResponse {
  // Contents of this message should be aligned to CSI NodeGetVolumeStatsResponse
  repeated VolumeUsage usage = 1;

  // Information about the current condition of the volume.
  VolumeCondition volume_condition = 2
}

message VolumeUsage {
  enum Unit {
    UNKNOWN = 0;
    BYTES = 1;
    INODES = 2;
  }
  // The available capacity in specified Unit.
  // The value of this field MUST NOT be negative.
  int64 available = 1;

  // The total capacity in specified Unit.
  // The value of this field MUST NOT be negative.
  int64 total = 2;

  // The used capacity in specified Unit.
  // The value of this field MUST NOT be negative.
  int64 used = 3;

  // Units by which values are measured.
  Unit unit = 4;
}

// VolumeCondition represents the current condition of a volume.
message VolumeCondition {

  // Normal volumes are available for use and operating optimally.
  // An abnormal volume does not meet these criteria.
  bool abnormal = 1;

  // The message describing the condition of the volume.
  string message = 2;
}

message RuntimeExpandVolumeRequest {
  // An ID assigned by the container runtime to the sandbox where the volume has
  // been published/mounted
  string sandbox_id = 1;

  // An ID to uniquely identify the volume in the host environment. In case of
  // a block device, this will be a path to the block device in the host
  // (e.g. /dev/sdx)
  string host_volume_id = 2;

  // The desired size of the expanded file system.
  int64 required_bytes = 3;
}

message RuntimeExpandVolumeResponse {
  // The capacity of the volume in bytes after file system expansion.
  int64 capacity_bytes = 1;
}

Limitations of the Design

In case of certain shared FS scenarios (like SMB), secrets associated with mounting the FS may need to be passed to the OCI runtime to enable it to authenticate. However, the configuration containing the OCI mount options may be persisted on the host file system by the CRI runtime (as described here in case of containerd). Based on the security posture of the host, it may not be recommended to enable runtime assisted mounts if secrets persisted on the host file system is undesirable. Future enhancements in container runtimes to pass the OCI spec in memory will address this limitation.

Test Plan

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name:
- Components depending on the feature gate:
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume Dynamic Kubelet Config feature is enabled).

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Alternative 1 - Skipping Kubelet involvement during mount deferral through metadata files

A CSI plugin can skip mounting of the Filesystem and pass the mount info to the runtime through a specially named metadata file placed in the publish path that gets passed to the runtime today as mount source. This approach does not require any changes to Kubelet and CSI spec. However, this approach does not provide a definitive way for a runtime to determine the authenticity of the metadata file. A legitimate CSI plugin can be made to surface a malicious metadata file for the container runtime to consume. Through this, an unauthorized user may get access to an arbitrary block device on the host or NFS share that the container runtime can mount based on the information in the malicious metadata file.

This approach was the starting point in the Kata community: kata-containers/kata-containers#1568 that led to this KEP

Alternative 2 - Integrate CRUST Interface APIs into CRI and Shim APIs

The CRUST Interface APIs could be integrated with CRI and Shim APIs. It would result in a major expansion of scope in CRI. CRI runtimes like ContainerD and CRIO will need to be made aware of the operations in the CRUST Interface which is unnecessary.

ddebroy/crust.md