- Release Signoff Checklist
- Summary
- Terminology
- Motivation
- Proposal
- Container RUntime STorage (CRUST) Interface and support in Container Runtime Handlers
- API enhancements in Storage Class and Runtime Class to opt-in to the feature
- Enhancements to CSI Node APIs and CSI Node Plugins
- Enhancements to Kubelet
- User Stories (Optional)
- Story 1: Defer file system mount and management operations to a container runtime handler fully capable of handling mounting, post mount configuration and file system management operations
- Story 2: Fallback to CSI plugin based mounting of file system due to lack of support for post mount configuration operations in container runtime handler
- Story 3: Fallback to CSI plugin based mounting of file system due to absence of runtimeclass field specifying UDS for CRUST interface
- Story 4: Fallback to CSI plugin based mounting of file system due to lack of support for deferral of file system mount in CSI node plugin
- Story 5: Fallback to CSI plugin based mounting of file system due to lack of support for a specific file system in container runtime handler
- Story 6: Failure to expand file system due to lack of support for file system resizing in container runtime handler
- Notes/Constraints/Caveats (Optional)
- Risks and Mitigations
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Certain container runtime handlers (e.g. Hypervisor based runtimes like Kata) may prefer to manage the file system mount process associated with persistent volumes consumed by containers in a pod. Deferring the file system mount to a container runtime handler - in coordination with a CSI plugin - is desirable in scenarios where strong isolation between pods is critical but using Raw Block mode and changing existing workloads to perform the filesystem mount is burdensome. This KEP proposes a set of enhancements to enable coordination around mounting and management of the file system on persistent volumes between a CSI plugin and a container runtime handler.
The following terms are used through out the KEP. This section clarifies what they refer to avoid verbosity.
- container runtime handler: the "low level" container runtime that a CRI runtime like containerd or CRIO invokes. Example: runc, runsc (gVisor), kata. A container runtime handler typically would not directly interact with the Kubelet using CRI. Instead, they consume the OCI spec generated by a CRI runtime to create a pod sandbox and launch containers in that sandbox.
- mounting a file system: invocation of the mount system call with a file system type and a set of options.
- post mount configuration: in the scope of Kubernetes, this involves actions like: application of fsGroup GID ownership on files based on fsGroupChangePolicy, selinux relabelling and safely surfacing a subpath within the volume to bind mount into a container.
- management of file system: in the scope of current CSI spec, this involves: [1] retrieving filesystem stats and conditions associated with a mounted volume and [2] online expansion of the filesystem associate with a mounted volume. Later, as CSI evolves, this may involve other actions too (e.g. quiescing the file system for snapshots).
- CRUST Interface: a new API/interface detailed in this KEP that allows Kubelet to invoke storage management APIs on a container runtime handler.
This KEP is inspired by design proposals in the Kata community to avoid mounting the file system of a PV in the host while bringing up a pod (in a guest sandbox) that mounts the PV. The existing approach (without any changes to Kubelet and CSI) has the following drawbacks:
- A CSI plugin has to directly invoke Kata specific interfaces and populate Kata specific configuration files. Thus, a CSI plugin has to be tightly integrated with a specific container runtime handler.
- The overall mechanism does not support application of fsGroup, subpaths and specified selinux labels (in the pod spec) after the file system is mounted by the container runtime handler. Supporting these require requesting pod details during CSI node publish, performing a pod spec lookup from the CSI plugin and passing the relevant fields to the container runtime handler.
A set of enhancements to CSI along with a new API between Kubelet and container runtime handlers (like Kata) overcomes the above drawbacks. These enable a CSI plugin and a container runtime handler to coordinate mounting, application of post mount configuration and management operations on the file system on persistent volumes in a generic fashion (without requiring runtime handler specific logic in CSI plugins or look-up of pod specs in CSI plugins).
- Enable a CSI plugin and a container runtime handler to coordinate publishing of persistent storage backed by a block device to pods by deferring the file system mount and application of post mount configuration to the container runtime.
- Enable a CSI plugin and a container runtime to coordinate management operations on the file system (e.g. collection of file system stats and expanding the file system while mounted)
- Defer the file system mount and any post mount configuration specified in the pod spec only if the container runtime handler is capable of handling the specified file system and applying all specified post mount configurations.
- Allow a CSI plugin, kubelet and CRI runtime to fall back to regular CSI node publish and application of post mount configuration when the container runtime handler cannot handle the file system or post mount configuration.
- Avoid changes to CRI, OCI and CRI container runtimes like CRIO/containerd.
- Enable a CSI node plugin to be launched within a special runtime class (e.g. Kata). It is expected that CSI node plugin pods launch using a “default” runtime (like runc) and is not restricted from performing privileged operations on the host OS.
- Falling back to CSI plugin based node publish flow if runtime assisted paths for mounting and managing file system fails.
- A pod using a microvm runtime (like Kata) can mount PVs backed by block
devices as a raw block device to ensure the file system mount is managed within
the pod sandbox environment. However this approach has the following shortcomings:
- Every workload pod that mounts PVs backed by block devices needs to mount the file system (on the raw block PV). This requires modifications to the workload container logic. Therefore this is not a general purpose solution.
- File system management operations like online expansion of the file system and reporting of FS stats cannot be performed at an infrastructural level if the file system mounts are managed by workload pods.
- A pod using a microvm runtime (like Kata) may use a filesystem (e.g. virtiofs)
to project the file-system on-disk (and mounted on the host) to the guest
environment (as detailed in the diagram below). While this solves several
use-cases, it is not desired due to the following factors:
- Native file-system features: The workload may wish to use specific features and system calls supported by an on-disk file system (e.g. ext4/xfs) mounted on the host. However, plumbing some of these features/system calls/file controls through the intermediate “projection” file system (e.g. virtio-fs) and it’s dependent framework (e.g. FUSE) may require extra work and therefore not supported until implemented and tested. Some examples of this incompatibility are: open_by_handle_at, xattr, F_SETLKW and fscrypt
- Safety and isolation: A compromised or malicious workload may try to exploit vulnerabilities in the file system of persistent volumes. Due to the inherently complex nature of file system modules running in kernel mode, they present a larger surface of attack relative to simpler/lower level modules in the disk/block stack. Therefore, it is safer to mount the file system within an isolated sandbox environment (like guest VM) rather than in the host environment as the blast radius of a file system kernel mode panic/DoS will be isolated to the sandbox rather than affecting the entire host. A recent example of a vulnerability in virtio-fs: https://nvd.nist.gov/vuln/detail/CVE-2020-10717
- Performance: As pointed out here [slide 6], in a microvm environment, a block interface (such as virtio-blk) provides a faster path to sharing data between a guest and host relative to a file-system interface.
Coordination between a CSI Plugin and a container runtime handler around mounting and management of the file system on persistent volumes can be accomplished in various ways. This section provides an overview of the primary enhancements detailed in this KEP. Various alternatives are listed further below in the Alternatives section. Details of the new APIs outlined below are specified in the Design Details section down below.
A new API: Container RUntime STorage (CRUST) Interface is proposed that roughly mirrors CSI node plugin APIs. The CRUST Interface will be used by the Kubelet to invoke file system mount and management operations on a container runtime handler that implements the interface. The range of operations will be similar to those invoked by Kubelet on CSI node plugins but the set of parameters will differ. Since CSI is implemented exclusively by storage plugins while the CRUST Interface is expected to be implemented by container runtime handlers, the APIs are kept independent in spite of similarities. Generally, the CRUST Interface is expected to "shadow" the CSI node plugin service APIs. The main operations that will be initially supported are: querying capabilities, mounting a file system and applying post mount configurations, querying file system stats and online expansion of the file system.
The CRUST Interface is not a part of CRI as the overall structure and target scenarios of CRI is different from those of the storage management focussed CRUST Interface. The CRUST Interface is generally expected to be implemented by a container runtime handler (like Kata) rather than a CRI capable runtime. If a CRI runtime does not depend on a lower level container runtime handler and wishes to support runtime assisted mounting and management of file systems on PVs, it may implement the CRUST interface.
Container runtime handlers (like Kata) need to implement CRUST to deliver performance and security benefits associated with runtime assisted mounts of file systems on PVs. Specific enhancements to apply post mount configurations will be necessary if the container runtime handler wishes to support them. These include: checking fsGroup ownership based on a policy and applying fsGroup ownership when needed, surfacing subpaths from a mounted volume and relabelling the file system with specified selinux labels.
A new optional field in StorageClass
is proposed to specify whether kubelet
will indicate to the CSI plugin that it can defer file system mount and
management operations to the container runtime. If this field is not explicitly
enabled, runtime assisted mounting will not take place on PVs associated with
the storage class even if the CSI plugin is capable of deferring to a container
runtime handler. This field should not be enabled for a storage class associated
with a CSI plugin that is not capable of deferring to a container runtime
handler.
A new optional field in RuntimeClass
is proposed to specify a unix domain
socket path on the host surfaced by a container runtime handler over which
Kubelet may invoke CRUST Interface calls. If this field is empty or not
specified, runtime assisted mounting will not take place for pods (mounting PVs)
associated with the RuntimeClass.
A CSI node plugin capable of deferring file system mounts should not perform any
mount operations during NodeStageVolume
and instead perform it during
NodePublishVolume
based on whether to defer mounts to a container runtime
handler. Such plugins also need to implement support for CSI API enhancements
below.
A new capability and set of new fields are proposed in CSI Node API Request and Response messages to support deferral of mounting and management operations:
-
A new CSI node plugin capability will be introduced to allow a CSI node plugin to indicate to the cluster orchestrator (Kubelet) that it supports deferring mount and management operations to a container runtime handler.
-
New fields are proposed in CSI
NodePublishVolumeRequest
andNodePublishVolumeResponse
. The new fields inNodePublishVolumeRequest
are expected to be populated by the container orchestrator (Kubelet) and inspected by a CSI node plugin to determine whether it can defer file system mount to a container runtime handler. If a CSI node plugin wishes to defer file system mount to a container runtime handler, it is expected to populate new fields in CSINodePublishVolumeResponse
to indicate the file system type and mount options to use for mounting the file system. -
New fields are proposed in CSI
NodeGetVolumeStatsRequest
andNodeExpandVolumeRequest
for the Kubelet to indicate to a CSI plugin that the pod (that a PV is published to) is associated with a container runtime handler that supports file system management operations. -
New fields are proposed in CSI
NodeGetVolumeStatsResponse
andNodeExpandVolumeResponse
for a CSI plugin to indicate to the Kubelet that it should pass the request to the container runtime handler by specifying asource
field corresponding to the volume. In the context of this KEP, this should always be the host path for a block device.
Besides implementing support for the above, a CSI node plugin capable of
deferring file system mounts should not perform any mount operations during
NodeStageVolume
and instead perform it during NodePublishVolume
based on
whether to defer mounts to a container runtime handler.
Kubelet will need enhancements to:
- Inspect capabilities of container runtime handlers (through CRUST) and CSI node plugins (through CSI Node APIs) as well as new fields in RuntimeClass and StorageClass with respect to runtime assisted file system mount and management operations and perform actions based on those capabilities.
- Invoke file system mount and management operations on container runtime handlers through the CRUST Interface when a CSI plugin indicates deferral of file system mount and management operations.
In all the stories below, we consider a pod specifying a microvm runtimeclass, a fsGroup, a subpath (in container mounts) and a PVC. The pod will get scheduled on a node and Kubelet on that node will prepare the volume mounts. The PVC is already bound to a persistent volume backed by a block device and managed by a CSI plugin. The first scenario details a situation (and interactions) where all conditions are satisfied for the CSI plugin to defer file system mount and management operations to the container runtime handler. Next, a set of scenarios describe absence of a certain condition for deferral of file system mount to container runtime handler from kicking in which will result in falling back to CSI plugin managed file system mount. Finally, a scenario describes lack of support for file system management operations in the container runtime handler that will result in those operations failing.
Story 1: Defer file system mount and management operations to a container runtime handler fully capable of handling mounting, post mount configuration and file system management operations
The CSI plugin will support deferral of file system mount and management operations to a container runtime handler. The storage class for the PV will have deferral of file system mounting and management operations enabled. The runtime handler in the runtime class will support the CRUST Interface and surface a UDS path over which to invoke the CRUST APIs. The UDS path will be specified in the runtimeclass. Following are the sequence of steps that will take place in the context of runtime assisted file system mounting and management operations:
- Kubelet will analyze the runtimeclass specified by the pod, query the
capabilities of the container runtime handler around supported file systems
(through CRUST
RuntimeGetSupportedFileSystems
) and management operations (through CRUSTRuntimeGetCapabilities
) and retrieve the set of file systems and post mount configurations supported by the container runtime handler. - Kubelet will analyze the pod spec and determine the container runtime handler can support application of all post mount configurations specified in the pod spec (checking and application of fsGroup and surfacing of Subpath).
- Kubelet will analyze the storage class of the PV bound to the PVC and note the enablement of deferral of file system mounting and management operations. Based on this, Kubelet will query the capabilities of the associated CSI node plugin and note support for deferral of file system mounting and management operations by the CSI node plugin and get a positive response.
- Kubelet will invoke CSI
NodeStageVolume
on the CSI node plugin. - The CSI plugin will stage the volume (ensuring it is formatted) but not mount the filesystem (associated with the PV) in the node host OS environment since it may need to defer filesystem mounts to the container runtime.
- Kubelet will inovoke CSI
NodePublishVolume
on the CSI node plugin. Through the new fields in the request, Kubelet will indicate to the CSI plugin that it can defer mounting of the file system to the container runtime handler along with the set of file systems supported by the container runtime handler (retrieved earlier). - The CSI plugin will note that the container runtime handler supports the file
system the volume is formatted with. It will pass the block device path (rather
than a file system mount point) on the host along with file-system type and
mount options to the Kubelet in response to
NodePublishVolume
. - Kubelet will invoke CRI
RunSandbox
on the CRI runtime. - The CRI runtime and the container runtime handler together will create the pod sandbox.
- Kubelet will invoke CRUST
RuntimePublishVolumeRequest
on the container runtime handler. The parameters will include: the block device path in the host, the target publish path in the host, file system type (to mount), mount options, fsGroup GID and fsGroup change policy. - The container runtime handler will attach the block device to the sandbox environment and mount the specified file system along with the mount options on the virtual block device (that corresponds to the specified host device) in the sandbox environment. Next, the container runtime handler will check and apply the fsGroup ownership based on the specified fsGroupChangePolicy. Finally, the mapping between the target publish path in the host, block device path on the host and the virtual block device path in the sandbox environment will be saved by the container runtime handler for handling container mounts and file system management operations later.
- Kubelet will invoke CRI
CreateContainer
on the CRI runtime and pass the paths to mount into the container without any security probes or validation of the subPath (as no filesystem is mounted on the host). - The CRI runtime will work with the container runtime handler to create the
container. The container runtime handler will extract the subpath by matching
the prefix of the source of the OCI mount spec to the target publish path of the
volume saved off when handling
RuntimePublishVolumeRequest
. Next, the subpath will be probed and checked against symlink based escapes from the mount source (similar to the logic in Kubelet today). Finally, the evaluated bind mounts will be prepared and the container started.
Handling FileSystem Stats and Conditions while the pod runs:
- Kubelet fsResourceAnalyzer will examine the runtimeclass specified by the
configured pod, query the capabilities of the container runtime handler around
file system management operations through CRUST
RuntimeGetCapabilities
and determines the pod's container runtime handler supports querying filesystem stats and condition. - Kubelet will invoke CSI
NodeGetVolumeStats
on the CSI node plugin. Through the new fields in the request, Kubelet will indicate to the CSI plugin that it can defer stats and conditions for the file system to the container runtime handler. - The CSI plugin will pass the block device path on the host (corresponding to the volume) to the Kubelet in response.
- Kubelet will invoke CRUST
RuntimeGetVolumeStats
on the container runtime handler with the block device path on the host as a parameter to identify the target. - The container runtime handler will map the host block device paths to the corresponding virtual device in the sandbox environment, query the file system stats and condition and populate the response.
- Kubelet fsResourceAnalyzer will parse and store the file system stats and conditions from the container runtime handler (instead of CSI plugin) and publish events related to volume health if the condition is abnormal.
Handling FileSystem Expansion while the pod runs:
- Kubelet will handle online file system expansion in a way identical to retrieval of file system stats described above. Handling file system expansion within the pod sandbox safely requires that the volume is mounted by a single pod on the node (using PVC access mode ReadWriteOncePod as described later).
When the pod terminates:
- Kubelet will invoke CRI
StopPodSandbox
on the CRI runtime which will work with the container runtime handler to bring down the sandbox in preparation for removal. - The container runtime handler will dismount filesystems it has mounted in the sandbox environment and detach the block device from the sandbox environment.
- Kubelet will invoke CSI
NodeUnpublishVolume
andNodeUnstageVolume
on the CSI plugin. - The node CSI plugin will perform any clean up of state but skip dismounting the file system on the PV.
Story 2: Fallback to CSI plugin based mounting of file system due to lack of support for post mount configuration operations in container runtime handler
When Kubelet will query the capabilities of the container runtime handler around
file system mount and management operations through CRUST
RuntimeGetCapabilities
, the container runtime handler will report it does not
support checking and application of fsGroup ownership as well as subPath
handling. Since the pod spec specifies a fsGroup and subPath for container
mounts, Kubelet will fall back to regular CSI plugin managed file system mounts
without any container runtime handler involvement.
Story 3: Fallback to CSI plugin based mounting of file system due to absence of runtimeclass field specifying UDS for CRUST interface
When Kubelet will analyze the runtimeclass specified by the pod, it will find that no UDS is specified to invoke CRUST Interface on the container runtime handler. Therefore, Kubelet will fall back to regular CSI plugin managed file system mounts without any container runtime handler involvement.
Story 4: Fallback to CSI plugin based mounting of file system due to lack of support for deferral of file system mount in CSI node plugin
When Kubelet will analyze the storage class of the PV bound to the PVC, it will find deferral of file system mounting and management operations is not explicitly enabled. Therefore, Kubelet will fall back to regular CSI plugin managed file system mounts without any container runtime handler involvement.
Story 5: Fallback to CSI plugin based mounting of file system due to lack of support for a specific file system in container runtime handler
When Kubelet will pass the list of file systems that the container runtime
handler supports mounting (retrieved through CRUST RuntimeGetCapabilities
) as
part of CSI NodePublishVolume
, the CSI plugin will determine that the file
system the volume is formatted with is not supported by the container runtime.
Therefore, in response to NodePublishVolume
, the CSI plugin will mount the
file system (in regular fashion) and not specify any runtime_mount_info
. As a
result, Kubelet will fall back to regular CSI plugin managed file system mounts
without any container runtime handler involvement.
Story 6: Failure to expand file system due to lack of support for file system resizing in container runtime handler
The CSI plugin will support deferral of file system mount and management operations to a container runtime handler. The storage class for the PV will have deferral of file system mounting and management operations enabled. The runtime handler in the runtime class will support the CRUST Interface and surface a UDS path over which to invoke the CRUST APIs. The UDS path will be specified in the runtimeclass. Mount and post mount operations will be deferred by the CSI plugin to the container runtime handler as described earlier and this will succeed resulting in successful pod startup.
While the pod runs, the PV is expanded.
- Kubelet through CRUST
RuntimeGetCapabilities
will determine that the container runtime handler does not support file system expansion. - Kubelet will invoke CSI
NodeExpandVolume
indicating that the container runtime handler does not support file system expansion. - the CSI plugin will determine that the file system mount was deferred to the container runtime handler and respond with a failure indicating a failed precondition.
- Kubelet will note the failed precondition and report an overall failure for the operation and not retry it.
Workloads specifying runtimes capable of handling filesystem mounts should use PVCs with ReadWriteOncePod access mode (rather than ReadWriteOnce) if they are expected to write to the PV. If two pods use the same PVC with ReadWriteOnce and get scheduled on the same node, there is a risk of data corruption because regular block filesystems like XFS or ext4 file system do not support parallel mounts. This does not apply if the filesystem to be mounted can support parallel mounts or the mounts are read-only.
A cluster admin can configure a webhook/OPA policy to restrict the set of access modes that can be specified on PVCs that are referred by pods associated with a microvm runtime capable of mounting and managing the filesystem on PVs.
A pod typically maps to the isolated sandbox environment in the context of microvm runtimes. Individual containers in the pod within the sandbox are not expected to be isolated with the same guarantees that exist across pods. To align with these isolation goals, restrictions around the ability of multiple containers within a pod mounting the same PV should not be necessary and considered beyond the scope of this KEP.
A container runtime handler may declare support for a post mount configuration if application of the configuration is a no-op in the sandbox environment. For example, Kata guest kernels do not enforce selinux; therefore application of selinux labels specified in the pod spec is not necessary. So the Kata runtime may declare support for selinux relabelling and perform runtime assisted mounting of file systems on PVs referred by pods that specify explicit selinux labels or if the host environment has selinux enabled in enforcing mode.
As summarized in the Proposal section above, coordination of mounting and management of the filesystem between a CSI plugin and a container runtime handler requires enhancements to multiple APIs and components of a Kubernetes cluster. This section delves into the details of each enhancement or addition.
A new field is necessary in RuntimeClass to specify a domain socket path (surfaced by the runtime handler) which Kubelet can use to invoke the CRUST APIs on the runtime handler. If the field is not specified, Kubelet will consider the container runtime handler unable to support runtime assisted mount and management of file systems.
type RuntimeClass struct {
metav1.TypeMeta `json:",inline"`
...
// FSMgmtSocket specifies an absolute path on the host to a UNIX socket
// surfaced by the Handler over which FileSystem Management APIs can be
// invoked. Absence of this implies Handler will not support CRUST Interface
// +optional
FSMgmtSocket *string `json:"fsMgmtSocket,omitempty" protobuf:"bytes,5,opt,name=fsMgmtSocket"`
}
A new field is necessary in StorageClass to specify whether Kubelet will attempt to initiate runtime assisted mount with the CSI plugin associated with the storage class. This allows disabling runtime assisted mount and management of PVs associated with the StorageClass even if the CSI plugin is capable of supporting runtime assisted mount and management of volumes.
type RuntimeClass struct {
metav1.TypeMeta `json:",inline"`
...
// AllowRuntimeAssistedMount specifies whether the storage class allows
// a CSI plugin to defer file system mount and management to a container
// runtime handler
// +optional
AllowRuntimeAssistedMount *bool `json:"allowRuntimeAssistedMount,omitempty" protobuf:"bytes,9,opt,name=allowRuntimeAssistedMount"`
}
New fields are necessary in the CSI node API requests:
NodePublishVolumeRequest
, NodeGetVolumeStatsRequest
and
NodeExpandVolumeRequest
for the Kubelet (or another Container Orchestrator
from the orchestrator agnostic CSI perspective) to indicate to a CSI plugin that
the pod (that a PV is to be published to) is associated with a container runtime
handler that supports mounting and management of the filesystem associated with
the PV. Based on these fields, a CSI plugin can decide whether to defer handling
of the operation to the container runtime (and populate the appropriate fields
in the corresponding CSI API responses if it decides to defer).
Enhancements for NodePublishVolumeRequest
: new field runtime_supported_filesystems
to indicate the list of file systems the container runtime can support mounting in the sandbox environment.
message NodePublishVolumeRequest {
// The ID of the volume to publish. This field is REQUIRED.
string volume_id = 1;
...
// Indicates file systems supported by the Container Runtime
// Hander associated with the containers.
// This field is OPTIONAL.
repeated string runtime_supported_filesystems = 6 [(alpha_field) = true];
}
Enhancements for NodeGetVolumeStatsRequest
: new field runtime_supported_stats
message NodeGetVolumeStatsRequest {
// The ID of the volume. This field is REQUIRED.
string volume_id = 1;
...
// Indicates Container Runtime supports reporting stats and
// condition of the file system on the volume
// This field is OPTIONAL.
bool runtime_supported_stats = 4 [(alpha_field) = true];
Enhancements for NodeExpandVolumeRequest
: new field runtime_supported_expansion
message NodeExpandVolumeRequest {
// The ID of the volume to publish. This field is REQUIRED.
string volume_id = 1;
...
// Indicates Container Runtime supports expanding the file system
// on the volume
// This field is OPTIONAL.
bool runtime_supports_expand = 7 [(alpha_field) = true];
}
Corresponding to the above enhancements in CSI API Requests, new fields are
necessary in the CSI node API responses: NodePublishVolumeResponse
,
NodeGetVolumeStatsResponse
and NodeVolumeExpandResponse
. The CSI plugin
needs to indicate to the Kubelet (or another Container Orchestrator from the
orchestrator agnostic CSI perspective) that the runtime (associated with the pod
mounting the PV) should be involved in processing of the API along with relevant
parameters to identify the volume and process the API call.
Enhancements for NodePublishVolumeResponse
: new optional field
runtime_mount_info
specifying details of how and what to mount on a block
device or network share.
message FileSystemMountInfo {
// Source device or network share (e.g. /dev/sdX, srv:/export) whose mount
// is deferred to a container runtime that is capable of executing the mount.
// This field is REQUIRED
string source = 1 [(alpha_field) = true];
// Type of the filesystem to mount (e.g. xfs, ext4, nfs, ntfs) on the specified
// source.
// This field is REQUIRED
string type = 2 [(alpha_field) = true];
// Mount options supported by the filesystem to be used for the specified source.
// This field is OPTIONAL.
map<string, string> options = 3 [(alpha_field) = true];
}
message NodePublishVolumeResponse {
// Specifies details of how to mount a file system on a source device or
// network share when SP defers file system mounts to a container runtime.
// A SP MUST populate this if runtime_supports_mount was set in
// NodePublishVolumeRequest and SP is capable of deferring filesystem mount to
// the container runtime.
// This field is OPTIONAL
FileSystemMountInfo runtime_mount_info [(alpha_field) = true];
}
In case of a block device backed PV, fields in NodePublishVolumeResponse would contain:
source
set to the name of the block device that the runtime should mount. Example: /dev/sdfoptions
set to filesystem specific options to be passed to mount. Example: nobarriertype
set to the filesystem to use for mounting. Example: xfs or ext4 for Linux, ntfs for Windows
In case of a NFS backed PV, fields in NodePublishVolumeResponse would contain:
source
set to the name of the block device that the runtime should mount. Example: srv1.net:/exported/pathoptions
set to the NFS client options that need to be passed to mount.nfstype
set to NFS
Enhancements for NodeGetVolumeStatsResponse
: new optional field source
which, if populated, will be passed by Kubelet (or another Container
Orchestrator) to the container runtime to retrieve stats associated with the
file system and it's condition.
message NodeGetVolumeStatsResponse {
...
// Source device or network share (e.g. /dev/sdX, srv:/export) whose mount
// is deferred to a container runtime that is capable of reporting stats for
// filesystems mounted by it.
// This field is OPTIONAL.
// It SHOULD be populated by a SP if runtime_supports_stats was set in
// NodeGetVolumeStatsRequest and SP is capable of deferring filesystem mount
// and stats requests to the container runtime.
// SP MUST NOT populate `usage` and `volume_conditions` fields if source is
// specified indicating deferral of stats to the container runtime.
string source = 3; [(alpha_field) = true];
}
Enhancements for NodeExpandVolumeResponse
: new optional field source
which,
if populated, along with the existing field capacity_bytes
, will be passed by
Kubelet (or another Container Orchestrator) to the container runtime to expand
the file system.
message NodeExpandVolumeResponse {
...
// Source device (e.g. /dev/sdX) whose mount is deferred to a container runtime
// that is capable of expanding the filesystems mounted by it.
// This field is OPTIONAL.
// It SHOULD be populated by a SP if runtime_supports_expand was set in
// NodeExpandVolumeRequest and SP is capable of deferring filesystem mount
// and expand requests to the container runtime.
// SP MUST populate `capacity_bytes` field with the desired capacity if source
// is specified indicating deferral of expansion to the container runtime.
string source = 2; [(alpha_field) = true];
}
A new API: Container RUntime STorage (CRUST) Interface is proposed that roughly mirrors CSI node plugin APIs. The CRUST Interface will be used by the Kubelet to invoke file system mount and management operations on a container runtime handler that implements the interface. Kubelet will use the UDS path specified in the FsMgmtSocket field mentioned above in RuntimeClass to invoke the CRUST APIs.
Implmenting the CRUST API and surfacing the unix domain socket for invocation of CRUST API is expected to be the responsibility of the OCI runtime handler with no involvement expected from the CRI runtime.
service RuntimeAssistedStorageManagement {
rpc RuntimeGetCapabilities(RuntimeGetCapabilitiesRequest)
returns (RuntimeGetCapabilitiesResponse) {}
rpc RuntimeGetSupportedFileSystems(RuntimeGetSupportedFileSystemsRequest)
returns (RuntimeGetSupportedFileSystemsResponse) {}
rpc RuntimePublishVolume (RuntimePublishVolumeRequest)
returns (RuntimePublishVolumeResponse) {}
rpc RuntimeGetVolumeStats (RuntimeGetVolumeStatsRequest)
returns (RuntimeGetVolumeStatsResponse) {}
rpc RuntimeExpandVolume (RuntimeExpandVolumeRequest)
returns (RuntimeExpandVolumeResponse) {}
}
message RuntimeGetCapabilitiesRequest {
// Intentionally empty.
}
message RuntimeGetCapabilitiesResponse {
// All the storage management capabilities that the runtime handler supports.
// This field is OPTIONAL.
repeated RuntimeCapability capabilities = 1;
}
// Specifies a capability of the runtime handler around storage management
// Similar to https://github.com/container-storage-interface/spec/blob/5b0d4540158a260cb3347ef1c87ede8600afb9bf/csi.proto#L1493
message RuntimeCapability {
message RPC {
enum Type {
UNKNOWN = 0;
// Performs a full recursive check of FsGroup GID ownership of files
// and applies FsGroup GID ownership to all files
FS_GROUP_CHANGE_POLICY_ALWAYS = 2;
// Performs a quick check of FsGroup GID ownership of root of the file
// system and skips recursive FsGroup GID ownership application if a
// match is found
FS_GROUP_CHANGE_POLICY_ROOT_MISMATCH = 3;
// Supports surfacing subpaths to containers from a mounted file system
SUBPATH = 4;
// Supports handling of file system with selinux labels and recursive
// relabelling of files if necessary
SELINUX_RELABEL = 5;
// Supports handling of file system with selinux labels and passing the
// selinux label as a mount option
SELINUX_RELABEL_ON_MOUNT = 6;
// Supports retrieval of volume stats from the file system
VOLUME_STATS = 7;
// Supports online resizing (expansion) of the file system
VOLUME_RESIZE = 8;
}
Type type = 1;
}
oneof type {
// RPC that the runtime supports.
RPC rpc = 1;
}
}
message RuntimeGetSupportedFileSystemsRequest {
// Intentionally empty.
}
message RuntimeGetSupportedFileSystemsResponse {
// A list of file systems that the runtime handler supports mounting
// A CSI plugin will usually match the entries in the list to the
// file system a volume is formatted with (potentially obtained using blkid
// as in https://github.com/kubernetes/utils/blob/2afb4311ab10fb869c5d8c260b5e92b4bbde7f80/mount/mount_linux.go#L403 and
// implemented in https://github.com/util-linux/util-linux/tree/441f9b9303d015f1777aec7168807d58feacca31/libblkid/src/superblocks)
// Examples: ext3, ext4, xfs
// A CSI plugin may also support a special/custom file system. If the runtime
// handler wishes to support mounting that, the cluster administrator wil
// need to ensure the container runtime environment has the appropriate
// software and configuration for the special file system and reports support
// for it through this API.
repeated string file_systems = 1;
}
message RuntimePublishVolumeRequest {
// An ID assigned by the container runtime to the sandbox where the volume
// needs to be published
string sandbox_id = 1;
// An ID to uniquely identify the volume in the host environment. In case of
// a block device, this will be a path to the block device in the host
// (e.g. /dev/sdx)
string host_volume_id = 1;
// The path on the host where the volume would have been mounted (if runtime
// assisted mount was not enabled). This path should be used to map either
// [1] source field in mounts structure in OCI config spec OR
// [2] host_path fields in Mounts in ContainerConfig in CRI CreateContainer
// to host_volume_id whose file system is mounted by the container runtime
string host_target_path = 2;
// The type of file system to be mounted on the volume
string file_system = 2;
// The mount options to use when mounting the file system
repeated string mount_options = 3;
// The supplemental gid that should own the file system so it can be shared
// between different containers potentially running with different uids.
int32 fsgroup_gid = 4;
// The policy to use when checking and applying fsgroup ownership of the
// mounted file system
string fsgroup_policy = 5;
}
message RuntimePublishVolumeResponse {
// Intentionally empty.
}
message RuntimeGetVolumeStatsRequest {
// An ID assigned by the container runtime to the sandbox where the volume has
// been published/mounted
string sandbox_id = 1;
// An ID to uniquely identify the volume in the host environment. In case of
// a block device, this will be a path to the block device in the host
// (e.g. /dev/sdx)
string host_volume_id = 1;
}
message RuntimeGetVolumeStatsResponse {
// Contents of this message should be aligned to CSI NodeGetVolumeStatsResponse
repeated VolumeUsage usage = 1;
// Information about the current condition of the volume.
VolumeCondition volume_condition = 2
}
message VolumeUsage {
enum Unit {
UNKNOWN = 0;
BYTES = 1;
INODES = 2;
}
// The available capacity in specified Unit.
// The value of this field MUST NOT be negative.
int64 available = 1;
// The total capacity in specified Unit.
// The value of this field MUST NOT be negative.
int64 total = 2;
// The used capacity in specified Unit.
// The value of this field MUST NOT be negative.
int64 used = 3;
// Units by which values are measured.
Unit unit = 4;
}
// VolumeCondition represents the current condition of a volume.
message VolumeCondition {
// Normal volumes are available for use and operating optimally.
// An abnormal volume does not meet these criteria.
bool abnormal = 1;
// The message describing the condition of the volume.
string message = 2;
}
message RuntimeExpandVolumeRequest {
// An ID assigned by the container runtime to the sandbox where the volume has
// been published/mounted
string sandbox_id = 1;
// An ID to uniquely identify the volume in the host environment. In case of
// a block device, this will be a path to the block device in the host
// (e.g. /dev/sdx)
string host_volume_id = 2;
// The desired size of the expanded file system.
int64 required_bytes = 3;
}
message RuntimeExpandVolumeResponse {
// The capacity of the volume in bytes after file system expansion.
int64 capacity_bytes = 1;
}
In case of certain shared FS scenarios (like SMB), secrets associated with mounting the FS may need to be passed to the OCI runtime to enable it to authenticate. However, the configuration containing the OCI mount options may be persisted on the host file system by the CRI runtime (as described here in case of containerd). Based on the security posture of the host, it may not be recommended to enable runtime assisted mounts if secrets persisted on the host file system is undesirable. Future enhancements in container runtimes to pass the OCI spec in memory will address this limitation.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
- Components depending on the feature gate:
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume
Dynamic Kubelet Config
feature is enabled).
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
A CSI plugin can skip mounting of the Filesystem and pass the mount info to the runtime through a specially named metadata file placed in the publish path that gets passed to the runtime today as mount source. This approach does not require any changes to Kubelet and CSI spec. However, this approach does not provide a definitive way for a runtime to determine the authenticity of the metadata file. A legitimate CSI plugin can be made to surface a malicious metadata file for the container runtime to consume. Through this, an unauthorized user may get access to an arbitrary block device on the host or NFS share that the container runtime can mount based on the information in the malicious metadata file.
This approach was the starting point in the Kata community: kata-containers/kata-containers#1568 that led to this KEP
The CRUST Interface APIs could be integrated with CRI and Shim APIs. It would result in a major expansion of scope in CRI. CRI runtimes like ContainerD and CRIO will need to be made aware of the operations in the CRUST Interface which is unnecessary.