Skip to content

Instantly share code, notes, and snippets.

@emaxerrno
Forked from chetan/mesos_isolators.md
Created July 28, 2017 12:37
Show Gist options
  • Save emaxerrno/101016624d88f6ce6b1a5faca02fb98f to your computer and use it in GitHub Desktop.
Save emaxerrno/101016624d88f6ce6b1a5faca02fb98f to your computer and use it in GitHub Desktop.
Description of available Apache Mesos isolators

List of Isolators

Side note: all available resource metrics are documented here:

https://github.com/apache/mesos/blob/037a346a205ad7bdba99d771855f8caeea835d4a/include/mesos/mesos.proto#L1015

Filesystem Isolators

These are used for isolating files on disk from both the host system as well as other running tasks.

filesystem/posix

Generic POSIX-compatible file isolation. Essentially creates a folder which is owned by the task user/group.

filesystem/windows

// TODO(hausdorff): (MESOS-5462) For now the Windows isolators are essentially
// direct copies of their POSIX counterparts. In the future, we expect to
// refactor the POSIX classes into platform-independent base class, with
// Windows and POSIX implementations. For now, we leave the Windows
// implementations as inheriting from the POSIX implementations.

filesystem/linux

Linux-specific isolation using mount namespaces.

filesystem/shared

// This isolator is to be used when all containers share the host's
// filesystem.  It supports creating mounting "volumes" from the host
// into each container's mount namespace. In particular, this can be
// used to give each container a "private" system directory, such as
// /tmp and /var/tmp.

Being deprecated in favor of filesystem/linux

Runtime Isolators

These isolators are used to ensure that a task behaves well at runtime and also provide runtime usage metrics for the given resource.

posix/cpu

No actual resource isolation but does support returning usage metrics.

Metrics: cpu user time & system time See: https://github.com/apache/mesos/blob/037a346a205ad7bdba99d771855f8caeea835d4a/src/usage/usage.cpp#L35

posix/mem

No actual resource isolation but does support returning usage metrics.

Metrics: mem_rss_bytes See: https://github.com/apache/mesos/blob/037a346a205ad7bdba99d771855f8caeea835d4a/src/usage/usage.cpp#L35

posix/disk

Uses du -k -s to ensure tasks stay within disk usage limits.

Can Kill Tasks? Yes

Metrics: disk_limit_bytes, disk_used_bytes

// This isolator monitors the disk usage for containers, and reports
// ContainerLimitation when a container exceeds its disk quota. This
// leverages the DiskUsageCollector to ensure that we don't induce too
// much CPU usage and disk caching effects from running 'du' too
// often.

disk/du

Alias for posix/disk

Can Kill Tasks? Yes

disk/xfs

The XFS Disk isolator uses XFS project quotas to track the disk space used by each container sandbox and to enforce the corresponding disk space allocation. Write operations performed by tasks exceeding their disk allocation will fail with an EDQUOT error. The task will not be terminated by the containerizer.

The XFS disk isolator is functionally similar to Posix Disk isolator but avoids the cost of repeatedly running the du. Though they will not interfere with each other, it is not recommended to use them together.

Metrics: disk_limit_bytes, disk_used_bytes

windows/cpu

// A basic MesosIsolatorProcess that keeps track of the pid but
// doesn't do any resource isolation. Subclasses must implement
// usage() for their appropriate resource(s).
//
// TODO(hausdorff): (MESOS-5462) For now the Windows isolators are essentially
// direct copies of their POSIX counterparts. In the future, we expect to
// refactor the POSIX classes into platform-independent base class, with
// Windows and POSIX implementations. For now, we leave the Windows
// implementations as inheriting from the POSIX implementations.

cgroups/cpu

Uses Cgroups cpu and cpuacct subsystems:

cpu
       Cgroups can be guaranteed a minimum number of "CPU shares"
       when a system is busy.  This does not limit a cgroup's CPU
       usage if the CPUs are not busy.

       Further information can be found in the kernel source file
       Documentation/scheduler/sched-bwc.txt.

cpuacct
       This provides accounting for CPU usage by groups of tasks.

       Further information can be found in the kernel source file
       Documentation/cgroup-v1/cpuacct.txt.

(from cgroups(7) man page)

// Use the Linux cpu cgroup controller for cpu isolation which uses the
// Completely Fair Scheduler (CFS).
// - cpushare implements proportionally weighted scheduling.
// - cfs implements hard quota based scheduling.

Metrics: processes, threads, cpus_user_time_secs, cpus_system_time_secs

Additional metrics when using CFS: cpus_nr_periods, cpus_nr_throttled, cpus_throttled_time_secs

https://github.com/apache/mesos/blob/037a346a205ad7bdba99d771855f8caeea835d4a/src/slave/containerizer/mesos/isolators/cgroups/cpushare.cpp#L446

cgroups/devices

// This isolator uses the cgroups devices subsystem to
// restrict access to devices in `/dev`. A small set of
// default devices are whitelisted upon container creation,
// and access to all other devices is restricted. It is
// assumed that other isolators will be used to allow / deny
// access to devices outside the default whitelist.

Whitelist

 devices
        This supports controlling which tasks may create (mknod)
        devices as well as open them for reading or writing.  The
        policies may be specified as whitelists and blacklists.
        Hierarchy is enforced, so new rules must not violate existing
        rules for the target or ancestor cgroups.

        Further information can be found in the kernel source file
        Documentation/cgroup-v1/devices.txt.

(from cgroups(7) man page)

Metrics: none

cgroups/mem

Cgroups memory subsystem:

 memory
        The memory controller supports reporting and limiting of
        process memory, kernel memory, and swap used by cgroups.

        Further information can be found in the kernel source file
        Documentation/cgroup-v1/memory.txt.

Can Kill Tasks? Yes

Metrics:

mem_total_bytes

// Total memory + swap usage. This is set if swap is enabled.
mem_total_memsw_bytes

// Hard memory limit for a container.
mem_limit_bytes

// Soft memory limit for a container.
mem_soft_limit_bytes

// Broken out memory usage information: pagecache, rss (anonymous),
// mmaped files and swap.

// TODO(chzhcn) mem_file_bytes and mem_anon_bytes are deprecated in
// 0.23.0 and will be removed in 0.24.0.
mem_file_bytes
mem_anon_bytes

// mem_cache_bytes is added in 0.23.0 to represent page cache usage.
mem_cache_bytes

// Since 0.23.0, mem_rss_bytes is changed to represent only
// anonymous memory usage. Note that neither its requiredness, type,
// name nor numeric tag has been changed.
mem_rss_bytes

mem_mapped_file_bytes
// This is only set if swap is enabled.
mem_swap_bytes
mem_unevictable_bytes

cgroups/net_cls

The cgroups/net_cls isolator allows operators to provide network performance isolation and network segmentation for containers within a Mesos cluster.

Read more

Metrics: none

cgroups/perf_event

TODO

appc/runtime

See docker/runtime below. Same concept, except for appc images.

Metrics: none

docker/runtime

The Docker Runtime isolator is used for supporting runtime configurations from the docker image (e.g., Entrypoint/Cmd, Env, etc.). This isolator is tied with --image_providers=docker. If --image_providers contains docker, this isolator must be used. Otherwise, the agent will refuse to start.

To enable the Docker Runtime isolator, append docker/runtime to the --isolation flag when starting the agent.

Currently, docker image default Entrypoint, Cmd, Env, and WorkingDir are supported with docker runtime isolator. Users can specify CommandInfo to override the default Entrypoint and Cmd in the image (see below for details). The CommandInfo should be inside of either TaskInfo or ExecutorInfo (depending on whether the task is a command task or uses a custom executor, respectively).

Read more

// The docker runtime isolator is responsible for preparing mesos
// container by merging runtime configuration specified by user
// and docker image default configuration.

Metrics: none

docker/volume

Allows using Docker Volumes within Mesos. Read docs here

Metrics: none

volume/image

TODO

gpu/nvidia

TODO

namespaces/pid

PID namespaces isolate the process ID number space, meaning that
processes in different PID namespaces can have the same PID.  PID
namespaces allow containers to provide functionality such as
suspending/resuming the set of processes in the container and
migrating the container to a new host while the processes inside the
container maintain the same PIDs.

PIDs in a new PID namespace start at 1, somewhat like a standalone
system, and calls to fork(2), vfork(2), or clone(2) will produce
processes with PIDs that are unique within the namespace.

from pid_namespaces(7)

Metrics: none

network/cni

TODO

network/port_mapping

TODO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment