Last active
December 24, 2015 17:49
-
-
Save peo3/6838807 to your computer and use it in GitHub Desktop.
Recent systemd changes for cgroups and namespaces.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
systemd 208 released [LWN.net] http://lwn.net/Articles/569235/ | |
* This release removes high-level support for the | |
MemorySoftLimit= cgroup setting. The underlying kernel | |
cgroup attribute memory.soft_limit= is currently badly | |
designed and likely to be removed from the kernel API in its | |
current form, hence we shouldn't expose it for now. | |
* The memory.use_hierarchy cgroup attribute is now enabled for | |
all cgroups systemd creates in the memory cgroup | |
hierarchy. This option is likely to be come the built-in | |
default in the kernel anyway, and the non-hierarchial mode | |
never made much sense in the intrinsically hierarchial | |
cgroup system. | |
[systemd-devel] [ANNOUNCE] systemd 207 http://lists.freedesktop.org/archives/systemd-devel/2013-September/013189.html | |
[systemd-devel] [ANNOUNCE] systemd 206 http://lists.freedesktop.org/archives/systemd-devel/2013-July/012236.html | |
* Creation of "dead" device nodes has been moved from udev | |
into kmod and tmpfiles. Previously, udev would read the kmod | |
databases to pre-generate dead device nodes based on meta | |
information contained in kernel modules, so that these would | |
be auto-loaded on access rather then at boot. As this | |
doesn't really have much to do with the exposing actual | |
kernel devices to userspace this has always been slightly | |
alien in the udev codebase. Following the new scheme kmod | |
will now generate a runtime snippet for tmpfiles from the | |
module meta information and it now is tmpfiles' job to the | |
create the nodes. This also allows overriding access and | |
other parameters for the nodes using the usual tmpfiles | |
facilities. As side effect this allows us to remove the | |
CAP_SYS_MKNOD capability bit from udevd entirely. | |
[systemd-devel] [ANNOUNCE] systemd 205 http://lists.freedesktop.org/archives/systemd-devel/2013-July/011679.html | |
* Two new unit types have been introduced: | |
Scope units are very similar to service units, however, are | |
created out of pre-existing processes -- instead of PID 1 | |
forking off the processes. By using scope units it is | |
possible for system services and applications to group their | |
own child processes (worker processes) in a powerful way | |
which then maybe used to organize them, or kill them | |
together, or apply resource limits on them. | |
Slice units may be used to partition system resources in an | |
hierarchial fashion and then assign other units to them. By | |
default there are now three slices: system.slice (for all | |
system services), user.slice (for all user sessions), | |
machine.slice (for VMs and containers). | |
Slices and scopes have been introduced primarily in | |
context of the work to move cgroup handling to a | |
single-writer scheme, where only PID 1 | |
creates/removes/manages cgroups. | |
* A new mini-daemon "systemd-machined" has been added which | |
may be used by virtualization managers to register local | |
VMs/containers. nspawn has been updated accordingly, and | |
libvirt will be updated shortly. machined will collect a bit | |
of meta information about the VMs/containers, and assign | |
them their own scope unit (see above). The collected | |
meta-data is then made available via the "machinectl" tool, | |
and exposed in "ps" and similar tools. machined/machinectl | |
is compile-time optional. | |
* As discussed earlier, the low-level cgroup configuration | |
options ControlGroup=, ControlGroupModify=, | |
ControlGroupPersistent=, ControlGroupAttribute= have been | |
removed. Please use high-level attribute settings instead as | |
well as slice units. | |
* A new bus call SetUnitProperties() has been added to alter | |
various runtime parameters of a unit. This is primarily | |
useful to alter cgroup parameters dynamically in a nice way, | |
but will be extended later on to make more properties | |
modifiable at runtime. systemctl gained a new set-properties | |
command that wraps this call. | |
* nspawn will now inform the user explicitly that kernels with | |
audit enabled break containers, and suggest the user to turn | |
off audit. | |
[systemd-devel] [ANNOUNCE] systemd 204 http://lists.freedesktop.org/archives/systemd-devel/2013-May/010950.html | |
[systemd-devel] [ANNOUNCE] systemd 203 http://lists.freedesktop.org/archives/systemd-devel/2013-May/010907.html | |
* systemd-nspawn will now store meta information about a | |
container on the container's cgroup as extended attribute | |
fields, including the root directory. | |
* The cgroup hierarchy has been reworked in many ways. All | |
objects any of the components systemd creates in the cgroup | |
tree are now suffixed. More specifically, user sessions are | |
now placed in cgroups suffixed with ".session", users in | |
cgroups suffixed with ".user", and nspawn containers in | |
cgroups suffixed with ".nspawn". Furthermore, all cgroup | |
names are now escaped in a simple scheme to avoid collision | |
of userspace object names with kernel filenames. This work | |
is preparation for making these objects relocatable in the | |
cgroup tree, in order to allow easy resource partitioning of | |
these objects without causing naming conflicts. | |
* libsystemd-logind.so gained a new call | |
sd_get_machine_names() to enumerate running containers and | |
VMs (currently only supported by very new libvirt and | |
nspawn). sd_login_monitor can now be used to watch | |
VMs/containers coming and going. | |
* systemd will no longer allow manipulating service paths in | |
the name=systemd:/system cgroup tree using ControlGroup= in | |
units. (But is still fine with it in all other dirs.) | |
* There's a new systemd-nspawn at .service service file that may | |
be used to easily run nspawn containers as system | |
services. With the container's root directory in | |
/var/lib/container/foobar it is now sufficient to run | |
"systemctl start systemd-nspawn at foobar.service" to boot it. | |
* systemd-cgls gained a new parameter "--machine" to list only | |
the processes within a certain container. | |
[systemd-devel] [ANNOUNCE] systemd 202 http://lists.freedesktop.org/archives/systemd-devel/2013-April/010623.html | |
* systemd-nspawn now places all containers in the new /machine | |
top-level cgroup directory in the name=systemd | |
hierarchy. libvirt will soon do the same, so that we get a | |
uniform separation of /system, /user and /machine for system | |
services, user processes and containers/virtual | |
machines. This new cgroup hierarchy is also useful to stick | |
stable names to specific container instances, which can be | |
recognized later this way (this name may be controlled | |
via systemd-nspawn's new -M switch). libsystemd-login also | |
gained a new call sd_pid_get_machine_name() to retrieve the | |
name of the container/VM a specific process belongs to. | |
[systemd-devel] [ANNOUNCE] systemd 201 http://lists.freedesktop.org/archives/systemd-devel/2013-April/010274.html | |
* systemd-cgtop now optionally shows summed up CPU times of | |
cgroups. Press '%' while running cgtop to switch between | |
percentage and absolute mode. This is useful to determine | |
which cgroups use up the most CPU time over the entire | |
runtime of the system. systemd-cgtop has also been updated | |
to be 'pipeable' for processing with further shell tools. | |
[systemd-devel] [ANNOUNCE] systemd 200 http://lists.freedesktop.org/archives/systemd-devel/2013-March/009999.html | |
[systemd-devel] [ANNOUNCE] systemd 199 http://lists.freedesktop.org/archives/systemd-devel/2013-March/009933.html | |
[systemd-devel] [ANNOUNCE] systemd 198 http://lists.freedesktop.org/archives/systemd-devel/2013-March/009496.html | |
* Resource limits (as exposed by the various control group | |
controllers) can now be controlled dynamically at runtime | |
for all units. More specifically, you can now use a command | |
like "systemctl set-cgroup-attr foobar.service cpu.shares | |
2000" to alter the CPU shares a specific service gets. These | |
settings are stored persistently on disk, and thus allow the | |
administrator to easily adjust the resource usage of | |
services with a few simple commands. This dynamic resource | |
management logic is also available to other programs via the | |
bus. Almost any kernel cgroup attribute and controller is | |
supported. | |
* nspawn will now implicitly add the CAP_AUDIT_WRITE and | |
CAP_AUDIT_CONTROL capabilities to the capabilities set for | |
the container. This makes it easier to boot unmodified | |
Fedora systems in a container, which however still requires | |
audit=0 to be passed on the kernel command line. Auditing in | |
kernel and userspace is unfortunately still too broken in | |
context of containers, hence we recommend compiling it out | |
of the kernel or using audit=0. Hopefully this will be fixed | |
one day for good in the kernel. | |
* nspawn gained the new --bind= and --bind-ro= parameters to | |
bind mount specific directories from the host into the | |
container. | |
* nspawn will now mount its own devpts file system instance | |
into the container, in order not to leak pty devices from | |
the host into the container. | |
[systemd-devel] [ANNOUNCE] systemd 197 http://lists.freedesktop.org/archives/systemd-devel/2013-January/008048.html | |
* nspawn may now be invoked without a controlling TTY. This | |
makes it suitable for invocation as its own service. This | |
may be used to set up a simple containerized server system | |
using only core OS tools. | |
* systemd and nspawn can now accept socket file descriptors | |
when they are started for socket activation. This enables | |
implementation of socket activated nspawn | |
containers. i.e. think about autospawning an entire OS image | |
when the first SSH or HTTP connection is received. We expect | |
that similar functionality will also be added to libvirt-lxc | |
eventually. | |
[systemd-devel] [ANNOUNCE] systemd v196 http://lists.freedesktop.org/archives/systemd-devel/2012-November/007504.html | |
[systemd-devel] [ANNOUNCE] systemd 195 http://lists.freedesktop.org/archives/systemd-devel/2012-October/007048.html | |
Oh, and one more thing. In Fedora I added | |
"cap_dac_override,cap_sys_ptrace+ep" as file capabilities to | |
/usr/bin/systemd-detect-virt, so that this useful tool works for | |
unprivileged users too. (Yeah, cap_sys_ptrace sounds crazy, but Linux | |
sucks, it's required to read a few things off /proc/1/). The systemd | |
makefile will do the same, but if you package systemd for your distro | |
with RPM or suchlike you probably need to declare this explicitly in | |
your spec file. Note that not adding these caps is not a problem, you'll | |
just get a clean permission error if you run it as non-privileged | |
user. Also nothing depends on this being run as unprivileged user that I | |
was aware of, so this is really just about making a useful tool more | |
widely available, and not really a dependency for anything. | |
[systemd-devel] [RELEASE] systemd 194 http://lists.freedesktop.org/archives/systemd-devel/2012-October/006817.html | |
[systemd-devel] [ANNOUNCE] systemd 193 http://lists.freedesktop.org/archives/systemd-devel/2012-September/006738.html | |
[systemd-devel] [ANNOUNCE] systemd 192 http://lists.freedesktop.org/archives/systemd-devel/2012-September/006710.html | |
* We don't mount the "cpuset" controller anymore together with | |
"cpu" and "cpuacct", as "cpuset" groups generally cannot be | |
started if no parameters are assigned to it. "cpuset" hence | |
broke code that assumed it it could create "cpu" groups and | |
just start them. | |
[systemd-devel] [ANNOUNCE] systemd 191 http://lists.freedesktop.org/archives/systemd-devel/2012-September/006645.html | |
* nspawn will now create a symlink /etc/localtime in the | |
container environment, copying the host's timezone | |
setting. Previously this has been done via a bind mount, but | |
since symlinks cannot be bind mounted this has now been | |
changed to create/update the appropriate symlink. | |
[systemd-devel] [ANNOUNCE] systemd 190 http://lists.freedesktop.org/archives/systemd-devel/2012-September/006625.html | |
* We will now mount the cgroup controllers cpu, cpuacct, | |
cpuset and the controllers net_cls, net_prio together by | |
default. | |
* nspawn containers will now have a virtualized boot | |
ID. (i.e. /proc/sys/kernel/random/boot_id is now mounted | |
over with a randomized ID at container initialization). This | |
has the effect of making "journalctl -b" do the right thing | |
in a container. | |
* We now support virtualized reboot() in containers, as | |
supported by newer kernels. We will fall back to exit() if | |
CAP_SYS_REBOOT is not available to the container. Also, | |
nspawn makes use of this now and will actually reboot the | |
container if the containerized OS asks for that. | |
[systemd-devel] [ANNOUNCE] systemd v189 http://lists.freedesktop.org/archives/systemd-devel/2012-August/006343.html | |
* The logic for file system namespace (ReadOnlyDirectory=, | |
ReadWriteDirectoy=, PrivateTmp=) has been reworked not to | |
require pivot_root() anymore. This means fewer temporary | |
directories are created below /tmp for this feature. | |
* nspawn containers will now see and receive all submounts | |
made on the host OS below the root file system of the | |
container. | |
* nspawn containers will now be run with /dev/stdin, /dev/fd/ | |
and similar symlinks pre-created. This makes running shells | |
as container init process a lot more fun. | |
[systemd-devel] [ANNOUNCE] systemd 188 http://lists.freedesktop.org/archives/systemd-devel/2012-August/006212.html | |
* cgtop gained a new -n switch (similar to top), to configure | |
the maximum number of iterations to run for. It also gained | |
-b, to run in batch mode (accepting no input). | |
[systemd-devel] [ANNOUNCE] systemd 187 http://lists.freedesktop.org/archives/systemd-devel/2012-July/005978.html | |
* nspawn gained a new --link-journal= switch (and quicker: -j) | |
to link the container journal with the host. This makes it | |
very easy to centralize log viewing on the host for all | |
guests while still keeping the journal files separated. | |
[systemd-devel] [ANNOUNCE] systemd 186 http://lists.freedesktop.org/archives/systemd-devel/2012-July/005781.html | |
* systemd-nspawn gained a new --capability= switch to pass | |
additional capabilities to the container. | |
* The notify socket is in the abstract namespace again, in | |
order to support daemons which chroot() at start-up. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment