Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save ravageralpha/5e9cebb85bd26771754e to your computer and use it in GitHub Desktop.
Save ravageralpha/5e9cebb85bd26771754e to your computer and use it in GitHub Desktop.
fixing shitty systemd update
From fa647ab1c8fb6580813973976a4fa646623b769d Mon Sep 17 00:00:00 2001
From: Markus Rathgeb <[email protected]>
Date: Thu, 5 Jun 2014 11:10:44 +0200
Subject: [PATCH] cgroup: add xattr support
Squashed commit of the following:
commit 83ba09a9229baee9af79724ab3ef0840c97b3b74
Author: Aristeu Rozanski <[email protected]>
Date: Thu Aug 23 16:53:30 2012 -0400
cgroup: add xattr support
This is one of the items in the plumber's wish list.
For use cases:
>> What would the use case be for this?
>
> Attaching meta information to services, in an easily discoverable
> way. For example, in systemd we create one cgroup for each service, and
> could then store data like the main pid of the specific service as an
> xattr on the cgroup itself. That way we'd have almost all service state
> in the cgroupfs, which would make it possible to terminate systemd and
> later restart it without losing any state information. But there's more:
> for example, some very peculiar services cannot be terminated on
> shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
> services in question could just mark that on their cgroup, by setting an
> xattr. On the more desktopy side of things there are other
> possibilities: for example there are plans defining what an application
> is along the lines of a cgroup (i.e. an app being a collection of
> processes). With xattrs one could then attach an icon or human readable
> program name on the cgroup.
>
> The key idea is that this would allow attaching runtime meta information
> to cgroups and everything they model (services, apps, vms), that doesn't
> need any complex userspace infrastructure, has good access control
> (i.e. because the file system enforces that anyway, and there's the
> "trusted." xattr namespace), notifications (inotify), and can easily be
> shared among applications.
>
> Lennart
v7:
- no changes
v6:
- remove user xattr namespace, only allow trusted and security
v5:
- check for capabilities before setting/removing xattrs
v4:
- no changes
v3:
- instead of config option, use mount option to enable xattr support
Original-patch-by: Li Zefan <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Lennart Poettering <[email protected]>
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Aristeu Rozanski <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
commit cac4522c594319e40a39b5427623996761b24b81
Author: Aristeu Rozanski <[email protected]>
Date: Thu Aug 23 16:53:28 2012 -0400
xattr: extract simple_xattr code from tmpfs
Extract in-memory xattr APIs from tmpfs. Will be used by cgroup.
$ size vmlinux.o
text data bss dec hex filename
4658782 880729 5195032 10734543 a3cbcf vmlinux.o
$ size vmlinux.o
text data bss dec hex filename
4658957 880729 5195032 10734718 a3cc7e vmlinux.o
v7:
- checkpatch warnings fixed
- Implement the changes requested by Hugh Dickins:
- make simple_xattrs_init and simple_xattrs_free inline
- get rid of locking and list reinitialization in simple_xattrs_free,
they're not needed
v6:
- no changes
v5:
- no changes
v4:
- move simple_xattrs_free() to fs/xattr.c
v3:
- in kmem_xattrs_free(), reinitialize the list
- use simple_xattr_* prefix
- introduce simple_xattr_add() to prevent direct list usage
Original-patch-by: Li Zefan <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Lennart Poettering <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Aristeu Rozanski <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Conflicts:
mm/shmem.c
commit cd8f6a121c4e0e49e9c15f4f1ab1e2c534a65d3b
Author: Aristeu Rozanski <[email protected]>
Date: Thu Aug 23 16:53:29 2012 -0400
cgroup: revise how we re-populate root directory
When remounting cgroupfs with some subsystems added to it and some
removed, cgroup will remove all the files in root directory and then
re-popluate it.
What I'm doing here is, only remove files which belong to subsystems that
are to be unbinded, and only create files for newly-added subsystems.
The purpose is to have all other files untouched.
This is a preparation for cgroup xattr support.
v7:
- checkpatch warnings fixed
v6:
- no changes
v5:
- no changes
v4:
- refactored cgroup_clear_directory() to not use cgroup_rm_file()
- instead of going thru the list of files, get the file list using the
subsystems
- use 'subsys_mask' instead of {added,removed}_bits and made
cgroup_populate_dir() to match the parameters with cgroup_clear_directory()
v3:
- refresh patches after recent refactoring
Original-patch-by: Li Zefan <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Lennart Poettering <[email protected]>
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Aristeu Rozanski <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
commit fa68f525a488e74d35f44e4cf5b1dbbc83ad8da9
Author: Tejun Heo <[email protected]>
Date: Tue Jul 3 10:38:06 2012 -0700
cgroup: cgroup_rm_files() was calling simple_unlink() with the wrong inode
While refactoring cgroup file removal path, 05ef1d7c4a "cgroup:
introduce struct cfent" incorrectly changed the @dir argument of
simple_unlink() to the inode of the file being deleted instead of that
of the containing directory.
The effect of this bug is minor - ctime and mtime of the parent
weren't properly updated on file deletion.
Fix it by using @cgrp->dentry->d_inode instead.
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Al Viro <[email protected]>
Acked-by: Li Zefan <[email protected]>
Cc: [email protected]
commit 8ca67b989e17a377f6aee5e251b0fa1b10ccd70f
Author: Tejun Heo <[email protected]>
Date: Sat Jul 7 16:08:18 2012 -0700
cgroup: fix cgroup hierarchy umount race
48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed. As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.
Destroying a superblock which has dentries with positive refcnts is a
critical bug and triggers BUG() in vfs code. As each cgroup dentry
holds an s_active reference, any lingering cgroup has both its dentry
and the superblock pinned and thus preventing premature release of
superblock.
Unfortunately, after 48ddbe1946, there's a small window while
releasing a cgroup which is directly under the root of the hierarchy.
When a cgroup directory is released, vfs layer first deletes the
corresponding dentry and then invokes dput() on the parent, which may
recurse further, so when a cgroup directly below root cgroup is
released, the cgroup is first destroyed - which releases the s_active
it was holding - and then the dentry for the root cgroup is dput().
This creates a window where the root dentry's refcnt isn't zero but
superblock's s_active is. If umount happens before or during this
window, vfs will see the root dentry with non-zero refcnt and trigger
BUG().
Before 48ddbe1946, this problem didn't exist because the last dentry
reference was guaranteed to be put synchronously from rmdir(2)
invocation which holds s_active around the whole process.
Fix it by holding an extra superblock->s_active reference across
dput() from css release, which is the dput() path added by 48ddbe1946
and the only one which doesn't hold an extra s_active ref across the
final cgroup dput().
Signed-off-by: Tejun Heo <[email protected]>
LKML-Reference: <[email protected]>
Reported-by: shyju pv <[email protected]>
Tested-by: shyju pv <[email protected]>
Cc: Sasha Levin <[email protected]>
Acked-by: Li Zefan <[email protected]>
commit 7db74fed5e5137076ce2e2ad8de8bcff08fbaca2
Author: Tejun Heo <[email protected]>
Date: Sat Jul 7 15:55:47 2012 -0700
Revert "cgroup: superblock can't be released with active dentries"
This reverts commit fa980ca87d15bb8a1317853f257a505990f3ffde. The
commit was an attempt to fix a race condition where a cgroup hierarchy
may be unmounted with positive dentry reference on root cgroup. While
the commit made the race condition slightly more difficult to trigger,
the race was still there and could be reliably triggered using a
different test case.
Revert the incorrect fix. The next commit will describe the race and
fix it correctly.
Signed-off-by: Tejun Heo <[email protected]>
LKML-Reference: <[email protected]>
Reported-by: shyju pv <[email protected]>
Cc: Sasha Levin <[email protected]>
Acked-by: Li Zefan <[email protected]>
commit 8a817aa20830078f4027fe9c23ba30d79b1d9d06
Author: Salman Qazi <[email protected]>
Date: Thu Jun 14 14:55:30 2012 -0700
cgroups: Account for CSS_DEACT_BIAS in __css_put
When we fixed the race between atomic_dec and css_refcnt, we missed
the fact that css_refcnt internally subtracts CSS_DEACT_BIAS to get
the actual reference count. This can potentially cause a refcount leak
if __css_put races with cgroup_clear_css_refs.
Signed-off-by: Salman Qazi <[email protected]>
Acked-by: Li Zefan <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
commit 8a0b43b1785f1eaf533693c65a5880a2951dbd1d
Author: Li Zefan <[email protected]>
Date: Wed Jun 6 19:12:30 2012 -0700
cgroup: remove hierarchy_mutex
It was introduced for memcg to iterate cgroup hierarchy without
holding cgroup_mutex, but soon after that it was replaced with
a lockless way in memcg.
No one used hierarchy_mutex since that, so remove it.
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
commit 2d111e16082c524a12fcb86b7a7d09e46767d7e2
Author: Salman Qazi <[email protected]>
Date: Wed Jun 6 18:51:35 2012 -0700
cgroup: make sure that decisions in __css_put are atomic
__css_put is using atomic_dec on the ref count, and then
looking at the ref count to make decisions. This is prone
to races, as someone else may decrement ref count between
our decrement and our decision. Instead, we should base our
decisions on the value that we decremented the ref count to.
(This results in an actual race on Google's kernel which I
haven't been able to reproduce on the upstream kernel. Having
said that, it's still incorrect by inspection).
Signed-off-by: Salman Qazi <[email protected]>
Acked-by: Li Zefan <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Cc: [email protected]
commit e742b3c8b306ebc06a1321b1437de20b935ac892
Author: Johannes Weiner <[email protected]>
Date: Tue May 29 15:06:24 2012 -0700
kernel: cgroup: push rcu read locking from css_is_ancestor() to callsite
Library functions should not grab locks when the callsites can do it,
even if the lock nests like the rcu read-side lock does.
Push the rcu_read_lock() from css_is_ancestor() to its single user,
mem_cgroup_same_or_subtree() in preparation for another user that may
already hold the rcu read-side lock.
Signed-off-by: Johannes Weiner <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Li Zefan <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
commit 4851db2641ccb6cdb2f32d9507505c52813562b3
Author: Tejun Heo <[email protected]>
Date: Thu May 24 08:24:39 2012 -0700
cgroup: superblock can't be released with active dentries
48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed. As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.
cgroup_create() does grab an active reference on the superblock to
prevent it from going away while there are !root cgroups; however, the
reference is put from cgroup_diput() which is invoked on cgroup
removal, so cgroup dentries which are removed but persisting due to
lingering csses already have released their superblock active refs
allowing superblock to be killed while those dentries are around.
Given the right condition, this makes cgroup_kill_sb() call
kill_litter_super() with dentries with non-zero d_count leading to
BUG() in shrink_dcache_for_umount_subtree().
Fix it by adding cgroup_dops->d_release() operation and moving
deactivate_super() to it. cgroup_diput() now marks dentry->d_fsdata
with itself if superblock should be deactivated and cgroup_d_release()
deactivates the superblock on dentry release.
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Sasha Levin <[email protected]>
Tested-by: Sasha Levin <[email protected]>
LKML-Reference: <CA+1xoqe5hMuxzCRhMy7J0XchDk2ZnuxOHJKikROk1-ReAzcT6g@mail.gmail.com>
Acked-by: Li Zefan <[email protected]>
commit ee7067598065cd269825ac9812040c9e2760d4b4
Author: Eric W. Biederman <[email protected]>
Date: Thu Nov 17 10:23:55 2011 -0800
userns: Add a Kconfig option to enforce strict kuid and kgid type checks
Make it possible to easily switch between strong mandatory
type checks and relaxed type checks so that the code can
easily be tested with the type checks and then built
with the strong type checks disabled so the resulting
code can be used.
Require strong mandatory type checks when enabling the user namespace.
It is very simple to make a typo and use the wrong type allowing
conversions to/from userspace values to be bypassed by accident,
the strong type checks prevent this.
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
commit c5601b73aab8bac90be8b69968bd99e76c0e516d
Author: Eric W. Biederman <[email protected]>
Date: Mon Nov 14 14:29:51 2011 -0800
userns: Add kuid_t and kgid_t and associated infrastructure in uidgid.h
Start distinguishing between internal kernel uids and gids and
values that userspace can use. This is done by introducing two
new types: kuid_t and kgid_t. These types and their associated
functions are infrastructure are declared in the new header
uidgid.h.
Ultimately there will be a different implementation of the mapping
functions for use with user namespaces. But to keep it simple
we introduce the mapping functions first to separate the meat
from the mechanical code conversions.
Export overflowuid and overflowgid so we can use from_kuid_munged
and from_kgid_munged in modular code.
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
commit 531ca1b1f0f5a873d89950d9683f44d9f7001cd3
Author: Mike Galbraith <[email protected]>
Date: Sat Apr 21 09:13:46 2012 +0200
cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
Allowing kthreadd to be moved to a non-root group makes no sense, it being
a global resource, and needlessly leads unsuspecting users toward trouble.
1. An RT workqueue worker thread spawned in a task group with no rt_runtime
allocated is not schedulable. Simple user error, but harmful to the box.
2. A worker thread which acquires PF_THREAD_BOUND can never leave a cpuset,
rendering the cpuset immortal.
Save the user some unexpected trouble, just say no.
Signed-off-by: Mike Galbraith <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Li Zefan <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
commit 5e4cba06cbe7fcd2b3ffee5c62f79acf4d126e92
Author: Tejun Heo <[email protected]>
Date: Tue Apr 10 10:16:36 2012 -0700
cgroup: remove cgroup_subsys->populate()
With memcg converted, cgroup_subsys->populate() doesn't have any user
left. Remove it.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
commit a7a79b552a3bf97788d8d22f2888a4f3310c128e
Author: Glauber Costa <[email protected]>
Date: Mon Apr 9 19:36:34 2012 -0300
cgroup: get rid of populate for memcg
The last man standing justifying the need for populate() is the
sock memcg initialization functions. Now that we are able to pass
a struct mem_cgroup instead of a struct cgroup to the socket
initialization, there is nothing that stops us from initializing
everything in create().
Signed-off-by: Glauber Costa <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
CC: Li Zefan <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Michal Hocko <[email protected]>
Conflicts:
mm/memcontrol.c
commit bed6daf96599cbfdec259507eaccd1d1967f2a9a
Author: Glauber Costa <[email protected]>
Date: Mon Apr 9 19:36:33 2012 -0300
cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
The only reason cgroup was used, was to be consistent with the populate()
interface. Now that we're getting rid of it, not only we no longer need
it, but we also *can't* call it this way.
Since we will no longer rely on populate(), this will be called from
create(). During create, the association between struct mem_cgroup
and struct cgroup does not yet exist, since cgroup internals hasn't
yet initialized its bookkeeping. This means we would not be able
to draw the memcg pointer from the cgroup pointer in these
functions, which is highly undesirable.
Signed-off-by: Glauber Costa <[email protected]>
Acked-by: Kamezawa Hiroyuki <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
CC: Li Zefan <[email protected]>
CC: Johannes Weiner <[email protected]>
CC: Michal Hocko <[email protected]>
commit 9a98fea2f9879cc84238cfb57fccf0956627cea1
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:56 2012 -0700
cgroup: make css->refcnt clearing on cgroup removal optional
Currently, cgroup removal tries to drain all css references. If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.
This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir). To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).
Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior. This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.
Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch. Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.
http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Conflicts:
kernel/cgroup.c
commit 37e6593be2a784f62e3b7c350a0b778dbc62f922
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:56 2012 -0700
cgroup: use negative bias on css->refcnt to block css_tryget()
When a cgroup is about to be removed, cgroup_clear_css_refs() is
called to check and ensure that there are no active css references.
This is currently achieved by dropping the refcnt to zero iff it has
only the base ref. If all css refs could be dropped to zero, ref
clearing is successful and CSS_REMOVED is set on all css. If not, the
base ref is restored. While css ref is zero w/o CSS_REMOVED set, any
css_tryget() attempt on it busy loops so that they are atomic
w.r.t. the whole css ref clearing.
This does work but dropping and re-instating the base ref is somewhat
hairy and makes it difficult to add more logic to the put path as
there are two of them - the regular css_put() and the reversible base
ref clearing.
This patch updates css ref clearing such that blocking new
css_tryget() and putting the base ref are separate operations.
CSS_DEACT_BIAS, defined as INT_MIN, is added to css->refcnt and
css_tryget() busy loops while refcnt is negative. After all css refs
are deactivated, if they were all one, ref clearing succeeded and
CSS_REMOVED is set and the base ref is put using the regular
css_put(); otherwise, CSS_DEACT_BIAS is subtracted from the refcnts
and the original postive values are restored.
css_refcnt() accessor which always returns the unbiased positive
reference counts is added and used to simplify refcnt usages. While
at it, relocate and reformat comments in cgroup_has_css_refs().
This separates css->refcnt deactivation and putting the base ref,
which enables the next patch to make ref clearing optional.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
Conflicts:
kernel/cgroup.c
commit 6ae80cc8464ab30d769e70241d61ddd45865f3bd
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:56 2012 -0700
cgroup: implement cgroup_rm_cftypes()
Implement cgroup_rm_cftypes() which removes an array of cftypes from a
subsystem. It can be called whether the target subsys is attached or
not. cgroup core will remove the specified file from all existing
cgroups.
This will be used to improve sub-subsys modularity and will be helpful
for unified hierarchy.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
commit d73a46c95efe8514724ad48fcac7dfb29bcac923
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:56 2012 -0700
cgroup: introduce struct cfent
This patch adds cfent (cgroup file entry) which is the association
between a cgroup and a file. This is in-cgroup representation of
files under a cgroup directory. This simplifies walking walking
cgroup files and thus cgroup_clear_directory(), which is now
implemented in two parts - cgroup_rm_file() and a loop around it.
cgroup_rm_file() will be used to implement cftype removal and cfent is
scheduled to serve cgroup specific per-file data (e.g. for sysfs-like
"sever" semantics).
v2: - cfe was freed from cgroup_rm_file() which led to use-after-free
if the file had openers at the time of removal. Moved to
cgroup_diput().
- cgroup_clear_directory() triggered WARN_ON_ONCE() if d_subdirs
wasn't empty after removing all files. This triggered
spuriously if some files were open during directory clearing.
Removed.
v3: - In cgroup_diput(), WARN_ONCE(!list_empty(&cfe->node)) could be
spuriously triggered for root cgroups because they don't go
through cgroup_clear_directory() on unmount. Don't trigger WARN
for root cgroups.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
Cc: Glauber Costa <[email protected]>
commit 82408a55564678646baf265768ce3d33b064e25c
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:55 2012 -0700
cgroup: relocate __d_cgrp() and __d_cft()
Move the two macros upwards as they'll be used earlier in the file.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
commit 9fbb0bb71a3bbf8b30278100e7804a52e9b25e62
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:55 2012 -0700
cgroup: remove cgroup_add_file[s]()
No controller is using cgroup_add_files[s](). Unexport them, and
convert cgroup_add_files() to handle NULL entry terminated array
instead of taking count explicitly and continue creation on failure
for internal use.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
commit 9e8ad7804f72e8b6eff7d959390f0a2ce369fdd9
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:55 2012 -0700
cgroup: convert memcg controller to the new cftype interface
Convert memcg to use the new cftype based interface. kmem support
abuses ->populate() for mem_cgroup_sockets_init() so it can't be
removed at the moment.
tcp_memcontrol is updated so that tcp_files[] is registered via a
__initcall. This change also allows removing the forward declaration
of tcp_files[]. Removed.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Li Zefan <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Glauber Costa <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Greg Thelen <[email protected]>
commit b32023e8c10f97dbc094c7030f5b9d9f760e4935
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:55 2012 -0700
memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
Instead of conditioning creation of memsw files on do_swap_account,
always create the files if compiled-in and fail read/write attempts
with -EOPNOTSUPP if !do_swap_account.
This is suggested by KAMEZAWA to simplify memcg file creation so that
it can use cgroup->subsys_cftypes.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Li Zefan <[email protected]>
commit 304613589145ed4689da8a882d9b6652b93f0b1e
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:55 2012 -0700
cgroup: convert all non-memcg controllers to the new cftype interface
Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
net_cls and device controllers to use the new cftype based interface.
Termination entry is added to cftype arrays and populate callbacks are
replaced with cgroup_subsys->base_cftypes initializations.
This is functionally identical transformation. There shouldn't be any
visible behavior change.
memcg is rather special and will be converted separately.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Vivek Goyal <[email protected]>
commit 45da257f6dcf9fdfab5a7594f271397cbc731f2b
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:55 2012 -0700
cgroup: relocate cftype and cgroup_subsys definitions in controllers
blk-cgroup, netprio_cgroup, cls_cgroup and tcp_memcontrol
unnecessarily define cftype array and cgroup_subsys structures at the
top of the file, which is unconventional and necessiates forward
declaration of methods.
This patch relocates those below the definitions of the methods and
removes the forward declarations. Note that forward declaration of
tcp_files[] is added in tcp_memcontrol.c for tcp_init_cgroup(). This
will be removed soon by another patch.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
commit 9799ae980cdd1e311eba6b7e7f9f13aa3528a42d
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:55 2012 -0700
cgroup: merge cft_release_agent cftype array into the base files array
Now that cftype can express whether a file should only be on root,
cft_release_agent can be merged into the base files cftypes array.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
commit bd6bb9b83926409b7ee13f03def2161266ca6d30
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:55 2012 -0700
cgroup: implement cgroup_add_cftypes() and friends
Currently, cgroup directories are populated by subsys->populate()
callback explicitly creating files on each cgroup creation. This
level of flexibility isn't needed or desirable. It provides largely
unused flexibility which call for abuses while severely limiting what
the core layer can do through the lack of structure and conventions.
Per each cgroup file type, the only distinction that cgroup users is
making is whether a cgroup is root or not, which can easily be
expressed with flags.
This patch introduces cgroup_add_cftypes(). These deal with cftypes
instead of individual files - controllers indicate that certain types
of files exist for certain subsystem. Newly added CFTYPE_*_ON_ROOT
flags indicate whether a cftype should be excluded or created only on
the root cgroup.
cgroup_add_cftypes() can be called any time whether the target
subsystem is currently attached or not. cgroup core will create files
on the existing cgroups as necessary.
Also, cgroup_subsys->base_cftypes is added to ease registration of the
base files for the subsystem. If non-NULL on subsys init, the cftypes
pointed to by ->base_cftypes are automatically registered on subsys
init / load.
Further patches will convert the existing users and remove the file
based interface. Note that this interface allows dynamic addition of
files to an active controller. This will be used for sub-controller
modularity and unified hierarchy in the longer term.
This patch implements the new mechanism but doesn't apply it to any
user.
v2: replaced DECLARE_CGROUP_CFTYPES[_COND]() with
cgroup_subsys->base_cftypes, which works better for cgroup_subsys
which is loaded as module.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
commit f7c3e8e02ead0cffbbb1e80a20dab96317950429
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:54 2012 -0700
cgroup: build list of all cgroups under a given cgroupfs_root
Build a list of all cgroups anchored at cgroupfs_root->allcg_list and
going through cgroup->allcg_node. The list is protected by
cgroup_mutex and will be used to improve cgroup file handling.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
commit 5b4f71cda30e97cdc874ca8a98992135de189d0b
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:54 2012 -0700
cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
cgroup_populate_dir() currently clears all files and then repopulate
the directory; however, the clearing part is only useful when it's
called from cgroup_remount(). Relocate the invocation to
cgroup_remount().
This is to prepare for further cgroup file handling updates.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
commit d8d6a3fa984898b1256b9c126c31acf85eb02d72
Author: Tejun Heo <[email protected]>
Date: Sun Apr 1 12:09:54 2012 -0700
cgroup: deprecate remount option changes
This patch marks the following features for deprecation.
* Rebinding subsys by remount: Never reached useful state - only works
on empty hierarchies.
* release_agent update by remount: release_agent itself will be
replaced with conventional fsnotify notification.
v2: Lennart pointed out that "name=" is necessary for mounts w/o any
controller attached. Drop "name=" deprecation.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
Cc: Lennart Poettering <[email protected]>
Conflicts:
Documentation/feature-removal-schedule.txt
commit 4e5190f3c4ca056efdead0f98d081fc5f8c7aa11
Author: Daisuke Nishimura <[email protected]>
Date: Wed Mar 10 15:22:05 2010 -0800
cgroup: introduce coalesce css_get() and css_put()
Current css_get() and css_put() increment/decrement css->refcnt one by
one.
This patch add a new function __css_get(), which takes "count" as a arg
and increment the css->refcnt by "count". And this patch also add a new
arg("count") to __css_put() and change the function to decrement the
css->refcnt by "count".
These coalesce version of __css_get()/__css_put() will be used to improve
performance of memcg's moving charge feature later, where instead of
calling css_get()/css_put() repeatedly, these new functions will be used.
No change is needed for current users of css_get()/css_put().
Signed-off-by: Daisuke Nishimura <[email protected]>
Acked-by: Paul Menage <[email protected]>
Cc: Balbir Singh <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Conflicts:
kernel/cgroup.c
---
Documentation/cgroups/cgroups.txt | 2 +-
Documentation/feature-removal-schedule.txt | 21 +-
block/blk-cgroup.c | 45 +-
fs/xattr.c | 166 +++++++
include/linux/cgroup.h | 107 ++--
include/linux/shmem_fs.h | 3 +-
include/linux/uidgid.h | 176 +++++++
include/linux/xattr.h | 48 ++
include/net/sock.h | 12 +-
include/net/tcp_memcontrol.h | 4 +-
init/Kconfig | 12 +-
kernel/cgroup.c | 774 +++++++++++++++++++++--------
kernel/cgroup_freezer.c | 11 +-
kernel/cpuset.c | 31 +-
kernel/sched/core.c | 16 +-
kernel/sys.c | 2 -
mm/memcontrol.c | 130 +++--
mm/shmem.c | 181 +------
net/core/netprio_cgroup.c | 30 +-
net/core/sock.c | 10 +-
net/ipv4/tcp_memcontrol.c | 77 ++-
net/sched/cls_cgroup.c | 31 +-
security/device_cgroup.c | 10 +-
23 files changed, 1241 insertions(+), 658 deletions(-)
create mode 100644 include/linux/uidgid.h
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 594ff17..66ad6bd 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -665,7 +665,7 @@ example in cpusets, no task may attach before 'cpus' and 'mems' are set
up.
void bind(struct cgroup *root)
-(cgroup_mutex and ss->hierarchy_mutex held by caller)
+(cgroup_mutex held by caller)
Called when a cgroup subsystem is rebound to a different hierarchy
and root cgroup. Currently this will only involve movement between
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index e4b5775..0eb7118 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -534,18 +534,9 @@ Who: Kees Cook <[email protected]>
----------------------------
-What: setitimer accepts user NULL pointer (value)
-When: 3.6
-Why: setitimer is not returning -EFAULT if user pointer is NULL. This
- violates the spec.
-Who: Sasikantha Babu <[email protected]>
-
-----------------------------
-
-What: V4L2_CID_HCENTER, V4L2_CID_VCENTER V4L2 controls
-When: 3.7
-Why: The V4L2_CID_VCENTER, V4L2_CID_HCENTER controls have been deprecated
- for about 4 years and they are not used by any mainline driver.
- There are newer controls (V4L2_CID_PAN*, V4L2_CID_TILT*) that provide
- similar functionality.
-Who: Sylwester Nawrocki <[email protected]>
+What: cgroup option updates via remount
+When: March 2013
+Why: Remount currently allows changing bound subsystems and
+ release_agent. Rebinding is hardly useful as it only works
+ when the hierarchy is empty and release_agent itself should be
+ replaced with conventional fsnotify.
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ea84a23..126c341 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -28,34 +28,12 @@ static LIST_HEAD(blkio_list);
struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
EXPORT_SYMBOL_GPL(blkio_root_cgroup);
-static struct cgroup_subsys_state *blkiocg_create(struct cgroup *);
-static int blkiocg_can_attach(struct cgroup *, struct cgroup_taskset *);
-static void blkiocg_attach(struct cgroup *, struct cgroup_taskset *);
-static void blkiocg_destroy(struct cgroup *);
-static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
-
/* for encoding cft->private value on file */
#define BLKIOFILE_PRIVATE(x, val) (((x) << 16) | (val))
/* What policy owns the file, proportional or throttle */
#define BLKIOFILE_POLICY(val) (((val) >> 16) & 0xffff)
#define BLKIOFILE_ATTR(val) ((val) & 0xffff)
-struct cgroup_subsys blkio_subsys = {
- .name = "blkio",
- .create = blkiocg_create,
- .can_attach = blkiocg_can_attach,
- .attach = blkiocg_attach,
- .destroy = blkiocg_destroy,
- .populate = blkiocg_populate,
-#ifdef CONFIG_BLK_CGROUP
- /* note: blkio_subsys_id is otherwise defined in blk-cgroup.h */
- .subsys_id = blkio_subsys_id,
-#endif
- .use_id = 1,
- .module = THIS_MODULE,
-};
-EXPORT_SYMBOL_GPL(blkio_subsys);
-
static inline void blkio_policy_insert_node(struct blkio_cgroup *blkcg,
struct blkio_policy_node *pn)
{
@@ -1537,14 +1515,9 @@ struct cftype blkio_files[] = {
.read_map = blkiocg_file_read_map,
},
#endif
+ { } /* terminate */
};
-static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
- return cgroup_add_files(cgroup, subsys, blkio_files,
- ARRAY_SIZE(blkio_files));
-}
-
static void blkiocg_destroy(struct cgroup *cgroup)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
@@ -1658,6 +1631,22 @@ static void blkiocg_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
}
}
+struct cgroup_subsys blkio_subsys = {
+ .name = "blkio",
+ .create = blkiocg_create,
+ .can_attach = blkiocg_can_attach,
+ .attach = blkiocg_attach,
+ .destroy = blkiocg_destroy,
+#ifdef CONFIG_BLK_CGROUP
+ /* note: blkio_subsys_id is otherwise defined in blk-cgroup.h */
+ .subsys_id = blkio_subsys_id,
+#endif
+ .base_cftypes = blkio_files,
+ .use_id = 1,
+ .module = THIS_MODULE,
+};
+EXPORT_SYMBOL_GPL(blkio_subsys);
+
void blkio_policy_register(struct blkio_policy_type *blkiop)
{
spin_lock(&blkio_list_lock);
diff --git a/fs/xattr.c b/fs/xattr.c
index 3c8c1cc..050f63a 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -779,3 +779,169 @@ EXPORT_SYMBOL(generic_getxattr);
EXPORT_SYMBOL(generic_listxattr);
EXPORT_SYMBOL(generic_setxattr);
EXPORT_SYMBOL(generic_removexattr);
+
+/*
+ * Allocate new xattr and copy in the value; but leave the name to callers.
+ */
+struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
+{
+ struct simple_xattr *new_xattr;
+ size_t len;
+
+ /* wrap around? */
+ len = sizeof(*new_xattr) + size;
+ if (len <= sizeof(*new_xattr))
+ return NULL;
+
+ new_xattr = kmalloc(len, GFP_KERNEL);
+ if (!new_xattr)
+ return NULL;
+
+ new_xattr->size = size;
+ memcpy(new_xattr->value, value, size);
+ return new_xattr;
+}
+
+/*
+ * xattr GET operation for in-memory/pseudo filesystems
+ */
+int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
+ void *buffer, size_t size)
+{
+ struct simple_xattr *xattr;
+ int ret = -ENODATA;
+
+ spin_lock(&xattrs->lock);
+ list_for_each_entry(xattr, &xattrs->head, list) {
+ if (strcmp(name, xattr->name))
+ continue;
+
+ ret = xattr->size;
+ if (buffer) {
+ if (size < xattr->size)
+ ret = -ERANGE;
+ else
+ memcpy(buffer, xattr->value, xattr->size);
+ }
+ break;
+ }
+ spin_unlock(&xattrs->lock);
+ return ret;
+}
+
+static int __simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
+ const void *value, size_t size, int flags)
+{
+ struct simple_xattr *xattr;
+ struct simple_xattr *new_xattr = NULL;
+ int err = 0;
+
+ /* value == NULL means remove */
+ if (value) {
+ new_xattr = simple_xattr_alloc(value, size);
+ if (!new_xattr)
+ return -ENOMEM;
+
+ new_xattr->name = kstrdup(name, GFP_KERNEL);
+ if (!new_xattr->name) {
+ kfree(new_xattr);
+ return -ENOMEM;
+ }
+ }
+
+ spin_lock(&xattrs->lock);
+ list_for_each_entry(xattr, &xattrs->head, list) {
+ if (!strcmp(name, xattr->name)) {
+ if (flags & XATTR_CREATE) {
+ xattr = new_xattr;
+ err = -EEXIST;
+ } else if (new_xattr) {
+ list_replace(&xattr->list, &new_xattr->list);
+ } else {
+ list_del(&xattr->list);
+ }
+ goto out;
+ }
+ }
+ if (flags & XATTR_REPLACE) {
+ xattr = new_xattr;
+ err = -ENODATA;
+ } else {
+ list_add(&new_xattr->list, &xattrs->head);
+ xattr = NULL;
+ }
+out:
+ spin_unlock(&xattrs->lock);
+ if (xattr) {
+ kfree(xattr->name);
+ kfree(xattr);
+ }
+ return err;
+
+}
+
+/*
+ * xattr SET operation for in-memory/pseudo filesystems
+ */
+int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
+ const void *value, size_t size, int flags)
+{
+ if (size == 0)
+ value = ""; /* empty EA, do not remove */
+ return __simple_xattr_set(xattrs, name, value, size, flags);
+}
+
+/*
+ * xattr REMOVE operation for in-memory/pseudo filesystems
+ */
+int simple_xattr_remove(struct simple_xattrs *xattrs, const char *name)
+{
+ return __simple_xattr_set(xattrs, name, NULL, 0, XATTR_REPLACE);
+}
+
+static bool xattr_is_trusted(const char *name)
+{
+ return !strncmp(name, XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN);
+}
+
+/*
+ * xattr LIST operation for in-memory/pseudo filesystems
+ */
+ssize_t simple_xattr_list(struct simple_xattrs *xattrs, char *buffer,
+ size_t size)
+{
+ bool trusted = capable(CAP_SYS_ADMIN);
+ struct simple_xattr *xattr;
+ size_t used = 0;
+
+ spin_lock(&xattrs->lock);
+ list_for_each_entry(xattr, &xattrs->head, list) {
+ size_t len;
+
+ /* skip "trusted." attributes for unprivileged callers */
+ if (!trusted && xattr_is_trusted(xattr->name))
+ continue;
+
+ len = strlen(xattr->name) + 1;
+ used += len;
+ if (buffer) {
+ if (size < used) {
+ used = -ERANGE;
+ break;
+ }
+ memcpy(buffer, xattr->name, len);
+ buffer += len;
+ }
+ }
+ spin_unlock(&xattrs->lock);
+
+ return used;
+}
+
+void simple_xattr_list_add(struct simple_xattrs *xattrs,
+ struct simple_xattr *new_xattr)
+{
+ spin_lock(&xattrs->lock);
+ list_add(&new_xattr->list, &xattrs->head);
+ spin_unlock(&xattrs->lock);
+}
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 6719089..d752505 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -16,6 +16,8 @@
#include <linux/prio_heap.h>
#include <linux/rwsem.h>
#include <linux/idr.h>
+#include <linux/workqueue.h>
+#include <linux/xattr.h>
#ifdef CONFIG_CGROUPS
@@ -75,14 +77,24 @@ struct cgroup_subsys_state {
unsigned long flags;
/* ID for this css, if possible */
struct css_id __rcu *id;
+
+ /* Used to put @cgroup->dentry on the last css_put() */
+ struct work_struct dput_work;
};
/* bits in struct cgroup_subsys_state flags field */
enum {
CSS_ROOT, /* This CSS is the root of the subsystem */
CSS_REMOVED, /* This CSS is dead */
+ CSS_CLEAR_CSS_REFS, /* @ss->__DEPRECATED_clear_css_refs */
};
+/* Caller must verify that the css is not for root cgroup */
+static inline void __css_get(struct cgroup_subsys_state *css, int count)
+{
+ atomic_add(count, &css->refcnt);
+}
+
/*
* Call css_get() to hold a reference on the css; it can be used
* for a reference obtained via:
@@ -109,16 +121,12 @@ static inline bool css_is_removed(struct cgroup_subsys_state *css)
* the css has been destroyed.
*/
+extern bool __css_tryget(struct cgroup_subsys_state *css);
static inline bool css_tryget(struct cgroup_subsys_state *css)
{
if (test_bit(CSS_ROOT, &css->flags))
return true;
- while (!atomic_inc_not_zero(&css->refcnt)) {
- if (test_bit(CSS_REMOVED, &css->flags))
- return false;
- cpu_relax();
- }
- return true;
+ return __css_tryget(css);
}
/*
@@ -126,11 +134,11 @@ static inline bool css_tryget(struct cgroup_subsys_state *css)
* css_get() or css_tryget()
*/
-extern void __css_put(struct cgroup_subsys_state *css, int count);
+extern void __css_put(struct cgroup_subsys_state *css);
static inline void css_put(struct cgroup_subsys_state *css)
{
if (!test_bit(CSS_ROOT, &css->flags))
- __css_put(css, 1);
+ __css_put(css);
}
/* bits in struct cgroup flags field */
@@ -166,6 +174,7 @@ struct cgroup {
*/
struct list_head sibling; /* my parent's children */
struct list_head children; /* my children */
+ struct list_head files; /* my files */
struct cgroup *parent; /* my parent */
struct dentry __rcu *dentry; /* cgroup fs entry, RCU protected */
@@ -182,6 +191,9 @@ struct cgroup {
*/
struct list_head css_sets;
+ struct list_head allcg_node; /* cgroupfs_root->allcg_list */
+ struct list_head cft_q_node; /* used during cftype add/rm */
+
/*
* Linked list running through all cgroups that can
* potentially be reaped by the release agent. Protected by
@@ -202,6 +214,9 @@ struct cgroup {
/* List of events which userspace want to receive */
struct list_head event_list;
spinlock_t event_list_lock;
+
+ /* directory xattrs */
+ struct simple_xattrs xattrs;
};
/*
@@ -267,11 +282,17 @@ struct cgroup_map_cb {
* - the 'cftype' of the file is file->f_dentry->d_fsdata
*/
-#define MAX_CFTYPE_NAME 64
+/* cftype->flags */
+#define CFTYPE_ONLY_ON_ROOT (1U << 0) /* only create on root cg */
+#define CFTYPE_NOT_ON_ROOT (1U << 1) /* don't create onp root cg */
+
+#define MAX_CFTYPE_NAME 64
+
struct cftype {
/*
* By convention, the name should begin with the name of the
- * subsystem, followed by a period
+ * subsystem, followed by a period. Zero length string indicates
+ * end of cftype array.
*/
char name[MAX_CFTYPE_NAME];
int private;
@@ -287,6 +308,12 @@ struct cftype {
*/
size_t max_write_len;
+ /* CFTYPE_* flags */
+ unsigned int flags;
+
+ /* file xattrs */
+ struct simple_xattrs xattrs;
+
int (*open)(struct inode *inode, struct file *file);
ssize_t (*read)(struct cgroup *cgrp, struct cftype *cft,
struct file *file,
@@ -365,6 +392,16 @@ struct cftype {
struct eventfd_ctx *eventfd);
};
+/*
+ * cftype_sets describe cftypes belonging to a subsystem and are chained at
+ * cgroup_subsys->cftsets. Each cftset points to an array of cftypes
+ * terminated by zero length name.
+ */
+struct cftype_set {
+ struct list_head node; /* chained at subsys->cftsets */
+ struct cftype *cfts;
+};
+
struct cgroup_scanner {
struct cgroup *cg;
int (*test_task)(struct task_struct *p, struct cgroup_scanner *scan);
@@ -374,21 +411,8 @@ struct cgroup_scanner {
void *data;
};
-/*
- * Add a new file to the given cgroup directory. Should only be
- * called by subsystems from within a populate() method
- */
-int cgroup_add_file(struct cgroup *cgrp, struct cgroup_subsys *subsys,
- const struct cftype *cft);
-
-/*
- * Add a set of new files to the given cgroup directory. Should
- * only be called by subsystems from within a populate() method
- */
-int cgroup_add_files(struct cgroup *cgrp,
- struct cgroup_subsys *subsys,
- const struct cftype cft[],
- int count);
+int cgroup_add_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
+int cgroup_rm_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
int cgroup_is_removed(const struct cgroup *cgrp);
@@ -454,7 +478,6 @@ struct cgroup_subsys {
void (*fork)(struct task_struct *task);
void (*exit)(struct cgroup *cgrp, struct cgroup *old_cgrp,
struct task_struct *task);
- int (*populate)(struct cgroup_subsys *ss, struct cgroup *cgrp);
void (*post_clone)(struct cgroup *cgrp);
void (*bind)(struct cgroup *root);
@@ -467,25 +490,24 @@ struct cgroup_subsys {
* (not available in early_init time.)
*/
bool use_id;
-#define MAX_CGROUP_TYPE_NAMELEN 32
- const char *name;
/*
- * Protects sibling/children links of cgroups in this
- * hierarchy, plus protects which hierarchy (or none) the
- * subsystem is a part of (i.e. root/sibling). To avoid
- * potential deadlocks, the following operations should not be
- * undertaken while holding any hierarchy_mutex:
+ * If %true, cgroup removal will try to clear css refs by retrying
+ * ss->pre_destroy() until there's no css ref left. This behavior
+ * is strictly for backward compatibility and will be removed as
+ * soon as the current user (memcg) is updated.
*
- * - allocating memory
- * - initiating hotplug events
+ * If %false, ss->pre_destroy() can't fail and cgroup removal won't
+ * wait for css refs to drop to zero before proceeding.
*/
- struct mutex hierarchy_mutex;
- struct lock_class_key subsys_key;
+ bool __DEPRECATED_clear_css_refs;
+
+#define MAX_CGROUP_TYPE_NAMELEN 32
+ const char *name;
/*
* Link to parent, and list entry in parent's children.
- * Protected by this->hierarchy_mutex and cgroup_lock()
+ * Protected by cgroup_lock()
*/
struct cgroupfs_root *root;
struct list_head sibling;
@@ -493,6 +515,13 @@ struct cgroup_subsys {
struct idr idr;
spinlock_t id_lock;
+ /* list of cftype_sets */
+ struct list_head cftsets;
+
+ /* base cftypes, automatically [de]registered with subsys itself */
+ struct cftype *base_cftypes;
+ struct cftype_set base_cftset;
+
/* should be defined only by modular subsystems */
struct module *module;
};
@@ -604,7 +633,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
* the lifetime of cgroup_subsys_state is subsys's matter.
*
* Looking up and scanning function should be called under rcu_read_lock().
- * Taking cgroup_mutex()/hierarchy_mutex() is not necessary for following calls.
+ * Taking cgroup_mutex is not necessary for following calls.
* But the css returned by this routine can be "not populated yet" or "being
* destroyed". The caller should check css and cgroup's status.
*/
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 79ab255..72263bd 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -5,6 +5,7 @@
#include <linux/mempolicy.h>
#include <linux/pagemap.h>
#include <linux/percpu_counter.h>
+#include <linux/xattr.h>
/* inode in-kernel data */
@@ -18,7 +19,7 @@ struct shmem_inode_info {
};
struct shared_policy policy; /* NUMA memory alloc policy */
struct list_head swaplist; /* chain of maybes on swap */
- struct list_head xattr_list; /* list of shmem_xattr */
+ struct simple_xattrs xattrs; /* list of xattrs */
struct inode vfs_inode;
};
diff --git a/include/linux/uidgid.h b/include/linux/uidgid.h
new file mode 100644
index 0000000..5398568
--- /dev/null
+++ b/include/linux/uidgid.h
@@ -0,0 +1,176 @@
+#ifndef _LINUX_UIDGID_H
+#define _LINUX_UIDGID_H
+
+/*
+ * A set of types for the internal kernel types representing uids and gids.
+ *
+ * The types defined in this header allow distinguishing which uids and gids in
+ * the kernel are values used by userspace and which uid and gid values are
+ * the internal kernel values. With the addition of user namespaces the values
+ * can be different. Using the type system makes it possible for the compiler
+ * to detect when we overlook these differences.
+ *
+ */
+#include <linux/types.h>
+#include <linux/highuid.h>
+
+struct user_namespace;
+extern struct user_namespace init_user_ns;
+
+#ifdef CONFIG_UIDGID_STRICT_TYPE_CHECKS
+
+typedef struct {
+ uid_t val;
+} kuid_t;
+
+
+typedef struct {
+ gid_t val;
+} kgid_t;
+
+#define KUIDT_INIT(value) (kuid_t){ value }
+#define KGIDT_INIT(value) (kgid_t){ value }
+
+static inline uid_t __kuid_val(kuid_t uid)
+{
+ return uid.val;
+}
+
+static inline gid_t __kgid_val(kgid_t gid)
+{
+ return gid.val;
+}
+
+#else
+
+typedef uid_t kuid_t;
+typedef gid_t kgid_t;
+
+static inline uid_t __kuid_val(kuid_t uid)
+{
+ return uid;
+}
+
+static inline gid_t __kgid_val(kgid_t gid)
+{
+ return gid;
+}
+
+#define KUIDT_INIT(value) ((kuid_t) value )
+#define KGIDT_INIT(value) ((kgid_t) value )
+
+#endif
+
+#define GLOBAL_ROOT_UID KUIDT_INIT(0)
+#define GLOBAL_ROOT_GID KGIDT_INIT(0)
+
+#define INVALID_UID KUIDT_INIT(-1)
+#define INVALID_GID KGIDT_INIT(-1)
+
+static inline bool uid_eq(kuid_t left, kuid_t right)
+{
+ return __kuid_val(left) == __kuid_val(right);
+}
+
+static inline bool gid_eq(kgid_t left, kgid_t right)
+{
+ return __kgid_val(left) == __kgid_val(right);
+}
+
+static inline bool uid_gt(kuid_t left, kuid_t right)
+{
+ return __kuid_val(left) > __kuid_val(right);
+}
+
+static inline bool gid_gt(kgid_t left, kgid_t right)
+{
+ return __kgid_val(left) > __kgid_val(right);
+}
+
+static inline bool uid_gte(kuid_t left, kuid_t right)
+{
+ return __kuid_val(left) >= __kuid_val(right);
+}
+
+static inline bool gid_gte(kgid_t left, kgid_t right)
+{
+ return __kgid_val(left) >= __kgid_val(right);
+}
+
+static inline bool uid_lt(kuid_t left, kuid_t right)
+{
+ return __kuid_val(left) < __kuid_val(right);
+}
+
+static inline bool gid_lt(kgid_t left, kgid_t right)
+{
+ return __kgid_val(left) < __kgid_val(right);
+}
+
+static inline bool uid_lte(kuid_t left, kuid_t right)
+{
+ return __kuid_val(left) <= __kuid_val(right);
+}
+
+static inline bool gid_lte(kgid_t left, kgid_t right)
+{
+ return __kgid_val(left) <= __kgid_val(right);
+}
+
+static inline bool uid_valid(kuid_t uid)
+{
+ return !uid_eq(uid, INVALID_UID);
+}
+
+static inline bool gid_valid(kgid_t gid)
+{
+ return !gid_eq(gid, INVALID_GID);
+}
+
+static inline kuid_t make_kuid(struct user_namespace *from, uid_t uid)
+{
+ return KUIDT_INIT(uid);
+}
+
+static inline kgid_t make_kgid(struct user_namespace *from, gid_t gid)
+{
+ return KGIDT_INIT(gid);
+}
+
+static inline uid_t from_kuid(struct user_namespace *to, kuid_t kuid)
+{
+ return __kuid_val(kuid);
+}
+
+static inline gid_t from_kgid(struct user_namespace *to, kgid_t kgid)
+{
+ return __kgid_val(kgid);
+}
+
+static inline uid_t from_kuid_munged(struct user_namespace *to, kuid_t kuid)
+{
+ uid_t uid = from_kuid(to, kuid);
+ if (uid == (uid_t)-1)
+ uid = overflowuid;
+ return uid;
+}
+
+static inline gid_t from_kgid_munged(struct user_namespace *to, kgid_t kgid)
+{
+ gid_t gid = from_kgid(to, kgid);
+ if (gid == (gid_t)-1)
+ gid = overflowgid;
+ return gid;
+}
+
+static inline bool kuid_has_mapping(struct user_namespace *ns, kuid_t uid)
+{
+ return true;
+}
+
+static inline bool kgid_has_mapping(struct user_namespace *ns, kgid_t gid)
+{
+ return true;
+}
+
+#endif /* _LINUX_UIDGID_H */
diff --git a/include/linux/xattr.h b/include/linux/xattr.h
index e5d1220..2ace7a6 100644
--- a/include/linux/xattr.h
+++ b/include/linux/xattr.h
@@ -59,7 +59,9 @@
#ifdef __KERNEL__
+#include <linux/slab.h>
#include <linux/types.h>
+#include <linux/spinlock.h>
struct inode;
struct dentry;
@@ -96,6 +98,52 @@ ssize_t vfs_getxattr_alloc(struct dentry *dentry, const char *name,
char **xattr_value, size_t size, gfp_t flags);
int vfs_xattr_cmp(struct dentry *dentry, const char *xattr_name,
const char *value, size_t size, gfp_t flags);
+
+struct simple_xattrs {
+ struct list_head head;
+ spinlock_t lock;
+};
+
+struct simple_xattr {
+ struct list_head list;
+ char *name;
+ size_t size;
+ char value[0];
+};
+
+/*
+ * initialize the simple_xattrs structure
+ */
+static inline void simple_xattrs_init(struct simple_xattrs *xattrs)
+{
+ INIT_LIST_HEAD(&xattrs->head);
+ spin_lock_init(&xattrs->lock);
+}
+
+/*
+ * free all the xattrs
+ */
+static inline void simple_xattrs_free(struct simple_xattrs *xattrs)
+{
+ struct simple_xattr *xattr, *node;
+
+ list_for_each_entry_safe(xattr, node, &xattrs->head, list) {
+ kfree(xattr->name);
+ kfree(xattr);
+ }
+}
+
+struct simple_xattr *simple_xattr_alloc(const void *value, size_t size);
+int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
+ void *buffer, size_t size);
+int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
+ const void *value, size_t size, int flags);
+int simple_xattr_remove(struct simple_xattrs *xattrs, const char *name);
+ssize_t simple_xattr_list(struct simple_xattrs *xattrs, char *buffer,
+ size_t size);
+void simple_xattr_list_add(struct simple_xattrs *xattrs,
+ struct simple_xattr *new_xattr);
+
#endif /* __KERNEL__ */
#endif /* _LINUX_XATTR_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index f673ba5..e3ab749 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -70,16 +70,16 @@
struct cgroup;
struct cgroup_subsys;
#ifdef CONFIG_NET
-int mem_cgroup_sockets_init(struct cgroup *cgrp, struct cgroup_subsys *ss);
-void mem_cgroup_sockets_destroy(struct cgroup *cgrp);
+int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss);
+void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg);
#else
static inline
-int mem_cgroup_sockets_init(struct cgroup *cgrp, struct cgroup_subsys *ss)
+int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
return 0;
}
static inline
-void mem_cgroup_sockets_destroy(struct cgroup *cgrp)
+void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
{
}
#endif
@@ -915,9 +915,9 @@ struct proto {
* This function has to setup any files the protocol want to
* appear in the kmem cgroup filesystem.
*/
- int (*init_cgroup)(struct cgroup *cgrp,
+ int (*init_cgroup)(struct mem_cgroup *memcg,
struct cgroup_subsys *ss);
- void (*destroy_cgroup)(struct cgroup *cgrp);
+ void (*destroy_cgroup)(struct mem_cgroup *memcg);
struct cg_proto *(*proto_cgroup)(struct mem_cgroup *memcg);
#endif
};
diff --git a/include/net/tcp_memcontrol.h b/include/net/tcp_memcontrol.h
index 48410ff..7df18bc 100644
--- a/include/net/tcp_memcontrol.h
+++ b/include/net/tcp_memcontrol.h
@@ -12,8 +12,8 @@ struct tcp_memcontrol {
};
struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg);
-int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
-void tcp_destroy_cgroup(struct cgroup *cgrp);
+int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss);
+void tcp_destroy_cgroup(struct mem_cgroup *memcg);
unsigned long long tcp_max_memory(const struct mem_cgroup *memcg);
void tcp_prot_mem(struct mem_cgroup *memcg, long val, int idx);
#endif /* _TCP_MEMCG_H */
diff --git a/init/Kconfig b/init/Kconfig
index a7cffc8..931d5b5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -828,7 +828,8 @@ config IPC_NS
config USER_NS
bool "User namespace (EXPERIMENTAL)"
depends on EXPERIMENTAL
- default y
+ select UIDGID_STRICT_TYPE_CHECKS
+ default n
help
This allows containers, i.e. vservers, to use user namespaces
to provide different user info for different servers.
@@ -852,6 +853,15 @@ config NET_NS
endif # NAMESPACES
+config UIDGID_STRICT_TYPE_CHECKS
+ bool "Require conversions between uid/gids and their internal representation"
+ default n
+ help
+ While the nececessary conversions are being added to all subsystems this option allows
+ the code to continue to build for unconverted subsystems.
+
+ Say Y here if you want the strict type checking enabled
+
config SCHED_AUTOGROUP
bool "Automatic process group scheduling"
select EVENTFD
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 714ac5d..5bdbcdf 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -60,9 +60,13 @@
#include <linux/eventfd.h>
#include <linux/poll.h>
#include <linux/flex_array.h> /* used in cgroup_attach_proc */
+#include <linux/kthread.h>
#include <linux/atomic.h>
+/* css deactivation bias, makes css->refcnt negative to deny new trygets */
+#define CSS_DEACT_BIAS INT_MIN
+
/*
* cgroup_mutex is the master lock. Any modification to cgroup or its
* hierarchy must be performed while holding it.
@@ -127,6 +131,9 @@ struct cgroupfs_root {
/* A list running through the active hierarchies */
struct list_head root_list;
+ /* All cgroups on this root, cgroup_mutex protected */
+ struct list_head allcg_list;
+
/* Hierarchy-specific flags */
unsigned long flags;
@@ -145,6 +152,15 @@ struct cgroupfs_root {
static struct cgroupfs_root rootnode;
/*
+ * cgroupfs file entry, pointed to from leaf dentry->d_fsdata.
+ */
+struct cfent {
+ struct list_head node;
+ struct dentry *dentry;
+ struct cftype *type;
+};
+
+/*
* CSS ID -- ID per subsys's Cgroup Subsys State(CSS). used only when
* cgroup_subsys->use_id != 0.
*/
@@ -239,6 +255,19 @@ int cgroup_lock_is_held(void)
EXPORT_SYMBOL_GPL(cgroup_lock_is_held);
+static int css_unbias_refcnt(int refcnt)
+{
+ return refcnt >= 0 ? refcnt : refcnt - CSS_DEACT_BIAS;
+}
+
+/* the current nr of refs, always >= 0 whether @css is deactivated or not */
+static int css_refcnt(struct cgroup_subsys_state *css)
+{
+ int v = atomic_read(&css->refcnt);
+
+ return css_unbias_refcnt(v);
+}
+
/* convenient tests for these bits */
inline int cgroup_is_removed(const struct cgroup *cgrp)
{
@@ -247,7 +276,8 @@ inline int cgroup_is_removed(const struct cgroup *cgrp)
/* bits in struct cgroupfs_root flags field */
enum {
- ROOT_NOPREFIX, /* mounted subsystems have no named prefix */
+ ROOT_NOPREFIX, /* mounted subsystems have no named prefix */
+ ROOT_XATTR, /* supports extended attributes */
};
static int cgroup_is_releasable(const struct cgroup *cgrp)
@@ -279,6 +309,21 @@ list_for_each_entry(_ss, &_root->subsys_list, sibling)
#define for_each_active_root(_root) \
list_for_each_entry(_root, &roots, root_list)
+static inline struct cgroup *__d_cgrp(struct dentry *dentry)
+{
+ return dentry->d_fsdata;
+}
+
+static inline struct cfent *__d_cfe(struct dentry *dentry)
+{
+ return dentry->d_fsdata;
+}
+
+static inline struct cftype *__d_cft(struct dentry *dentry)
+{
+ return __d_cfe(dentry)->type;
+}
+
/* the list of cgroups eligible for automatic release. Protected by
* release_list_lock */
static LIST_HEAD(release_list);
@@ -818,7 +863,8 @@ EXPORT_SYMBOL_GPL(cgroup_unlock);
static int cgroup_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode);
static struct dentry *cgroup_lookup(struct inode *, struct dentry *, struct nameidata *);
static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry);
-static int cgroup_populate_dir(struct cgroup *cgrp);
+static int cgroup_populate_dir(struct cgroup *cgrp, bool base_files,
+ unsigned long subsys_mask);
static const struct inode_operations cgroup_dir_inode_operations;
static const struct file_operations proc_cgroupstats_operations;
@@ -854,12 +900,17 @@ static int cgroup_call_pre_destroy(struct cgroup *cgrp)
struct cgroup_subsys *ss;
int ret = 0;
- for_each_subsys(cgrp->root, ss)
- if (ss->pre_destroy) {
- ret = ss->pre_destroy(cgrp);
- if (ret)
- break;
+ for_each_subsys(cgrp->root, ss) {
+ if (!ss->pre_destroy)
+ continue;
+
+ ret = ss->pre_destroy(cgrp);
+ if (ret) {
+ /* ->pre_destroy() failure is being deprecated */
+ WARN_ON_ONCE(!ss->__DEPRECATED_clear_css_refs);
+ break;
}
+ }
return ret;
}
@@ -901,7 +952,19 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
*/
BUG_ON(!list_empty(&cgrp->pidlists));
+ simple_xattrs_free(&cgrp->xattrs);
+
kfree_rcu(cgrp, rcu_head);
+ } else {
+ struct cfent *cfe = __d_cfe(dentry);
+ struct cgroup *cgrp = dentry->d_parent->d_fsdata;
+ struct cftype *cft = cfe->type;
+
+ WARN_ONCE(!list_empty(&cfe->node) &&
+ cgrp != &cgrp->root->top_cgroup,
+ "cfe still linked for %s\n", cfe->type->name);
+ kfree(cfe);
+ simple_xattrs_free(&cft->xattrs);
}
iput(inode);
}
@@ -920,34 +983,53 @@ static void remove_dir(struct dentry *d)
dput(parent);
}
-static void cgroup_clear_directory(struct dentry *dentry)
-{
- struct list_head *node;
-
- BUG_ON(!mutex_is_locked(&dentry->d_inode->i_mutex));
- spin_lock(&dentry->d_lock);
- node = dentry->d_subdirs.next;
- while (node != &dentry->d_subdirs) {
- struct dentry *d = list_entry(node, struct dentry, d_u.d_child);
-
- spin_lock_nested(&d->d_lock, DENTRY_D_LOCK_NESTED);
- list_del_init(node);
- if (d->d_inode) {
- /* This should never be called on a cgroup
- * directory with child cgroups */
- BUG_ON(d->d_inode->i_mode & S_IFDIR);
- dget_dlock(d);
- spin_unlock(&d->d_lock);
- spin_unlock(&dentry->d_lock);
- d_delete(d);
- simple_unlink(dentry->d_inode, d);
- dput(d);
- spin_lock(&dentry->d_lock);
- } else
- spin_unlock(&d->d_lock);
- node = dentry->d_subdirs.next;
+static int cgroup_rm_file(struct cgroup *cgrp, const struct cftype *cft)
+{
+ struct cfent *cfe;
+
+ lockdep_assert_held(&cgrp->dentry->d_inode->i_mutex);
+ lockdep_assert_held(&cgroup_mutex);
+
+ list_for_each_entry(cfe, &cgrp->files, node) {
+ struct dentry *d = cfe->dentry;
+
+ if (cft && cfe->type != cft)
+ continue;
+
+ dget(d);
+ d_delete(d);
+ simple_unlink(cgrp->dentry->d_inode, d);
+ list_del_init(&cfe->node);
+ dput(d);
+
+ return 0;
+ }
+ return -ENOENT;
+}
+
+/**
+ * cgroup_clear_directory - selective removal of base and subsystem files
+ * @dir: directory containing the files
+ * @base_files: true if the base files should be removed
+ * @subsys_mask: mask of the subsystem ids whose files should be removed
+ */
+static void cgroup_clear_directory(struct dentry *dir, bool base_files,
+ unsigned long subsys_mask)
+{
+ struct cgroup *cgrp = __d_cgrp(dir);
+ struct cgroup_subsys *ss;
+
+ for_each_subsys(cgrp->root, ss) {
+ struct cftype_set *set;
+ if (!test_bit(ss->subsys_id, &subsys_mask))
+ continue;
+ list_for_each_entry(set, &ss->cftsets, node)
+ cgroup_rm_file(cgrp, set->cfts);
+ }
+ if (base_files) {
+ while (!list_empty(&cgrp->files))
+ cgroup_rm_file(cgrp, NULL);
}
- spin_unlock(&dentry->d_lock);
}
/*
@@ -956,8 +1038,9 @@ static void cgroup_clear_directory(struct dentry *dentry)
static void cgroup_d_remove_dir(struct dentry *dentry)
{
struct dentry *parent;
+ struct cgroupfs_root *root = dentry->d_sb->s_fs_info;
- cgroup_clear_directory(dentry);
+ cgroup_clear_directory(dentry, true, root->subsys_bits);
parent = dentry->d_parent;
spin_lock(&parent->d_lock);
@@ -1020,28 +1103,24 @@ static int rebind_subsystems(struct cgroupfs_root *root,
BUG_ON(cgrp->subsys[i]);
BUG_ON(!dummytop->subsys[i]);
BUG_ON(dummytop->subsys[i]->cgroup != dummytop);
- mutex_lock(&ss->hierarchy_mutex);
cgrp->subsys[i] = dummytop->subsys[i];
cgrp->subsys[i]->cgroup = cgrp;
list_move(&ss->sibling, &root->subsys_list);
ss->root = root;
if (ss->bind)
ss->bind(cgrp);
- mutex_unlock(&ss->hierarchy_mutex);
/* refcount was already taken, and we're keeping it */
} else if (bit & removed_bits) {
/* We're removing this subsystem */
BUG_ON(ss == NULL);
BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]);
BUG_ON(cgrp->subsys[i]->cgroup != cgrp);
- mutex_lock(&ss->hierarchy_mutex);
if (ss->bind)
ss->bind(dummytop);
dummytop->subsys[i]->cgroup = dummytop;
cgrp->subsys[i] = NULL;
subsys[i]->root = &rootnode;
list_move(&ss->sibling, &rootnode.subsys_list);
- mutex_unlock(&ss->hierarchy_mutex);
/* subsystem is now free - drop reference on module */
module_put(ss->module);
} else if (bit & final_bits) {
@@ -1077,6 +1156,8 @@ static int cgroup_show_options(struct seq_file *seq, struct dentry *dentry)
seq_printf(seq, ",%s", ss->name);
if (test_bit(ROOT_NOPREFIX, &root->flags))
seq_puts(seq, ",noprefix");
+ if (test_bit(ROOT_XATTR, &root->flags))
+ seq_puts(seq, ",xattr");
if (strlen(root->release_agent_path))
seq_printf(seq, ",release_agent=%s", root->release_agent_path);
if (clone_children(&root->top_cgroup))
@@ -1145,6 +1226,10 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
opts->clone_children = true;
continue;
}
+ if (!strcmp(token, "xattr")) {
+ set_bit(ROOT_XATTR, &opts->flags);
+ continue;
+ }
if (!strncmp(token, "release_agent=", 14)) {
/* Specifying two release agents is forbidden */
if (opts->release_agent)
@@ -1295,6 +1380,7 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
struct cgroupfs_root *root = sb->s_fs_info;
struct cgroup *cgrp = &root->top_cgroup;
struct cgroup_sb_opts opts;
+ unsigned long added_bits, removed_bits;
mutex_lock(&cgrp->dentry->d_inode->i_mutex);
mutex_lock(&cgroup_mutex);
@@ -1305,6 +1391,14 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
if (ret)
goto out_unlock;
+ /* See feature-removal-schedule.txt */
+ if (opts.subsys_bits != root->actual_subsys_bits || opts.release_agent)
+ pr_warning("cgroup: option changes via remount are deprecated (pid=%d comm=%s)\n",
+ task_tgid_nr(current), current->comm);
+
+ added_bits = opts.subsys_bits & ~root->subsys_bits;
+ removed_bits = root->subsys_bits & ~opts.subsys_bits;
+
/* Don't allow flags or name to change at remount */
if (opts.flags != root->flags ||
(opts.name && strcmp(opts.name, root->name))) {
@@ -1319,8 +1413,10 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
goto out_unlock;
}
- /* (re)populate subsystem files */
- cgroup_populate_dir(cgrp);
+ /* clear out any existing files and repopulate subsystem files */
+ cgroup_clear_directory(cgrp->dentry, false, removed_bits);
+ /* re-populate subsystem files */
+ cgroup_populate_dir(cgrp, false, added_bits);
if (opts.release_agent)
strcpy(root->release_agent_path, opts.release_agent);
@@ -1344,22 +1440,27 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
{
INIT_LIST_HEAD(&cgrp->sibling);
INIT_LIST_HEAD(&cgrp->children);
+ INIT_LIST_HEAD(&cgrp->files);
INIT_LIST_HEAD(&cgrp->css_sets);
INIT_LIST_HEAD(&cgrp->release_list);
INIT_LIST_HEAD(&cgrp->pidlists);
mutex_init(&cgrp->pidlist_mutex);
INIT_LIST_HEAD(&cgrp->event_list);
spin_lock_init(&cgrp->event_list_lock);
+ simple_xattrs_init(&cgrp->xattrs);
}
static void init_cgroup_root(struct cgroupfs_root *root)
{
struct cgroup *cgrp = &root->top_cgroup;
+
INIT_LIST_HEAD(&root->subsys_list);
INIT_LIST_HEAD(&root->root_list);
+ INIT_LIST_HEAD(&root->allcg_list);
root->number_of_cgroups = 1;
cgrp->root = root;
cgrp->top_cgroup = cgrp;
+ list_add_tail(&cgrp->allcg_node, &root->allcg_list);
init_cgroup_housekeeping(cgrp);
}
@@ -1615,7 +1716,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
BUG_ON(root->number_of_cgroups != 1);
cred = override_creds(&init_cred);
- cgroup_populate_dir(root_cgrp);
+ cgroup_populate_dir(root_cgrp, true, root->subsys_bits);
revert_creds(cred);
mutex_unlock(&cgroup_root_mutex);
mutex_unlock(&cgroup_mutex);
@@ -1691,6 +1792,8 @@ static void cgroup_kill_sb(struct super_block *sb) {
mutex_unlock(&cgroup_root_mutex);
mutex_unlock(&cgroup_mutex);
+ simple_xattrs_free(&cgrp->xattrs);
+
kill_litter_super(sb);
cgroup_drop_root(root);
}
@@ -1703,16 +1806,6 @@ static struct file_system_type cgroup_fs_type = {
static struct kobject *cgroup_kobj;
-static inline struct cgroup *__d_cgrp(struct dentry *dentry)
-{
- return dentry->d_fsdata;
-}
-
-static inline struct cftype *__d_cft(struct dentry *dentry)
-{
- return dentry->d_fsdata;
-}
-
/**
* cgroup_path - generate the path of a cgroup
* @cgrp: the cgroup in question
@@ -2216,6 +2309,18 @@ retry_find_task:
if (threadgroup)
tsk = tsk->group_leader;
+
+ /*
+ * Workqueue threads may acquire PF_THREAD_BOUND and become
+ * trapped in a cpuset, or RT worker may be born in a cgroup
+ * with no rt_runtime allocated. Just say no.
+ */
+ if (tsk == kthreadd_task || (tsk->flags & PF_THREAD_BOUND)) {
+ ret = -EINVAL;
+ rcu_read_unlock();
+ goto out_unlock_cgroup;
+ }
+
get_task_struct(tsk);
rcu_read_unlock();
@@ -2528,6 +2633,64 @@ static int cgroup_rename(struct inode *old_dir, struct dentry *old_dentry,
return simple_rename(old_dir, old_dentry, new_dir, new_dentry);
}
+static struct simple_xattrs *__d_xattrs(struct dentry *dentry)
+{
+ if (S_ISDIR(dentry->d_inode->i_mode))
+ return &__d_cgrp(dentry)->xattrs;
+ else
+ return &__d_cft(dentry)->xattrs;
+}
+
+static inline int xattr_enabled(struct dentry *dentry)
+{
+ struct cgroupfs_root *root = dentry->d_sb->s_fs_info;
+ return test_bit(ROOT_XATTR, &root->flags);
+}
+
+static bool is_valid_xattr(const char *name)
+{
+ if (!strncmp(name, XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN) ||
+ !strncmp(name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN))
+ return true;
+ return false;
+}
+
+static int cgroup_setxattr(struct dentry *dentry, const char *name,
+ const void *val, size_t size, int flags)
+{
+ if (!xattr_enabled(dentry))
+ return -EOPNOTSUPP;
+ if (!is_valid_xattr(name))
+ return -EINVAL;
+ return simple_xattr_set(__d_xattrs(dentry), name, val, size, flags);
+}
+
+static int cgroup_removexattr(struct dentry *dentry, const char *name)
+{
+ if (!xattr_enabled(dentry))
+ return -EOPNOTSUPP;
+ if (!is_valid_xattr(name))
+ return -EINVAL;
+ return simple_xattr_remove(__d_xattrs(dentry), name);
+}
+
+static ssize_t cgroup_getxattr(struct dentry *dentry, const char *name,
+ void *buf, size_t size)
+{
+ if (!xattr_enabled(dentry))
+ return -EOPNOTSUPP;
+ if (!is_valid_xattr(name))
+ return -EINVAL;
+ return simple_xattr_get(__d_xattrs(dentry), name, buf, size);
+}
+
+static ssize_t cgroup_listxattr(struct dentry *dentry, char *buf, size_t size)
+{
+ if (!xattr_enabled(dentry))
+ return -EOPNOTSUPP;
+ return simple_xattr_list(__d_xattrs(dentry), buf, size);
+}
+
static const struct file_operations cgroup_file_operations = {
.read = cgroup_file_read,
.write = cgroup_file_write,
@@ -2536,11 +2699,22 @@ static const struct file_operations cgroup_file_operations = {
.release = cgroup_file_release,
};
+static const struct inode_operations cgroup_file_inode_operations = {
+ .setxattr = cgroup_setxattr,
+ .getxattr = cgroup_getxattr,
+ .listxattr = cgroup_listxattr,
+ .removexattr = cgroup_removexattr,
+};
+
static const struct inode_operations cgroup_dir_inode_operations = {
.lookup = cgroup_lookup,
.mkdir = cgroup_mkdir,
.rmdir = cgroup_rmdir,
.rename = cgroup_rename,
+ .setxattr = cgroup_setxattr,
+ .getxattr = cgroup_getxattr,
+ .listxattr = cgroup_listxattr,
+ .removexattr = cgroup_removexattr,
};
static struct dentry *cgroup_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd)
@@ -2588,6 +2762,7 @@ static int cgroup_create_file(struct dentry *dentry, umode_t mode,
} else if (S_ISREG(mode)) {
inode->i_size = 0;
inode->i_fop = &cgroup_file_operations;
+ inode->i_op = &cgroup_file_inode_operations;
}
d_instantiate(dentry, inode);
dget(dentry); /* Extra count - pin the dentry in core */
@@ -2645,50 +2820,193 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
return mode;
}
-int cgroup_add_file(struct cgroup *cgrp,
- struct cgroup_subsys *subsys,
- const struct cftype *cft)
+static int cgroup_add_file(struct cgroup *cgrp, struct cgroup_subsys *subsys,
+ struct cftype *cft)
{
struct dentry *dir = cgrp->dentry;
+ struct cgroup *parent = __d_cgrp(dir);
struct dentry *dentry;
+ struct cfent *cfe;
int error;
umode_t mode;
-
char name[MAX_CGROUP_TYPE_NAMELEN + MAX_CFTYPE_NAME + 2] = { 0 };
+
+ simple_xattrs_init(&cft->xattrs);
+
+ /* does @cft->flags tell us to skip creation on @cgrp? */
+ if ((cft->flags & CFTYPE_NOT_ON_ROOT) && !cgrp->parent)
+ return 0;
+ if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgrp->parent)
+ return 0;
+
if (subsys && !test_bit(ROOT_NOPREFIX, &cgrp->root->flags)) {
strcpy(name, subsys->name);
strcat(name, ".");
}
strcat(name, cft->name);
+
BUG_ON(!mutex_is_locked(&dir->d_inode->i_mutex));
+
+ cfe = kzalloc(sizeof(*cfe), GFP_KERNEL);
+ if (!cfe)
+ return -ENOMEM;
+
dentry = lookup_one_len(name, dir, strlen(name));
- if (!IS_ERR(dentry)) {
- mode = cgroup_file_mode(cft);
- error = cgroup_create_file(dentry, mode | S_IFREG,
- cgrp->root->sb);
- if (!error)
- dentry->d_fsdata = (void *)cft;
- dput(dentry);
- } else
+ if (IS_ERR(dentry)) {
error = PTR_ERR(dentry);
+ goto out;
+ }
+
+ mode = cgroup_file_mode(cft);
+ error = cgroup_create_file(dentry, mode | S_IFREG, cgrp->root->sb);
+ if (!error) {
+ cfe->type = (void *)cft;
+ cfe->dentry = dentry;
+ dentry->d_fsdata = cfe;
+ list_add_tail(&cfe->node, &parent->files);
+ cfe = NULL;
+ }
+ dput(dentry);
+out:
+ kfree(cfe);
return error;
}
-EXPORT_SYMBOL_GPL(cgroup_add_file);
-int cgroup_add_files(struct cgroup *cgrp,
- struct cgroup_subsys *subsys,
- const struct cftype cft[],
- int count)
+static int cgroup_addrm_files(struct cgroup *cgrp, struct cgroup_subsys *subsys,
+ struct cftype cfts[], bool is_add)
{
- int i, err;
- for (i = 0; i < count; i++) {
- err = cgroup_add_file(cgrp, subsys, &cft[i]);
- if (err)
- return err;
+ struct cftype *cft;
+ int err, ret = 0;
+
+ for (cft = cfts; cft->name[0] != '\0'; cft++) {
+ if (is_add)
+ err = cgroup_add_file(cgrp, subsys, cft);
+ else
+ err = cgroup_rm_file(cgrp, cft);
+ if (err) {
+ pr_warning("cgroup_addrm_files: failed to %s %s, err=%d\n",
+ is_add ? "add" : "remove", cft->name, err);
+ ret = err;
+ }
+ }
+ return ret;
+}
+
+static DEFINE_MUTEX(cgroup_cft_mutex);
+
+static void cgroup_cfts_prepare(void)
+ __acquires(&cgroup_cft_mutex) __acquires(&cgroup_mutex)
+{
+ /*
+ * Thanks to the entanglement with vfs inode locking, we can't walk
+ * the existing cgroups under cgroup_mutex and create files.
+ * Instead, we increment reference on all cgroups and build list of
+ * them using @cgrp->cft_q_node. Grab cgroup_cft_mutex to ensure
+ * exclusive access to the field.
+ */
+ mutex_lock(&cgroup_cft_mutex);
+ mutex_lock(&cgroup_mutex);
+}
+
+static void cgroup_cfts_commit(struct cgroup_subsys *ss,
+ struct cftype *cfts, bool is_add)
+ __releases(&cgroup_mutex) __releases(&cgroup_cft_mutex)
+{
+ LIST_HEAD(pending);
+ struct cgroup *cgrp, *n;
+
+ /* %NULL @cfts indicates abort and don't bother if @ss isn't attached */
+ if (cfts && ss->root != &rootnode) {
+ list_for_each_entry(cgrp, &ss->root->allcg_list, allcg_node) {
+ dget(cgrp->dentry);
+ list_add_tail(&cgrp->cft_q_node, &pending);
+ }
+ }
+
+ mutex_unlock(&cgroup_mutex);
+
+ /*
+ * All new cgroups will see @cfts update on @ss->cftsets. Add/rm
+ * files for all cgroups which were created before.
+ */
+ list_for_each_entry_safe(cgrp, n, &pending, cft_q_node) {
+ struct inode *inode = cgrp->dentry->d_inode;
+
+ mutex_lock(&inode->i_mutex);
+ mutex_lock(&cgroup_mutex);
+ if (!cgroup_is_removed(cgrp))
+ cgroup_addrm_files(cgrp, ss, cfts, is_add);
+ mutex_unlock(&cgroup_mutex);
+ mutex_unlock(&inode->i_mutex);
+
+ list_del_init(&cgrp->cft_q_node);
+ dput(cgrp->dentry);
}
+
+ mutex_unlock(&cgroup_cft_mutex);
+}
+
+/**
+ * cgroup_add_cftypes - add an array of cftypes to a subsystem
+ * @ss: target cgroup subsystem
+ * @cfts: zero-length name terminated array of cftypes
+ *
+ * Register @cfts to @ss. Files described by @cfts are created for all
+ * existing cgroups to which @ss is attached and all future cgroups will
+ * have them too. This function can be called anytime whether @ss is
+ * attached or not.
+ *
+ * Returns 0 on successful registration, -errno on failure. Note that this
+ * function currently returns 0 as long as @cfts registration is successful
+ * even if some file creation attempts on existing cgroups fail.
+ */
+int cgroup_add_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
+{
+ struct cftype_set *set;
+
+ set = kzalloc(sizeof(*set), GFP_KERNEL);
+ if (!set)
+ return -ENOMEM;
+
+ cgroup_cfts_prepare();
+ set->cfts = cfts;
+ list_add_tail(&set->node, &ss->cftsets);
+ cgroup_cfts_commit(ss, cfts, true);
+
return 0;
}
-EXPORT_SYMBOL_GPL(cgroup_add_files);
+EXPORT_SYMBOL_GPL(cgroup_add_cftypes);
+
+/**
+ * cgroup_rm_cftypes - remove an array of cftypes from a subsystem
+ * @ss: target cgroup subsystem
+ * @cfts: zero-length name terminated array of cftypes
+ *
+ * Unregister @cfts from @ss. Files described by @cfts are removed from
+ * all existing cgroups to which @ss is attached and all future cgroups
+ * won't have them either. This function can be called anytime whether @ss
+ * is attached or not.
+ *
+ * Returns 0 on successful unregistration, -ENOENT if @cfts is not
+ * registered with @ss.
+ */
+int cgroup_rm_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
+{
+ struct cftype_set *set;
+
+ cgroup_cfts_prepare();
+
+ list_for_each_entry(set, &ss->cftsets, node) {
+ if (set->cfts == cfts) {
+ list_del_init(&set->node);
+ cgroup_cfts_commit(ss, cfts, false);
+ return 0;
+ }
+ }
+
+ cgroup_cfts_commit(ss, NULL, false);
+ return -ENOENT;
+}
/**
* cgroup_task_count - count the number of tasks in a cgroup.
@@ -3678,36 +3996,44 @@ static struct cftype files[] = {
.read_u64 = cgroup_clone_children_read,
.write_u64 = cgroup_clone_children_write,
},
+ {
+ .name = "release_agent",
+ .flags = CFTYPE_ONLY_ON_ROOT,
+ .read_seq_string = cgroup_release_agent_show,
+ .write_string = cgroup_release_agent_write,
+ .max_write_len = PATH_MAX,
+ },
+ { } /* terminate */
};
-static struct cftype cft_release_agent = {
- .name = "release_agent",
- .read_seq_string = cgroup_release_agent_show,
- .write_string = cgroup_release_agent_write,
- .max_write_len = PATH_MAX,
-};
-
-static int cgroup_populate_dir(struct cgroup *cgrp)
+/**
+ * cgroup_populate_dir - selectively creation of files in a directory
+ * @cgrp: target cgroup
+ * @base_files: true if the base files should be added
+ * @subsys_mask: mask of the subsystem ids whose files should be added
+ */
+static int cgroup_populate_dir(struct cgroup *cgrp, bool base_files,
+ unsigned long subsys_mask)
{
int err;
struct cgroup_subsys *ss;
- /* First clear out any existing files */
- cgroup_clear_directory(cgrp->dentry);
-
- err = cgroup_add_files(cgrp, NULL, files, ARRAY_SIZE(files));
- if (err < 0)
- return err;
-
- if (cgrp == cgrp->top_cgroup) {
- if ((err = cgroup_add_file(cgrp, NULL, &cft_release_agent)) < 0)
+ if (base_files) {
+ err = cgroup_addrm_files(cgrp, NULL, files, true);
+ if (err < 0)
return err;
}
+ /* process cftsets of each subsystem */
for_each_subsys(cgrp->root, ss) {
- if (ss->populate && (err = ss->populate(ss, cgrp)) < 0)
- return err;
+ struct cftype_set *set;
+ if (!test_bit(ss->subsys_id, &subsys_mask))
+ continue;
+
+ list_for_each_entry(set, &ss->cftsets, node)
+ cgroup_addrm_files(cgrp, ss, set->cfts, true);
}
+
/* This cgroup is ready now */
for_each_subsys(cgrp->root, ss) {
struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
@@ -3723,6 +4049,18 @@ static int cgroup_populate_dir(struct cgroup *cgrp)
return 0;
}
+static void css_dput_fn(struct work_struct *work)
+{
+ struct cgroup_subsys_state *css =
+ container_of(work, struct cgroup_subsys_state, dput_work);
+ struct dentry *dentry = css->cgroup->dentry;
+ struct super_block *sb = dentry->d_sb;
+
+ atomic_inc(&sb->s_active);
+ dput(dentry);
+ deactivate_super(sb);
+}
+
static void init_cgroup_css(struct cgroup_subsys_state *css,
struct cgroup_subsys *ss,
struct cgroup *cgrp)
@@ -3735,37 +4073,16 @@ static void init_cgroup_css(struct cgroup_subsys_state *css,
set_bit(CSS_ROOT, &css->flags);
BUG_ON(cgrp->subsys[ss->subsys_id]);
cgrp->subsys[ss->subsys_id] = css;
-}
-
-static void cgroup_lock_hierarchy(struct cgroupfs_root *root)
-{
- /* We need to take each hierarchy_mutex in a consistent order */
- int i;
/*
- * No worry about a race with rebind_subsystems that might mess up the
- * locking order, since both parties are under cgroup_mutex.
+ * If !clear_css_refs, css holds an extra ref to @cgrp->dentry
+ * which is put on the last css_put(). dput() requires process
+ * context, which css_put() may be called without. @css->dput_work
+ * will be used to invoke dput() asynchronously from css_put().
*/
- for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
- struct cgroup_subsys *ss = subsys[i];
- if (ss == NULL)
- continue;
- if (ss->root == root)
- mutex_lock(&ss->hierarchy_mutex);
- }
-}
-
-static void cgroup_unlock_hierarchy(struct cgroupfs_root *root)
-{
- int i;
-
- for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
- struct cgroup_subsys *ss = subsys[i];
- if (ss == NULL)
- continue;
- if (ss->root == root)
- mutex_unlock(&ss->hierarchy_mutex);
- }
+ INIT_WORK(&css->dput_work, css_dput_fn);
+ if (ss->__DEPRECATED_clear_css_refs)
+ set_bit(CSS_CLEAR_CSS_REFS, &css->flags);
}
/*
@@ -3828,21 +4145,24 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
ss->post_clone(cgrp);
}
- cgroup_lock_hierarchy(root);
list_add(&cgrp->sibling, &cgrp->parent->children);
- cgroup_unlock_hierarchy(root);
root->number_of_cgroups++;
err = cgroup_create_dir(cgrp, dentry, mode);
if (err < 0)
goto err_remove;
- set_bit(CGRP_RELEASABLE, &parent->flags);
+ /* If !clear_css_refs, each css holds a ref to the cgroup's dentry */
+ for_each_subsys(root, ss)
+ if (!ss->__DEPRECATED_clear_css_refs)
+ dget(dentry);
/* The cgroup directory was pre-locked for us */
BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex));
- err = cgroup_populate_dir(cgrp);
+ list_add_tail(&cgrp->allcg_node, &root->allcg_list);
+
+ err = cgroup_populate_dir(cgrp, true, root->subsys_bits);
/* If err < 0, we have a half-filled directory - oh well ;) */
mutex_unlock(&cgroup_mutex);
@@ -3852,9 +4172,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
err_remove:
- cgroup_lock_hierarchy(root);
list_del(&cgrp->sibling);
- cgroup_unlock_hierarchy(root);
root->number_of_cgroups--;
err_destroy:
@@ -3881,18 +4199,19 @@ static int cgroup_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
return cgroup_create(c_parent, dentry, mode | S_IFDIR);
}
+/*
+ * Check the reference count on each subsystem. Since we already
+ * established that there are no tasks in the cgroup, if the css refcount
+ * is also 1, then there should be no outstanding references, so the
+ * subsystem is safe to destroy. We scan across all subsystems rather than
+ * using the per-hierarchy linked list of mounted subsystems since we can
+ * be called via check_for_release() with no synchronization other than
+ * RCU, and the subsystem linked list isn't RCU-safe.
+ */
static int cgroup_has_css_refs(struct cgroup *cgrp)
{
- /* Check the reference count on each subsystem. Since we
- * already established that there are no tasks in the
- * cgroup, if the css refcount is also 1, then there should
- * be no outstanding references, so the subsystem is safe to
- * destroy. We scan across all subsystems rather than using
- * the per-hierarchy linked list of mounted subsystems since
- * we can be called via check_for_release() with no
- * synchronization other than RCU, and the subsystem linked
- * list isn't RCU-safe */
int i;
+
/*
* We won't need to lock the subsys array, because the subsystems
* we're concerned about aren't going anywhere since our cgroup root
@@ -3901,17 +4220,21 @@ static int cgroup_has_css_refs(struct cgroup *cgrp)
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
struct cgroup_subsys_state *css;
+
/* Skip subsystems not present or not in this hierarchy */
if (ss == NULL || ss->root != cgrp->root)
continue;
+
css = cgrp->subsys[ss->subsys_id];
- /* When called from check_for_release() it's possible
+ /*
+ * When called from check_for_release() it's possible
* that by this point the cgroup has been removed
* and the css deleted. But a false-positive doesn't
* matter, since it can only happen if the cgroup
* has been deleted and hence no longer needs the
- * release agent to be called anyway. */
- if (css && (atomic_read(&css->refcnt) > 1))
+ * release agent to be called anyway.
+ */
+ if (css && css_refcnt(css) > 1)
return 1;
}
return 0;
@@ -3921,51 +4244,63 @@ static int cgroup_has_css_refs(struct cgroup *cgrp)
* Atomically mark all (or else none) of the cgroup's CSS objects as
* CSS_REMOVED. Return true on success, or false if the cgroup has
* busy subsystems. Call with cgroup_mutex held
+ *
+ * Depending on whether a subsys has __DEPRECATED_clear_css_refs set or
+ * not, cgroup removal behaves differently.
+ *
+ * If clear is set, css refcnt for the subsystem should be zero before
+ * cgroup removal can be committed. This is implemented by
+ * CGRP_WAIT_ON_RMDIR and retry logic around ->pre_destroy(), which may be
+ * called multiple times until all css refcnts reach zero and is allowed to
+ * veto removal on any invocation. This behavior is deprecated and will be
+ * removed as soon as the existing user (memcg) is updated.
+ *
+ * If clear is not set, each css holds an extra reference to the cgroup's
+ * dentry and cgroup removal proceeds regardless of css refs.
+ * ->pre_destroy() will be called at least once and is not allowed to fail.
+ * On the last put of each css, whenever that may be, the extra dentry ref
+ * is put so that dentry destruction happens only after all css's are
+ * released.
*/
-
static int cgroup_clear_css_refs(struct cgroup *cgrp)
{
struct cgroup_subsys *ss;
unsigned long flags;
bool failed = false;
+
local_irq_save(flags);
+
+ /*
+ * Block new css_tryget() by deactivating refcnt. If all refcnts
+ * for subsystems w/ clear_css_refs set were 1 at the moment of
+ * deactivation, we succeeded.
+ */
for_each_subsys(cgrp->root, ss) {
struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
- int refcnt;
- while (1) {
- /* We can only remove a CSS with a refcnt==1 */
- refcnt = atomic_read(&css->refcnt);
- if (refcnt > 1) {
- failed = true;
- goto done;
- }
- BUG_ON(!refcnt);
- /*
- * Drop the refcnt to 0 while we check other
- * subsystems. This will cause any racing
- * css_tryget() to spin until we set the
- * CSS_REMOVED bits or abort
- */
- if (atomic_cmpxchg(&css->refcnt, refcnt, 0) == refcnt)
- break;
- cpu_relax();
- }
+
+ WARN_ON(atomic_read(&css->refcnt) < 0);
+ atomic_add(CSS_DEACT_BIAS, &css->refcnt);
+
+ if (ss->__DEPRECATED_clear_css_refs)
+ failed |= css_refcnt(css) != 1;
}
- done:
+
+ /*
+ * If succeeded, set REMOVED and put all the base refs; otherwise,
+ * restore refcnts to positive values. Either way, all in-progress
+ * css_tryget() will be released.
+ */
for_each_subsys(cgrp->root, ss) {
struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
- if (failed) {
- /*
- * Restore old refcnt if we previously managed
- * to clear it from 1 to 0
- */
- if (!atomic_read(&css->refcnt))
- atomic_set(&css->refcnt, 1);
- } else {
- /* Commit the fact that the CSS is removed */
+
+ if (!failed) {
set_bit(CSS_REMOVED, &css->flags);
+ css_put(css);
+ } else {
+ atomic_sub(CSS_DEACT_BIAS, &css->refcnt);
}
}
+
local_irq_restore(flags);
return !failed;
}
@@ -4069,10 +4404,10 @@ again:
list_del_init(&cgrp->release_list);
raw_spin_unlock(&release_list_lock);
- cgroup_lock_hierarchy(cgrp->root);
/* delete this cgroup from parent->children */
list_del_init(&cgrp->sibling);
- cgroup_unlock_hierarchy(cgrp->root);
+
+ list_del_init(&cgrp->allcg_node);
d = dget(cgrp->dentry);
@@ -4099,12 +4434,29 @@ again:
return 0;
}
+static void __init_or_module cgroup_init_cftsets(struct cgroup_subsys *ss)
+{
+ INIT_LIST_HEAD(&ss->cftsets);
+
+ /*
+ * base_cftset is embedded in subsys itself, no need to worry about
+ * deregistration.
+ */
+ if (ss->base_cftypes) {
+ ss->base_cftset.cfts = ss->base_cftypes;
+ list_add_tail(&ss->base_cftset.node, &ss->cftsets);
+ }
+}
+
static void __init cgroup_init_subsys(struct cgroup_subsys *ss)
{
struct cgroup_subsys_state *css;
printk(KERN_INFO "Initializing cgroup subsys %s\n", ss->name);
+ /* init base cftset */
+ cgroup_init_cftsets(ss);
+
/* Create the top cgroup state for this subsystem */
list_add(&ss->sibling, &rootnode.subsys_list);
ss->root = &rootnode;
@@ -4126,8 +4478,6 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss)
* need to invoke fork callbacks here. */
BUG_ON(!list_empty(&init_task.tasks));
- mutex_init(&ss->hierarchy_mutex);
- lockdep_set_class(&ss->hierarchy_mutex, &ss->subsys_key);
ss->active = 1;
/* this function shouldn't be used with modular subsystems, since they
@@ -4174,6 +4524,9 @@ int __init_or_module cgroup_load_subsys(struct cgroup_subsys *ss)
return 0;
}
+ /* init base cftset */
+ cgroup_init_cftsets(ss);
+
/*
* need to register a subsys id before anything else - for example,
* init_cgroup_css needs it.
@@ -4251,8 +4604,6 @@ int __init_or_module cgroup_load_subsys(struct cgroup_subsys *ss)
}
write_unlock(&css_set_lock);
- mutex_init(&ss->hierarchy_mutex);
- lockdep_set_class(&ss->hierarchy_mutex, &ss->subsys_key);
ss->active = 1;
/* success! */
@@ -4735,26 +5086,43 @@ static void check_for_release(struct cgroup *cgrp)
}
/* Caller must verify that the css is not for root cgroup */
-void __css_get(struct cgroup_subsys_state *css, int count)
+bool __css_tryget(struct cgroup_subsys_state *css)
{
- atomic_add(count, &css->refcnt);
- set_bit(CGRP_RELEASABLE, &css->cgroup->flags);
+ do {
+ int v = css_refcnt(css);
+
+ if (atomic_cmpxchg(&css->refcnt, v, v + 1) == v)
+ return true;
+ cpu_relax();
+ } while (!test_bit(CSS_REMOVED, &css->flags));
+
+ return false;
}
-EXPORT_SYMBOL_GPL(__css_get);
+EXPORT_SYMBOL_GPL(__css_tryget);
/* Caller must verify that the css is not for root cgroup */
-void __css_put(struct cgroup_subsys_state *css, int count)
+void __css_put(struct cgroup_subsys_state *css)
{
struct cgroup *cgrp = css->cgroup;
- int val;
+ int v;
+
rcu_read_lock();
- val = atomic_sub_return(count, &css->refcnt);
- if (val == 1) {
- check_for_release(cgrp);
+ v = css_unbias_refcnt(atomic_dec_return(&css->refcnt));
+
+ switch (v) {
+ case 1:
+ if (notify_on_release(cgrp)) {
+ set_bit(CGRP_RELEASABLE, &cgrp->flags);
+ check_for_release(cgrp);
+ }
cgroup_wakeup_rmdir_waiter(cgrp);
+ break;
+ case 0:
+ if (!test_bit(CSS_CLEAR_CSS_REFS, &css->flags))
+ schedule_work(&css->dput_work);
+ break;
}
rcu_read_unlock();
- WARN_ON_ONCE(val < 1);
}
EXPORT_SYMBOL_GPL(__css_put);
@@ -4873,7 +5241,7 @@ unsigned short css_id(struct cgroup_subsys_state *css)
* on this or this is under rcu_read_lock(). Once css->id is allocated,
* it's unchanged until freed.
*/
- cssid = rcu_dereference_check(css->id, atomic_read(&css->refcnt));
+ cssid = rcu_dereference_check(css->id, css_refcnt(css));
if (cssid)
return cssid->id;
@@ -4885,7 +5253,7 @@ unsigned short css_depth(struct cgroup_subsys_state *css)
{
struct css_id *cssid;
- cssid = rcu_dereference_check(css->id, atomic_read(&css->refcnt));
+ cssid = rcu_dereference_check(css->id, css_refcnt(css));
if (cssid)
return cssid->depth;
@@ -4899,7 +5267,7 @@ EXPORT_SYMBOL_GPL(css_depth);
* @root: the css supporsed to be an ancestor of the child.
*
* Returns true if "root" is an ancestor of "child" in its hierarchy. Because
- * this function reads css->id, this use rcu_dereference() and rcu_read_lock().
+ * this function reads css->id, the caller must hold rcu_read_lock().
* But, considering usual usage, the csses should be valid objects after test.
* Assuming that the caller will do some action to the child if this returns
* returns true, the caller must take "child";s reference count.
@@ -4911,18 +5279,18 @@ bool css_is_ancestor(struct cgroup_subsys_state *child,
{
struct css_id *child_id;
struct css_id *root_id;
- bool ret = true;
- rcu_read_lock();
child_id = rcu_dereference(child->id);
+ if (!child_id)
+ return false;
root_id = rcu_dereference(root->id);
- if (!child_id
- || !root_id
- || (child_id->depth < root_id->depth)
- || (child_id->stack[root_id->depth] != root_id->id))
- ret = false;
- rcu_read_unlock();
- return ret;
+ if (!root_id)
+ return false;
+ if (child_id->depth < root_id->depth)
+ return false;
+ if (child_id->stack[root_id->depth] != root_id->id)
+ return false;
+ return true;
}
void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css)
@@ -5266,19 +5634,15 @@ static struct cftype debug_files[] = {
.name = "releasable",
.read_u64 = releasable_read,
},
-};
-static int debug_populate(struct cgroup_subsys *ss, struct cgroup *cont)
-{
- return cgroup_add_files(cont, ss, debug_files,
- ARRAY_SIZE(debug_files));
-}
+ { } /* terminate */
+};
struct cgroup_subsys debug_subsys = {
.name = "debug",
.create = debug_create,
.destroy = debug_destroy,
- .populate = debug_populate,
.subsys_id = debug_subsys_id,
+ .base_cftypes = debug_files,
};
#endif /* CONFIG_CGROUP_DEBUG */
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index a902e2a..f990569 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -353,24 +353,19 @@ static int freezer_write(struct cgroup *cgroup,
static struct cftype files[] = {
{
.name = "state",
+ .flags = CFTYPE_NOT_ON_ROOT,
.read_seq_string = freezer_read,
.write_string = freezer_write,
},
+ { } /* terminate */
};
-static int freezer_populate(struct cgroup_subsys *ss, struct cgroup *cgroup)
-{
- if (!cgroup->parent)
- return 0;
- return cgroup_add_files(cgroup, ss, files, ARRAY_SIZE(files));
-}
-
struct cgroup_subsys freezer_subsys = {
.name = "freezer",
.create = freezer_create,
.destroy = freezer_destroy,
- .populate = freezer_populate,
.subsys_id = freezer_subsys_id,
.can_attach = freezer_can_attach,
.fork = freezer_fork,
+ .base_cftypes = files,
};
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4b843ac..f1dec4b 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1769,28 +1769,17 @@ static struct cftype files[] = {
.write_u64 = cpuset_write_u64,
.private = FILE_SPREAD_SLAB,
},
-};
-
-static struct cftype cft_memory_pressure_enabled = {
- .name = "memory_pressure_enabled",
- .read_u64 = cpuset_read_u64,
- .write_u64 = cpuset_write_u64,
- .private = FILE_MEMORY_PRESSURE_ENABLED,
-};
-static int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont)
-{
- int err;
+ {
+ .name = "memory_pressure_enabled",
+ .flags = CFTYPE_ONLY_ON_ROOT,
+ .read_u64 = cpuset_read_u64,
+ .write_u64 = cpuset_write_u64,
+ .private = FILE_MEMORY_PRESSURE_ENABLED,
+ },
- err = cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
- if (err)
- return err;
- /* memory_pressure_enabled is in root cpuset only */
- if (!cont->parent)
- err = cgroup_add_file(cont, ss,
- &cft_memory_pressure_enabled);
- return err;
-}
+ { } /* terminate */
+};
/*
* post_clone() is called during cgroup_create() when the
@@ -1891,9 +1880,9 @@ struct cgroup_subsys cpuset_subsys = {
.destroy = cpuset_destroy,
.can_attach = cpuset_can_attach,
.attach = cpuset_attach,
- .populate = cpuset_populate,
.post_clone = cpuset_post_clone,
.subsys_id = cpuset_subsys_id,
+ .base_cftypes = files,
.early_init = 1,
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a2ba28e..dac1d43 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8249,13 +8249,9 @@ static struct cftype cpu_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
#endif
+ { } /* terminate */
};
-static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
-{
- return cgroup_add_files(cont, ss, cpu_files, ARRAY_SIZE(cpu_files));
-}
-
struct cgroup_subsys cpu_cgroup_subsys = {
.name = "cpu",
.create = cpu_cgroup_create,
@@ -8264,8 +8260,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
.attach = cpu_cgroup_attach,
.allow_attach = cpu_cgroup_allow_attach,
.exit = cpu_cgroup_exit,
- .populate = cpu_cgroup_populate,
.subsys_id = cpu_cgroup_subsys_id,
+ .base_cftypes = cpu_files,
.early_init = 1,
};
@@ -8450,13 +8446,9 @@ static struct cftype files[] = {
.name = "stat",
.read_map = cpuacct_stats_show,
},
+ { } /* terminate */
};
-static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
-{
- return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
-}
-
/*
* charge this task's execution time to its accounting group.
*
@@ -8488,7 +8480,7 @@ struct cgroup_subsys cpuacct_subsys = {
.name = "cpuacct",
.create = cpuacct_create,
.destroy = cpuacct_destroy,
- .populate = cpuacct_populate,
.subsys_id = cpuacct_subsys_id,
+ .base_cftypes = files,
};
#endif /* CONFIG_CGROUP_CPUACCT */
diff --git a/kernel/sys.c b/kernel/sys.c
index b3d4f92..7163847 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -96,10 +96,8 @@
int overflowuid = DEFAULT_OVERFLOWUID;
int overflowgid = DEFAULT_OVERFLOWGID;
-#ifdef CONFIG_UID16
EXPORT_SYMBOL(overflowuid);
EXPORT_SYMBOL(overflowgid);
-#endif
/*
* the same as above, but for filesystems which can only store a 16-bit
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9db1557..08d6852 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1175,12 +1175,16 @@ struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
static bool mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
struct mem_cgroup *memcg)
{
- if (root_memcg != memcg) {
- return (root_memcg->use_hierarchy &&
- css_is_ancestor(&memcg->css, &root_memcg->css));
- }
+ bool ret;
- return true;
+ if (root_memcg == memcg)
+ return true;
+ if (!root_memcg->use_hierarchy)
+ return false;
+ rcu_read_lock();
+ ret = css_is_ancestor(&memcg->css, &root_memcg->css);
+ rcu_read_unlock();
+ return ret;
}
int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *memcg)
@@ -3905,14 +3909,21 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
return val << PAGE_SHIFT;
}
-static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
+static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
+ struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ char str[64];
u64 val;
- int type, name;
+ int type, name, len;
type = MEMFILE_TYPE(cft->private);
name = MEMFILE_ATTR(cft->private);
+
+ if (!do_swap_account && type == _MEMSWAP)
+ return -EOPNOTSUPP;
+
switch (type) {
case _MEM:
if (name == RES_USAGE)
@@ -3929,7 +3940,9 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
default:
BUG();
}
- return val;
+
+ len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
+ return simple_read_from_buffer(buf, nbytes, ppos, str, len);
}
/*
* The user of this function is...
@@ -3945,6 +3958,10 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
type = MEMFILE_TYPE(cft->private);
name = MEMFILE_ATTR(cft->private);
+
+ if (!do_swap_account && type == _MEMSWAP)
+ return -EOPNOTSUPP;
+
switch (name) {
case RES_LIMIT:
if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
@@ -4010,12 +4027,15 @@ out:
static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
{
- struct mem_cgroup *memcg;
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
int type, name;
- memcg = mem_cgroup_from_cont(cont);
type = MEMFILE_TYPE(event);
name = MEMFILE_ATTR(event);
+
+ if (!do_swap_account && type == _MEMSWAP)
+ return -EOPNOTSUPP;
+
switch (name) {
case RES_MAX_USAGE:
if (type == _MEM)
@@ -4662,29 +4682,22 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
#endif /* CONFIG_NUMA */
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
-static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
+static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
- /*
- * Part of this would be better living in a separate allocation
- * function, leaving us with just the cgroup tree population work.
- * We, however, depend on state such as network's proto_list that
- * is only initialized after cgroup creation. I found the less
- * cumbersome way to deal with it to defer it all to populate time
- */
- return mem_cgroup_sockets_init(cont, ss);
+ return mem_cgroup_sockets_init(memcg, ss);
};
-static void kmem_cgroup_destroy(struct cgroup *cont)
+static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
{
- mem_cgroup_sockets_destroy(cont);
+ mem_cgroup_sockets_destroy(memcg);
}
#else
-static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
+static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
return 0;
}
-static void kmem_cgroup_destroy(struct cgroup *cont)
+static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
{
}
#endif
@@ -4693,7 +4706,7 @@ static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
- .read_u64 = mem_cgroup_read,
+ .read = mem_cgroup_read,
.register_event = mem_cgroup_usage_register_event,
.unregister_event = mem_cgroup_usage_unregister_event,
},
@@ -4701,25 +4714,25 @@ static struct cftype mem_cgroup_files[] = {
.name = "max_usage_in_bytes",
.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
.trigger = mem_cgroup_reset,
- .read_u64 = mem_cgroup_read,
+ .read = mem_cgroup_read,
},
{
.name = "limit_in_bytes",
.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
.write_string = mem_cgroup_write,
- .read_u64 = mem_cgroup_read,
+ .read = mem_cgroup_read,
},
{
.name = "soft_limit_in_bytes",
.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
.write_string = mem_cgroup_write,
- .read_u64 = mem_cgroup_read,
+ .read = mem_cgroup_read,
},
{
.name = "failcnt",
.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
.trigger = mem_cgroup_reset,
- .read_u64 = mem_cgroup_read,
+ .read = mem_cgroup_read,
},
{
.name = "stat",
@@ -4764,14 +4777,11 @@ static struct cftype mem_cgroup_files[] = {
.mode = S_IRUGO,
},
#endif
-};
-
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-static struct cftype memsw_cgroup_files[] = {
{
.name = "memsw.usage_in_bytes",
.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
- .read_u64 = mem_cgroup_read,
+ .read = mem_cgroup_read,
.register_event = mem_cgroup_usage_register_event,
.unregister_event = mem_cgroup_usage_unregister_event,
},
@@ -4779,35 +4789,23 @@ static struct cftype memsw_cgroup_files[] = {
.name = "memsw.max_usage_in_bytes",
.private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
.trigger = mem_cgroup_reset,
- .read_u64 = mem_cgroup_read,
+ .read = mem_cgroup_read,
},
{
.name = "memsw.limit_in_bytes",
.private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
.write_string = mem_cgroup_write,
- .read_u64 = mem_cgroup_read,
+ .read = mem_cgroup_read,
},
{
.name = "memsw.failcnt",
.private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
.trigger = mem_cgroup_reset,
- .read_u64 = mem_cgroup_read,
+ .read = mem_cgroup_read,
},
-};
-
-static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
-{
- if (!do_swap_account)
- return 0;
- return cgroup_add_files(cont, ss, memsw_cgroup_files,
- ARRAY_SIZE(memsw_cgroup_files));
-};
-#else
-static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
-{
- return 0;
-}
#endif
+ { }, /* terminate */
+};
static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
{
@@ -5059,7 +5057,17 @@ mem_cgroup_create(struct cgroup *cont)
memcg->move_charge_at_immigrate = 0;
mutex_init(&memcg->thresholds_lock);
spin_lock_init(&memcg->move_lock);
- vmpressure_init(&memcg->vmpressure);
+
+ error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
+ if (error) {
+ /*
+ * We call put now because our (and parent's) refcnts
+ * are already in place. mem_cgroup_put() will internally
+ * call __mem_cgroup_free, so return directly
+ */
+ mem_cgroup_put(memcg);
+ return ERR_PTR(error);
+ }
return &memcg->css;
free_out:
__mem_cgroup_free(memcg);
@@ -5077,28 +5085,11 @@ static void mem_cgroup_destroy(struct cgroup *cont)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
- kmem_cgroup_destroy(cont);
+ kmem_cgroup_destroy(memcg);
mem_cgroup_put(memcg);
}
-static int mem_cgroup_populate(struct cgroup_subsys *ss,
- struct cgroup *cont)
-{
- int ret;
-
- ret = cgroup_add_files(cont, ss, mem_cgroup_files,
- ARRAY_SIZE(mem_cgroup_files));
-
- if (!ret)
- ret = register_memsw_files(cont, ss);
-
- if (!ret)
- ret = register_kmem_files(cont, ss);
-
- return ret;
-}
-
#ifdef CONFIG_MMU
/* Handlers for move charge at task migration. */
#define PRECHARGE_COUNT_AT_ONCE 256
@@ -5682,12 +5673,13 @@ struct cgroup_subsys mem_cgroup_subsys = {
.create = mem_cgroup_create,
.pre_destroy = mem_cgroup_pre_destroy,
.destroy = mem_cgroup_destroy,
- .populate = mem_cgroup_populate,
.can_attach = mem_cgroup_can_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.attach = mem_cgroup_move_task,
+ .base_cftypes = mem_cgroup_files,
.early_init = 0,
.use_id = 1,
+ .__DEPRECATED_clear_css_refs = true,
};
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
diff --git a/mm/shmem.c b/mm/shmem.c
index 4489566..0fa474e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -76,11 +76,16 @@ static struct vfsmount *shm_mnt;
/* Symlink up to this size is kmalloc'ed instead of using a swappable page */
#define SHORT_SYMLINK_LEN 128
-struct shmem_xattr {
- struct list_head list; /* anchored by shmem_inode_info->xattr_list */
- char *name; /* xattr name */
- size_t size;
- char value[0];
+/*
+ * shmem_fallocate and shmem_writepage communicate via inode->i_private
+ * (with i_mutex making sure that it has only one user at a time):
+ * we would prefer not to enlarge the shmem inode just for that.
+ */
+struct shmem_falloc {
+ pgoff_t start; /* start of range currently being fallocated */
+ pgoff_t next; /* the next page offset to be fallocated */
+ pgoff_t nr_falloced; /* how many new pages have been fallocated */
+ pgoff_t nr_unswapped; /* how often writepage refused to swap out */
};
/* Flag allocation requirements to shmem_getpage */
@@ -577,7 +582,6 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
static void shmem_evict_inode(struct inode *inode)
{
struct shmem_inode_info *info = SHMEM_I(inode);
- struct shmem_xattr *xattr, *nxattr;
if (inode->i_mapping->a_ops == &shmem_aops) {
shmem_unacct_size(info->flags, inode->i_size);
@@ -591,11 +595,8 @@ static void shmem_evict_inode(struct inode *inode)
} else
kfree(info->symlink);
- list_for_each_entry_safe(xattr, nxattr, &info->xattr_list, list) {
- kfree(xattr->name);
- kfree(xattr);
- }
- WARN_ON(inode->i_blocks);
+ simple_xattrs_free(&info->xattrs);
+ BUG_ON(inode->i_blocks);
shmem_free_inode(inode->i_sb);
end_writeback(inode);
}
@@ -1145,7 +1146,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
spin_lock_init(&info->lock);
info->flags = flags & VM_NORESERVE;
INIT_LIST_HEAD(&info->swaplist);
- INIT_LIST_HEAD(&info->xattr_list);
+ simple_xattrs_init(&info->xattrs);
cache_no_acl(inode);
switch (mode & S_IFMT) {
@@ -1718,28 +1719,6 @@ static void shmem_put_link(struct dentry *dentry, struct nameidata *nd, void *co
*/
/*
- * Allocate new xattr and copy in the value; but leave the name to callers.
- */
-static struct shmem_xattr *shmem_xattr_alloc(const void *value, size_t size)
-{
- struct shmem_xattr *new_xattr;
- size_t len;
-
- /* wrap around? */
- len = sizeof(*new_xattr) + size;
- if (len <= sizeof(*new_xattr))
- return NULL;
-
- new_xattr = kmalloc(len, GFP_KERNEL);
- if (!new_xattr)
- return NULL;
-
- new_xattr->size = size;
- memcpy(new_xattr->value, value, size);
- return new_xattr;
-}
-
-/*
* Callback for security_inode_init_security() for acquiring xattrs.
*/
static int shmem_initxattrs(struct inode *inode,
@@ -1748,11 +1727,11 @@ static int shmem_initxattrs(struct inode *inode,
{
struct shmem_inode_info *info = SHMEM_I(inode);
const struct xattr *xattr;
- struct shmem_xattr *new_xattr;
+ struct simple_xattr *new_xattr;
size_t len;
for (xattr = xattr_array; xattr->name != NULL; xattr++) {
- new_xattr = shmem_xattr_alloc(xattr->value, xattr->value_len);
+ new_xattr = simple_xattr_alloc(xattr->value, xattr->value_len);
if (!new_xattr)
return -ENOMEM;
@@ -1769,91 +1748,12 @@ static int shmem_initxattrs(struct inode *inode,
memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN,
xattr->name, len);
- spin_lock(&info->lock);
- list_add(&new_xattr->list, &info->xattr_list);
- spin_unlock(&info->lock);
+ simple_xattr_list_add(&info->xattrs, new_xattr);
}
return 0;
}
-static int shmem_xattr_get(struct dentry *dentry, const char *name,
- void *buffer, size_t size)
-{
- struct shmem_inode_info *info;
- struct shmem_xattr *xattr;
- int ret = -ENODATA;
-
- info = SHMEM_I(dentry->d_inode);
-
- spin_lock(&info->lock);
- list_for_each_entry(xattr, &info->xattr_list, list) {
- if (strcmp(name, xattr->name))
- continue;
-
- ret = xattr->size;
- if (buffer) {
- if (size < xattr->size)
- ret = -ERANGE;
- else
- memcpy(buffer, xattr->value, xattr->size);
- }
- break;
- }
- spin_unlock(&info->lock);
- return ret;
-}
-
-static int shmem_xattr_set(struct inode *inode, const char *name,
- const void *value, size_t size, int flags)
-{
- struct shmem_inode_info *info = SHMEM_I(inode);
- struct shmem_xattr *xattr;
- struct shmem_xattr *new_xattr = NULL;
- int err = 0;
-
- /* value == NULL means remove */
- if (value) {
- new_xattr = shmem_xattr_alloc(value, size);
- if (!new_xattr)
- return -ENOMEM;
-
- new_xattr->name = kstrdup(name, GFP_KERNEL);
- if (!new_xattr->name) {
- kfree(new_xattr);
- return -ENOMEM;
- }
- }
-
- spin_lock(&info->lock);
- list_for_each_entry(xattr, &info->xattr_list, list) {
- if (!strcmp(name, xattr->name)) {
- if (flags & XATTR_CREATE) {
- xattr = new_xattr;
- err = -EEXIST;
- } else if (new_xattr) {
- list_replace(&xattr->list, &new_xattr->list);
- } else {
- list_del(&xattr->list);
- }
- goto out;
- }
- }
- if (flags & XATTR_REPLACE) {
- xattr = new_xattr;
- err = -ENODATA;
- } else {
- list_add(&new_xattr->list, &info->xattr_list);
- xattr = NULL;
- }
-out:
- spin_unlock(&info->lock);
- if (xattr)
- kfree(xattr->name);
- kfree(xattr);
- return err;
-}
-
static const struct xattr_handler *shmem_xattr_handlers[] = {
#ifdef CONFIG_TMPFS_POSIX_ACL
&generic_acl_access_handler,
@@ -1884,6 +1784,7 @@ static int shmem_xattr_validate(const char *name)
static ssize_t shmem_getxattr(struct dentry *dentry, const char *name,
void *buffer, size_t size)
{
+ struct shmem_inode_info *info = SHMEM_I(dentry->d_inode);
int err;
/*
@@ -1898,12 +1799,13 @@ static ssize_t shmem_getxattr(struct dentry *dentry, const char *name,
if (err)
return err;
- return shmem_xattr_get(dentry, name, buffer, size);
+ return simple_xattr_get(&info->xattrs, name, buffer, size);
}
static int shmem_setxattr(struct dentry *dentry, const char *name,
const void *value, size_t size, int flags)
{
+ struct shmem_inode_info *info = SHMEM_I(dentry->d_inode);
int err;
/*
@@ -1918,15 +1820,12 @@ static int shmem_setxattr(struct dentry *dentry, const char *name,
if (err)
return err;
- if (size == 0)
- value = ""; /* empty EA, do not remove */
-
- return shmem_xattr_set(dentry->d_inode, name, value, size, flags);
-
+ return simple_xattr_set(&info->xattrs, name, value, size, flags);
}
static int shmem_removexattr(struct dentry *dentry, const char *name)
{
+ struct shmem_inode_info *info = SHMEM_I(dentry->d_inode);
int err;
/*
@@ -1941,45 +1840,13 @@ static int shmem_removexattr(struct dentry *dentry, const char *name)
if (err)
return err;
- return shmem_xattr_set(dentry->d_inode, name, NULL, 0, XATTR_REPLACE);
-}
-
-static bool xattr_is_trusted(const char *name)
-{
- return !strncmp(name, XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN);
+ return simple_xattr_remove(&info->xattrs, name);
}
static ssize_t shmem_listxattr(struct dentry *dentry, char *buffer, size_t size)
{
- bool trusted = capable(CAP_SYS_ADMIN);
- struct shmem_xattr *xattr;
- struct shmem_inode_info *info;
- size_t used = 0;
-
- info = SHMEM_I(dentry->d_inode);
-
- spin_lock(&info->lock);
- list_for_each_entry(xattr, &info->xattr_list, list) {
- size_t len;
-
- /* skip "trusted." attributes for unprivileged callers */
- if (!trusted && xattr_is_trusted(xattr->name))
- continue;
-
- len = strlen(xattr->name) + 1;
- used += len;
- if (buffer) {
- if (size < used) {
- used = -ERANGE;
- break;
- }
- memcpy(buffer, xattr->name, len);
- buffer += len;
- }
- }
- spin_unlock(&info->lock);
-
- return used;
+ struct shmem_inode_info *info = SHMEM_I(dentry->d_inode);
+ return simple_xattr_list(&info->xattrs, buffer, size);
}
#endif /* CONFIG_TMPFS_XATTR */
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index ba6900f..b2a5986 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -23,21 +23,6 @@
#include <net/sock.h>
#include <net/netprio_cgroup.h>
-static struct cgroup_subsys_state *cgrp_create(struct cgroup *cgrp);
-static void cgrp_destroy(struct cgroup *cgrp);
-static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
-
-struct cgroup_subsys net_prio_subsys = {
- .name = "net_prio",
- .create = cgrp_create,
- .destroy = cgrp_destroy,
- .populate = cgrp_populate,
-#ifdef CONFIG_NETPRIO_CGROUP
- .subsys_id = net_prio_subsys_id,
-#endif
- .module = THIS_MODULE
-};
-
#define PRIOIDX_SZ 128
static unsigned long prioidx_map[PRIOIDX_SZ];
@@ -257,12 +242,19 @@ static struct cftype ss_files[] = {
.read_map = read_priomap,
.write_string = write_priomap,
},
+ { } /* terminate */
};
-static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
-{
- return cgroup_add_files(cgrp, ss, ss_files, ARRAY_SIZE(ss_files));
-}
+struct cgroup_subsys net_prio_subsys = {
+ .name = "net_prio",
+ .create = cgrp_create,
+ .destroy = cgrp_destroy,
+#ifdef CONFIG_NETPRIO_CGROUP
+ .subsys_id = net_prio_subsys_id,
+#endif
+ .base_cftypes = ss_files,
+ .module = THIS_MODULE
+};
static int netprio_device_event(struct notifier_block *unused,
unsigned long event, void *ptr)
diff --git a/net/core/sock.c b/net/core/sock.c
index 832cf04..f409f8d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -140,7 +140,7 @@ static DEFINE_MUTEX(proto_list_mutex);
static LIST_HEAD(proto_list);
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
-int mem_cgroup_sockets_init(struct cgroup *cgrp, struct cgroup_subsys *ss)
+int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
struct proto *proto;
int ret = 0;
@@ -148,7 +148,7 @@ int mem_cgroup_sockets_init(struct cgroup *cgrp, struct cgroup_subsys *ss)
mutex_lock(&proto_list_mutex);
list_for_each_entry(proto, &proto_list, node) {
if (proto->init_cgroup) {
- ret = proto->init_cgroup(cgrp, ss);
+ ret = proto->init_cgroup(memcg, ss);
if (ret)
goto out;
}
@@ -159,19 +159,19 @@ int mem_cgroup_sockets_init(struct cgroup *cgrp, struct cgroup_subsys *ss)
out:
list_for_each_entry_continue_reverse(proto, &proto_list, node)
if (proto->destroy_cgroup)
- proto->destroy_cgroup(cgrp);
+ proto->destroy_cgroup(memcg);
mutex_unlock(&proto_list_mutex);
return ret;
}
-void mem_cgroup_sockets_destroy(struct cgroup *cgrp)
+void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
{
struct proto *proto;
mutex_lock(&proto_list_mutex);
list_for_each_entry_reverse(proto, &proto_list, node)
if (proto->destroy_cgroup)
- proto->destroy_cgroup(cgrp);
+ proto->destroy_cgroup(memcg);
mutex_unlock(&proto_list_mutex);
}
#endif
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index e795272..1517037 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -6,37 +6,6 @@
#include <linux/memcontrol.h>
#include <linux/module.h>
-static u64 tcp_cgroup_read(struct cgroup *cont, struct cftype *cft);
-static int tcp_cgroup_write(struct cgroup *cont, struct cftype *cft,
- const char *buffer);
-static int tcp_cgroup_reset(struct cgroup *cont, unsigned int event);
-
-static struct cftype tcp_files[] = {
- {
- .name = "kmem.tcp.limit_in_bytes",
- .write_string = tcp_cgroup_write,
- .read_u64 = tcp_cgroup_read,
- .private = RES_LIMIT,
- },
- {
- .name = "kmem.tcp.usage_in_bytes",
- .read_u64 = tcp_cgroup_read,
- .private = RES_USAGE,
- },
- {
- .name = "kmem.tcp.failcnt",
- .private = RES_FAILCNT,
- .trigger = tcp_cgroup_reset,
- .read_u64 = tcp_cgroup_read,
- },
- {
- .name = "kmem.tcp.max_usage_in_bytes",
- .private = RES_MAX_USAGE,
- .trigger = tcp_cgroup_reset,
- .read_u64 = tcp_cgroup_read,
- },
-};
-
static inline struct tcp_memcontrol *tcp_from_cgproto(struct cg_proto *cg_proto)
{
return container_of(cg_proto, struct tcp_memcontrol, cg_proto);
@@ -49,7 +18,7 @@ static void memcg_tcp_enter_memory_pressure(struct sock *sk)
}
EXPORT_SYMBOL(memcg_tcp_enter_memory_pressure);
-int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
+int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
/*
* The root cgroup does not use res_counters, but rather,
@@ -59,13 +28,12 @@ int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
struct res_counter *res_parent = NULL;
struct cg_proto *cg_proto, *parent_cg;
struct tcp_memcontrol *tcp;
- struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct mem_cgroup *parent = parent_mem_cgroup(memcg);
struct net *net = current->nsproxy->net_ns;
cg_proto = tcp_prot.proto_cgroup(memcg);
if (!cg_proto)
- goto create_files;
+ return 0;
tcp = tcp_from_cgproto(cg_proto);
@@ -88,15 +56,12 @@ int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
cg_proto->sockets_allocated = &tcp->tcp_sockets_allocated;
cg_proto->memcg = memcg;
-create_files:
- return cgroup_add_files(cgrp, ss, tcp_files,
- ARRAY_SIZE(tcp_files));
+ return 0;
}
EXPORT_SYMBOL(tcp_init_cgroup);
-void tcp_destroy_cgroup(struct cgroup *cgrp)
+void tcp_destroy_cgroup(struct mem_cgroup *memcg)
{
- struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct cg_proto *cg_proto;
struct tcp_memcontrol *tcp;
u64 val;
@@ -270,3 +235,37 @@ void tcp_prot_mem(struct mem_cgroup *memcg, long val, int idx)
tcp->tcp_prot_mem[idx] = val;
}
+
+static struct cftype tcp_files[] = {
+ {
+ .name = "kmem.tcp.limit_in_bytes",
+ .write_string = tcp_cgroup_write,
+ .read_u64 = tcp_cgroup_read,
+ .private = RES_LIMIT,
+ },
+ {
+ .name = "kmem.tcp.usage_in_bytes",
+ .read_u64 = tcp_cgroup_read,
+ .private = RES_USAGE,
+ },
+ {
+ .name = "kmem.tcp.failcnt",
+ .private = RES_FAILCNT,
+ .trigger = tcp_cgroup_reset,
+ .read_u64 = tcp_cgroup_read,
+ },
+ {
+ .name = "kmem.tcp.max_usage_in_bytes",
+ .private = RES_MAX_USAGE,
+ .trigger = tcp_cgroup_reset,
+ .read_u64 = tcp_cgroup_read,
+ },
+ { } /* terminate */
+};
+
+static int __init tcp_memcontrol_init(void)
+{
+ WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys, tcp_files));
+ return 0;
+}
+__initcall(tcp_memcontrol_init);
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 1afaa28..7743ea8 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -22,22 +22,6 @@
#include <net/sock.h>
#include <net/cls_cgroup.h>
-static struct cgroup_subsys_state *cgrp_create(struct cgroup *cgrp);
-static void cgrp_destroy(struct cgroup *cgrp);
-static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
-
-struct cgroup_subsys net_cls_subsys = {
- .name = "net_cls",
- .create = cgrp_create,
- .destroy = cgrp_destroy,
- .populate = cgrp_populate,
-#ifdef CONFIG_NET_CLS_CGROUP
- .subsys_id = net_cls_subsys_id,
-#endif
- .module = THIS_MODULE,
-};
-
-
static inline struct cgroup_cls_state *cgrp_cls_state(struct cgroup *cgrp)
{
return container_of(cgroup_subsys_state(cgrp, net_cls_subsys_id),
@@ -86,12 +70,19 @@ static struct cftype ss_files[] = {
.read_u64 = read_classid,
.write_u64 = write_classid,
},
+ { } /* terminate */
};
-static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
-{
- return cgroup_add_files(cgrp, ss, ss_files, ARRAY_SIZE(ss_files));
-}
+struct cgroup_subsys net_cls_subsys = {
+ .name = "net_cls",
+ .create = cgrp_create,
+ .destroy = cgrp_destroy,
+#ifdef CONFIG_NET_CLS_CGROUP
+ .subsys_id = net_cls_subsys_id,
+#endif
+ .base_cftypes = ss_files,
+ .module = THIS_MODULE,
+};
struct cls_cgroup_head {
u32 handle;
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index c43a332..442204c 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -447,22 +447,16 @@ static struct cftype dev_cgroup_files[] = {
.read_seq_string = devcgroup_seq_read,
.private = DEVCG_LIST,
},
+ { } /* terminate */
};
-static int devcgroup_populate(struct cgroup_subsys *ss,
- struct cgroup *cgroup)
-{
- return cgroup_add_files(cgroup, ss, dev_cgroup_files,
- ARRAY_SIZE(dev_cgroup_files));
-}
-
struct cgroup_subsys devices_subsys = {
.name = "devices",
.can_attach = devcgroup_can_attach,
.create = devcgroup_create,
.destroy = devcgroup_destroy,
- .populate = devcgroup_populate,
.subsys_id = devices_subsys_id,
+ .base_cftypes = dev_cgroup_files,
};
int __devcgroup_inode_permission(struct inode *inode, int mask)
--
2.0.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment