Docker Storage Drivers を読んでのメモです。
With O_WRONLY or O_RDWR - write access look it up in the top branch; if it's found here, open it
otherwise, look it up in the other branches; if we find it, copy it to the read-write (top) branch, then open the copy
That "copy-up" operation can take a while if the file is big!
The AUFS mountpoint for a container is /var/lib/docker/aufs/mnt/$CONTAINER_ID/
It is only mounted when the container is running
The AUFS branches (read-only and read-write) are in /var/lib/docker/aufs/diff/$CONTAINER_OR_IMAGE_ID/
All writes go to /var/lib/docker
To see details about an AUFS mount:
look for its internal ID in /proc/mounts
look in /sys/fs/aufs/si_.../br*
each branch (except the two top ones) translates to an image
aufs 使ってる時 sysfs 見てなかったから参考になる。
Read/write access has native speeds
But initial open() is expensive in two scenarios: when writing big files (log files, databases ...) with many layers + many directories in PATH (dynamic loading, anyone?)
When starting the same container 1000x, the data is loaded only once from disk, and cached only once in memory (but dentries will be duplicated)
aufs がパフォーマンス上問題となるのは "open" 時のみということか。
キャッシュはいいとして、dentries の重複に関しては頭の片隅に。
The mountpoint for a container is /var/lib/docker/devicemapper/mnt/$CONTAINER_ID/
It is only mounted when the container is running
The data is stored in two files, "data" and "metadata" (More on this later)
Since we are working on the block level, there is not much visibility on the diffs between images and containers
docker info will tell you about the state of the pool (used/available space)
List devices with dmsetup ls
Device names are prefixed with docker-MAJ:MIN-INO
MAJ, MIN, and INO are derived from the block major, block minor, and inode number where the Docker data is located (to avoid conflict when > >>running multiple Docker instances, e.g. with Docker-in-Docker) Get more info about them with dmsetup info, dmsetup status (you shouldn't need this, unless the system is badly borked)
Snapshots have an internal numeric ID
/var/lib/docker/devicemapper/metadata/$CONTAINER_OR_IMAGE_ID is a small JSON file tracking the snapshot ID and its size
後から確認する
When there are no more blocks in the pool, attempts to write will stall until the pool is increased (or the write operation aborted)
これが centos などで刺さると言われる原因かな?(aufs/overlay ではなったことがない)
and sparse file performance isn't great anyway
ふむ。dm のほうが多少マシだと思ってたが容量の問題もあり sparse file だからそこが色々ネックになってそうだ
docker -d --storage-opt dm.datadev=/dev/sdb1 --storage-opt dm.metadatadev=/dev/sdc1
ないな
読む前メモ: CoreOS もやめるみたいだしパフォーマンスには期待できないだろう。流し読み
BTRFS integrates the snapshot and block pool management features at the filesystem level, instead of the block device level
dm より有利か?
Data is not written directly, it goes to the journal first (in some circumstances1, this will affect performance)
The performance will be half of the "native" performance
知らんかった。頭の片隅に
# btrfs filesys balance start -dusage=1 /var/lib/docker
chunk 周りの話と "No space left on device" って言われた時の回避方法について
btrfs よく知らないからあまり意味がわかってない
Not much to tune.
# btrfs filesys show
読む前メモ: おそらく本命。Overlay(3.18~) と Overlayfs (~3.17 Ubuntu あたりの独自パッチ) あたりは注意が必要。使えるのは前者のみ
Images and containers are materialized under /var/lib/docker/overlay/$ID_OF_CONTAINER_OR_IMAGE
Images just have a root subdirectory (containing the root FS)
Containers have:
lower-id → file containing the ID of the image
merged/ → mount point for the container (when running)
upper/ → read-write layer for the container
work/ → temporary space used for atomic copy-up
identical files are hardlinked between images
Not much to tune at this point
Performance should be slightly better than AUFS: no stat() explosion good memory use slow copy-up, still (nobody's perfect)
ふむ~。aufs よりパフォーマンス本当にいいのか?
No copy on write. Docker does a full copy each time!
Space inefficient, slow
ないな。てか使ってる人いるのかな。
Might be useful for production setups
少しわかるけど、それなら docker 使わないきがする。
Discard と Trim の話
Trim のごくごく一般的な話。
Also meaningful on copy-on-write storage (if/when every snapshots as trimmed a block, it can be freed)
CoW だとブロックの書き換えがないから Trim 使いやすそう
discard = fs の(mount 時の)オプション
fstrim コマンドはじめ知った
discard works on Device Mapper + loopback devices
... but is particularly slow on loopback devices (the loopback file needs to be "re-sparsified" after container or image deletion, and this is a slow operation)
You can turn it on or off depending on your preference
dm で有効になっているかあとから見る
EOF