Skip to content

Instantly share code, notes, and snippets.

@DBarney
Last active May 3, 2020 17:35
Show Gist options
  • Save DBarney/63564ffb76fe896d8180 to your computer and use it in GitHub Desktop.
Save DBarney/63564ffb76fe896d8180 to your computer and use it in GitHub Desktop.
Overview of Illumos Zone operation bottlenecks.

We constantly have around 1000 running zones and around 1000 stopped zones on a SmartOS machine. Having a lot of zones on the machine cause zone operations to take around 5 - 15 Minutes apeice. That it a huge amount of time. For example a few months ago we had a machine reboot and it took about 4 hours for all of the zones to transition to running.

4 hours. That sucked.

Before that we had been digging into speeding things up, but 4 hours made it a major priority.

These files are a short description of what we found and did to speed things up.

All frame graphs were generated using Brendan Gregg's FlageGraph.

In vplat.c the zfs zoned property is set if it hasn't been set yet. The issue is that setting this property causes the dataset to unmount and then remount, which takes time.

You can see this in this flame graph if you follow the stack

zfs_prop_set
validate_datasets <- where the property is set
vplat_bringup
zone_ready
server
__door_return

(click on the image to interact with the svg) SVG for zone property

The fix is setting this property when creating the dataset.

Another issue we were seeing is that the mnttab operations are slow. We have over 10,000 filesystems mounted on our production servers, and interacting with the mnttab was adding seconds onto zone operations.

You can see this in the following flame graph. Look at these two stacks:

getmntent
build_mnttable
lofs_read_mnttab
resolve_lofs
vplat_bringup
zone_ready
server
__door_return

and

getmntent
build_mnttable
lofs_read_mnttab
duplicate_reachable_path
vplat_create
zone_ready
server
__door_return

mnttab is slow

The fix was to not read the mnttab. These two code paths do checking for things that we will never do at Pagodabox. We commented them out.

The dnlc is fast for lookups, but clearing it out is a O(n) operation, as every entry needs to be compared against the vnode being cleared. With millions of entries we were seeing this take 50+ seconds on our production SmartOS machines. That 50 seconds on CPU.

you can see this with these two flame graphs. The first is of zoneadmd, the second is of the kernel.

umount2 <- syscall to unmount a filesystem
zfs_unmount
changelist_prefix
zfs_prop_set
validate_datasets
vplat_bringup
zone_ready
server
__door_return

(click on the image to interact with the svg) Userland flamegraph

dnlc_purge_fsvp
dounmount
umount2_engine
umount2
sys_syscall

(click on the image to interact with the svg) Userland flamegraph

The fix is to implement a reverse lookup hash table so that purges are no longer a O(n) operation. The reverse lookup hash maps every vnode in the cache back to the cache entry for lookups.

The last thing we changed for things on CPU was with how devfsadm checked device drivers. It would use regexs to compare device names, and we were seeing that this was taking time on the CPU.

We don't have a flame graph for this one, I lost it.

the fix was to change as many of the regexs to strings that we could.

The last change we made was that creating the zconosle was a serial operation. We found that not only was it taking 5 seconds to create a zconsole, but that also as part of the creations it was probing all of our scsi devices every time.

You can see the scsi device probing in this on the stack:

scsi_hba_bus_config
devi_config_common
mt_config_thread
thread_start

scsi devices being probed

And in this flame graph you can see that creating a zconsole takes a few seconds. This flame graph was of time spent not running or being pushed off CPU. The count is actually in nano seconds.

init_console_dev
init_console
main
_start

Waiting on the zconsole

We made two changes to make the zconsole create faster. The first was to not probe our scsi devices on every zconsole creation. The second was to move the creation of a zconsole outside of a mutex.

The last change we made was that we were seeing that SmartOS was taking 15-20 minutes to boot. It would enter the spashscreen and then sit there.

After using mdb we discovered that what it was doing was validating the datasets in the zpool "zones" and then mounting them.

With 2000 datasets on a machine this takes about 15-20 minutes.

We did some searching to see if anyone had already written some code to fix this issue and we found this in the illumos-zfs lists. Someone had been running into the same issue was had, created a potential patch, and then it was never merged into illumos. We contacted the author and started the ball rolling to test it out.

Another thing that we fixed is that we no longer mount the zfs datasets for zones on boot, or if we are in the global zone. We have a custom utility that brings zones online after boot, and we exteneded it to ensure that the data set is mounted for zones that are starting.

So before these changes were made it would take 5 minutes (if you were lucky) to create a new zone after 500 were on the machine. And that same machine would take 20 minutes to reboot and 4 hours for all the zones to come online.

After all of these changes, it takes 15 seconds to create a zone after 1000 have already been created, and booting takes 40 seconds, with all the zones coming back online after 15 minutes.

Kind of amazing what you can do with dtrace and mdb. Especially considering we are dealing with a codebase larger then any one person can understand.

@tylerflint
Copy link

It's worth noting that the very first thing we did was actually rollout the vminfod project. This effort has been coordinated with Josh Wilsdon. While it is not quite complete and some scenarios don't quite work yet, we needed a way to stop the bleeding. The current implementation handles all of the scenarios that Pagoda Box requires. When vminfod was implemented, the zone operations dropped from ~5m to ~2m. While that is a significant drop, it certainly wasn't enough to suffice. The cataloged changes here brought the zone operations from ~2m to ~10s or less.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment