We constantly have around 1000 running zones and around 1000 stopped zones on a SmartOS machine. Having a lot of zones on the machine cause zone operations to take around 5 - 15 Minutes apeice. That it a huge amount of time. For example a few months ago we had a machine reboot and it took about 4 hours for all of the zones to transition to running.
4 hours. That sucked.
Before that we had been digging into speeding things up, but 4 hours made it a major priority.
These files are a short description of what we found and did to speed things up.
All frame graphs were generated using Brendan Gregg's FlageGraph.
It's worth noting that the very first thing we did was actually rollout the vminfod project. This effort has been coordinated with Josh Wilsdon. While it is not quite complete and some scenarios don't quite work yet, we needed a way to stop the bleeding. The current implementation handles all of the scenarios that Pagoda Box requires. When vminfod was implemented, the zone operations dropped from ~5m to ~2m. While that is a significant drop, it certainly wasn't enough to suffice. The cataloged changes here brought the zone operations from ~2m to ~10s or less.