Skip to content

Instantly share code, notes, and snippets.

@zh
Forked from octplane/polycorn.md
Created October 22, 2013 02:48
Show Gist options
  • Save zh/7094458 to your computer and use it in GitHub Desktop.
Save zh/7094458 to your computer and use it in GitHub Desktop.

Life and death of Unicorns

The introduction of our Unicorn management tool, Polycorn.

jump into our train!

Photo by Protohiro from Flickr

At Fotopedia, we use Unicorn to serve our main Rails application. Every day, we restart our application several times, spawning and killing hundred of Unicorns. Managing graceful restarts is a complex task, and requires careful monitoring and command. This article introduces our tool Polycorn, a Unicorn management program.

[[MORE]]

Working with Unicorn

Unicorn is a ruby Rack HTTP server designed for fast client and Unix. With simplicity in mind, it aims at providing a confortable and powerful house for your application while remaining easy to integrate in your application stack.

At Fotopedia, we started by using ModPassenger quite early during the initial development but we had several problems with it. We regularly managed to crash/hang some workers, depleting the number of available workers slowly, but surely, and leaving the application unresponsive, until we restarted the whole Apache. At that time, our code contained several memory leaks (gracefully provided by rmagick) and probably other issues but the whole was very much unstable and we stopped using ModPassenger around the 20th of June 2011.

At that time, Unicorn was quite young but already promising and we quickly decided to use this piece of code in our infrastructure: It would fit nicely behing our Nginx and it was quite easy to add some code to the stack to kill long running processes.

This way, if a worker had not taken any request in the last 60s, we would know for sure it was stuck and we could safely send it ad patres and the master process would start another worker quickly.

What was wrong ?

Deployments. Often.

Because the dev team at Fotopedia is very agile, anybody can work on almost any part of the code, commit a fix and ask for a deploy at anytime (including Fridays at 5pm, yes, and even on week-ends). As a result, we sometimes have to deploy things very quickly. Sometimes because we want to iterate on a new feature, other times (but this is much less frequent), because something we just deployed broke hell on us. As a result, we deploy between 0 and 20 times a day. There is no rule.

When you update the application code, there are many ways to notify Unicorn of that update. The simplest way of all is to just restart the daemon. Provided your stack is fault-tolerant, your new incoming requests will end up on another backend and will be served anyway (after a variable but uncompressible delay if your stack is not actively switching off your backend from the stack). This is a rather extreme way of life and we never actually did like that.

Moreover, Unicorn supports various signals that indicate how it should behave and we started by sending HUP after a deploy. HUP "reloads" the whole unicorn, including its configuration, and restarts all the workers. At that time, we had a small number of backends to serve all the requests (well, this is Ruby) and restarting a whole Unicorn (which happened to run 8 workers) was cutting our capacity a lot. Of course, the restarts were phased out slowly at 60s intervals to ensure the newly started Unicorn had picked up where the old one left but still, there was a some lag and the whole stack was slower for a few minutes.

Not cool.

So how can we fix that ?

In the Unicorn documentation, there is a mention of USR2 signal that re-executes the running binary. This feature is very useful. It spawns a new Unicorn master, which in turn spawns children. The whole stuff boots up your application and starts serving requests as soon as possible. At that point, your application is running two versions of the software at the same time and things are getting more interesting...

However, you must be careful: If your application is broken or some dependencies are missing, the workers are likely to die very quickly. So you have to pay attention to them, and ensure they are "stable" and not dying too much. Especially as the Unicorn master process will try to have n running workers everytime, so a quick glance at the process list will not show you something is wrong immediately.

Also, once the situation is stabilised, you can kill the old master and it will lease the socket opened for the new master, your new code is now online. Hurray !

Enter Polycorn

Polycorn was born from our need to automate these restarts without giving up on the overall monitoring we want. Polycorn is a Unicorn manager (hence its very original name). It serves 4 goals:

  • Start, Stop and Reload Unicorn gracefully
  • Detect always-dying Unicorns
  • Detect leaking Unicorns
  • Notify the outside world about what is going on

Because we wanted a robust design without too much overhead, Polycorn dies a lot. Everytime its internal state does change, it will commit suicide and is expected to restart. We use runit to control Polycorn runs and runit will always ensure that an instance of Polycorn is running.

As we work with Unicorn, we do not need to have any internal state (except for short transitive states):

  • Unicorn maintains the pid of the main Unicorn master
  • It also maintains the pid of the new Unicorn master, when transitioning via USR2
  • The process table can be inspected for the Unicorn masters children

Installation

Get it from https://gist.github.com/octplane/7039960.

Polycorn has no dependency outside sys-proctable, a little library that does some process table introspection. The main script is the only thing you will need.

Invocation

Polycorn invocation consists of 2 parameters:

polycorn /path/to/unicorn/pids/folder "unicorn path and options"

So a simple invocation will look like:

/usr/local/bin/polycorn /path/to/unicorn/pids/folder unicorn -E production /ftn/apps/our/testing/current/config.ru -D

While a more complex use case could be (this is more or less what we use at Fotopedia):

exec /usr/local/bin/polycorn /ftn/apps/our/shared/pids "export RBENV_ROOT=/opt/rbenv; export PATH=/
opt/rbenv/shims:/opt/rbenv/bin:/usr/local/bin/bin:/usr/local/bin:/usr/bin:/bin; unset GEM_HOME; unset
 GEM_PATH; export RUBYOPT=W0; unset RBENV_DIR; unset BUNDLE_BIN_PATH; unset RBENV_HOOK_PATH; unset BU
NDLE_GEMFILE; unset RBENV_VERSION; cd /tmp/; RBENV_VERSION=$(cat /ftn/apps/our/testing/current/.rbe
nv-version) BUNDLE_GEMFILE=/ftn/apps/our/testing/current/Gemfile chpst -u apps:apps bundle exec uni
corn -E development -c /etc/our.unicorn.conf.rb /ftn/apps/our/testing/current/config.ru -D"

Configuration

Depending on your needs, you might want to customize Polycorn a bit further . If you wish to, you can create a file /etc/polycorn.conf.rb and dump some more configuration.

The default configuration is:

# Maximum time to wait before declaring an emergency in Polycorn state processing
@max_wait = 60
# Maximum RSS a Unicorn can use before being considered as too fat.
@max_rss = 1_400_000_000
# Called when something has to be told to the outer world
def alert(message)
end

The configuration file can also be used if you use Polycorn in a Bundler environment, or if you require some other library for your alert processing.

For example, our configuration file looks like this:

# Generated by Chef
ENV['BUNDLE_GEMFILE']='/etc/bundler.chef/Gemfile'
require 'rubygems'
require 'bundler/setup'

require 'fwissr'
GRID=Fwissr['/grid']
CHANNEL = case GRID
when 'prod'
  "#unicorn"
else 'testing'
  "#unicorn-testing"
end

def alert(message)
  # log message
  irc_report(CHANNEL, message)
end

@max_wait = 60
@max_rss = 1_400_000_000

Polycorn was written by Oct, our Server Architect. It was written a long time ago in our Chef repository. It's a Ruby script.

#!/usr/bin/env ruby
# :vi:syntax=ruby:
def alert(message)
end
@max_wait = 60
@max_rss = 1_400_000_000
def log s
msg = "[polycorn:#{Process.pid}] #{s}"
$stderr.puts(msg)
alert(msg)
end
if File.exists?("/etc/polycorn.conf.rb")
log("Loading configuration")
require '/etc/polycorn.conf.rb'
end
require 'sys/proctable'
include Sys
def format_memory(mem)
out = if mem >= 1_000_000
"%3.3fM" % (mem / 1_000_000)
else
"%.fk" % (mem / 1_000)
end
out
end
# Check these are belonging to actual unicorns
def is_unicorn(pid)
ProcTable.ps(@unicorn) && ProcTable.ps(@unicorn).cmdline=~ /unicorn/
end
class EmergencyAction
def initialize(&block)
@action = block
end
def run
@action.call()
end
end
class MessageEmergencyAction < EmergencyAction
def MessageEmergencyAction.message(msg)
return EmergencyAction.new do
log(msg)
end
end
end
class MessageAndKillAction < EmergencyAction
def initialize(pid)
@pid = pid
@counter = 0
super(){
@counter += 1
log("Unicorn #{@pid} won't respond to USR2. Help me !(#{@counter})")
# Suicide after a while
if @counter > 3
Process.kill("QUIT", @pid)
exit 0
end
}
end
end
def wait_until emergency_action, short_message, &block
if emergency_action.is_a?(String)
emergency_action = MessageEmergencyAction.message(emergency_action)
end
sleep_interval = 2
wait = 0
while ! block.call
log short_message
wait += sleep_interval
sleep (sleep_interval)
if wait > @max_wait
# Something is very wrong, unicorn won't stop. Call for help
sleep_interval = 20
emergency_action.run
end
end
end
def terminate()
log("QUITting #{@unicorn}")
Process.kill("QUIT", @unicorn)
wait_until "Unicorn #{@unicorn} won't respond to TERM. Help me !",
"Waiting for termination..." do
ProcTable.ps(@unicorn) == nil
end
# Yay !
log("Finished. Good bye !")
exit 0
end
def signal(sig)
log("#{sig}ing #{@unicorn}")
Process.kill(sig, @unicorn)
end
def get_children_and_rss(pid)
current = {}
ProcTable.ps{ |process|
current[process.pid] = process.rss * 4096 if process.ppid == @unicorn
}
return current
end
def reload()
log("Reloading unicorn(#{@unicorn}) (USR2)")
Process.kill("USR2", @unicorn)
# Give some time to unicorn to start restarting
# FIXME monitor creation of more unicorns
old_unicorn_pidfile = File.join(@pid_folder, "unicorn.pid.oldbin")
wait_until MessageAndKillAction.new(@unicorn),
"Waiting for new unicorn to be started..." do
File.exists?(old_unicorn_pidfile)
end
log("See you soon")
exit(0)
end
@pid_folder = ARGV[0] #'/ftn/apps/our/testing/shared/pids'
application = ARGV[1..-1].join(" ") # unset RBENV_VERSION; /opt/rbenv/shims/unicorn -E production -c /etc/our.unicorn.conf.rb -E development /ftn/apps/our/testing/current/config.ru -D'
old_unicorn_pidfile = File.join(@pid_folder, "unicorn.pid.oldbin")
unicorn_pidfile = File.join(@pid_folder, "unicorn.pid")
@old_unicorn = File.exists?(old_unicorn_pidfile) && File.read(old_unicorn_pidfile).to_i
@unicorn = File.exists?(unicorn_pidfile) && File.read(unicorn_pidfile).to_i
if @unicorn && !is_unicorn(@unicorn)
@unicorn = nil
File.unlink(unicorn_pidfile)
end
if @old_unicorn && !is_unicorn(@old_unicorn)
@old_unicorn = nil
File.unlink(old_unicorn_pidfile)
end
log("Polycorn (ruby #{RUBY_VERSION}) starting.")
if @old_unicorn && @unicorn
# We are transitionning
log("Waiting for the new unicorn to be ready")
# Every process that enters this list must never leave
current = {}
now = Time.now.to_i
wait_until "Unicorn is not starting correctly. Help me !",
"Waiting for startup..." do
old = current
current = get_children_and_rss(@unicorn)
died_processes = old.keys - current.keys
if died_processes.length > 0
log("[warn] #{died_processes.join(", ")} died since last check.")
# Reset timeout
now = Time.now.to_i
false
else
if Time.now.to_i - now >= 40
true
else
false
end
end
end
log("New Unicorn is ready, stopping old Unicorn")
Process.kill("WINCH", @old_unicorn)
Process.kill("QUIT", @old_unicorn)
wait_until "Old Unicorn #{@unicorn} won't respond to QUIT. Help me !",
"Waiting for the old unicorn to stop..." do
ProcTable.ps(@old_unicorn) == nil
end
log("Old Unicorn has stopped. Exiting.")
exit 0
elsif @old_unicorn && !@unicorn
# We have an old unicorn but no current. Something went wrong
log("We have an old unicorn but no current. Something went wrong. Help me :'(")
exit 0
elsif @unicorn && !@old_unicorn
# nothing to declare. Wait until signal
log("Unicorn is running at #{@unicorn}.")
log("Waiting until we receive a signal.")
["INT", "TERM"].each do |signal|
Signal.trap(signal) do
terminate()
end
end
Signal.trap("QUIT") do
terminate()
end
Signal.trap("USR2") do
reload()
end
["HUP", "USR1"].each do |signal|
Signal.trap(signal) do
signal(signal)
end
end
current = {}
survivors = nil
count = 0
counter = 0
while true do
old = current
oc = count
# Check memory
current = get_children_and_rss(@unicorn)
if counter % 24 == 0
log("[info] mem:" + current.keys.sort.map {|pid| "#{pid} (#{format_memory(current[pid])})"}.join(" "))
end
counter+= 1
current.each do |pid, rss|
if rss.to_i >= @max_rss
Process.kill("QUIT", pid)
log("[warn] QUITted #{pid}: was using #{rss}")
end
end
# Check running PIDs
if survivors == nil
survivors = current.keys
else
survivors = survivors & current.keys
end
if survivors && survivors.length == 0
# Everybody has died, it sucks
log("All unicorn have died, killing master.")
Process.kill("TERM", @unicorn)
end
sleep(5)
end
else
# none of the two
# Start regular
log("Starting unicorn...")
if Module.constants.include?(:Bundler)
Bundler.with_clean_env do
fork { exec(application) }
end
else
fork { exec(application) }
end
# Ohai, kill myself
exit 0
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment