julik · April 8, 2013 08:07
diff --git a/fork_and_forget.rb b/fork_and_forget.rb
 #! /usr/bin/env ruby
 # "Fork and Forget"
 # Don't wait if you don't have to:  A mini-tutorial about concurrency
 # mechanisms in Ruby and basic Unix systems programming, and how you can use
 # them to avoid waiting.
 #
 # I have heard that people are occasionally unfamiliar with this strategy.
 # It's a common idiom, regardless of language, and it is also essentially built
 # into Erlang (and Termite Scheme, etc.).
 #
 # If you have a thing that takes forever and your program doesn't care so much
 # about its output (or prefers to collect it later through some other means,
 # like a pipe/file/DB/etc.), then this is probably the thing you want to do,
 # rather than stopping execution to wait.
 #
 # The relevant methods (fork, wait, waitpid, etc.) are essentially just
 # wrappers around the standard Unix system calls of the same name, and have
 # almost the same semantics, although they're often nicer to use in Ruby than
 # in C.
 #
 # Ruby:
 # Process.waitpid(fork {
 #       do_a_thing!
 # })
 #
 # C:
 # if(pid = fork()) {
 #       do_a_thing();
 # } else {
 #       waitpid(pid, NULL, 0);
 # }
 #
 # And so, without further ado, let's get started.
 
 require 'open-uri'
 
 # A quick utility.
 class File
        def self.append fn, str
                (open(fn, 'a') << str).close
        end
 end
 
 # Fibonacci.  The naive, recursive version is great for pegging the CPU and
 # simple to implement.
 def fib n
        return 1 if n < 2
        fib(n - 1) + fib(n - 2)
 end
 
 # We stuff the results into a file.  IO.popen is another good way to grab the
 # results if the current process wants to handle the data itself.
 
 # Contrived.
 def io_bound url, output
        File.append output, open(url).read
 end
 
 # Contrived, way contrived.
 def cpu_bound n, output
        File.append output, "#{fib(n)}\n"
 end
 
 # Now we get to the interesting bits about Unix processes, Ruby threads, and
 # concurrency.
 
 # Where do processes go when they die?  They become zombies.  (Unix terminology
 # is best terminology.)  The status will show up as a "Z" in the output of
 # top(1) or ps(1).  The main reason they hang out like this is to allow the
 # parent access to their status on exit.  If the parent exits before they do,
 # though, the processes become children of the init(8) process.  init
 # automatically reaps all of its children that become zombies.
 
 # This is a fairly simple loop that waits to reap all of the child processes
 # that have exited.  They'll all run in parallel, and we'll wait until they're
 # done before we continue about our own business.
 def wait_for_children
        # Note that, unless you pass WNOHANG (see below), Process.wait will wait
        # for the processes (of course).
        true while((Process.wait rescue nil))
 end
 
 # This, like the above, will reap zombies.  Every time we get a SIGCLD, we try
 # to reap all of the zombies we have.  Signal handlers are, as far as your
 # program is concerned, asynchronous (although this is a simplification), and
 # when a child process dies, a SIGCLD is sent to the parent.  By default, this
 # signal is ignored (see the signal(7) man page), but if we set up a handler
 # for it, we can reap processes when they come back.
 def do_not_wait_for_children
        trap('CLD') {
                true while((Process.wait(-1, Process::WNOHANG) rescue nil))
        }
 end
 
 # Since Ruby threads are cheap and (mostly) non-blocking, you can do something
 # like this instead of a plain fork().  It spawns a thread that spawns a
 # process and waits to collect that process when it exits.
 def autoreap_fork(&b)
        Thread.new { Process.wait(fork(&b)) }
 end
 
 # Of course, there is another option, which is to do nothing.  You don't want
 # to fill the process table with zombies if you are, say, a long-running
 # process, but as noted above, if the parent exits, its children are adopted by
 # the init process, which will reap them when they become zombies.  So if
 # the output of these children is collected elsewhere later, you can just exit
 # after spawning them and let your OS do the rest.  That's the approach we take
 # here:  we spawn all of the children and forget about them.  When this process
 # dies, our children are handled by init.
 
 # We do some expensive calculations here.  Note that this will spawn several
 # processes, and fib(n) can be expensive to calculate (with the algorithm we
 # use) depending on your hardware.  You may want to adjust the numbers here.
 # To watch output as it arrives, you can run the following in another shell
 # before you start this program:
 #       touch /tmp/fibonacci_numbers ; tail -f /tmp/fibonacci_numbers
 (1..42).each { |n|
        # Note that these numbers don't necessarily arrive in order!  (Although, in
        # our case, it is likely that they will, since fib(n+1) will take longer to
        # calculate than fib(n).
        fork {
                # This will make the output of top/ps a little more friendly, if you
                # want to watch the processes while they run.
                $0 = "fib[#{n}]"
                cpu_bound n, '/tmp/fibonacci_numbers'
        }
 }
 puts "Running a few calculations in the 'background'."
 
 %w(
        http://ruby-doc.org/core/classes/Process.html
        http://debu.gs/
        http://reverso.be/
        http://asdf.com/
        http://bigempire.com/filthy
        http://gist.github.com/
        http://localhost/
        http://code.google.com/p/termite/
 ).each { |url|
        # For Ruby, threads are cheaper than processes, but since they are "green"
        # threads (rather than OS-level threads; JRuby is an exception here, but
        # JRuby threads aren't as cheap), CPU-bound processes don't really speed up
        # when parallelized with threads.  Threading can also lead to odd bugs if
        # the threads touch any sort of shared resource:  unlike processes created
        # by forking, threads share memory, file descriptors, sockets, and
        # other process-level resources.  Using a Thread on a problem that is not
        # CPU-bound (like fetching a website from the internet, which is
        # I/O-bound), though, will let all of the I/O run almost in parallel while
        # we wait.  Of course, forking will work here, too.
        Thread.new {
                io_bound url, "/tmp/#{url.gsub(/[^a-z0-9\._]+/, '_')}.html"
        }
 }
 
 puts "Downloading some web pages!"
 # You can use Thread#join to wait for a process to finish.  I'm going to do
 # this the lazy way by, instead of keeping references to the Thread objects,
 # just asking Ruby for all of the threads except the current one, and joining
 # those.  By the time we arrive here, some of them may already be done, and
 # since we didn't keep a reference around, they could have already been GC'd
 # and won't show up in Thread.list.
 (Thread.list - [Thread.current]).each { |thread|
        # We wrap it in an exception-eater as joining a thread will give you the
        # block's return value.  If the thread died mysteriously, though, the
        # exception that killed it will bubble up here.  Since we don't especially
        # care what the thread did or even if it finished its mission, but we *do*
        # want to wait until all of the threads are finished before we exit (an
        # arbitrary restriction; we only care for the purposes of illustrating what
        # to do when you care), we join all of them but ignore any mishaps they may
        # have encountered.
        begin
                thread.join
        rescue Exception
        end
 }
 
 puts "Downloaded all of them!  (Or the threads crashed, or any combination.)"
 # At this point, the fibonacci processes may or may not have finished
 # (depending on your CPU speed versus the speed of your net connection), but
 # they're no longer our problem, since this program is over.
	#! /usr/bin/env ruby
	# "Fork and Forget"
	# Don't wait if you don't have to: A mini-tutorial about concurrency
	# mechanisms in Ruby and basic Unix systems programming, and how you can use
	# them to avoid waiting.
	#
	# I have heard that people are occasionally unfamiliar with this strategy.
	# It's a common idiom, regardless of language, and it is also essentially built
	# into Erlang (and Termite Scheme, etc.).
	#
	# If you have a thing that takes forever and your program doesn't care so much
	# about its output (or prefers to collect it later through some other means,
	# like a pipe/file/DB/etc.), then this is probably the thing you want to do,
	# rather than stopping execution to wait.
	#
	# The relevant methods (fork, wait, waitpid, etc.) are essentially just
	# wrappers around the standard Unix system calls of the same name, and have
	# almost the same semantics, although they're often nicer to use in Ruby than
	# in C.
	#
	# Ruby:
	# Process.waitpid(fork {
	# do_a_thing!
	# })
	#
	# C:
	# if(pid = fork()) {
	# do_a_thing();
	# } else {
	# waitpid(pid, NULL, 0);
	# }
	#
	# And so, without further ado, let's get started.

	require 'open-uri'

	# A quick utility.
	class File
	def self.append fn, str
	(open(fn, 'a') << str).close
	end
	end

	# Fibonacci. The naive, recursive version is great for pegging the CPU and
	# simple to implement.
	def fib n
	return 1 if n < 2
	fib(n - 1) + fib(n - 2)
	end

	# We stuff the results into a file. IO.popen is another good way to grab the
	# results if the current process wants to handle the data itself.

	# Contrived.
	def io_bound url, output
	File.append output, open(url).read
	end

	# Contrived, way contrived.
	def cpu_bound n, output
	File.append output, "#{fib(n)}\n"
	end

	# Now we get to the interesting bits about Unix processes, Ruby threads, and
	# concurrency.

	# Where do processes go when they die? They become zombies. (Unix terminology
	# is best terminology.) The status will show up as a "Z" in the output of
	# top(1) or ps(1). The main reason they hang out like this is to allow the
	# parent access to their status on exit. If the parent exits before they do,
	# though, the processes become children of the init(8) process. init
	# automatically reaps all of its children that become zombies.

	# This is a fairly simple loop that waits to reap all of the child processes
	# that have exited. They'll all run in parallel, and we'll wait until they're
	# done before we continue about our own business.
	def wait_for_children
	# Note that, unless you pass WNOHANG (see below), Process.wait will wait
	# for the processes (of course).
	true while((Process.wait rescue nil))
	end

	# This, like the above, will reap zombies. Every time we get a SIGCLD, we try
	# to reap all of the zombies we have. Signal handlers are, as far as your
	# program is concerned, asynchronous (although this is a simplification), and
	# when a child process dies, a SIGCLD is sent to the parent. By default, this
	# signal is ignored (see the signal(7) man page), but if we set up a handler
	# for it, we can reap processes when they come back.
	def do_not_wait_for_children
	trap('CLD') {
	true while((Process.wait(-1, Process::WNOHANG) rescue nil))
	}
	end

	# Since Ruby threads are cheap and (mostly) non-blocking, you can do something
	# like this instead of a plain fork(). It spawns a thread that spawns a
	# process and waits to collect that process when it exits.
	def autoreap_fork(&b)
	Thread.new { Process.wait(fork(&b)) }
	end

	# Of course, there is another option, which is to do nothing. You don't want
	# to fill the process table with zombies if you are, say, a long-running
	# process, but as noted above, if the parent exits, its children are adopted by
	# the init process, which will reap them when they become zombies. So if
	# the output of these children is collected elsewhere later, you can just exit
	# after spawning them and let your OS do the rest. That's the approach we take
	# here: we spawn all of the children and forget about them. When this process
	# dies, our children are handled by init.

	# We do some expensive calculations here. Note that this will spawn several
	# processes, and fib(n) can be expensive to calculate (with the algorithm we
	# use) depending on your hardware. You may want to adjust the numbers here.
	# To watch output as it arrives, you can run the following in another shell
	# before you start this program:
	# touch /tmp/fibonacci_numbers ; tail -f /tmp/fibonacci_numbers
	(1..42).each { \|n\|
	# Note that these numbers don't necessarily arrive in order! (Although, in
	# our case, it is likely that they will, since fib(n+1) will take longer to
	# calculate than fib(n).
	fork {
	# This will make the output of top/ps a little more friendly, if you
	# want to watch the processes while they run.
	$0 = "fib[#{n}]"
	cpu_bound n, '/tmp/fibonacci_numbers'
	}
	}
	puts "Running a few calculations in the 'background'."

	%w(
	http://ruby-doc.org/core/classes/Process.html
	http://debu.gs/
	http://reverso.be/
	http://asdf.com/
	http://bigempire.com/filthy
	http://gist.github.com/
	http://localhost/
	http://code.google.com/p/termite/
	).each { \|url\|
	# For Ruby, threads are cheaper than processes, but since they are "green"
	# threads (rather than OS-level threads; JRuby is an exception here, but
	# JRuby threads aren't as cheap), CPU-bound processes don't really speed up
	# when parallelized with threads. Threading can also lead to odd bugs if
	# the threads touch any sort of shared resource: unlike processes created
	# by forking, threads share memory, file descriptors, sockets, and
	# other process-level resources. Using a Thread on a problem that is not
	# CPU-bound (like fetching a website from the internet, which is
	# I/O-bound), though, will let all of the I/O run almost in parallel while
	# we wait. Of course, forking will work here, too.
	Thread.new {
	io_bound url, "/tmp/#{url.gsub(/[^a-z0-9\._]+/, '_')}.html"
	}
	}

	puts "Downloading some web pages!"
	# You can use Thread#join to wait for a process to finish. I'm going to do
	# this the lazy way by, instead of keeping references to the Thread objects,
	# just asking Ruby for all of the threads except the current one, and joining
	# those. By the time we arrive here, some of them may already be done, and
	# since we didn't keep a reference around, they could have already been GC'd
	# and won't show up in Thread.list.
	(Thread.list - [Thread.current]).each { \|thread\|
	# We wrap it in an exception-eater as joining a thread will give you the
	# block's return value. If the thread died mysteriously, though, the
	# exception that killed it will bubble up here. Since we don't especially
	# care what the thread did or even if it finished its mission, but we do
	# want to wait until all of the threads are finished before we exit (an
	# arbitrary restriction; we only care for the purposes of illustrating what
	# to do when you care), we join all of them but ignore any mishaps they may
	# have encountered.
	begin
	thread.join
	rescue Exception
	end
	}

	puts "Downloaded all of them! (Or the threads crashed, or any combination.)"
	# At this point, the fibonacci processes may or may not have finished
	# (depending on your CPU speed versus the speed of your net connection), but
	# they're no longer our problem, since this program is over.