Skip to content

Instantly share code, notes, and snippets.

@justinstoller
Last active October 11, 2018 20:40
Show Gist options
  • Save justinstoller/4aa80598c10d391f8aae97573be15087 to your computer and use it in GitHub Desktop.
Save justinstoller/4aa80598c10d391f8aae97573be15087 to your computer and use it in GitHub Desktop.
# to reliably reproduce run with: ruby repro.rb 10000
require 'fcntl'
class Pump
attr_reader :string
def initialize(io)
@io = io
@thread = nil
@string = ''
@run = true
@min_delay = 0.05
@max_delay = 1.0
end
def start
@thread = Thread.new { pump }
end
# Block until the pumping thread reaches EOF on the IO object.
def wait
@thread.join
end
private
def pump
backoff = @min_delay
while @run
begin
@string << @io.read_nonblock(4096)
backoff /= 2 if backoff > @min_delay
rescue Errno::EWOULDBLOCK, Errno::EAGAIN
backoff *= 2 if backoff < @max_delay
IO.select([@io], [], [], backoff)
rescue EOFError
@run = false
end
end
end
end
class Runner
def pipe
::IO.pipe.tap do |pair|
pair.each {|p| p.fcntl(Fcntl::F_SETFD, Fcntl::FD_CLOEXEC) }
end
end
def run
stdout_r, stdout_w = pipe
stderr_r, stderr_w = pipe
exec_r, exec_w = pipe
stdout_pump = Pump.new(stdout_r)
stderr_pump = Pump.new(stderr_r)
stdout_pump.start
stderr_pump.start
pid = fork do
exec_r.close
begin
Process.setsid
# Reopen file descriptors
STDOUT.reopen(stdout_w)
STDERR.reopen(stderr_w)
executable = 'ls'
exec([executable, executable], '.')
rescue SystemCallError => e
exec_w.write("#{e.class}: #{e.message}")
exit(254)
end
end
exec_w.close
stdout_w.close
stderr_w.close
stdout = ''
stderr = ''
status = nil
if !exec_r.eof?
stderr = exec_r.read || "exec() failed"
_, status = Process.waitpid2(pid)
else
_, status = Process.waitpid2(pid)
stdout_pump.wait
stderr_pump.wait
stdout = stdout_pump.string
stderr = stderr_pump.string
end
exec_r.close
stdout_r.close
stderr_r.close
return status, stdout, stderr
end
end
iterations = ARGV[0].to_i
iterations.times do
Runner.new.run
end
@justinstoller
Copy link
Author

Notes/draft commit message:
Previously on Bionic we would see issues where the Ruby interpreter
would die with an "ASYNC BUG EPIPE" message and no meaningful
stacktrace. It appears that this comes from rb_async_bug_errno in Ruby's
error.c:608.

When attempting to strace, when registering listeners for various
syscalls with sysdig, or when attaching to a debug build of Ruby with
gdb this issue no longer presents itself.

This seems like a race condition and when it happens seems relatively
normally distributed form the high level of running a Ruby script to
reproduce. ie. a ruby script that mimics r10k's subcommand runner and
is repeatedly called in a loop will sometime fail immediately, and
sometimes not until tens of thousands of iterations but most often fails
in the 1k-10k range of iterations.

I did not capture system metrics but failure frequencies did not seem to
change depending on the usage of the VM itself or what else was going on
in our VM infrastructure.

When using Ruby 2.3.7 the issue seems less likely to happen. It seemed
to happen a similar amount with 2.6.0preview2. It would happen with
2.5.1 on Ubuntu 16.04 (Xenial, with kernel 4.4 vs Bionics 4.15 kernel).
I could not reproduce it on Redhat 7 (Ruby 2.5.1, Linux kernel
3.something).

I was able to isolate this issue down to a race condition when forking
during the while loop of R10K::Util::Subprocess::Pump#pump through
process of elimination.

This commit moves starting the pipe output pumping until after the
subprocess fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment