Skip to content

Instantly share code, notes, and snippets.

@rwstauner
Created December 18, 2025 19:32
Show Gist options
  • Select an option

  • Save rwstauner/c73c0fd41f09ef7fd49cb49008d71728 to your computer and use it in GitHub Desktop.

Select an option

Save rwstauner/c73c0fd41f09ef7fd49cb49008d71728 to your computer and use it in GitHub Desktop.
GRPC 1.76.0 segfault

Use-After-Free in LockfreeEvent::SetReady (Epoll1Poller)

Summary

A use-after-free bug in the GRPC PosixEventEngine causes crashes when LockfreeEvent::SetReady() dereferences corrupted pointers. The Epoll1EventHandle memory has been freed and reused, leading to invalid pointer dereferences.

Affected Version: GRPC 1.76.0 (Ruby bindings)

Impact: Multiple crashes observed in production Ruby applications.

Backtrace

#0  absl::lts_20250512::Status::operator= (this=0x10000000501013c)
    at third_party/abseil-cpp/absl/status/status.h:779
#1  grpc_event_engine::experimental::PosixEngineClosure::SetStatus (this=0x100000005010104)
    at ./src/core/lib/event_engine/posix_engine/posix_engine_closure.h:41
#2  grpc_event_engine::experimental::LockfreeEvent::SetReady (this=0x7bd5354fa988)
    at src/core/lib/event_engine/posix_engine/lockfree_event.cc:236
#3  grpc_event_engine::experimental::Epoll1EventHandle::ExecutePendingActions (this=0x7bd5354fa960)
    at src/core/lib/event_engine/posix_engine/ev_epoll1_linux.cc:122
#4  grpc_event_engine::experimental::Epoll1Poller::Work(...)
    at src/core/lib/event_engine/posix_engine/ev_epoll1_linux.cc:445
#5  grpc_event_engine::experimental::PosixEventEngine::PollingCycle::PollerWorkInternal
#6+ ... WorkStealingThreadPool -> pthread_create

Evidence of Use-After-Free

1. Handle Memory Reused for User-Agent String

In one crash, the Epoll1EventHandle memory was reused to store a user-agent string:

(gdb) x/16s 0x7bd67bf628a0
0x7bd67bf628b0: "gl-ruby/"
0x7bd67bf628bb: ".7 gccl/2.11.1 gax/1.1.0 gapic/1.3.0 grpc/1.7"

The corrupted closure pointer 0x332f627020302e36 decodes to ASCII "6.0 pb/3" - part of the user-agent string that overwrote the handle memory.

2. VTable Points to malloc's main_arena

In another crash, the handle's vtable pointer points to glibc's malloc arena:

(gdb) print *this   # Epoll1EventHandle
$3 = {
  _vptr.EventHandle = 0x7f6105a08b60 <main_arena+160>,  # VTABLE POINTS TO MALLOC ARENA!
  poller_ = 0xc6,                          # garbage
  read_closure_ = {
    state_ = 0,
    thread_pool_ = 0x97                    # garbage - CRASH CAUSE
  },
  write_closure_ = { thread_pool_ = 0x50 },
  error_closure_ = { thread_pool_ = 0x100000001 }
}

This proves:

  1. The Epoll1EventHandle was deleted
  2. Memory was returned to malloc's free list
  3. malloc reused it for arena metadata
  4. A worker thread crashed accessing the stale pointer

Root Cause Analysis

The Bug Location

In ev_epoll1_linux.cc, Epoll1Poller::Work() (lines 418-448):

Poller::WorkResult Epoll1Poller::Work(...) {
  Events pending_events;
  {
    grpc_core::MutexLock lock(&mu_);
    ProcessEpollEvents(..., pending_events);  // Collects handle pointers
  }  // MUTEX RELEASED HERE

  schedule_poll_again();

  for (auto& it : pending_events) {
    it->ExecutePendingActions();  // CRASH: handle may be freed
  }
}

The Race Window

  1. ProcessEpollEvents() adds Epoll1EventHandle* raw pointers to pending_events
  2. Mutex is released
  3. Another thread orphans the handle (via OrphanHandle() -> freelist -> Close() -> delete)
  4. Memory is reallocated for other purposes
  5. Worker thread calls ExecutePendingActions() on freed memory

Crash Location in lockfree_event.cc

// lockfree_event.cc:235-237
auto closure = reinterpret_cast<PosixEngineClosure*>(curr);
closure->SetStatus(absl::OkStatus());  // CRASH: closure is garbage
thread_pool_->Run(closure);            // CRASH: thread_pool_ is garbage

Ruby-Specific Trigger

The crash occurs during normal operation (not shutdown) because Ruby GC destroys GRPC channels while EventEngine workers still hold handle pointers.

Ruby GC Integration

// rb_channel.c
static void grpc_rb_channel_free(void* p) {
  grpc_channel_destroy(wrapper->channel);  // Destroys C-core channel!
  xfree(p);
}

static rb_data_type_t grpc_channel_data_type = {
    "grpc_channel",
    {grpc_rb_channel_mark, grpc_rb_channel_free, ...},
    RUBY_TYPED_FREE_IMMEDIATELY  // <-- Freed during GC, not deferred
};

Global EventEngine

The EventEngine is a global singleton that outlives individual channels:

// default_event_engine.cc
std::shared_ptr<EventEngine> GetDefaultEventEngine() {
  // Returns global engine - ONE instance shared by all channels
}

Race Sequence

T1 (Ruby Main Thread - GC):
  1. Channel has no more Ruby references
  2. Ruby GC collects channel
  3. grpc_rb_channel_free() -> grpc_channel_destroy()
  4. Channel's Epoll1EventHandle goes to freelist
  5. Handle deleted or memory reused

T2 (EventEngine Worker Thread):
  - Still has handle pointer in pending_events
  - Calls ExecutePendingActions() on freed handle
  - CRASH!

Related Issue

This appears related to grpc/grpc#19195. The test call_credentials_timeout_test.rb exercises a similar race condition.

Suggested Fixes

  1. Extend mutex scope: Hold the mutex through the pending_events iteration

  2. Use shared_ptr for handles: Convert pending_events to hold std::shared_ptr<Epoll1EventHandle> instead of raw pointers

  3. Reference counting: Prevent handles from being freed while referenced in pending_events

  4. Additional vulnerability: PosixEndpointImpl stores a raw poller_ pointer (posix_endpoint.h:593) with no lifetime guarantee. If the poller is destroyed while endpoints exist, OrphanHandle() will access freed memory at poller_->posix_interface().

Files Involved

File Lines Issue
ev_epoll1_linux.cc 440-445 Mutex released before iterating pending_events
ev_epoll1_linux.cc 270-275 Freelist allows handles to be deleted while referenced
lockfree_event.cc 235-237 No validation before dereferencing closure pointer
posix_endpoint.h 593 Raw poller_ pointer with no lifetime guarantee
rb_channel.c 68-78 Ruby GC frees channel immediately via RUBY_TYPED_FREE_IMMEDIATELY
#!/usr/bin/env ruby
# Set GRPC_REPO=/path/to/repo or put this script at the root of that repo
# $ GRPC_VERSION=1.75.0 ./repro.rb # works every time
# $ GRPC_VERSION=1.76.0 ./repro.rb # SEGV 40+% of the time
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem 'grpc_fork_safety'
gem 'google-protobuf', '~> 4.31.1'
gem 'grpc', ENV.fetch('GRPC_VERSION', '1.76.0')
end
puts "GRPC #{$".detect { |l| l =~ /grpc_c/ }}"
GRPC_REPO = ENV.fetch("GRPC_REPO") { Dir.pwd }.tap do |repo|
"#{repo}/examples/ruby/lib".tap do |lib|
raise "Set GRPC_REPO to path to grpc/grpc repo" unless File.directory?(lib)
$: << lib
end
end
require 'helloworld_pb'
require 'helloworld_services_pb'
require "socket"
def log(msg)
STDERR.puts "#{Time.now.strftime("%H:%M:%S.%3N")} #{$0} [#{$$}] #{msg}"
end
# GrpcForkSafety.before_disable { log "grpc_fork_safety before_disable" }
# GrpcForkSafety.after_enable { log "grpc_fork_safety after_enable" }
def title(t)
$orig_title ||= $0
$0 = "#{$orig_title} #{t}"
end
def wait(who, pid)
log "wait #{pid.inspect} #{who}"
_, status = Process.waitpid2(pid)
log "waited for #{who}: #{status.inspect}"
end
def kill(sig, pid)
log "kill #{sig} => #{pid}"
Process.kill(sig, pid)
end
def health_check
pid = Process.fork { title "health check"; Process.exit!(0) }
_, status = Process.waitpid2(pid)
unless status&.success?
$stderr.puts("after_mold_fork the mold seem unhealthy (fork+exit), aborting")
Process.exit!(1)
end
# LibC getaddrinfo uses a mutex, if a native thread was holding it when we forked
# then `getaddrinfo` may no longer work in the child.
Socket.getaddrinfo("example.com", nil)
end
# process mangement {{{
def process_wait_with_timeout(pid, timeout)
(timeout * 50).times do
_, status = Process.waitpid2(pid, Process::WNOHANG)
return status if status
sleep 0.02 # 50 * 20ms => 1s
end
log "process #{pid} timed out, killing"
# The process didn't exit in the allotted time, so we kill it.
Process.kill('INT', pid) rescue nil
Process.kill('TERM', pid) rescue nil
Process.kill('KILL', pid) rescue nil
_, status = Process.waitpid2(pid)
status
end
def double_fork
log "double-forking"
r, w = IO.pipe
if middle_pid = fork
log "parent forked #{middle_pid}"
w.close
process_wait_with_timeout(middle_pid, 60)
str_pid = r.gets
r.close
raise "child failed to register grandchild: #{str_pid.inspect}" unless str_pid
log "got #{str_pid}"
return str_pid.to_i
else
title "middle"
log "middle child forked"
r.close
pid = fork do
begin
title "inner"
log "inner child forked"
w.close
yield
# NOTE: if the parent exits without waiting on this
# we won't know what happened so we log any errors explicitly.
log "inner child succeeded"
rescue StandardError
log "inner child ERROR #{$!.message}"
exit(1)
end
end
log "middle child exiting. inner child pid #{pid}"
begin
w.puts(pid)
w.close
rescue Errno::EPIPE
log $!.message
end
log "middle child exit"
wait("inner child", pid) # often a SEGV
exit!
end
end
def wait_for_unowned_process_to_finish(pid)
while true
begin
Process.kill(0, pid)
rescue Errno::ESRCH
break
sleep 1
end
# One extra sleep for output to flush.
sleep 1
end
end
# }}}
# rpc server {{{
class GreeterServer < Helloworld::Greeter::Service
def say_hello(hello_req, _unused_call)
Helloworld::HelloReply.new(message: "Hello #{hello_req.name}")
end
def say_hello_stream_reply(hello_req, _unused_call)
hello_req.name.to_i.times.lazy.map do |i|
sleep 1
Helloworld::HelloReply.new(message: "Hello #{hello_req.name} #{i.succ}")
end
end
end
CERT_DIR = File.join(GRPC_REPO, "src/ruby/spec/testdata")
CA_PEM = File.read(File.join(CERT_DIR, "ca.pem"))
def create_server_credentials
GRPC::Core::ServerCredentials.new(
CA_PEM,
[{
private_key: File.read(File.join(CERT_DIR, "server1.key")),
cert_chain: File.read(File.join(CERT_DIR, "server1.pem")),
}],
true
)
end
def run_server
server = GRPC::RpcServer.new
bind = '0.0.0.0:50051'
server.add_http2_port(bind, create_server_credentials)
server.handle(GreeterServer)
log "gRPC server listening on #{bind}"
server.run_till_terminated_or_interrupted([1, 'int', 'SIGTERM'])
end
# }}}
# rpc client {{{
def ssl_creds
GRPC::Core::ChannelCredentials.new(
CA_PEM,
File.read(File.join(CERT_DIR, './client.key')),
File.read(File.join(CERT_DIR, './client.pem')),
)
end
def run_client(requests = 5)
stub = Helloworld::Greeter::Stub.new(
'localhost:50051',
ssl_creds,
channel_args: { GRPC::Core::Channel::SSL_TARGET => 'foo.test.google.fr' }
)
request = Helloworld::HelloRequest.new(name: requests.to_s)
stub.say_hello_stream_reply(request, deadline: Time.now + requests * 2).each do |reply|
log "Received: #{reply.message}"
end
end
# }}}
server_pid = fork do
title "server"
run_server
end
log "server pid #{server_pid}"
client_pid = fork do
title "client"
ENV['GRPC_TRACE'] = 'tsi'
# Always create this on main thread https://github.com/grpc/grpc/pull/39208
GRPC::Core::ChannelCredentials.new
# First worker: Use GRPC to process requests
run_client(2)
# Stop handling requests and create new mold from self.
pid = double_fork do
log "double forked"
health_check
log "run double-forked client"
run_client
end
wait_for_unowned_process_to_finish(pid)
end
wait("client wrapper", client_pid) # sometimes a SEGV
kill('TERM', server_pid)
wait("server", server_pid)
16:58:24.048 ./repro.rb [244305] grpc_fork_safety before_disable
GRPC /data/rbenv/versions/3.4.5/lib/ruby/gems/3.4.0/gems/grpc-1.76.0-x86_64-linux-gnu/src/ruby/lib/grpc/3.4/grpc_c.so
16:58:24.050 ./repro.rb [244305] grpc_fork_safety after_enable
16:58:24.050 ./repro.rb [244305] server pid 244325
16:58:24.050 ./repro.rb [244305] grpc_fork_safety before_disable
16:58:24.050 ./repro.rb [244325] grpc_fork_safety after_enable
16:58:24.051 ./repro.rb [244305] grpc_fork_safety after_enable
16:58:24.051 ./repro.rb [244305] wait 244332 client wrapper
16:58:24.051 ./repro.rb [244332] grpc_fork_safety after_enable
16:58:24.053 ./repro.rb server [244325] gRPC server listening on 0.0.0.0:50051
16:58:25.072 ./repro.rb client [244332] Received: Hello 2 1
16:58:26.072 ./repro.rb client [244332] Received: Hello 2 2
16:58:26.073 ./repro.rb client [244332] double-forking
16:58:26.073 ./repro.rb client [244332] grpc_fork_safety before_disable
16:58:26.078 ./repro.rb client [244332] grpc_fork_safety after_enable
16:58:26.078 ./repro.rb client [244332] parent forked 244414
16:58:26.078 ./repro.rb client [244414] grpc_fork_safety after_enable
16:58:26.078 ./repro.rb middle [244414] middle child forked
16:58:26.078 ./repro.rb middle [244414] grpc_fork_safety before_disable
16:58:26.704 ./repro.rb [244305] waited for client wrapper: #<Process::Status: pid 244332 SIGSEGV (signal 11) (core dumped)>
16:58:26.704 ./repro.rb [244305] kill TERM => 244325
16:58:26.704 ./repro.rb [244305] wait 244325 server
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1766015906.911231 244354 ssl_transport_security_utils.cc:114] Corruption detected.
E0000 00:00:1766015906.911301 244354 ssl_transport_security_utils.cc:71] error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT
E0000 00:00:1766015906.911308 244354 ssl_transport_security_utils.cc:71] error:1000008b:SSL routines:OPENSSL_internal:DECRYPTION_FAILED_OR_BAD_RECORD_MAC
16:58:26.911 ./repro.rb middle [244414] grpc_fork_safety after_enable
E0000 00:00:1766015906.911316 244354 secure_endpoint.cc:244] Decryption error: TSI_DATA_CORRUPTED
16:58:26.911 ./repro.rb middle [244414] middle child exiting. inner child pid 244467
16:58:26.911 ./repro.rb middle [244414] Broken pipe
16:58:26.911 ./repro.rb middle [244414] middle child exit
16:58:26.911 ./repro.rb middle [244467] grpc_fork_safety after_enable
16:58:26.912 ./repro.rb inner [244467] inner child forked
16:58:26.912 ./repro.rb inner [244467] double forked
16:58:26.912 ./repro.rb inner [244467] grpc_fork_safety before_disable
16:58:26.914 ./repro.rb inner [244467] grpc_fork_safety after_enable
16:58:26.914 ./repro.rb inner [244509] grpc_fork_safety after_enable
16:58:28.920 ./repro.rb [244305] waited for server: #<Process::Status: pid 244325 exit 0>
16:58:28.926 ./repro.rb inner [244467] run double-forked client
16:58:28.929 ./repro.rb inner [244467] ERROR 14:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50051: Failed to connect to remote host: Connection refused. debug_error_string:{UNKNOWN:Error received from peer {grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50051: Failed to connect to remote host: Connection refused"}}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment