This is a write-up of debugging a zebrad mainnet initial sync that wedged repeatedly at ~42% on otherwise perfectly-spec'd hardware. The summary up front: it's a real upstream bug (ZcashFoundation/zebra#5709) and the ZEBRA_SYNC__CHECKPOINT_VERIFY_CONCURRENCY_LIMIT knob (NOT the two knobs the name implies you'd reach for) is the fix that actually moves the needle.
If you're hitting the same symptom, scroll to "What actually worked." If you want the diagnostic walk, read on.
zebrad on a clean, well-provisioned host stalls during initial sync. The log pattern is identical across restarts:
- Process starts, connects to ~30 peers, kicks off
extending tipswithin_flightclimbing towardlookahead_limit=1000. - Within 30-90 seconds,
in_flightsaturates at 999. - Sync goes completely silent — no warnings, no retries, no log lines
from the syncer task. The
estimated progresstask continues emitting once per minute showingtime_since_last_state_blockclimbing. - After ~8 minutes zebrad's internal verifier timeout fires and the cycle restarts. ~1.6K-8K new blocks land in the burst. Stall again. Repeat.
Steady-state effective rate: ~14K blocks/hr. Burst rate during the active ~30-90s window: 70K-150K blocks/hr.
It looks for all the world like a peer or resource problem. It is neither.
We ruled all of these out empirically. Save yourself the time:
- CPU bound. Container CPU sits at 0.01-0.07% across 24 cores during the stall. We resized c3-standard-8 → c4-standard-16 → c4-standard-24. No change in stall behavior.
- Disk bound.
iostat -x 5shows the disk at <1% utilization with queue depth ~0. We migrated pd-ssd → hyperdisk-balanced (250GB, 15K provisioned IOPS, 600 MB/s throughput). The 15K IOPS sat completely unused. (Note for cost-conscious operators: don't burn money on hyperdisk-extreme for this workload until you've ruled out the actual bug — IOPS is not the bottleneck.) - Peer count.
ss -tninside the container's netns shows 30 ESTAB connections to peers on :8233.Recv-Q/Send-Qare all zero — peers are alive but no data is flowing. Bandwidth drops to keepalive-only: ~700 B/s in, ~2 KB/s out. - Stale peer cache. Wiping
/var/lib/zebrad-cache/network/mainnet.peersproduces another burst, then re-stalls in the same pattern. - The two knobs whose names suggest they'd help. Lowering both
ZEBRA_SYNC__DOWNLOAD_CONCURRENCY_LIMIT(50 → 25) andZEBRA_NETWORK__PEERSET_INITIAL_TARGET_SIZE(50 → 25) had no effect.in_flightstill saturated at 996-999. These knobs do not boundin_flight. - Zebrad version. We went 4.4.1 → 4.5.1 hoping the 4.5.0 security fix for "peer inventory registry poisoning on sync restart" (GHSA-rj6c-83wx-jxf2) would address it. Same stall pattern on 4.5.1.
Per issue #5709: zebrad downloads blocks out of height order from
peers, but the checkpoint verifier needs them strictly contiguous. When
one block in the lowest checkpoint range is late or missing, every
already-downloaded block above it parks in the verifier holding a queue
slot. in_flight pins near the configured ceiling. The syncer stops
requesting anything new and goes idle until the verifier's 8-minute
internal timeout fires.
The ceiling on in_flight is checkpoint_verify_concurrency_limit
(default 1000), not the two knobs whose names sound relevant. The
verifier is what holds the slots.
Three changes, in order of impact:
ZEBRA_SYNC__CHECKPOINT_VERIFY_CONCURRENCY_LIMIT=400
This caps the blast radius of each stall to ~1 checkpoint range instead
of ~2.5. After applying, in_flight saturates at 399 instead of 999.
We immediately saw sync rates of 12-18K blocks/minute (720K-1M blocks/hr
instantaneous), with time_since_last_state_block=0s continuously and
CPU jumping to ~90%. The runbook says you can also test 500 and compare.
The kernel's net.ipv4.tcp_slow_start_after_idle=1 default resets the
congestion window after every idle interval. zebrad fetches one block
per peer with idle gaps; every fetch starts cold on long-haul links.
On the host:
echo 'net.ipv4.tcp_slow_start_after_idle=0' > /etc/sysctl.d/99-zebra.conf
sysctl --system
Or inside the container's netns via Docker:
--sysctl net.ipv4.tcp_slow_start_after_idle=0
Gotcha: sysctl --system reapplies all /etc/sysctl.d/*.conf files,
including any system defaults that set net.ipv4.ip_forward=0. Docker
sets ip_forward=1 at daemon start to enable container outbound traffic.
If sysctl --system reverts it, every container loses external
connectivity and DNS stops resolving. We hit this exact regression mid-
debug. Add net.ipv4.ip_forward=1 to your config file alongside the
slow-start setting so the system file precedence keeps it in place.
ZEBRA_NETWORK__EXTERNAL_ADDR=<your-ip>:8233
Without this, zebrad advertises [::]:8233, which other nodes drop from
their peer pools. You see fewer responsive peers and more stragglers.
A small Python systemd service that polls zebrad's estimated progress
log line every 30s and runs systemctl restart zebrad-docker whenever
time_since_last_state_block exceeds 90s. With items 1-3 in place this
should fire rarely; without them it'll keep sync inching forward at a
respectable ~30-90K blocks/hr average even while the underlying bug
persists. Source at the end of this gist.
- Bump
download_concurrency_limitorpeerset_initial_target_size(wrong layer; we tested both directions, no effect). - Upgrade 4.4.1 → 4.5.1 expecting a sync fix (the sync/verify code is byte-identical between them — upgrade anyway for the security fixes, but not for this bug).
- Buy bigger instances or faster disks. The stall is not resource-bound.
We went all the way to c4-standard-24 with 15K IOPS hyperdisk-balanced
and saw zero impact until we set
checkpoint_verify_concurrency_limit=400. - Touch
max_connections_per_ipunless you have confirmed evidence your peers are sharing IPs.
It doesn't work. The figment env-var deserializer in zebrad 4.4.x / 4.5.x rejects both CSV and JSON-array values:
invalid type: string ..., expected a set for key network.initial_mainnet_peers
This is zebra#10658. To pin known-good peers you have to mount a TOML config file into the container. We deferred this; the other changes were enough to get healthy sync.
The single most expensive thing about this debug was that the syncer
task emits nothing during the stall. No WARN, no peer-eviction event,
no "X requests in flight for Y seconds." If in_flight is saturated and
state hasn't advanced for >N seconds, zebrad should log at WARN level
with the peer ID and block hash of the oldest in-flight request. That
would have collapsed the diagnostic time from hours to minutes.
A documented note that lookahead_limit (gated by
checkpoint_verify_concurrency_limit) is the actual ceiling on
in_flight — not download_concurrency_limit — would also have saved
considerable time.
/etc/bedrock/zebra.env:
ZEBRA_DOCKER_IMAGE=zfnd/zebra:4.5.1
ZEBRA_SYNC__FULL_VERIFY_CONCURRENCY_LIMIT=40
ZEBRA_SYNC__CHECKPOINT_VERIFY_CONCURRENCY_LIMIT=400
ZEBRA_NETWORK__EXTERNAL_ADDR=<public-ip>:8233
/etc/sysctl.d/99-zebra.conf:
net.ipv4.tcp_slow_start_after_idle=0
net.ipv4.ip_forward=1
systemd unit excerpt (the env-var pass-through is required because
docker run only inherits vars you explicitly -e):
ExecStart=/usr/bin/docker run --name zebra --rm \
-p 8233:8233 \
-p 8232:8232 \
-v /var/lib/zebrad-cache:/home/zebra/.cache/zebra \
--dns=8.8.8.8 --dns=1.1.1.1 \
-e ZEBRA_NETWORK__NETWORK=Mainnet \
-e ZEBRA_RPC__LISTEN_ADDR=0.0.0.0:8232 \
-e ZEBRA_RPC__ENABLE_COOKIE_AUTH=false \
-e ZEBRA_SYNC__FULL_VERIFY_CONCURRENCY_LIMIT \
-e ZEBRA_SYNC__CHECKPOINT_VERIFY_CONCURRENCY_LIMIT \
-e ZEBRA_NETWORK__EXTERNAL_ADDR \
${ZEBRA_DOCKER_IMAGE}
#!/usr/bin/env python3
"""Zebrad sync stall watchdog. Polls progress logs; restarts on stall."""
import logging, re, signal, subprocess, sys, time
STALL_THRESHOLD_SEC = 90
POLL_INTERVAL_SEC = 30
SETTLE_SEC = 75
SERVICE = "zebrad-docker"
DURATION_RE = re.compile(r"time_since_last_state_block=(?:(\d+)m)?(?:\s*(\d+)s)?")
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
def parse_duration_sec(line):
m = DURATION_RE.search(line)
if not m: return None
return int(m.group(1) or 0) * 60 + int(m.group(2) or 0)
def latest_progress_line():
try:
out = subprocess.check_output(
["docker", "logs", "zebra", "--since", "5m", "--tail", "20"],
stderr=subprocess.STDOUT, timeout=15
).decode(errors="replace")
except subprocess.SubprocessError as e:
logging.warning("docker logs failed: %s", e)
return None
for line in reversed(out.splitlines()):
if "estimated progress" in line:
return line
return None
def restart_zebrad():
logging.warning("Restarting %s", SERVICE)
subprocess.run(["systemctl", "restart", SERVICE], check=False, timeout=180)
def handle_term(*_): sys.exit(0)
def main():
signal.signal(signal.SIGTERM, handle_term)
signal.signal(signal.SIGINT, handle_term)
logging.info("watchdog started threshold=%ds", STALL_THRESHOLD_SEC)
while True:
line = latest_progress_line()
if line is None:
time.sleep(POLL_INTERVAL_SEC); continue
secs = parse_duration_sec(line)
if secs is None:
time.sleep(POLL_INTERVAL_SEC); continue
if secs >= STALL_THRESHOLD_SEC:
logging.warning("stall age=%ds, restarting", secs)
restart_zebrad()
time.sleep(SETTLE_SEC); continue
logging.info("ok: stall age=%ds", secs)
time.sleep(POLL_INTERVAL_SEC)
if __name__ == "__main__":
main()systemd unit:
[Unit]
Description=Zebrad sync stall watchdog
After=zebrad-docker.service
Wants=zebrad-docker.service
[Service]
Type=simple
ExecStart=/usr/bin/python3 /usr/local/bin/zebra-watchdog.py
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.targetThe diagnostic call-out to issue #5709 and the
checkpoint_verify_concurrency_limit=400 recommendation came from a
second agent who'd already worked the problem. Without that pointer this
would have taken many more hours and possibly a fresh-sync from genesis
to get past.