Skip to content

Instantly share code, notes, and snippets.

@zmanian
Created May 30, 2026 04:43
Show Gist options
  • Select an option

  • Save zmanian/08280e428b3f9b90551a2fa74a4b1a40 to your computer and use it in GitHub Desktop.

Select an option

Save zmanian/08280e428b3f9b90551a2fa74a4b1a40 to your computer and use it in GitHub Desktop.
Getting zebrad off a wedged initial sync — symptoms, mis-diagnoses, and the checkpoint_verify_concurrency_limit fix (May 2026)

Getting zebrad off a wedged initial sync — a 2026-05-29 field report

This is a write-up of debugging a zebrad mainnet initial sync that wedged repeatedly at ~42% on otherwise perfectly-spec'd hardware. The summary up front: it's a real upstream bug (ZcashFoundation/zebra#5709) and the ZEBRA_SYNC__CHECKPOINT_VERIFY_CONCURRENCY_LIMIT knob (NOT the two knobs the name implies you'd reach for) is the fix that actually moves the needle.

If you're hitting the same symptom, scroll to "What actually worked." If you want the diagnostic walk, read on.

Symptom

zebrad on a clean, well-provisioned host stalls during initial sync. The log pattern is identical across restarts:

  1. Process starts, connects to ~30 peers, kicks off extending tips with in_flight climbing toward lookahead_limit=1000.
  2. Within 30-90 seconds, in_flight saturates at 999.
  3. Sync goes completely silent — no warnings, no retries, no log lines from the syncer task. The estimated progress task continues emitting once per minute showing time_since_last_state_block climbing.
  4. After ~8 minutes zebrad's internal verifier timeout fires and the cycle restarts. ~1.6K-8K new blocks land in the burst. Stall again. Repeat.

Steady-state effective rate: ~14K blocks/hr. Burst rate during the active ~30-90s window: 70K-150K blocks/hr.

It looks for all the world like a peer or resource problem. It is neither.

What it is NOT

We ruled all of these out empirically. Save yourself the time:

  • CPU bound. Container CPU sits at 0.01-0.07% across 24 cores during the stall. We resized c3-standard-8 → c4-standard-16 → c4-standard-24. No change in stall behavior.
  • Disk bound. iostat -x 5 shows the disk at <1% utilization with queue depth ~0. We migrated pd-ssd → hyperdisk-balanced (250GB, 15K provisioned IOPS, 600 MB/s throughput). The 15K IOPS sat completely unused. (Note for cost-conscious operators: don't burn money on hyperdisk-extreme for this workload until you've ruled out the actual bug — IOPS is not the bottleneck.)
  • Peer count. ss -tn inside the container's netns shows 30 ESTAB connections to peers on :8233. Recv-Q/Send-Q are all zero — peers are alive but no data is flowing. Bandwidth drops to keepalive-only: ~700 B/s in, ~2 KB/s out.
  • Stale peer cache. Wiping /var/lib/zebrad-cache/network/mainnet.peers produces another burst, then re-stalls in the same pattern.
  • The two knobs whose names suggest they'd help. Lowering both ZEBRA_SYNC__DOWNLOAD_CONCURRENCY_LIMIT (50 → 25) and ZEBRA_NETWORK__PEERSET_INITIAL_TARGET_SIZE (50 → 25) had no effect. in_flight still saturated at 996-999. These knobs do not bound in_flight.
  • Zebrad version. We went 4.4.1 → 4.5.1 hoping the 4.5.0 security fix for "peer inventory registry poisoning on sync restart" (GHSA-rj6c-83wx-jxf2) would address it. Same stall pattern on 4.5.1.

What's actually happening

Per issue #5709: zebrad downloads blocks out of height order from peers, but the checkpoint verifier needs them strictly contiguous. When one block in the lowest checkpoint range is late or missing, every already-downloaded block above it parks in the verifier holding a queue slot. in_flight pins near the configured ceiling. The syncer stops requesting anything new and goes idle until the verifier's 8-minute internal timeout fires.

The ceiling on in_flight is checkpoint_verify_concurrency_limit (default 1000), not the two knobs whose names sound relevant. The verifier is what holds the slots.

What actually worked

Three changes, in order of impact:

1. Lower checkpoint_verify_concurrency_limit to its minimum (400)

ZEBRA_SYNC__CHECKPOINT_VERIFY_CONCURRENCY_LIMIT=400

This caps the blast radius of each stall to ~1 checkpoint range instead of ~2.5. After applying, in_flight saturates at 399 instead of 999. We immediately saw sync rates of 12-18K blocks/minute (720K-1M blocks/hr instantaneous), with time_since_last_state_block=0s continuously and CPU jumping to ~90%. The runbook says you can also test 500 and compare.

2. Disable TCP slow-start-after-idle

The kernel's net.ipv4.tcp_slow_start_after_idle=1 default resets the congestion window after every idle interval. zebrad fetches one block per peer with idle gaps; every fetch starts cold on long-haul links.

On the host:

echo 'net.ipv4.tcp_slow_start_after_idle=0' > /etc/sysctl.d/99-zebra.conf
sysctl --system

Or inside the container's netns via Docker:

--sysctl net.ipv4.tcp_slow_start_after_idle=0

Gotcha: sysctl --system reapplies all /etc/sysctl.d/*.conf files, including any system defaults that set net.ipv4.ip_forward=0. Docker sets ip_forward=1 at daemon start to enable container outbound traffic. If sysctl --system reverts it, every container loses external connectivity and DNS stops resolving. We hit this exact regression mid- debug. Add net.ipv4.ip_forward=1 to your config file alongside the slow-start setting so the system file precedence keeps it in place.

3. Set external_addr to your actual public IP

ZEBRA_NETWORK__EXTERNAL_ADDR=<your-ip>:8233

Without this, zebrad advertises [::]:8233, which other nodes drop from their peer pools. You see fewer responsive peers and more stragglers.

4. (Safety net) Auto-bounce watchdog

A small Python systemd service that polls zebrad's estimated progress log line every 30s and runs systemctl restart zebrad-docker whenever time_since_last_state_block exceeds 90s. With items 1-3 in place this should fire rarely; without them it'll keep sync inching forward at a respectable ~30-90K blocks/hr average even while the underlying bug persists. Source at the end of this gist.

Things you should NOT also do

  • Bump download_concurrency_limit or peerset_initial_target_size (wrong layer; we tested both directions, no effect).
  • Upgrade 4.4.1 → 4.5.1 expecting a sync fix (the sync/verify code is byte-identical between them — upgrade anyway for the security fixes, but not for this bug).
  • Buy bigger instances or faster disks. The stall is not resource-bound. We went all the way to c4-standard-24 with 15K IOPS hyperdisk-balanced and saw zero impact until we set checkpoint_verify_concurrency_limit=400.
  • Touch max_connections_per_ip unless you have confirmed evidence your peers are sharing IPs.

Setting ZEBRA_NETWORK__INITIAL_MAINNET_PEERS via env var

It doesn't work. The figment env-var deserializer in zebrad 4.4.x / 4.5.x rejects both CSV and JSON-array values:

invalid type: string ..., expected a set for key network.initial_mainnet_peers

This is zebra#10658. To pin known-good peers you have to mount a TOML config file into the container. We deferred this; the other changes were enough to get healthy sync.

What we wish zebrad would log

The single most expensive thing about this debug was that the syncer task emits nothing during the stall. No WARN, no peer-eviction event, no "X requests in flight for Y seconds." If in_flight is saturated and state hasn't advanced for >N seconds, zebrad should log at WARN level with the peer ID and block hash of the oldest in-flight request. That would have collapsed the diagnostic time from hours to minutes.

A documented note that lookahead_limit (gated by checkpoint_verify_concurrency_limit) is the actual ceiling on in_flight — not download_concurrency_limit — would also have saved considerable time.

Final config (us-central1-a, c4-standard-24)

/etc/bedrock/zebra.env:

ZEBRA_DOCKER_IMAGE=zfnd/zebra:4.5.1
ZEBRA_SYNC__FULL_VERIFY_CONCURRENCY_LIMIT=40
ZEBRA_SYNC__CHECKPOINT_VERIFY_CONCURRENCY_LIMIT=400
ZEBRA_NETWORK__EXTERNAL_ADDR=<public-ip>:8233

/etc/sysctl.d/99-zebra.conf:

net.ipv4.tcp_slow_start_after_idle=0
net.ipv4.ip_forward=1

systemd unit excerpt (the env-var pass-through is required because docker run only inherits vars you explicitly -e):

ExecStart=/usr/bin/docker run --name zebra --rm \
  -p 8233:8233 \
  -p 8232:8232 \
  -v /var/lib/zebrad-cache:/home/zebra/.cache/zebra \
  --dns=8.8.8.8 --dns=1.1.1.1 \
  -e ZEBRA_NETWORK__NETWORK=Mainnet \
  -e ZEBRA_RPC__LISTEN_ADDR=0.0.0.0:8232 \
  -e ZEBRA_RPC__ENABLE_COOKIE_AUTH=false \
  -e ZEBRA_SYNC__FULL_VERIFY_CONCURRENCY_LIMIT \
  -e ZEBRA_SYNC__CHECKPOINT_VERIFY_CONCURRENCY_LIMIT \
  -e ZEBRA_NETWORK__EXTERNAL_ADDR \
  ${ZEBRA_DOCKER_IMAGE}

Watchdog script

#!/usr/bin/env python3
"""Zebrad sync stall watchdog. Polls progress logs; restarts on stall."""
import logging, re, signal, subprocess, sys, time

STALL_THRESHOLD_SEC = 90
POLL_INTERVAL_SEC = 30
SETTLE_SEC = 75
SERVICE = "zebrad-docker"
DURATION_RE = re.compile(r"time_since_last_state_block=(?:(\d+)m)?(?:\s*(\d+)s)?")

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

def parse_duration_sec(line):
    m = DURATION_RE.search(line)
    if not m: return None
    return int(m.group(1) or 0) * 60 + int(m.group(2) or 0)

def latest_progress_line():
    try:
        out = subprocess.check_output(
            ["docker", "logs", "zebra", "--since", "5m", "--tail", "20"],
            stderr=subprocess.STDOUT, timeout=15
        ).decode(errors="replace")
    except subprocess.SubprocessError as e:
        logging.warning("docker logs failed: %s", e)
        return None
    for line in reversed(out.splitlines()):
        if "estimated progress" in line:
            return line
    return None

def restart_zebrad():
    logging.warning("Restarting %s", SERVICE)
    subprocess.run(["systemctl", "restart", SERVICE], check=False, timeout=180)

def handle_term(*_): sys.exit(0)

def main():
    signal.signal(signal.SIGTERM, handle_term)
    signal.signal(signal.SIGINT, handle_term)
    logging.info("watchdog started threshold=%ds", STALL_THRESHOLD_SEC)
    while True:
        line = latest_progress_line()
        if line is None:
            time.sleep(POLL_INTERVAL_SEC); continue
        secs = parse_duration_sec(line)
        if secs is None:
            time.sleep(POLL_INTERVAL_SEC); continue
        if secs >= STALL_THRESHOLD_SEC:
            logging.warning("stall age=%ds, restarting", secs)
            restart_zebrad()
            time.sleep(SETTLE_SEC); continue
        logging.info("ok: stall age=%ds", secs)
        time.sleep(POLL_INTERVAL_SEC)

if __name__ == "__main__":
    main()

systemd unit:

[Unit]
Description=Zebrad sync stall watchdog
After=zebrad-docker.service
Wants=zebrad-docker.service

[Service]
Type=simple
ExecStart=/usr/bin/python3 /usr/local/bin/zebra-watchdog.py
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Acknowledgments

The diagnostic call-out to issue #5709 and the checkpoint_verify_concurrency_limit=400 recommendation came from a second agent who'd already worked the problem. Without that pointer this would have taken many more hours and possibly a fresh-sync from genesis to get past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment