Skip to content

Instantly share code, notes, and snippets.

@efstathiosntonas
Created December 6, 2025 09:29
Show Gist options
  • Select an option

  • Save efstathiosntonas/2b3239de12afb2113d8222ced29a0d41 to your computer and use it in GitHub Desktop.

Select an option

Save efstathiosntonas/2b3239de12afb2113d8222ced29a0d41 to your computer and use it in GitHub Desktop.
Go Network performance from blog: https://goperf.dev/02-networking/

Go-Networking - Benchmarking

Pages: 7


Comparing TCP, HTTP/2, and gRPC Performance in Go - Go Optimization Guide

URL: https://goperf.dev/02-networking/tcp-http2-grpc/

Contents:

  • Comparing TCP, HTTP/2, and gRPC Performance in Go¶
  • Raw TCP with Custom Framing¶
    • Custom Framing Protocol¶
      • Protocol Structure¶
      • Disadvantages¶
    • Performance Insights¶
  • HTTP/2 via net/http¶
    • Server Implementation¶
    • Performance Insights¶
  • gRPC¶

In distributed systems, the choice of communication protocol shapes how services interact under real-world load. It influences not just raw throughput and latency, but also how well the system scales, how much CPU and memory it consumes, and how predictable its behavior remains under pressure. In this article, we dissect three prominent options—raw TCP with custom framing, HTTP/2 via Go's built-in net/http package, and gRPC—and explore their performance characteristics through detailed benchmarks and practical scenarios.

Raw TCP provides maximum flexibility with virtually no protocol overhead, but that comes at a cost: all message boundaries, framing logic, and error handling must be implemented manually. Since TCP delivers a continuous byte stream with no inherent notion of messages, applications must explicitly define how to separate and interpret those bytes.

A common way to handle message boundaries over TCP is to use length-prefix framing: each message starts with a 4-byte header that tells the receiver how many bytes to read next. The length is encoded in big-endian format, following the standard network byte order, so it behaves consistently across different systems. This setup solves a core issue with TCP—while it guarantees reliable delivery, it doesn’t preserve message boundaries. Without knowing the size upfront, the receiver has no way to tell where one message ends and the next begins.

TCP guarantees reliable, in-order delivery of bytes, but it does not preserve or indicate message boundaries. For example, if a client sends three logical messages:

the server may receive them as a continuous byte stream with arbitrary segmentations, such as:

TCP delivers a continuous stream of bytes with no built-in concept of where one message stops and another starts. This means the receiver can’t rely on read boundaries to infer message boundaries—what arrives might be a partial message, multiple messages concatenated, or an arbitrary slice of both. To make sense of structured data over such a stream, the application needs a framing strategy. Length-prefixing does this by including the size of the message up front, so the receiver knows exactly how many bytes to expect before starting to parse the payload.

While length-prefixing is the most common and efficient framing strategy, there are other options depending on the use case. Other framing strategies exist, each with its own trade-offs in terms of simplicity, robustness, and flexibility. Delimiter-b

[Content truncated]

Examples:

Example 1 (unknown):

[msg1][msg2][msg3]

Example 2 (unknown):

[msg1_part][msg2][msg3_part]

Example 3 (unknown):

| Length (4 bytes) | Payload (Length bytes) |

Example 4 (go):

func writeFrame(conn net.Conn, payload []byte) error {
    frameLen := uint32(len(payload))
    buf := make([]byte, 4+len(payload))
    binary.BigEndian.PutUint32(buf[:4], frameLen)
    copy(buf[4:], payload)
    _, err := conn.Write(buf)
    return err
}

func readFrame(conn net.Conn) ([]byte, error) {
    lenBuf := make([]byte, 4)
    if _, err := io.ReadFull(conn, lenBuf); err != nil {
        return nil, err
    }
    frameLen := binary.BigEndian.Uint32(lenBuf)
    payload := make([]byte, frameLen)
    if _, err := io.ReadFull(conn, payload); err != nil {
        return nil, err
    }
    
...

QUIC – Building Low-Latency Services with quic-go - Go Optimization Guide

URL: https://goperf.dev/02-networking/quic-in-go/

Contents:

  • QUIC in Go: Building Low-Latency Services with quic-go¶
  • Understanding QUIC¶
    • QUIC vs. TCP: Key Differences¶
  • Is QUIC Based on DTLS?¶
  • Introducing quic-go¶
    • Getting Started with quic-go¶
    • Basic QUIC Server¶
  • Multiplexed Streams¶
  • Performance: QUIC vs. HTTP/2 and TCP¶
  • Connection Migration¶

QUIC has emerged as a robust protocol, solving many inherent limitations of traditional TCP connections. QUIC combines encryption, multiplexing, and connection migration into a unified protocol, designed to optimize web performance, particularly in real-time and mobile-first applications. In Go, quic-go is the main QUIC implementation and serves as a practical base for building efficient, low-latency network services with built-in encryption and stream multiplexing.

Originally developed at Google and later standardized by the IETF, QUIC rethinks the transport layer to overcome longstanding TCP limitations:

QUIC takes a fundamentally different approach from TCP. While TCP is built directly on IP and requires a connection-oriented handshake before data can flow, QUIC runs over UDP and handles its own connection logic, reducing setup overhead and improving startup latency. This architectural choice allows QUIC to provide multiplexed, independent streams that effectively eliminate the head-of-line blocking issue commonly experienced with TCP, where the delay or loss of one packet stalls subsequent packets.

QUIC integrates TLS 1.3 directly into its transport layer, eliminating the layered negotiation seen in TCP+TLS. This design streamlines the handshake process and enables 0-RTT data, where repeat connections can begin transmitting encrypted payloads immediately—something TCP simply doesn’t support.

Another key distinction is how connections are identified. TCP connections are bound to a specific IP and port, so any change in network interface results in a broken connection. QUIC avoids this by using connection IDs that remain stable across address changes, allowing sessions to continue uninterrupted when a device moves between networks—critical for mobile and latency-sensitive use cases.

Although QUIC and DTLS both use TLS cryptographic primitives over UDP, QUIC does not build on DTLS. Instead, QUIC incorporates TLS 1.3 directly into its transport layer, inheriting only the cryptographic handshake—not the record framing or protocol structure of DTLS.

QUIC defines its own packet encoding, multiplexing, retransmission, and encryption formats. It wraps TLS handshake messages within QUIC packets and tightly couples encryption state with transport features like packet numbers and stream IDs. In contrast, DTLS operates as a secured datagram layer atop UDP, providing encryption and authentication but leaving transport semantics—such as retransmit, ordering, or

[Content truncated]

Examples:

Example 1 (unknown):

go get github.com/quic-go/quic-go

Example 2 (unknown):

listener, err := quic.ListenAddr("localhost:4242", generateTLSConfig(), nil)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println("QUIC server listening on localhost:4242")

    for {
        conn, err := listener.Accept(context.Background())
        if err != nil {
            log.Println("Accept error:", err)
            continue
        }
        go handleConn(conn)
    }

Example 3 (unknown):

defer conn.CloseWithError(0, "bye")

    for {
        stream, err := conn.AcceptStream(context.Background())
        if err != nil {
            return
        }

        go func(s quic.Stream) {
            defer s.Close()

            data, err := io.ReadAll(s)
            if len(data) > 0 {
                log.Printf("Received: %s", string(data))
            }
            if err != nil && err != io.EOF {
                if appErr, ok := err.(*quic.ApplicationError); !ok || appErr.ErrorCode != 0 {
                    log.Println("read error:", err)
                }
            }
        }(
...

Example 4 (unknown):

stream, err := session.OpenStreamSync(context.Background())
if err != nil {
    log.Fatal(err)
}
_, err = stream.Write([]byte("Hello QUIC!"))

Benchmarking and Load Testing for Networked Go Apps - Go Optimization Guide

URL: https://goperf.dev/02-networking/bench-and-load/

Contents:

  • Benchmarking and Load Testing for Networked Go Apps¶
  • Test App: Simulating Fast/Slow Paths and GC pressure¶
  • Simulating Load: Tools That Reflect Reality¶
    • When to Use What¶
    • Vegeta¶
    • wrk¶
    • k6¶
  • Profiling Networked Go Applications with pprof¶
    • CPU Profiling¶
      • What to Look For¶

Before you reach for a mutex-free queue or tune your goroutine pool, step back. Optimization without a baseline is just guesswork. In Go applications, performance tuning starts with understanding how your system behaves under pressure, which means benchmarking it under load.

Load testing isn't just about pushing requests until things break. It's about simulating realistic usage patterns to extract measurable, repeatable data. That data anchors every optimization that follows.

To benchmark meaningfully, we need endpoints that reflect different workload characteristics.

This is by no means an exhaustive list. The ecosystem of load-testing tools is broad and constantly evolving. Tools like Apache JMeter, Locust, Artillery, and Gatling each bring their own strengths—ranging from UI-driven test design to distributed execution or JVM-based scenarios. The right choice depends on your stack, test goals, and team workflow. The tools listed here are optimized for Go-based services and local-first benchmarking, but they’re just a starting point.

At a glance, vegeta, wrk, and k6 all hammer HTTP endpoints. But they serve different roles depending on what you're testing, how much precision you need, and how complex your scenario is.

Each of these tools has a place in your benchmarking toolkit. Picking the right one depends on whether you're validating performance, exploring scaling thresholds, or simulating end-user behavior.

Vegeta is a flexible HTTP load testing tool written in Go, built for generating constant request rates. This makes it well-suited for simulating steady, sustained traffic patterns instead of sudden spikes.

We reach for Vegeta when precision matters. It maintains exact request rates and captures detailed latency distributions, which helps track how system behavior changes under load. It’s lightweight, easy to automate, and integrates cleanly into CI workflows—making it a reliable option for benchmarking Go services.

Which endpoint(s) we are going to test:

Depending on your goals, there are two recommended approaches for testing both /fast and /slow endpoints in a single run.

Option 1: Round-Robin Between Endpoints

Create a targets.txt with both endpoints:

Option 2: Weighted Mix Using Multiple Vegeta Runs

To simulate different traffic proportions (e.g., 80% fast, 20% slow):

Then merge the results and generate a report:

Both methods are valid—choose based on whether you need simplicity or control.

wrk is a high-performance HTTP benchma

[Content truncated]

Examples:

Example 1 (go):

package main

// pprof-start
import (
// pprof-end
    "flag"
    "fmt"
    "log"
    "math/rand/v2"
    "net/http"
// pprof-start
    _ "net/http/pprof"
// pprof-end
    "os"
    "os/signal"
    "time"
// pprof-start
)
// pprof-end

var (
    fastDelay   = flag.Duration("fast-delay", 0, "Fixed delay for fast handler (if any)")
    slowMin     = flag.Duration("slow-min", 1*time.Millisecond, "Minimum delay for slow handler")
    slowMax     = flag.Duration("slow-max", 300*time.Millisecond, "Maximum delay for slow handler")
    gcMinAlloc  = flag.Int("gc-min-alloc", 50, "Minimum number of alloca
...

Example 2 (unknown):

go run main.go

Example 3 (unknown):

go install github.com/tsenart/vegeta@latest

Example 4 (unknown):

echo "GET http://localhost:8080/slow" > targets.txt

Practical Networking Patterns in Go - Go Optimization Guide

URL: https://goperf.dev/02-networking/

Contents:

  • Practical Networking Patterns in Go¶
  • Benchmarking First¶
  • Foundations and Core Concepts¶
  • Scaling and Performance Engineering¶
  • Diagnostics and Resilience¶
  • Transport-Level Optimization¶
  • Low-Level and Advanced Tuning¶

A 13-part guide to building scalable, efficient, and resilient networked applications in Go—grounded in real-world benchmarks, low-level optimizations, and practical design patterns.

Benchmarking and Load Testing for Networked Go Apps

Establish performance baselines before optimizing anything. Learn how to simulate realistic traffic using tools like vegeta, wrk, and k6. Covers throughput, latency percentiles, connection concurrency, and profiling under load. Sets the foundation for diagnosing bottlenecks and measuring the impact of every optimization in the series.

How Go Handles Networking: Concurrency, Goroutines, and the net Package

Understand Go’s approach to networking from the ground up. Covers how goroutines, the net package, and the runtime scheduler interact, including blocking I/O behavior, connection handling, and the use of pollers like epoll or kqueue under the hood.

Efficient Use of net/http, net.Conn, and UDP

Compare idiomatic and advanced usage of net/http vs raw net.Conn. Dive into connection pooling, custom dialers, stream reuse, and buffer tuning. Demonstrates how to avoid common pitfalls like leaking connections, blocking handlers, or over-allocating buffers.

Managing 10K++ Concurrent Connections in Go

Handling massive concurrency requires intentional architecture. Explore how to efficiently serve 10,000+ concurrent sockets using Go’s goroutines, proper resource capping, socket tuning, and runtime configuration. Focuses on connection lifecycles, scaling pitfalls, and real-world tuning.

GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning

Dive into low-level performance knobs like GOMAXPROCS, GODEBUG, thread pinning, and how Go’s scheduler interacts with epoll/kqueue. Learn when increasing parallelism helps—and when it doesn’t. Includes tools for CPU affinity and benchmarking the effect of these changes.

Building Resilient Connection Handling with Load Shedding and Backpressure

Learn how to prevent overloads from crashing your system. Covers circuit breakers, passive vs active load shedding, backpressure strategies using channel buffering and timeouts, and how to reject or degrade requests gracefully under pressure.

Memory Management and Leak Prevention in Long-Lived Connections

Long-lived connections like WebSockets or TCP streams can slowly leak memory or accumulate goroutines. This post shows how to identify common leaks, enforce read/write deadlines, manage backpressure, and trace heap growth with memory profiles.

Comp

[Content truncated]


Socket Options That Matter - Go Optimization Guide

URL: https://goperf.dev/02-networking/low-level-optimizations/

Contents:

  • Low-Level Network Optimizations: Socket Options That Matter¶
  • Disabling Nagle’s Algorithm: TCP_NODELAY¶
  • SO_REUSEPORT for Scalability¶
  • Tuning Socket Buffer Sizes: SO_RCVBUF and SO_SNDBUF¶
  • TCP Keepalives for Reliability¶
  • Connection Backlog: SOMAXCONN¶
  • Safely Wrapping Syscalls in Go¶
  • Real-World Considerations¶

Socket settings can limit both throughput and latency when the system is under load. The defaults are designed for safety and compatibility, not for any particular workload. In practice they often become the bottleneck before CPU or memory do. Go lets you reach the underlying file descriptors through syscall, so you can change key socket options without giving up its concurrency model or the standard library.

Nagle’s algorithm exists to make TCP more efficient. Every tiny packet you send carries headers that add up to a lot of wasted bandwidth if left unchecked. Nagle fixes that by holding back small writes until it can batch them into a full segment, cutting down on overhead and network chatter. That trade-off — bandwidth at the expense of latency — is usually fine, which is why it’s on by default. But if your application sends lots of small, time-critical messages, like a game server or a trading system, waiting even a few milliseconds for the buffer to fill can hurt.

Nagle’s algorithm trades latency for efficiency by holding back small packets until there’s more data to send or an acknowledgment comes back. That delay is fine for bulk transfers but a problem for anything that needs fast, small messages. Setting TCP_NODELAY turns it off so data goes out immediately. This is critical for workloads like gaming, trading, real-time video, and other interactive systems where you can’t afford to wait.

In Go, you can turn off Nagle’s algorithm with TCP_NODELAY:

SO_REUSEPORT lets multiple sockets on the same machine bind to the same port and accept connections at the same time. Instead of funneling all incoming connections through one socket, the kernel distributes new connections across all of them, so each socket gets its own share of the load. This is useful when running several worker processes or threads that each accept connections independently, because it removes the need for user-space coordination and avoids contention on a single accept queue. It also makes better use of multiple CPU cores by letting each process or thread handle its own queue of connections directly.

Typical scenarios for SO_REUSEPORT:

In Go, SO_REUSEPORT isn’t exposed in the standard library, but it can be set via syscall when creating the socket. This is done with syscall.SetsockoptInt, which operates on the socket’s file descriptor. You pass the protocol level (SOL_SOCKET), the option (SO_REUSEPORT), and the value (1 to enable). This must happen before calling bind(), so it’

[Content truncated]

Examples:

Example 1 (unknown):

sequenceDiagram
    participant App as Application
    participant TCP as TCP Stack
    participant Net as Network

    App->>TCP: Send 1 byte
    note right of TCP: Buffering (no ACK received)
    App->>TCP: Send 1 byte
    note right of TCP: Still buffering...

    TCP-->>Net: Send 2 bytes (batched)
    Net-->>TCP: ACK received

    App->>TCP: Send 1 byte
    TCP-->>Net: Immediately send (ACK received, buffer clear)

Example 2 (unknown):

func SetTCPNoDelay(conn *net.TCPConn) error {
    return conn.SetNoDelay(true)
}

Example 3 (unknown):

listenerConfig := &net.ListenConfig{
    Control: func(network, address string, c syscall.RawConn) error {
        return c.Control(func(fd uintptr) {
            syscall.SetsockoptInt(int(fd), syscall.SOL_SOCKET, syscall.SO_REUSEPORT, 1)
        })
    },
}
listener, err := listenerConfig.Listen(context.Background(), "tcp", ":8080")

Example 4 (go):

func SetSocketBuffers(conn *net.TCPConn, recvBuf, sendBuf int) error {
    if err := conn.SetReadBuffer(recvBuf); err != nil {
        return err
    }
    return conn.SetWriteBuffer(sendBuf)
}

Building Resilient Connection Handling - Go Optimization Guide

URL: https://goperf.dev/02-networking/resilient-connection-handling/

Contents:

  • Building Resilient Connection Handling with Load Shedding and Backpressure¶
  • Circuit Breakers: Failure Isolation¶
    • What It Does¶
    • Why It Matters¶
    • Implementation Sketch¶
  • Load Shedding: Passive vs Active¶
    • Passive Load Shedding¶
      • Why It Matters¶
    • Active Load Shedding¶
      • Why It Matters¶

In high-throughput services, connection floods and sudden spikes can saturate resources, leading to latency spikes or complete system collapse. This article dives into the low-level mechanisms—circuit breakers, load shedding (passive and active), backpressure via channel buffering and timeouts—and shows how to degrade or reject requests gracefully when pressure mounts.

Circuit breakers guard downstream dependencies by short‑circuiting calls when error rates or latencies exceed thresholds. Without them, a slow or failing service causes client goroutines to pile up, consuming all threads or connections and triggering cascading failure. This mechanism isolates failing services, preventing them from affecting the overall system stability. A circuit breaker continuously monitors response times and error rates, intelligently managing request flow and allowing the system to adapt to changing conditions automatically.

A circuit breaker maintains three states:

Without circuit breakers, services depending on slow or failing components will eventually experience thread exhaustion, request queue buildup, and degraded tail latencies. Circuit breakers introduce bounded failure response by proactively rejecting requests once a dependency is known to be unstable. This reduces the impact surface of a single failure and increases system recoverability. During the Half-Open phase, only limited traffic probes the system, minimizing the risk of amplifying an unstable recovery. Circuit breakers are especially critical in distributed systems where fault domains span across network and service boundaries. They also serve as a feedback mechanism, signaling operational anomalies without requiring centralized alerting.

There are many ways to implement a Circuit Breaker, each varying in complexity and precision. Some designs use fixed time windows, others rely on exponential backoff, or combine error rates with latency thresholds. In this article, we’ll focus on a simple, practical approach: a sliding window with discrete time buckets for failure tracking, combined with a straightforward three-state machine to control call flow and recovery.

First, we need a lightweight way to track how many failures have occurred recently. Instead of maintaining an unbounded history, we use a sliding window with fixed-size time buckets:

Each bucket counts events for a short time slice. As time moves forward, we rotate to the next bucket and reset it, ensuring old data naturally fades away. Her

[Content truncated]

Examples:

Example 1 (unknown):

stateDiagram-v2
    [*] --> Closed
    Closed --> Open : errorRate > threshold
    Open --> HalfOpen : resetTimeout expires
    HalfOpen --> Closed : testSuccess >= threshold
    HalfOpen --> Open : testFailure

Example 2 (unknown):

flowchart TD
subgraph SlidingWindow ["Sliding Window (last N intervals)"]
    B0((Bucket 0))
    B1((Bucket 1))
    B2((Bucket 2))
    B3((Bucket 3))
    B4((Bucket 4))
end

B0 -.-> Tick1["Tick(): move idx + reset bucket"]
Tick1 --> B1
B1 -.-> Tick2["Tick()"]
Tick2 --> B2
B2 -.-> Tick3["Tick()"]
Tick3 --> B3
B3 -.-> Tick4["Tick()"]
Tick4 --> B4
B4 -.-> Tick5["Tick()"]
Tick5 --> B0

B0 -.-> SumFailures["Sum all failures"]

SumFailures -->|Failures >= errorThreshold| OpenCircuit["Circuit Opens"]

OpenCircuit --> WaitReset["Wait resetTimeout"]
WaitReset --> HalfOpen["Move to Half-Open state"]

su
...

Example 3 (unknown):

type slidingWindow struct {
    buckets []int32
    size    int
    idx     int
    mu      sync.Mutex
}

Example 4 (unknown):

func (w *slidingWindow) Tick() {
    w.mu.Lock()
    defer w.mu.Unlock()
    w.idx = (w.idx + 1) % w.size
    atomic.StoreInt32(&w.buckets[w.idx], 0)
}

Practicle example of Profiling Networked Go Applications with pprof - Go Optimization Guide

URL: https://goperf.dev/02-networking/gc-endpoint-profiling/

Contents:

  • Practical Example: Profiling Networked Go Applications with pprof¶
  • CPU Profiling in Networked Apps¶
  • CPU Profiling Walkthrough: Load on the /gc Endpoint¶
  • Where the Time Went¶
    • HTTP Stack Dominates the Surface¶
    • Garbage Collection Overhead is Clearly Visible¶
    • I/O and Syscalls Take a Big Slice¶
    • Scheduler Activity Is Non-Trivial¶
  • Memory Profiling: Retained Heap from the /gc Endpoint¶
  • Summary: CPU and Memory Profiling of the /gc Endpoint¶

This section walks through a demo application instrumented with benchmarking tools and runtime profiling to ground profiling concepts in a real-world context. It covers identifying performance bottlenecks, interpreting flame graphs, and analyzing system behavior under various simulated network conditions.

The demo application is intentionally designed to be as simple as possible to highlight key profiling concepts without unnecessary complexity. While the code and patterns used in the demo are basic, the profiling insights gained here are highly applicable to more complex, production-grade applications.

To enable continuous profiling under load, we expose pprof via a dedicated HTTP endpoint:

The next step will be to establish a connection with the profiled app and collect samples:

View results interactively:

or you can save the profiling graph as an svg image.

We profiled the application during a 30-second load test targeting the /gc endpoint to see what happens under memory pressure. This handler was intentionally designed to trigger allocations and force garbage collection, which makes it a great candidate for observing runtime behavior under stress.

We used Go’s built-in profiler to capture a CPU trace:

This gave us 3.02 seconds of sampled CPU activity out of 30 seconds of wall-clock time—a useful window into what the runtime and application were doing under pressure.

As expected, the majority of CPU time was spent on request handling:

This aligns with the fact that we were sustaining constant traffic. The Go HTTP stack is doing the bulk of the work, managing connections and dispatching requests.

A large portion of CPU time was spent inside the garbage collector:

This confirms that gcHeavyHandler is achieving its goal. What we care about is whether this kind of allocation pressure leaks into real-world handlers. If it does, we’re paying for it in latency and CPU churn.

We also saw high syscall activity—especially from:

These functions reflect the cost of writing responses back to clients. For simple handlers, this is expected. But if your handler logic is lightweight and most of the time is spent just flushing data over TCP, it’s worth asking whether the payloads or buffer strategies could be optimized.

Functions like runtime.schedule, mcall, and findRunnable were also on the board. These are Go runtime internals responsible for managing goroutines. Seeing them isn’t unusual during high-concurrency tests—but if they dominate, it often poi

[Content truncated]

Examples:

Example 1 (unknown):

import (

    _ "net/http/pprof"

)

// ...

    // Start pprof in a separate goroutine.
    go func() {
        log.Println("pprof listening on :6060")
        if err := http.ListenAndServe("localhost:6060", nil); err != nil {
            log.Fatalf("pprof server error: %v", err)
        }
    }()

Example 2 (go):

package main

// pprof-start
import (
// pprof-end
    "flag"
    "fmt"
    "log"
    "math/rand/v2"
    "net/http"
// pprof-start
    _ "net/http/pprof"
// pprof-end
    "os"
    "os/signal"
    "time"
// pprof-start
)
// pprof-end

var (
    fastDelay   = flag.Duration("fast-delay", 0, "Fixed delay for fast handler (if any)")
    slowMin     = flag.Duration("slow-min", 1*time.Millisecond, "Minimum delay for slow handler")
    slowMax     = flag.Duration("slow-max", 300*time.Millisecond, "Maximum delay for slow handler")
    gcMinAlloc  = flag.Int("gc-min-alloc", 50, "Minimum number of alloca
...

Example 3 (unknown):

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

Example 4 (unknown):

go tool pprof -http=:7070 cpu.prof # (1)

Go-Networking - Connection Management

Pages: 3


Memory Management and Leak Prevention in Long-Lived Connections - Go Optimization Guide

URL: https://goperf.dev/02-networking/long-lived-connections/

Contents:

  • Memory Management and Leak Prevention in Long-Lived Connections¶
  • Identifying Common Leak Patterns¶
    • Goroutine Leaks¶
    • Buffer and Channel Accumulation¶
  • Enforcing Read/Write Deadlines¶
    • Setting Deadlines¶
    • Context-Based Cancellation¶
  • Managing Backpressure¶
    • Rate Limiting and Queuing¶
    • Flow Control via TCP¶

Long-lived connections—such as WebSockets or TCP streams—are critical for real-time systems but also prone to gradual degradation. When connections persist, any failure to clean up buffers, goroutines, or timeouts can quietly consume memory over time. These leaks often evade unit tests or staging environments but surface under sustained load in production.

This article focuses on memory management strategies tailored to long-lived connections in Go. It outlines patterns that cause leaks, techniques for enforcing resource bounds, and tools to identify hidden retention through profiling.

In garbage-collected languages like Go, memory leaks typically involve lingering references—objects that are no longer needed but remain reachable. The most common culprits in connection-heavy services include goroutines that don’t exit, buffered channels that accumulate data, and slices that retain large backing arrays.

Handlers for persistent connections often run in their own goroutines. If the control flow within a handler blocks indefinitely—whether due to I/O operations, nested goroutines, or external dependencies—those goroutines can remain active even after the connection is no longer useful.

Here, if process(message) internally spawns goroutines without proper cancellation, or if conn.ReadMessage() blocks indefinitely after a network interruption, the handler goroutine can hang forever, retaining references to stacks and heap objects. Blocking reads prevent the loop from exiting, and unbounded goroutine spawning within process can accumulate if upstream errors aren’t handled. Now multiply by 10,000 connections.

Buffered channels and pooled buffers offer performance advantages, but misuse can lead to retained memory that outlives its usefulness. A typical example involves sync.Pool combined with I/O:

This version correctly isolates the active portion of the buffer using a copy. Problems arise when the copy is skipped:

Although data appears small, it still points to the original 4 KB buffer. If process stores that slice in a log queue, cache, or channel, the entire backing array remains in memory. Over time, this pattern can hold onto hundreds of megabytes of heap space across thousands of connections.

To prevent this, always create a new slice with just the required data length before handing it off to any code that might retain it. Copying a slice may seem inefficient, but it ensures the larger buffer is no longer indirectly referenced.

Network I/O without

[Content truncated]

Examples:

Example 1 (go):

func handleWS(conn *websocket.Conn) {
    for {
        _, message, err := conn.ReadMessage()
        if err != nil {
            break
        }
        process(message)
    }
}

http.HandleFunc("/ws", func(w http.ResponseWriter, r *http.Request) {
    ws, err := upgrader.Upgrade(w, r, nil)
    if err != nil {
        return
    }
    go handleWS(ws)
})

Example 2 (go):

var bufferPool = sync.Pool{
    New: func() interface{} { return make([]byte, 4096) },
}

func handle(conn net.Conn) {
    buf := bufferPool.Get().([]byte)
    defer bufferPool.Put(buf)

    for {
        n, err := conn.Read(buf)
        if err != nil {
            return
        }

        data := make([]byte, n)
        copy(data, buf[:n])
        go process(data)
    }
}

Example 3 (unknown):

data := buf[:n]
go process(data)

Example 4 (javascript):

const timeout = 30 * time.Second

func handle(conn net.Conn) {
    defer conn.Close()

    buffer := make([]byte, 4096) // 4 KB buffer; size depends on protocol and usage

    for {
        conn.SetReadDeadline(time.Now().Add(timeout))
        n, err := conn.Read(buffer)
        if err != nil {
            if netErr, ok := err.(net.Error); ok && netErr.Timeout() {
                return
            }
            return
        }

        conn.SetWriteDeadline(time.Now().Add(timeout))
        _, err = conn.Write(buffer[:n])
        if err != nil {
            return
        }
    }
}

Connection Lifecycle Observability - Go Optimization Guide

URL: https://goperf.dev/02-networking/connection_observability/

Contents:

  • Connection Lifecycle Observability: From Dial to Close¶
  • DNS Resolution¶
  • Dialing¶
  • Handshake and Negotiation¶
  • Application Reads and Writes¶
  • Teardown¶
  • Correlating Spans and Errors¶
  • Detecting and Explaining Hangs¶
  • Beyond Logs: Metrics and Structured Events¶
    • Why reduce logs?¶

Many observability systems expose only high-level HTTP metrics, but deeper insight comes from explicitly instrumenting each stage — DNS resolution, dialing, handshake, negotiation, reads/writes, and teardown. Observing the full lifecycle of a network connection provides the clarity needed to diagnose latency issues, identify failures, and relate resource usage to external behavior.

Every outbound connection begins with a name resolution step unless an IP is already known. DNS latency and failures often dominate connection setup time and can vary significantly depending on caching and resolver performance.

Capturing DNS resolution duration, errors and resulting address provides visibility into one of the least predictable phases. At this stage, it is valuable to:

In Go, this can be achieved by wrapping the net.Resolver.

This explicit measurement avoids relying on opaque metrics exposed by the OS resolver or libraries.

After obtaining an address, the next phase is dialing — establishing a TCP (or other transport) connection. Here, round-trip latency to the target, route stability, and ephemeral port exhaustion can all surface.

Observing the dial phase often involves intercepting net.Dialer.

Why trace here? Dialing failures can indicate downstream unavailability, but also local issues like SYN flood protection(1) or bad routing. Without a per-dial timestamp and error trace, identifying the locus of failure is guesswork.

For secure connections, the next stage — cryptographic handshake — dominates. TLS negotiation can involve multiple round-trips, certificate validation, and cipher negotiation. Measuring this stage separately is necessary because it isolates pure network latency (dial) from cryptographic and policy enforcement costs (handshake).

The crypto/tls library in Go allows instrumentation of the handshake explicitly.

A common misconception is that slow TLS handshakes always reflect bad network conditions; in practice, slow certificate validation or OCSP/CRL checks are frequently to blame. Separating these phases helps pinpoint the cause.

Once the connection is established and negotiated, application-level traffic proceeds as reads and writes. Observability at this stage is often the least precise yet most critical for correlating client-perceived latency to backend processing.

Instrumenting reads and writes directly yields fine-grained latency and throughput metrics. Wrapping a connection is a common strategy.

Why measure at this granularit

[Content truncated]

Examples:

Example 1 (go):

import (
    "context"
    "log"
    "net"
    "time"
)

func resolveWithTracing(ctx context.Context, hostname string) ([]string, error) {
    start := time.Now()
    ips, err := net.DefaultResolver.LookupHost(ctx, hostname)
    elapsed := time.Since(start)

    log.Printf("dns: host=%s duration=%s ips=%v err=%v", hostname, elapsed, ips, err)
    return ips, err
}

Example 2 (go):

func dialWithTracing(ctx context.Context, network, addr string) (net.Conn, error) {
    var d net.Dialer
    start := time.Now()
    conn, err := d.DialContext(ctx, network, addr)
    elapsed := time.Since(start)

    log.Printf("dial: addr=%s duration=%s err=%v", addr, elapsed, err)
    return conn, err
}

Example 3 (go):

import (
    "crypto/tls"
)

func handshakeWithTracing(conn net.Conn, config *tls.Config) (*tls.Conn, error) {
    tlsConn := tls.Client(conn, config)
    start := time.Now()
    err := tlsConn.Handshake()
    elapsed := time.Since(start)

    log.Printf("handshake: duration=%s err=%v", elapsed, err)
    return tlsConn, err
}

Example 4 (go):

type tracedConn struct {
    net.Conn
}

func (c *tracedConn) Read(b []byte) (int, error) {
    start := time.Now()
    n, err := c.Conn.Read(b)
    elapsed := time.Since(start)

    log.Printf("read: bytes=%d duration=%s err=%v", n, elapsed, err)
    return n, err
}

func (c *tracedConn) Write(b []byte) (int, error) {
    start := time.Now()
    n, err := c.Conn.Write(b)
    elapsed := time.Since(start)

    log.Printf("write: bytes=%d duration=%s err=%v", n, elapsed, err)
    return n, err
}

Managing 10K+ Concurrent Connections in Go - Go Optimization Guide

URL: https://goperf.dev/02-networking/10k-connections/

Contents:

  • Managing 10K+ Concurrent Connections in Go¶
  • Embracing Go’s Concurrency Model¶
    • Managing Concurrency at Scale¶
    • OS-Level and Socket Tuning¶
    • Go Scheduler and Memory Pressure¶
    • Optimizing Goroutine Behavior¶
    • Pooling and Reusing Objects¶
    • Connection Lifecycle Management¶
  • Real-World Tuning and Scaling Pitfalls¶
    • Instrumenting and Benchmarking the Server¶

While framing the challenge in terms of “100K concurrent connections” is tempting, practical engineering often begins with a more grounded target: 10K to 20K stable, performant connections. This isn’t a limitation of Go itself but a reflection of real-world constraints: ulimit settings, ephemeral port availability, TCP stack configuration, and the nature of the application workload all set hard boundaries.

Cloud environments introduce their own considerations. For instance, AWS Fargate explicitly sets both the soft and hard nofile (number of open files) limit to 65,535, which provides more headroom for socket-intensive applications but still falls short of the 100K+ threshold. On EC2 instances, the practical limits depend on the base operating system and user configuration. By default, many Linux distributions impose a soft limit of 1024 and a hard limit of 65535 for nofile. Even this hard cap is lower than required to handle 100,000 open connections in a single process. Reaching higher limits requires kernel-level tuning, container runtime overrides, and multi-process strategies to distribute file descriptor load.

A server handling simple echo logic behaves very differently from one performing CPU-bound processing, structured logging, or real-time transformation. Additionally, platform-level tunability varies—Linux exposes granular control through sysctl, epoll, and reuseport, while macOS lacks many of these mechanisms. In that context, achieving and sustaining 10K+ concurrent connections with real workloads is a demanding, yet practical, benchmark.

Handling massive concurrency in Go is often romanticized—"goroutines are cheap, just spawn them!"—but reality gets harsher as we push towards six-digit concurrency levels. Serving over 10,000 concurrent sockets isn’t something you solve by scaling hardware alone—it requires an architecture that works with the OS, the Go runtime, and the network stack, not against them.

Go’s lightweight goroutines and its powerful runtime scheduler make it an excellent choice for scaling network applications. Goroutines consume only a few kilobytes of stack space, which, in theory, makes them ideal for handling tens of thousands of concurrent connections. However, reality forces us to think beyond just spinning up goroutines. While the language’s abstraction makes concurrency almost “magical,” achieving true efficiency at this scale demands intentional design.

Running a server that spawns one goroutine per connection means

[Content truncated]

Examples:

Example 1 (go):

package main

import (
    "log"
    "net"
    "sync/atomic"
    "time"
)

var activeConnections uint64

func main() {
    listener, err := net.Listen("tcp", ":8080")
    if err != nil {
        log.Fatalf("Error starting TCP listener: %v", err)
    }
    defer listener.Close()

    for {
        conn, err := listener.Accept()
        if err != nil {
            log.Printf("Error accepting connection: %v", err)
            continue
        }

        atomic.AddUint64(&activeConnections, 1)
        go handleConnection(conn)
    }
}

func handleConnection(conn net.Conn) {
    defer func() {
    
...

Example 2 (go):

package main

import (
    "net"
)

var connLimiter = make(chan struct{}, 10000) // Max 10K concurrent conns

func main() {
    ln, _ := net.Listen("tcp", ":8080")
    defer ln.Close()

    for {
        conn, _ := ln.Accept()

        connLimiter <- struct{}{} // Acquire slot
        go func(c net.Conn) {
            defer func() {
                c.Close()
                <-connLimiter // Release slot
            }()
            // Dummy echo logic
            buf := make([]byte, 1024)
            c.Read(buf)
            c.Write(buf)
        }(conn)
    }
}

Example 3 (unknown):

# Increase file descriptor limit
ulimit -n 200000

Example 4 (unknown):

sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.ip_local_port_range="10000 65535"
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=15

Go-Networking - Dns Tuning

Pages: 1


Tuning DNS Performance in Go Services - Go Optimization Guide

URL: https://goperf.dev/02-networking/dns_performance/

Contents:

  • Tuning DNS Performance in Go Services¶
  • How DNS Resolution Works in Go: cgo vs. Native Resolver¶
    • Runtime Dependencies¶
  • DNS Caching: Why and When¶
  • Using Custom Dialers and Pre-resolved IPs¶
  • Metrics and Debugging Real-world DNS Slowdowns¶
    • Metrics¶
    • Debugging Tips¶
    • Advanced DNS Performance Tips¶

DNS resolution tends to fly under the radar, but it can still slow down Go applications. Even brief delays in lookups can stack up in distributed or microservice architectures where components frequently communicate. Understanding how Go resolves DNS under the hood — and how to adjust it — can make your service more responsive and reliable.

Go supports two different ways of handling DNS queries: the built-in pure-Go resolver and the cgo-based resolver.

The pure-Go resolver is fully self-contained and avoids using any external DNS libraries. It reads its configuration from /etc/resolv.conf (on Unix-like systems) and talks directly to the configured nameservers. This makes it simple and generally performant, though it sometimes struggles to handle more exotic or highly customized DNS environments.

In contrast, the cgo-based resolver delegates DNS resolution to the operating system’s own resolver (through the C standard library, libc). This gives better compatibility with complicated or custom DNS environments—like those involving LDAP or multicast DNS—but it also comes with a tradeoff. The cgo resolver adds overhead due to calls into external C libraries, and it can sometimes lead to issues around thread safety or unpredictable latency spikes.

It's possible to explicitly configure Go to prefer the pure-Go resolver using an environment variable:

Alternatively, force the use of cgo resolver:

Enabling cgo changes how the Go binary interacts with the operating system. With cgo turned on, the binary no longer stands alone — it links dynamically to libc and the system loader, which ldd reveals in its output.

A cgo-enabled binary relies on the system’s C runtime (libc.so.6) and the dynamic loader (ld-linux). Without these shared libraries available on the host, the binary won’t start — which makes it unsuitable for stripped-down environments like scratch containers.

By contrast, a pure-Go binary is completely self-contained and statically linked. If you run ldd on it, you’ll simply see:

This shows that all the code the binary needs is already baked in, with no dependency on shared libraries at runtime. Because of this, pure-Go builds are a good fit for minimal containers or bare environments without a C runtime, offering better portability and fewer operational surprises. The downside is that these binaries can’t take advantage of system-level resolver features that require cgo and the host’s libc.

Caching DNS results prevents the application from sending

[Content truncated]

Examples:

Example 1 (unknown):

export GODEBUG=netdns=go

Example 2 (unknown):

export GODEBUG=netdns=cgo

Example 3 (javascript):

$ ldd ./app-cgo
    linux-vdso.so.1 (0x0000fa34ddbad000)
    libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000fa34dd9b0000)
    /lib/ld-linux-aarch64.so.1 (0x0000fa34ddb70000)

Example 4 (unknown):

$ ldd ./app-pure
    not a dynamic executable

Go-Networking Documentation Index

Categories

Benchmarking

File: benchmarking.md Pages: 7

Connection Management

File: connection_management.md Pages: 3

Dns Tuning

File: dns_tuning.md Pages: 1

Networking Fundamentals

File: networking_fundamentals.md Pages: 3

Other

File: other.md Pages: 2

Tls Security

File: tls_security.md Pages: 1

Go-Networking - Networking Fundamentals

Pages: 3


GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning - Go Optimization Guide

URL: https://goperf.dev/02-networking/a-bit-more-tuning/

Contents:

  • GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning¶
  • Understanding GOMAXPROCS¶
  • Diving into Go’s Scheduler Internals¶
  • Netpoller: Deep Dive into epoll on Linux and kqueue on BSD¶
  • Thread Pinning with LockOSThread and GODEBUG Flags¶
  • CPU Affinity and External Tools¶

Go applications operating at high concurrency levels frequently encounter performance ceilings that are not attributable to CPU saturation. These limitations often stem from runtime-level mechanics: how goroutines (G) are scheduled onto logical processors (P) via operating system threads (M), how blocking operations affect thread availability, and how the runtime interacts with kernel facilities like epoll or kqueue for I/O readiness.

Unlike surface-level code optimization, resolving these issues requires awareness of the Go scheduler’s internal design, particularly how GOMAXPROCS governs execution parallelism and how thread contention, cache locality, and syscall latency emerge under load. Misconfigured runtime settings can lead to excessive context switching, stalled P’s, and degraded throughput despite available cores.

System-level tuning—through CPU affinity, thread pinning, and scheduler introspection—provides a critical path to improving latency and throughput in multicore environments. When paired with precise benchmarking and observability, these adjustments allow Go services to scale more predictably and fully take advantage of modern hardware architectures.

In Go, GOMAXPROCS defines the maximum number of operating system threads (M’s) simultaneously executing user‑level Go code (G’s). It’s set to the developer's machine’s logical CPU count by default. Under the hood, the scheduler exposes P’s (processors) equal to GOMAXPROCS. Each P hosts a run queue of G’s and binds to a single M to execute Go code.

When developers increase GOMAXPROCS, developers allow more P’s—and therefore more OS threads—to run Go‑routines in parallel. That often boosts performance for CPU‑bound workloads. However, more P’s also incur more context switches, more cache thrashing, and potentially more contention in shared data structures (e.g., the garbage collector’s work queues). It's important to understand that blindly scaling past the sweet spot can actually degrade latency.

Go’s scheduler organizes three core actors: G (goroutine), M (OS thread), and P (logical processor), see more details here. When a goroutine makes a blocking syscall, its M detaches from its P, returning the P to the global scheduler so another M can pick it up. This design prevents syscalls from starving CPU‑bound goroutines.

The scheduler uses work stealing: each P maintains a local run queue, and idle P’s will steal work from busier peers. If developers set GOMAXPROCS too high, developers will

[Content truncated]

Examples:

Example 1 (python):

package main

import (
    "fmt"
    "runtime"
)

func main() {
    // Show current value
    fmt.Printf("GOMAXPROCS = %d\n", runtime.GOMAXPROCS(0))

    // Set to 4 and confirm
    prev := runtime.GOMAXPROCS(4)
    fmt.Printf("Changed from %d to %d\n", prev, runtime.GOMAXPROCS(0))
}

Example 2 (unknown):

GODEBUG=schedtrace=1000,scheddetail=1 go run main.go

Example 3 (unknown):

SCHED 3024ms: gomaxprocs=14 idleprocs=14 threads=26 spinningthreads=0 needspinning=0 idlethreads=20 runqueue=0 gcwaiting=false nmidlelocked=1 stopwait=0 sysmonwait=false
  P0: status=0 schedtick=173 syscalltick=3411 m=nil runqsize=0 gfreecnt=6 timerslen=0
  ...
  P13: status=0 schedtick=96 syscalltick=310 m=nil runqsize=0 gfreecnt=2 timerslen=0
  M25: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  ...

Example 4 (go):

func pollAndRead(conn net.Conn) ([]byte, error) {
    buf := make([]byte, 4096)
    for {
        n, err := conn.Read(buf)
        if n > 0 {
            return buf[:n], nil
        }
        if err != nil && !isTemporary(err) {
            return nil, err
        }
        // Data not ready yet — goroutine will be parked until poller wakes it
    }
}

Efficient Use of net/http, net.Conn, and UDP - Go Optimization Guide

URL: https://goperf.dev/02-networking/efficient-net-use/

Contents:

  • Efficient Use of net/http, net.Conn, and UDP in High-Traffic Go Services¶
  • The Hidden Complexity Behind a Simple HTTP Call¶
  • Transport Tuning: When Defaults Aren’t Enough¶
    • Custom http.Transport Fields to Tune¶
  • More Advanced Optimization Tricks¶
    • Set ExpectContinueTimeout Carefully¶
    • Constrain MaxConnsPerHost¶
    • Use Small http.Client.Timeout¶
    • Explicitly Set ReadBufferSize and WriteBufferSize in http.Server¶
    • Use bufio.Reader.Peek() for Efficient Framing¶

When we first start building high-traffic services in Go, we often lean heavily on net/http. It’s stable, ergonomic, and remarkably capable for 80% of use cases. But as soon as traffic spikes or latency budgets shrink, the cracks begin to show.

It’s not that net/http is broken—it’s just that the defaults are tuned for convenience, not for performance under stress. And as we scale backend services to handle millions of requests per second, understanding what happens underneath the abstraction becomes the difference between meeting SLOs and fire-fighting in production.

This article is a walkthrough of how to make networked Go services truly efficient—what works, what breaks, and how to go beyond idiomatic usage. We’ll start with net/http, drop into raw net.Conn, and finish with real-world patterns for handling UDP in latency-sensitive systems.

Let’s begin where most Go developers do: a simple http.Client.

This looks harmless. It gets the job done, and in most local tests, it performs reasonably well. But in production, at scale, this innocent-looking code can trigger a surprising range of issues: leaked connections, memory spikes, blocked goroutines, and mysterious latency cliffs.

One of the most common issues is forgetting to fully read resp.Body before closing it. Go’s HTTP client won’t reuse connections unless the body is drained. And under load, that means you're constantly opening new TCP connections—slamming the kernel with ephemeral ports, exhausting file descriptors, and triggering throttling.

Here’s the safe pattern:

It’s easy to overlook how much global state hides behind http.DefaultTransport. If you spin up multiple http.Client instances across your app without customizing the transport, you're probably reusing a shared global pool without realizing it.

This leads to unpredictable behavior under load: idle connections get evicted too quickly, or keep-alive connections linger longer than they should. The fix? Build a tuned Transport that matches your concurrency profile.

All the following settings are part of the http.Transport struct:

These are all tied to key settings in the http.Transport, http.Client, and http.Server structs, or custom wrappers built on top of them:

If our clients send large POST requests and the server doesn’t support 100-continue properly, we can reduce or eliminate this delay:

Go’s default HTTP client will open an unbounded number of connections to a host. That’s fine until one of your downstreams can’t handle i

[Content truncated]

Examples:

Example 1 (unknown):

client := &http.Client{
    Timeout: 5 * time.Second,
}

resp, err := client.Get("http://localhost:8080/data")
if err != nil {
    log.Fatal(err)
}
defer resp.Body.Close()

Example 2 (unknown):

io.Copy(io.Discard, resp.Body)
defer resp.Body.Close()

Example 3 (unknown):

transport := &http.Transport{
    MaxIdleConns:          1000,
    MaxConnsPerHost:       100,
    IdleConnTimeout:       90 * time.Second,
    ExpectContinueTimeout: 0,
    DialContext: (&net.Dialer{
        Timeout:   5 * time.Second,
        KeepAlive: 30 * time.Second,
    }).DialContext,
}

client := &http.Client{
    Transport: transport,
    Timeout:   2 * time.Second,
}

Example 4 (unknown):

transport := &http.Transport{
    ...
    ExpectContinueTimeout: 0, // if not needed, skip the wait entirely
    ...
}

How Go Handles Networking - Go Optimization Guide

URL: https://goperf.dev/02-networking/networking-internals/

Contents:

  • Go Networking Internals¶
  • Goroutines and the Runtime Scheduler¶
  • Blocking I/O in Goroutines: What Really Happens?¶
  • Internals of the net Package¶
  • The Netpoller: Polling with Epoll/Kqueue/IOCP¶
  • Example: High-Performance TCP Echo Server¶
    • Imports and Setup¶
    • Listener Setup¶
    • Accept Loop and Goroutine Scheduling¶
    • Connection Handler¶

Go’s networking model is deceptively simple on the surface—spawn a goroutine, accept a connection, read from it, and write a response. But behind this apparent ease is a highly optimized and finely tuned runtime that handles tens or hundreds of thousands of connections with minimal OS overhead. In this deep dive, we’ll walk through the mechanisms that make this possible: from goroutines and the scheduler to how Go interacts with OS-level pollers like epoll, kqueue, and IOCP.

Goroutines are lightweight user-space threads managed by the Go runtime. They’re cheap to create (a few kilobytes of stack) and can scale to millions. But they’re not magic—they rely on the runtime scheduler to multiplex execution across a limited number of OS threads.

Go’s scheduler is based on an M:N model:

Each P can execute one G at a time using an M. There are as many Ps as GOMAXPROCS. If a goroutine blocks on I/O, another runnable G may park and reuse the thread.

Suppose a goroutine calls conn.Read(). This looks blocking—but only from the goroutine's perspective. Internally, Go’s runtime intercepts the call and uses a mechanism known as the netpoller.

On Unix-based systems, Go uses readiness-based polling (epoll on Linux, kqueue on macOS/BSD). When a goroutine performs a syscall like read(fd), the runtime checks whether the file descriptor is ready. If not:

This system enables Go to serve a massive number of clients concurrently, using a small number of threads, avoiding the overhead of traditional thread-per-connection models.

Let’s take a look at what happens behind net.Listen("tcp", ":8080") and conn.Read().

Here’s a rough diagram of the call chain:

This architecture makes the blocking calls from the developer’s perspective translate into non-blocking interactions with the kernel.

The netpoller is a runtime subsystem that integrates low-level polling mechanisms with Go’s scheduling system. Each fd has an associated pollDesc, which helps coordinate goroutine suspension and resumption.

The poller operates in a dedicated thread (or threads) that loop over OS wait primitives:

When an I/O event fires, the poller finds the associated pollDesc, identifies the parked goroutine, and puts it back into the run queue.

In the Go source, relevant files include:

The Go poller is readiness-based (not completion-based, except for Windows IOCP). It handles:

Let's break down a simple Go TCP echo server and map each part to Go’s internal networking and scheduling mechanisms — inclu

[Content truncated]

Examples:

Example 1 (unknown):

stateDiagram-v2
    [*] --> New : goroutine declared
    New --> Runnable : go func() invoked
    Runnable --> Running : scheduled on an available P
    Running --> Waiting : blocking syscall, channel op, etc.
    Waiting --> Runnable : event ready, rescheduled
    Running --> Terminated : function exits or panics
    Waiting --> Terminated : canceled or panicked
    Terminated --> [*]

    state "Go Scheduler\n(GOMAXPROCS = N)" as Scheduler {
        [*] --> P1
        [*] --> P2
        ...
        [*] --> PN

        P1 --> ScheduleGoroutine1 : pick from global/runq
        P2 --> ScheduleG
...

Example 2 (unknown):

flowchart TD
    A["Goroutine: conn.Read()"] --> B[netpoller checks FD]
    B --> C{FD ready?}
    C -- No --> D[Park goroutine]
    D --> E[FD registered with epoll]
    E --> F[epoll_wait blocks]
    F --> G[FD ready]
    G --> H[Wake goroutine]
    H --> I[Re-schedule]
    C -- Yes --> H

Example 3 (unknown):

flowchart TD
    A[net.Listen] --> B[ListenTCP] --> C[listenFD]
    C --> D["pollDesc (register with netpoll)"]
    D --> E[runtime-integrated non-blocking syscall wrappers]

Example 4 (go):

package main

import (
    "bufio"
    "fmt"
    "net"
    "time"
)

func main() {
    // Start listening on TCP port 9000
    listener, err := net.Listen("tcp", ":9000")
    if err != nil {
        panic(err) // Exit if the port can't be bound
    }
    fmt.Println("Echo server listening on :9000")

    // Accept incoming connections in a loop
    for {
        conn, err := listener.Accept() // Accept new client connection
        if err != nil {
            fmt.Printf("Accept error: %v\n", err)
            continue // Skip this iteration on error
        }

        // Handle the connection i
...

name description
go-networking
Go networking performance patterns and best practices. Use when optimizing network I/O, building high-performance servers, managing connections, tuning TCP/HTTP/gRPC, or diagnosing networking issues in Go applications.

Go-Networking Skill

Comprehensive assistance with go-networking development, generated from official documentation.

When to Use This Skill

This skill should be triggered when:

  • Working with go-networking
  • Asking about go-networking features or APIs
  • Implementing go-networking solutions
  • Debugging go-networking code
  • Learning go-networking best practices

Quick Reference

Common Patterns

Pattern 1: Go Optimization Guide GitHub Home Blog Common Performance Patterns Common Performance Patterns Memory Management & Efficiency Memory Management & Efficiency Object Pooling Memory Preallocation Struct Field Alignment Avoiding Interface Boxing Zero-Copy Techniques Memory Efficiency and Go’s Garbage Collector Stack Allocations and Escape Analysis Concurrency and Synchronization Concurrency and Synchronization Goroutine Worker Pools Atomic Operations and Synchronization Primitives Lazy Initialization Immutable Data Sharing Efficient Context Management I/O Optimization and Throughput I/O Optimization and Throughput Efficient Buffering Batching Operations Compiler-Level Optimization and Tuning Compiler-Level Optimization and Tuning Leveraging Compiler Optimization Flags Practical Networking Patterns Practical Networking Patterns Benchmarking First Benchmarking First Benchmarking and Load Testing for Networked Go Apps Practicle example of Profiling Networked Go Applications with pprof Foundations and Core Concepts Foundations and Core Concepts How Go Handles Networking Efficient Use of net/http, net.Conn, and UDP Scaling and Performance Engineering Scaling and Performance Engineering Managing 10K+ Concurrent Connections in Go GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning Table of contents Understanding GOMAXPROCS Diving into Go’s Scheduler Internals Netpoller: Deep Dive into epoll on Linux and kqueue on BSD Thread Pinning with LockOSThread and GODEBUG Flags CPU Affinity and External Tools Diagnostics and Resilience Diagnostics and Resilience Building Resilient Connection Handling Memory Management and Leak Prevention in Long-Lived Connections Transport-Level Optimization Transport-Level Optimization Comparing TCP, HTTP/2, and gRPC Performance in Go QUIC – Building Low-Latency Services with quic-go Low-Level and Advanced Tuning Low-Level and Advanced Tuning Socket Options That Matter Tuning DNS Performance in Go Services Optimizing TLS for Speed Connection Lifecycle Observability GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning¶ Go applications operating at high concurrency levels frequently encounter performance ceilings that are not attributable to CPU saturation. These limitations often stem from runtime-level mechanics: how goroutines (G) are scheduled onto logical processors (P) via operating system threads (M), how blocking operations affect thread availability, and how the runtime interacts with kernel facilities like epoll or kqueue for I/O readiness. Unlike surface-level code optimization, resolving these issues requires awareness of the Go scheduler’s internal design, particularly how GOMAXPROCS governs execution parallelism and how thread contention, cache locality, and syscall latency emerge under load. Misconfigured runtime settings can lead to excessive context switching, stalled P’s, and degraded throughput despite available cores. System-level tuning—through CPU affinity, thread pinning, and scheduler introspection—provides a critical path to improving latency and throughput in multicore environments. When paired with precise benchmarking and observability, these adjustments allow Go services to scale more predictably and fully take advantage of modern hardware architectures. Understanding GOMAXPROCS¶ In Go, GOMAXPROCS defines the maximum number of operating system threads (M’s) simultaneously executing user‑level Go code (G’s). It’s set to the developer's machine’s logical CPU count by default. Under the hood, the scheduler exposes P’s (processors) equal to GOMAXPROCS. Each P hosts a run queue of G’s and binds to a single M to execute Go code. package main import ( "fmt" "runtime" ) func main() { // Show current value fmt.Printf("GOMAXPROCS = %d\n", runtime.GOMAXPROCS(0)) // Set to 4 and confirm prev := runtime.GOMAXPROCS(4) fmt.Printf("Changed from %d to %d\n", prev, runtime.GOMAXPROCS(0)) } When developers increase GOMAXPROCS, developers allow more P’s—and therefore more OS threads—to run Go‑routines in parallel. That often boosts performance for CPU‑bound workloads. However, more P’s also incur more context switches, more cache thrashing, and potentially more contention in shared data structures (e.g., the garbage collector’s work queues). It's important to understand that blindly scaling past the sweet spot can actually degrade latency. Diving into Go’s Scheduler Internals¶ Go’s scheduler organizes three core actors: G (goroutine), M (OS thread), and P (logical processor), see more details here. When a goroutine makes a blocking syscall, its M detaches from its P, returning the P to the global scheduler so another M can pick it up. This design prevents syscalls from starving CPU‑bound goroutines. The scheduler uses work stealing: each P maintains a local run queue, and idle P’s will steal work from busier peers. If developers set GOMAXPROCS too high, developers will see diminishing returns in stolen work versus the overhead of balancing those run queues. Enabling scheduler tracing via GODEBUG can reveal fine grained metrics: GODEBUG=schedtrace=1000,scheddetail=1 go run main.go schedtrace=1000 instructs the runtime to print scheduler state every 1000 milliseconds (1 second). scheddetail=1 enables additional information per logical processor (P), such as individual run queue lengths. Each printed trace includes statistics like: SCHED 3024ms: gomaxprocs=14 idleprocs=14 threads=26 spinningthreads=0 needspinning=0 idlethreads=20 runqueue=0 gcwaiting=false nmidlelocked=1 stopwait=0 sysmonwait=false P0: status=0 schedtick=173 syscalltick=3411 m=nil runqsize=0 gfreecnt=6 timerslen=0 ... P13: status=0 schedtick=96 syscalltick=310 m=nil runqsize=0 gfreecnt=2 timerslen=0 M25: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil ... The first line reports global scheduler state including whether garbage collection is blocking (gcwaiting), if spinning threads are needed, and idle thread counts. Each P line details the logical processor's scheduler activity, including the number of times it's scheduled (schedtick), system call activity (syscalltick), timers, and free goroutine slots. The M lines correspond to OS threads. Each line shows which goroutine—if any—is running on that thread, whether the thread is idle, spinning, or blocked, along with memory allocation activity and lock states. This view makes it easier to spot not only classic concurrency bottlenecks but also deeper issues: scheduler delays, blocking syscalls, threads that spin without doing useful work, or CPU cores that sit idle when they shouldn’t. The output reveals patterns that aren’t visible from logs or metrics alone. gomaxprocs=14: Number of logical processors (P’s). idleprocs=14: All processors are idle, indicating no runnable goroutines. threads=26: Number of M’s (OS threads) created. spinningthreads=0: No threads are actively searching for work. needspinning=0: No additional spinning threads are requested by the scheduler. idlethreads=20: Number of OS threads currently idle. runqueue=0: Global run queue is empty. gcwaiting=false: Garbage collector is not blocking execution. nmidlelocked=1: One P is locked to a thread that is currently idle. stopwait=0: No goroutines waiting to stop the world. sysmonwait=false: The system monitor is actively running, not sleeping. The global run queue holds goroutines that are not bound to any specific P or that overflowed local queues. In contrast, each logical processor (P) maintains a local run queue of goroutines it is responsible for scheduling. Goroutines are preferentially enqueued locally for performance: local queues avoid lock contention and improve cache locality. It may be placed on the global queue only when a P's local queue is full, or a goroutine originates from outside a P (e.g., from a syscall). This dual-queue strategy reduces synchronization overhead across P’s and enables efficient scheduling under high concurrency. Understanding the ratio of local vs global queue activity helps diagnose whether the system is under-provisioned, improperly balanced, or suffering from excessive cross-P migrations. These insights help quantify how efficiently goroutines are scheduled, how much parallelism is actually utilized, and whether the system is under- or over-provisioned in terms of logical processors. Observing these patterns under load is crucial when adjusting GOMAXPROCS, diagnosing tail latency, or identifying scheduler contention. Netpoller: Deep Dive into epoll on Linux and kqueue on BSD¶ In any Go application handling high connection volumes, the network poller plays a critical behind-the-scenes role. At its core, Go uses the OS-level multiplexing facilities—epoll on Linux and kqueue on BSD/macOS—to monitor thousands of sockets concurrently with minimal threads. The runtime leverages these mechanisms efficiently, but understanding how and why reveals opportunities for tuning, especially under demanding loads. When a goroutine initiates a network operation like reading from a TCP connection, the runtime doesn't immediately block the underlying thread. Instead, it registers the file descriptor with the poller—using epoll_ctl in edge-triggered mode or EV_SET with EVFILT_READ—and parks the goroutine. The actual thread (M) becomes free to run other goroutines. When data arrives, the kernel signals the poller thread, which in turn wakes the appropriate goroutine by scheduling it onto a P’s run queue. This wakeup process minimizes contention by relying on per-P notification lists and avoids runtime lock bottlenecks. Go uses edge-triggered notifications, which signal only on state transitions—like new data becoming available. This design requires the application to drain sockets fully during each wakeup or risk missing future events. While more complex than level-triggered behavior, edge-triggered mode significantly reduces syscall overhead under load. Here's a simplified version of what happens under the hood during a read operation: func pollAndRead(conn net.Conn) ([]byte, error) { buf := make([]byte, 4096) for { n, err := conn.Read(buf) if n > 0 { return buf[:n], nil } if err != nil && !isTemporary(err) { return nil, err } // Data not ready yet — goroutine will be parked until poller wakes it } } Internally, Go runs a dedicated poller thread that loops on epoll_wait or kevent, collecting batches of events (typically 512 at a time). After the call returns, the runtime processes these events, distributing wakeups across logical processors to prevent any single P from becoming a bottleneck. To further promote scheduling fairness, the poller thread may rotate across P’s periodically, a behavior governed by GODEBUG=netpollWaitLatency. Go’s runtime is optimized to reduce unnecessary syscalls and context switches. All file descriptors are set to non-blocking, which allows the poller thread to remain responsive. To avoid the thundering herd problem—where multiple threads wake on the same socket—the poller ensures only one goroutine handles a given FD event at a time. The design goes even further by aligning the circular event buffer with cache lines and distributing wakeups via per-P lists. These details matter at scale. With proper alignment and locality, Go reduces CPU cache contention when thousands of connections are active. For developers looking to inspect poller behavior, enabling tracing with GODEBUG=netpoll=1 can surface system-level latencies and epoll activity. Additionally, the GODEBUG=netpollWaitLatency=200 flag configures the poller’s willingness to hand off to another P every 200 microseconds. That’s particularly helpful in debugging idle P starvation or evaluating fairness in high-throughput systems. Here's a small experiment that logs event activity: GODEBUG=netpoll=1 go run main.go You’ll see log lines like: runtime: netpoll: poll returned n=3 runtime: netpoll: waking g=102 for fd=5 Most developers never need to think about this machinery—and they shouldn't. But these details become valuable in edge cases, like high-throughput HTTP proxies or latency-sensitive services dealing with hundreds of thousands of concurrent sockets. Tuning parameters like GOMAXPROCS, adjusting the event buffer size, or modifying poller wake-up intervals can yield measurable performance improvements, particularly in tail latencies. For example, in a system handling hundreds of thousands of concurrent HTTP/2 streams, increasing GOMAXPROCS while using GODEBUG=netpollWaitLatency=100 helped reduce the 99th percentile read latency by over 15%, simply by preventing poller starvation under I/O backpressure. As with all low-level tuning, it's not about changing knobs blindly. It's about knowing what Go’s netpoller is doing, why it’s structured the way it is, and where its boundaries can be nudged for just a bit more efficiency—when measurements tell you it’s worth it. Thread Pinning with LockOSThread and GODEBUG Flags¶ Go offers tools like runtime.LockOSThread() to pin a goroutine to a specific OS thread, but in most real-world applications, the payoff is minimal. Benchmarks consistently show that for typical server workloads—especially those that are CPU-bound—Go’s scheduler handles thread placement well without manual intervention. Introducing thread pinning tends to add complexity without delivering measurable gains. There are exceptions. In ultra-low-latency or real-time systems, pinning can help reduce jitter by avoiding thread migration. But these gains typically require isolated CPU cores, tightly controlled environments, and strict latency targets. In practice, that means bare metal. On shared infrastructure—especially in cloud environments like AWS where cores are virtualized and noisy neighbors are common—thread pinning rarely delivers any measurable benefit. If you’re exploring pinning, it’s not enough to assume benefit—you need to benchmark it. Enabling GODEBUG=schedtrace=1000,scheddetail=1 gives detailed insight into how goroutines are scheduled and whether contention or migration is actually a problem. Without that evidence, thread pinning is more likely to hinder than help. Here's how developers might pin threads cautiously: runtime.LockOSThread() defer runtime.UnlockOSThread() // perform critical latency-sensitive work here Always pair such modifications with extensive metrics collection and scheduler tracing (GODEBUG=schedtrace=1000,scheddetail=1) to validate tangible gains over Go’s robust default scheduling behavior. CPU Affinity and External Tools¶ Using external tools like taskset or system calls such as sched_setaffinity can bind threads or processes to specific CPU cores. While theoretically beneficial for cache locality and predictable performance, extensive benchmarking consistently demonstrates limited practical value in most Go applications. Explicit CPU affinity management typically helps only in tightly controlled environments with: Real-time latency constraints (microsecond-level jitter). Dedicated and isolated CPUs (e.g., via Linux kernel’s isolcpus). Avoidance of thread migration on NUMA hardware. Example of cautious CPU affinity usage: func setAffinity(cpuList []int) error { pid := os.Getpid() var mask unix.CPUSet for _, cpu := range cpuList { mask.Set(cpu) } return unix.SchedSetaffinity(pid, &mask) } func main() { runtime.LockOSThread() defer runtime.UnlockOSThread() if err := setAffinity([]int{2, 3}); err != nil { log.Fatalf("CPU affinity failed: %v", err) } // perform critical work with confirmed benefit } Without dedicated benchmarking and validation, these techniques may degrade performance, starve other processes, or introduce subtle latency regressions. Treat thread pinning and CPU affinity as highly specialized tools—effective only after meticulous measurement confirms their benefit. Tuning Go at the scheduler level can unlock significant performance gains, but it demands an intimate understanding of P’s, M’s, and G’s. Blindly upping GOMAXPROCS or pinning threads without measurement can backfire. the advice is to treat these knobs as surgical tools: use GODEBUG traces to diagnose, isolate subsystems where affinity or pinning makes sense, and always validate with benchmarks and profiles. Go’s runtime is ever‑evolving. Upcoming work in preemptive scheduling and user‑level interrupts promises to reduce tail latency further and improve fairness. Until then, these low‑level levers remain some of the most powerful ways to squeeze every drop of performance from developer's Go services.

epoll

Pattern 2: Go Optimization Guide GitHub Home Blog Common Performance Patterns Common Performance Patterns Memory Management & Efficiency Memory Management & Efficiency Object Pooling Memory Preallocation Struct Field Alignment Avoiding Interface Boxing Zero-Copy Techniques Memory Efficiency and Go’s Garbage Collector Stack Allocations and Escape Analysis Concurrency and Synchronization Concurrency and Synchronization Goroutine Worker Pools Atomic Operations and Synchronization Primitives Lazy Initialization Immutable Data Sharing Efficient Context Management I/O Optimization and Throughput I/O Optimization and Throughput Efficient Buffering Batching Operations Compiler-Level Optimization and Tuning Compiler-Level Optimization and Tuning Leveraging Compiler Optimization Flags Practical Networking Patterns Practical Networking Patterns Benchmarking First Benchmarking First Benchmarking and Load Testing for Networked Go Apps Practicle example of Profiling Networked Go Applications with pprof Foundations and Core Concepts Foundations and Core Concepts How Go Handles Networking Efficient Use of net/http, net.Conn, and UDP Scaling and Performance Engineering Scaling and Performance Engineering Managing 10K+ Concurrent Connections in Go GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning Diagnostics and Resilience Diagnostics and Resilience Building Resilient Connection Handling Memory Management and Leak Prevention in Long-Lived Connections Transport-Level Optimization Transport-Level Optimization Comparing TCP, HTTP/2, and gRPC Performance in Go Comparing TCP, HTTP/2, and gRPC Performance in Go Table of contents Raw TCP with Custom Framing Custom Framing Protocol Protocol Structure Disadvantages Performance Insights HTTP/2 via net/http Server Implementation Performance Insights gRPC gRPC Service Definition Performance Insights Choosing the Right Protocol QUIC – Building Low-Latency Services with quic-go Low-Level and Advanced Tuning Low-Level and Advanced Tuning Socket Options That Matter Tuning DNS Performance in Go Services Optimizing TLS for Speed Connection Lifecycle Observability Comparing TCP, HTTP/2, and gRPC Performance in Go¶ In distributed systems, the choice of communication protocol shapes how services interact under real-world load. It influences not just raw throughput and latency, but also how well the system scales, how much CPU and memory it consumes, and how predictable its behavior remains under pressure. In this article, we dissect three prominent options—raw TCP with custom framing, HTTP/2 via Go's built-in net/http package, and gRPC—and explore their performance characteristics through detailed benchmarks and practical scenarios. Raw TCP with Custom Framing¶ Raw TCP provides maximum flexibility with virtually no protocol overhead, but that comes at a cost: all message boundaries, framing logic, and error handling must be implemented manually. Since TCP delivers a continuous byte stream with no inherent notion of messages, applications must explicitly define how to separate and interpret those bytes. Custom Framing Protocol¶ A common way to handle message boundaries over TCP is to use length-prefix framing: each message starts with a 4-byte header that tells the receiver how many bytes to read next. The length is encoded in big-endian format, following the standard network byte order, so it behaves consistently across different systems. This setup solves a core issue with TCP—while it guarantees reliable delivery, it doesn’t preserve message boundaries. Without knowing the size upfront, the receiver has no way to tell where one message ends and the next begins. TCP guarantees reliable, in-order delivery of bytes, but it does not preserve or indicate message boundaries. For example, if a client sends three logical messages: [msg1][msg2][msg3] the server may receive them as a continuous byte stream with arbitrary segmentations, such as: [msg1_part][msg2][msg3_part] TCP delivers a continuous stream of bytes with no built-in concept of where one message stops and another starts. This means the receiver can’t rely on read boundaries to infer message boundaries—what arrives might be a partial message, multiple messages concatenated, or an arbitrary slice of both. To make sense of structured data over such a stream, the application needs a framing strategy. Length-prefixing does this by including the size of the message up front, so the receiver knows exactly how many bytes to expect before starting to parse the payload. Protocol Structure¶ While length-prefixing is the most common and efficient framing strategy, there are other options depending on the use case. Other framing strategies exist, each with its own trade-offs in terms of simplicity, robustness, and flexibility. Delimiter-based framing uses a specific byte or sequence—like \n or \0 to signal the end of a message. It’s easy to implement but fragile if the delimiter can appear in the payload. Fixed-size framing avoids ambiguity by making every message the same length, which simplifies parsing and memory allocation but doesn’t work well when message sizes vary. Self-describing formats like Protobuf or ASN.1 embed length and type information inside the payload itself, allowing for richer structure and evolution over time, but require more sophisticated parsing logic and schema awareness on both ends. Choosing the right approach depends on how much control you need, how predictable your data is, and how much complexity you’re willing to absorb. Each frame of length-prefixing implementation consists of: | Length (4 bytes) | Payload (Length bytes) | Length: A 4-byte unsigned integer encoded in big-endian format (network byte order), representing the number of bytes in the payload. Payload: Raw binary data of arbitrary length. The use of binary.BigEndian.PutUint32 ensures the frame length is encoded in a standardized format—most significant byte first. This is consistent with Internet protocol standards (1) , allowing for predictable decoding and reliable interoperation between heterogeneous systems. Following the established convention of network byte order, which is defined as big-endian in RFC 791, Section 3.1 and used consistently in transport and application protocols such as TCP (RFC 793). func writeFrame(conn net.Conn, payload []byte) error { frameLen := uint32(len(payload)) buf := make([]byte, 4+len(payload)) binary.BigEndian.PutUint32(buf[:4], frameLen) copy(buf[4:], payload) _, err := conn.Write(buf) return err } func readFrame(conn net.Conn) ([]byte, error) { lenBuf := make([]byte, 4) if _, err := io.ReadFull(conn, lenBuf); err != nil { return nil, err } frameLen := binary.BigEndian.Uint32(lenBuf) payload := make([]byte, frameLen) if _, err := io.ReadFull(conn, payload); err != nil { return nil, err } return payload, nil } This approach is straightforward, performant, and predictable, yet it provides no built-in concurrency management, request multiplexing, or flow control—these must be explicitly managed by the developer. Disadvantages¶ While the protocol is efficient and minimal, it lacks several features commonly found in more complex transport protocols. The lack of built-in framing features in raw TCP means that key responsibilities shift entirely to the application layer. There’s no support for multiplexing, so only one logical message can be in flight per connection unless additional coordination is built manually—pushing clients to open multiple connections to achieve parallelism. Flow control is also absent; unlike HTTP/2 or gRPC, there’s no way to signal backpressure, making it easy for a fast sender to overwhelm a slow receiver, potentially exhausting memory or triggering a crash. There’s no space for structured metadata like message types, compression flags, or trace context unless you embed them yourself into the payload format. And error handling is purely ad hoc—there’s no protocol-level mechanism for communicating faults, so malformed frames or incorrect lengths often lead to abrupt connection resets or inconsistent state. These limitations might be manageable in tightly scoped, high-performance systems where both ends of the connection are under full control and the protocol behavior is well understood. In such environments, the minimal overhead and direct access to the wire can justify the trade-offs. But in broader production contexts—especially those involving multiple teams, evolving requirements, or untrusted clients—they introduce significant risk. Without strict validation, clear framing, and robust error handling, even small inconsistencies can lead to silent corruption, resource leaks, or hard-to-diagnose failures. Building on raw TCP demands both precise engineering and long-term maintenance discipline. Performance Insights¶ Latency: Lowest achievable due to minimal overhead; ideal for latency-critical scenarios like financial trading systems. Throughput: Excellent, constrained only by network and application-layer handling. CPU/Memory Cost: Very low, with negligible overhead from protocol management. HTTP/2 via net/http¶ HTTP/2 brought several protocol-level improvements over HTTP/1.1, including multiplexed streams over a single connection, header compression via HPACK, and support for server push. In Go, these features are integrated directly into the net/http standard library, which handles connection reuse, stream multiplexing, and concurrency without requiring manual intervention. Unlike raw TCP, where applications must explicitly define message boundaries, HTTP/2 defines them at the protocol level: each request and response is framed using structured HEADERS and DATA frames and explicitly closed with an END_STREAM flag. These frames are handled entirely within Go’s HTTP/2 implementation, so developers interact with complete, logically isolated messages using the standard http.Request and http.ResponseWriter interfaces. You don’t have to deal with byte streams or worry about where a message starts or ends—by the time a request hits your handler, it’s already been framed and parsed. When you write a response, the runtime takes care of wrapping it up and signaling completion. That frees you up to focus on the logic, not the plumbing, while still getting the performance benefits of HTTP/2 like multiplexing and connection reuse. Server Implementation¶ Beyond framing and multiplexing, HTTP/2 brings a handful of practical advantages that make server code easier to write and faster to run. It handles connection reuse out of the box, applies flow control to avoid overwhelming either side, and compresses headers using HPACK to cut down on overhead. Go’s net/http stack takes care of all of this behind the scenes, so you get the benefits without needing to wire it up yourself. As a result, developers can build concurrent, efficient servers without managing low-level connection or stream state manually. func handler(w http.ResponseWriter, r *http.Request) { payload, err := io.ReadAll(r.Body) if err != nil { http.Error(w, "invalid request", http.StatusBadRequest) return } defer r.Body.Close() // Process payload... w.WriteHeader(http.StatusOK) w.Write([]byte("processed")) } func main() { server := &http.Server{ Addr: ":8080", Handler: http.HandlerFunc(handler), } log.Fatal(server.ListenAndServeTLS("server.crt", "server.key")) } Info Even this is not mentioned explisitly, this code serves HTTP/2 because it uses ListenAndServeTLS, which enables TLS-based communication. Go's net/http package automatically upgrades connections to HTTP/2 when a client supports it via ALPN (Application-Layer Protocol Negotiation) during the TLS handshake. Since Go 1.6, this upgrade is implicit—no extra configuration is required. The server transparently handles HTTP/2 requests while remaining compatible with HTTP/1.1 clients. HTTP/2’s multiplexing capability allows multiple independent streams to share a single TCP connection without blocking each other, which significantly improves connection reuse. This reduces the overhead of establishing and managing parallel connections, especially under high concurrency. As a result, latency is lower and throughput more consistent, even when multiple requests are in flight. These traits make HTTP/2 well-suited for general-purpose web services and internal APIs—places where predictable latency, efficient connection reuse, and solid concurrency handling carry more weight than raw protocol minimalism. Performance Insights¶ Latency: Slightly higher than raw TCP because of framing and compression overhead, but stable and consistent thanks to multiplexing and persistent connections. Throughput: High under concurrent load; stream multiplexing and header compression help sustain performance without opening more sockets. CPU/Memory Cost: Moderate overhead, mostly due to header processing, TLS encryption, and flow control mechanisms. gRPC¶ gRPC is a high-performance, contract-first RPC framework built on top of HTTP/2, designed for low-latency, cross-language communication between services. It combines streaming-capable transport with strongly typed APIs defined using Protocol Buffers (Protobuf), enabling compact, efficient message serialization and seamless interoperability across platforms. Unlike traditional HTTP APIs, where endpoints are loosely defined by URL patterns and free-form JSON, gRPC enforces strict interface contracts through .proto definitions, which serve as both schema and implementation spec. The gRPC toolchain generates client and server code for multiple languages, eliminating manual serialization, improving safety, and standardizing interactions across heterogeneous systems. gRPC takes advantage of HTTP/2’s core features—stream multiplexing, flow control, and binary framing—to support both one-off RPC calls and full-duplex streaming, all with built-in backpressure. But it goes further than just transport. It bakes in support for deadlines, cancellation, structured metadata, and standardized error reporting, all of which help services communicate clearly and fail predictably. This makes gRPC a solid choice for internal APIs, service meshes, and performance-critical systems where you need efficiency, strong contracts, and reliable behavior under load. gRPC Service Definition¶ A minimal .proto file example: syntax = "proto3"; service EchoService { rpc Echo(EchoRequest) returns (EchoResponse); } message EchoRequest { string message = 1; } message EchoResponse { string message = 1; } Generated Go stubs allow developers to easily implement the service: type server struct { UnimplementedEchoServiceServer } func (s *server) Echo(ctx context.Context, req *EchoRequest) (*EchoResponse, error) { return &EchoResponse{Message: req.Message}, nil } func main() { lis, err := net.Listen("tcp", ":50051") if err != nil { log.Fatalf("failed to listen: %v", err) } grpcServer := grpc.NewServer() RegisterEchoServiceServer(grpcServer, &server{}) grpcServer.Serve(lis) } Performance Insights¶ Latency: Slightly higher than raw HTTP/2 due to additional serialization/deserialization steps, yet still performant for most scenarios. Throughput: High throughput thanks to efficient payload serialization (protobuf) and inherent HTTP/2 multiplexing capabilities. CPU/Memory Cost: Higher than HTTP/2 due to protobuf encoding overhead; memory consumption slightly increased due to temporary object allocations. Choosing the Right Protocol¶ Internal APIs and microservices: gRPC usually hits the sweet spot—it’s fast, strongly typed, and easy to work with once the tooling is in place. Low-latency systems and trading platforms: Raw TCP with custom framing gives you the lowest overhead, but you’re on your own for everything else. Public APIs or general web services: HTTP/2 via net/http is a solid choice. You get connection reuse, multiplexing, and good performance without needing to pull in a full RPC stack. Raw TCP gives you maximum control and the best performance on paper—but it also means building everything yourself: framing, flow control, error handling. HTTP/2 and gRPC trade some of that raw speed for built-in structure, better connection handling, and less code to maintain. What’s right depends entirely on where performance matters and how much complexity you want to own.

net/http

Pattern 3: A minimal .proto file example:

.proto

Pattern 4: Example: Optimizing buffer reuse using sync.Pool greatly reduces GC pressure during high-volume network operations.

sync.Pool

Pattern 5: Here’s the safe pattern:

io.Copy(io.Discard, resp.Body)
defer resp.Body.Close()

Pattern 6: Go Optimization Guide GitHub Home Blog Common Performance Patterns Common Performance Patterns Memory Management & Efficiency Memory Management & Efficiency Object Pooling Memory Preallocation Struct Field Alignment Avoiding Interface Boxing Zero-Copy Techniques Memory Efficiency and Go’s Garbage Collector Stack Allocations and Escape Analysis Concurrency and Synchronization Concurrency and Synchronization Goroutine Worker Pools Atomic Operations and Synchronization Primitives Lazy Initialization Immutable Data Sharing Efficient Context Management I/O Optimization and Throughput I/O Optimization and Throughput Efficient Buffering Batching Operations Compiler-Level Optimization and Tuning Compiler-Level Optimization and Tuning Leveraging Compiler Optimization Flags Practical Networking Patterns Practical Networking Patterns Benchmarking First Benchmarking First Benchmarking and Load Testing for Networked Go Apps Practicle example of Profiling Networked Go Applications with pprof Practicle example of Profiling Networked Go Applications with pprof Table of contents CPU Profiling in Networked Apps CPU Profiling Walkthrough: Load on the /gc Endpoint Where the Time Went HTTP Stack Dominates the Surface Garbage Collection Overhead is Clearly Visible I/O and Syscalls Take a Big Slice Scheduler Activity Is Non-Trivial Memory Profiling: Retained Heap from the /gc Endpoint Summary: CPU and Memory Profiling of the /gc Endpoint Foundations and Core Concepts Foundations and Core Concepts How Go Handles Networking Efficient Use of net/http, net.Conn, and UDP Scaling and Performance Engineering Scaling and Performance Engineering Managing 10K+ Concurrent Connections in Go GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning Diagnostics and Resilience Diagnostics and Resilience Building Resilient Connection Handling Memory Management and Leak Prevention in Long-Lived Connections Transport-Level Optimization Transport-Level Optimization Comparing TCP, HTTP/2, and gRPC Performance in Go QUIC – Building Low-Latency Services with quic-go Low-Level and Advanced Tuning Low-Level and Advanced Tuning Socket Options That Matter Tuning DNS Performance in Go Services Optimizing TLS for Speed Connection Lifecycle Observability Practical Example: Profiling Networked Go Applications with pprof¶ This section walks through a demo application instrumented with benchmarking tools and runtime profiling to ground profiling concepts in a real-world context. It covers identifying performance bottlenecks, interpreting flame graphs, and analyzing system behavior under various simulated network conditions. CPU Profiling in Networked Apps¶ The demo application is intentionally designed to be as simple as possible to highlight key profiling concepts without unnecessary complexity. While the code and patterns used in the demo are basic, the profiling insights gained here are highly applicable to more complex, production-grade applications. To enable continuous profiling under load, we expose pprof via a dedicated HTTP endpoint: import ( _ "net/http/pprof" ) // ... // Start pprof in a separate goroutine. go func() { log.Println("pprof listening on :6060") if err := http.ListenAndServe("localhost:6060", nil); err != nil { log.Fatalf("pprof server error: %v", err) } }() full net-app's source code package main // pprof-start import ( // pprof-end "flag" "fmt" "log" "math/rand/v2" "net/http" // pprof-start _ "net/http/pprof" // pprof-end "os" "os/signal" "time" // pprof-start ) // pprof-end var ( fastDelay = flag.Duration("fast-delay", 0, "Fixed delay for fast handler (if any)") slowMin = flag.Duration("slow-min", 1time.Millisecond, "Minimum delay for slow handler") slowMax = flag.Duration("slow-max", 300time.Millisecond, "Maximum delay for slow handler") gcMinAlloc = flag.Int("gc-min-alloc", 50, "Minimum number of allocations in GC heavy handler") gcMaxAlloc = flag.Int("gc-max-alloc", 1000, "Maximum number of allocations in GC heavy handler") ) func randRange(min, max int) int { return rand.IntN(max-min) + min } func fastHandler(w http.ResponseWriter, r *http.Request) { if *fastDelay > 0 { time.Sleep(*fastDelay) } fmt.Fprintln(w, "fast response") } func slowHandler(w http.ResponseWriter, r *http.Request) { delayRange := int((*slowMax - *slowMin) / time.Millisecond) delay := time.Duration(randRange(1, delayRange)) * time.Millisecond time.Sleep(delay) fmt.Fprintf(w, "slow response with delay %d ms\n", delay.Milliseconds()) } // heavy-start var longLivedData [][]byte func gcHeavyHandler(w http.ResponseWriter, r *http.Request) { numAllocs := randRange(*gcMinAlloc, gcMaxAlloc) var data [][]byte for i := 0; i < numAllocs; i++ { // Allocate 10KB slices. Occasionally retain a reference to simulate long-lived objects. b := make([]byte, 102410) data = append(data, b) if i%100 == 0 { // every 100 allocations, keep the data alive longLivedData = append(longLivedData, b) } } fmt.Fprintf(w, "allocated %d KB\n", len(data)*10) } // heavy-end func main() { flag.Parse() http.HandleFunc("/fast", fastHandler) http.HandleFunc("/slow", slowHandler) http.HandleFunc("/gc", gcHeavyHandler) // pprof-start // ... // Start pprof in a separate goroutine. go func() { log.Println("pprof listening on :6060") if err := http.ListenAndServe("localhost:6060", nil); err != nil { log.Fatalf("pprof server error: %v", err) } }() // pprof-end // Create a server to allow for graceful shutdown. server := &http.Server{Addr: ":8080"} go func() { log.Println("HTTP server listening on :8080") if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed { log.Fatalf("HTTP server error: %v", err) } }() // Graceful shutdown on interrupt signal. sigCh := make(chan os.Signal, 1) signal.Notify(sigCh, os.Interrupt) <-sigCh log.Println("Shutting down server...") if err := server.Shutdown(nil); err != nil { log.Fatalf("Server Shutdown Failed:%+v", err) } log.Println("Server exited") } The next step will be to establish a connection with the profiled app and collect samples: go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 View results interactively: go tool pprof -http=:7070 cpu.prof # (1) the actual cpu.prof path will be something like $HOME/pprof/pprof.net-app.samples.cpu.004.pb.gz or you can save the profiling graph as an svg image. CPU Profiling Walkthrough: Load on the /gc Endpoint¶ We profiled the application during a 30-second load test targeting the /gc endpoint to see what happens under memory pressure. This handler was intentionally designed to trigger allocations and force garbage collection, which makes it a great candidate for observing runtime behavior under stress. We used Go’s built-in profiler to capture a CPU trace: CPU profiling trace for the /gc endpoint This gave us 3.02 seconds of sampled CPU activity out of 30 seconds of wall-clock time—a useful window into what the runtime and application were doing under pressure. Where the Time Went¶ HTTP Stack Dominates the Surface¶ As expected, the majority of CPU time was spent on request handling: http.(*conn).serve accounted for nearly 58% of sampled time http.serverHandler.ServeHTTP appeared prominently as well This aligns with the fact that we were sustaining constant traffic. The Go HTTP stack is doing the bulk of the work, managing connections and dispatching requests. Garbage Collection Overhead is Clearly Visible¶ A large portion of CPU time was spent inside the garbage collector: runtime.gcDrain, runtime.scanobject, and runtime.gcBgMarkWorker were all active Combined with memory-related functions like runtime.mallocgc, these accounted for roughly 20% of total CPU time This confirms that gcHeavyHandler is achieving its goal. What we care about is whether this kind of allocation pressure leaks into real-world handlers. If it does, we’re paying for it in latency and CPU churn. I/O and Syscalls Take a Big Slice¶ We also saw high syscall activity—especially from: syscall.syscall (linked to poll, Read, and Write) bufio.Writer.Flush and http.response.finishRequest These functions reflect the cost of writing responses back to clients. For simple handlers, this is expected. But if your handler logic is lightweight and most of the time is spent just flushing data over TCP, it’s worth asking whether the payloads or buffer strategies could be optimized. Scheduler Activity Is Non-Trivial¶ Functions like runtime.schedule, mcall, and findRunnable were also on the board. These are Go runtime internals responsible for managing goroutines. Seeing them isn’t unusual during high-concurrency tests—but if they dominate, it often points to excessive goroutine churn or blocking behavior. Memory Profiling: Retained Heap from the /gc Endpoint¶ We also captured a memory profile to complement the CPU view while hammering the /gc endpoint. This profile used the inuse_space metric, which shows how much heap memory is actively retained by each function at the time of capture. We triggered the profile with: go tool pprof -http=:7070 http://localhost:6060/debug/pprof/heap Memory profiling for the /gc endpoint At the time of capture, the application retained 649MB of heap memory, and almost all of it—99.46%—was attributed to a single function: gcHeavyHandler. This was expected. The handler simulates allocation pressure by creating 10KB slices in a tight loop. Every 100th slice is added to a global variable to simulate long-lived memory. Here’s what the handler does: var longLivedData [][]byte func gcHeavyHandler(w http.ResponseWriter, r *http.Request) { numAllocs := randRange(*gcMinAlloc, gcMaxAlloc) var data [][]byte for i := 0; i < numAllocs; i++ { // Allocate 10KB slices. Occasionally retain a reference to simulate long-lived objects. b := make([]byte, 102410) data = append(data, b) if i%100 == 0 { // every 100 allocations, keep the data alive longLivedData = append(longLivedData, b) } } fmt.Fprintf(w, "allocated %d KB\n", len(data)*10) } The flamegraph confirmed what we expected: gcHeavyHandler accounted for nearly all memory in use. The path traced cleanly from the HTTP connection, through the Go router stack, into the handler logic. No significant allocations came from elsewhere—this was a focused, controlled memory pressure scenario. This type of profile is valuable because it reveals what is still being held in memory, not just what was allocated. This view is often the most revealing for diagnosing leaks, retained buffers, or forgotten references. Summary: CPU and Memory Profiling of the /gc Endpoint¶ The /gc endpoint was intentionally built to simulate high allocation pressure and GC activity. Profiling this handler under load gave us a clean, focused view of how the Go runtime behaves when pushed to its memory limits. From the CPU profile, we saw that: As expected, most of the time was spent in the HTTP handler path during sustained load. Nearly 20% of CPU samples were attributed to memory allocation and garbage collection. Syscall activity was high, mostly from writing responses. The Go scheduler was moderately active, managing the concurrent goroutines handling traffic. From the memory profile, we captured 649MB of live heap usage, with 99.46% of it retained by gcHeavyHandler. This matched our expectations: the handler deliberately retains every 100th 10KB allocation to simulate long-lived data. Together, these profiles give us confidence that the /gc endpoint behaves as intended under synthetic pressure: It creates meaningful CPU and memory load. It exposes the cost of sustained allocations and GC cycles. It provides a predictable environment for testing optimizations or GC tuning strategies.

pprof

Reference Files

This skill includes comprehensive documentation in references/:

  • benchmarking.md - Benchmarking documentation
  • connection_management.md - Connection Management documentation
  • dns_tuning.md - Dns Tuning documentation
  • networking_fundamentals.md - Networking Fundamentals documentation
  • other.md - Other documentation
  • tls_security.md - Tls Security documentation

Use view to read specific reference files when detailed information is needed.

Working with This Skill

For Beginners

Start with the getting_started or tutorials reference files for foundational concepts.

For Specific Features

Use the appropriate category reference file (api, guides, etc.) for detailed information.

For Code Examples

The quick reference section above contains common patterns extracted from the official docs.

Resources

references/

Organized documentation extracted from official sources. These files contain:

  • Detailed explanations
  • Code examples with language annotations
  • Links to original documentation
  • Table of contents for quick navigation

scripts/

Add helper scripts here for common automation tasks.

assets/

Add templates, boilerplate, or example projects here.

Notes

  • This skill was automatically generated from official documentation
  • Reference files preserve the structure and examples from source docs
  • Code examples include language detection for better syntax highlighting
  • Quick reference patterns are extracted from common usage examples in the docs

Updating

To refresh this skill with updated documentation:

  1. Re-run the scraper with the same configuration
  2. The skill will be rebuilt with the latest information

Go-Networking - Tls Security

Pages: 1


Optimizing TLS for Speed - Go Optimization Guide

URL: https://goperf.dev/02-networking/tls-for-speed/

Contents:

  • Optimizing TLS for Speed: Handshake, Reuse, and Cipher Choice¶
  • Understanding TLS Overhead: Where Performance Suffers¶
  • Session Resumption: Cutting Handshake Latency¶
  • Choosing Cipher Suites Wisely¶
  • Using ALPN Wisely¶
  • Minimizing Certificate Verification Overhead¶
  • TLS Best Practices in Go¶

TLS does what it’s supposed to: it keeps your connections private and trustworthy. But it also slows things down — a lot more than most people realize. In Go, if you care about how quickly your service responds, you can squeeze out better performance by tuning how TLS negotiates and what it negotiates.

Most of the slowdown in TLS happens right at the start. The handshake is a back-and-forth process: picking algorithms, swapping keys, proving identities, and setting up the session. That back-and-forth usually takes two full trips across the network. In something like a trading platform or a real-time app, that delay is noticeable.

To make TLS faster, the most effective place to start is cutting down the handshake steps and making the crypto work less expensive.

Because every new TLS connection runs the entire handshake — negotiating ciphers, exchanging keys, verifying certificates — it introduces noticeable latency. Session resumption sidesteps most of that by reusing the cryptographic state from an earlier session, making reconnects much faster.

Session resumption is a mechanism in TLS to avoid repeating the full handshake on reconnect. There are two main approaches: session IDs and session tickets. Both rely on the idea that the server remembers (or encodes) the session’s cryptographic state from a prior connection. When a client reconnects, it presents either the session ID or the session ticket, allowing the server to restore the session state and skip expensive asymmetric key exchange.

A session ticket is a data blob issued by the server to the client at the end of the handshake. This ticket contains the encrypted session state (such as negotiated cipher suite, keys, and session parameters) and is opaque to the client. On reconnect, the client sends the ticket back, and the server decrypts it to resume the session without performing a full handshake.

In Go, you enable session resumption by setting up session ticket keys. The server uses these keys to encrypt and decrypt the session state that clients send back when resuming a connection. You can generate a secure 32‑byte key at startup with crypto/rand and reuse it if your service is running across multiple instances behind a load balancer. Just make sure to rotate the key now and then to keep it secure.

What makes session resumption effective is that it avoids re‑doing the slowest parts of TLS. Instead of negotiating everything from scratch, the server decrypts the ticket, verifies it, and rest

[Content truncated]

Examples:

Example 1 (unknown):

sequenceDiagram
    participant Client
    participant Server

    Client->>Server: ClientHello (supported ciphers, random)
    Server->>Client: ServerHello (chosen cipher, random)
    Server->>Client: Certificate
    Server->>Client: ServerKeyExchange
    Client->>Server: ClientKeyExchange
    Client->>Server: ChangeCipherSpec
    Server->>Client: ChangeCipherSpec
    Note over Client,Server: Handshake Complete – Encrypted communication begins

Example 2 (unknown):

tlsConfig := &tls.Config{
    SessionTicketsDisabled: false, // Enable session tickets explicitly
    SessionTicketKey: [32]byte{...}, // Persist securely and rotate periodically
}

Example 3 (unknown):

tlsConfig := &tls.Config{
    CipherSuites: []uint16{
        tls.TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,
        tls.TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,
        tls.TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,
        tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
    },
    PreferServerCipherSuites: true,
}

Example 4 (unknown):

tlsConfig := &tls.Config{
    NextProtos: []string{"h2", "http/1.1"},
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment