efstathiosntonas/SKILL.md

Last active January 5, 2026 14:50

Star (1) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/efstathiosntonas/8a3d77594831e6782696f5213aeec8c7.js"></script>
Save efstathiosntonas/8a3d77594831e6782696f5213aeec8c7 to your computer and use it in GitHub Desktop.

Download ZIP

Claude Skill from blog: https://goperf.dev/01-common-patterns/

Raw

compiler_optimization.md

Go-Performance - Compiler Optimization

Pages: 1

Leveraging Compiler Optimization Flags - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/comp-flags/

Contents:

Leveraging Compiler Optimization Flags in Go¶
Why Compiler Flags Matter¶
Key Compiler and Linker Flags¶
- -ldflags="-s -w" — Strip Debug Info¶
- -gcflags — Control Compiler Optimizations¶
- Cross-Compilation Flags¶
- Build Tags¶
- -ldflags="-X ..." — Inject Build-Time Variables¶
- -extldflags='-static' — Build Fully Static Binaries¶
  - Example: Static Build with libcurl via CGO¶

When tuning Go applications for performance, most of the attention goes to runtime behavior—profiling hot paths, trimming allocations, improving concurrency. But there’s another layer that’s easy to miss: what the Go compiler does with your code before it ever runs. The build process includes several optimization passes, and understanding how to surface or influence them can give you clearer insights into what’s actually happening under the hood. It’s not about tweaking obscure flags to squeeze out extra instructions—it’s about knowing how the compiler treats your code so you’re not working against it.

While Go doesn’t expose the same granular set of compiler flags as C or Rust, it still provides useful ways to influence how your code is built—especially when targeting performance, binary size, or specific environments.

Go's compiler (specifically cmd/compile and cmd/link) performs several default optimizations: inlining, escape analysis, dead code elimination, and more. However, there are scenarios where you can squeeze more performance or control from your build using the right flags.

When you want to shrink binary size, especially in production or containers:

Why it matters: This can reduce binary size by up to 30-40%, depending on your codebase. It is useful in Docker images or when distributing binaries.

The -gcflags flag allows you to control how the compiler treats specific packages. For example, you can disable optimizations for debugging:

When to use: During debugging sessions with Delve or similar tools. Turning off inlining and optimizations make stack traces and breakpoints more reliable.

Need to build for another OS or architecture?

Build tags allow conditional compilation. Use //go:build or // +build in your source code to control what gets compiled in.

You can inject version numbers or metadata into your binary at build time:

This sets the version variable at link time without modifying your source code. It's useful for embedding release versions, commit hashes, or build dates.

The -extldflags '-static' option passes the -static flag to the external system linker, instructing it to produce a fully statically linked binary.

This is especially useful when you're using CGO and want to avoid runtime dynamic library dependencies:

To go further and ensure your binary avoids relying on C library DNS resolution (such as glibc's getaddrinfo), you can use the netgo build tag. This forces Go to use its pure Go implementation of the DNS res

[Content truncated]

Examples:

Example 1 (unknown):

go build -ldflags="-s -w" -o app main.go

Example 2 (unknown):

go build -gcflags="all=-N -l" -o app main.go

Example 3 (unknown):

GOOS=linux GOARCH=arm64 go build -o app main.go

Example 4 (go):

//go:build debug

package main

import "log"

func debugLog(msg string) {
    log.Println("[DEBUG]", msg)
}

Raw

concurrency.md

Go-Performance - Concurrency

Pages: 5

Lazy Initialization - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/lazy-init/

Contents:

Lazy Initialization
Lazy Initialization for Performance in Go¶
- Why Lazy Initialization Matters¶
- Using sync.Once for Thread-Safe Initialization¶
- Using sync.OnceValue and sync.OnceValues for Initialization with Output Values¶
- Custom Lazy Initialization with Atomic Operations¶
- Performance Considerations¶
Benchmarking Impact¶
When to Choose Lazy Initialization¶

In Go, some resources are expensive to initialize, or simply unnecessary unless certain code paths are triggered. That’s where lazy initialization becomes useful: it defers the construction of a value until the moment it’s actually needed. This pattern can improve performance, reduce startup overhead, and avoid unnecessary work—especially in high-concurrency applications.

Initializing heavy resources like database connections, caches, or large in-memory structures at startup can slow down application launch and consume memory before it’s actually needed. Lazy initialization defers this work until the first time the resource is used, keeping startup fast and memory usage lean.

It’s also a practical pattern when you have logic that might be triggered multiple times but should only run once—ensuring that expensive operations aren’t repeated and that initialization remains safe and idempotent across concurrent calls.

Go provides the sync.Once type to implement lazy initialization safely in concurrent environments:

In this example, the function expensiveInit() executes exactly once, no matter how many goroutines invoke getResource() concurrently. This ensures thread-safe initialization without additional synchronization overhead.

Since Go 1.21, if your initialization logic returns a value, you might prefer using sync.OnceValue (single value) or sync.OnceValues (multiple values) for simpler, more expressive code:

Here, sync.OnceValue provides a concise way to wrap one-time initialization logic and access the result without managing flags or mutexes manually. It simplifies lazy loading by directly returning the computed value on demand.

For cases where the initializer returns more than one value—such as a resource and an error—sync.OnceValues extends the same idea. It ensures the function runs exactly once and cleanly unpacks the results, keeping the code readable and thread-safe without boilerplate.

Choosing sync.OnceValue or sync.OnceValues helps you clearly express initialization logic with direct value returns, whereas sync.Once remains best suited for general scenarios requiring flexible initialization logic without immediate value returns.

Yes, it’s technically possible to replace sync.Once, sync.OnceValue, or sync.OnceFunc with custom logic using low-level atomic operations like atomic.CompareAndSwap or atomic.Load/Store. In rare, performance-critical paths, this can avoid the small overhead or allocations that come with the standard types.

Howev

[Content truncated]

Examples:

Example 1 (go):

var (
    resource *MyResource
    once     sync.Once
)

func getResource() *MyResource {
    once.Do(func() {
        resource = expensiveInit()
    })
    return resource
}

Example 2 (go):

var getResource = sync.OnceValue(func() *MyResource {
    return expensiveInit()
})

func processData() {
    res := getResource()
    // use res
}

Example 3 (go):

var getConfig = sync.OnceValues(func() (*Config, error) {
    return loadConfig("config.yml")
})

func processData() {
    config, err := getConfig()
    if err != nil {
        log.Fatal(err)
    }
    // use config
}

Goroutine Worker Pools - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/worker-pool/

Contents:

Goroutine Worker Pools in Go¶
Why Worker Pools Matter¶
Basic Worker Pool Implementation¶
- Worker Count and CPU Cores¶
- Why Too Many Workers Hurts Performance¶
Benchmarking Impact¶
When To Use Worker Pools¶

Go’s concurrency model makes it deceptively easy to spin up thousands of goroutines—but that ease can come at a cost. Each goroutine starts small, but under load, unbounded concurrency can cause memory usage to spike, context switches to pile up, and overall performance to become unpredictable.

A worker pool helps apply backpressure by limiting the number of active goroutines. Instead of spawning one per task, a fixed pool handles work in controlled parallelism—keeping memory usage predictable and avoiding overload. This makes it easier to maintain steady performance even as demand scales.

While launching a goroutine for every task is idiomatic and often effective, doing so at scale comes with trade-offs. Each goroutine requires stack space and introduces scheduling overhead. Performance can degrade sharply when the number of active goroutines grows, especially in systems handling unbounded input like HTTP requests, jobs from a queue, or tasks from a channel.

A worker pool maintains a fixed number of goroutines that pull tasks from a shared job queue. This creates a backpressure mechanism, ensuring the system never processes more work concurrently than it can handle. Worker pools are particularly valuable when the cost of each task is predictable, and the overall system throughput needs to be stable.

Here’s a minimal implementation of a worker pool:

In this example, five workers pull from the jobs channel and push results to the results channel. The worker pool limits concurrency to five tasks at a time, regardless of how many tasks are sent.

The optimal number of workers in a pool is closely tied to the number of CPU cores, which you can obtain in Go using runtime.NumCPU() or runtime.GOMAXPROCS(0). For CPU-bound tasks—where each worker consumes substantial CPU time—you generally want the number of workers to be equal to or slightly less than the number of logical CPU cores. This ensures maximum core utilization without excessive overhead.

If your tasks are I/O-bound (e.g., network calls, disk I/O, database queries), the pool size can be larger than the number of cores. This is because workers will spend much of their time blocked, allowing others to run. In contrast, CPU-heavy workloads benefit from a smaller, tightly bounded pool that avoids contention and context switching.

Adding more workers can seem like a straightforward way to boost throughput, but the benefits taper off quickly past a certain point. Once you exceed the system’s optimal lev

[Content truncated]

Examples:

Example 1 (go):

func worker(id int, jobs <-chan int, results chan<- [32]byte) {
    for j := range jobs {
        results <- doWork(j)
    }
}

func doWork(n int) [32]byte {
    data := []byte(fmt.Sprintf("payload-%d", n))
    return sha256.Sum256(data)                  // (1)
}

func main() {
    jobs := make(chan int, 100)
    results := make(chan [32]byte, 100)

    for w := 1; w <= 5; w++ {
        go worker(w, jobs, results)
    }

    for j := 1; j <= 10; j++ {
        jobs <- j
    }
    close(jobs)

    for a := 1; a <= 10; a++ {
        <-results
    }
}

Example 2 (javascript):

package perf

import (
    // "log"
    "fmt"
    // "os"
    "runtime"
    "sync"
    "testing"
    "crypto/sha256"
)

const (
    numJobs     = 10000
    workerCount = 10
)

func doWork(n int) [32]byte {
    data := []byte(fmt.Sprintf("payload-%d", n))
    return sha256.Sum256(data)
}

func BenchmarkUnboundedGoroutines(b *testing.B) {
    for b.Loop() {
        var wg sync.WaitGroup
        wg.Add(numJobs)

        for j := 0; j < numJobs; j++ {
            go func(job int) {
                _ = doWork(job)
                wg.Done()
            }(j)
        }
        wg.Wait()
    }
}

func 
...

Efficient Context Management - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/context/

Contents:

Efficient Context Management¶
Why Context Matters¶
Practical Examples of Context Usage¶
- HTTP Server Request Cancellation¶
- Database Operations with Timeouts¶
- Propagating Request IDs for Distributed Tracing¶
- Concurrent Worker Management¶
- Graceful Shutdown in CLI Tools¶
- Streaming and Real-Time Data Pipelines¶
- Middleware and Rate Limiting¶

Whether you're handling HTTP requests, coordinating worker goroutines, or querying external services, there's often a need to cancel in-flight operations or enforce execution deadlines. Go’s context package is designed for precisely that—it provides a consistent and thread-safe way to manage operation lifecycles, propagate metadata, and ensure resources are cleaned up promptly.

Go provides two base context constructors: context.Background() and context.TODO().

The context package in Go is designed to carry deadlines, cancellation signals, and other request-scoped values across API boundaries. It's especially useful in concurrent programs where operations need to be coordinated and canceled cleanly.

A typical context workflow begins at the entry point of a program or request—like an HTTP handler, main function, or RPC server. From there, a base context is created using context.Background() or context.TODO(). This context can then be extended using constructors like:

Each of these functions returns a new context that wraps its parent. Cancellation signals, deadlines, and values are automatically propagated down the call stack. When a context is canceled—either manually or by timeout—any goroutines or functions listening on <-ctx.Done() are immediately notified.

By passing context explicitly through function parameters, you avoid hidden dependencies and gain fine-grained control over the execution lifecycle of concurrent operations.

The following examples show how context.Context enables better control, observability, and resource management across a variety of real-world scenarios.

Contexts help gracefully handle cancellations when clients disconnect early. Every incoming HTTP request in Go carries a context that gets canceled if the client closes the connection. By checking <-ctx.Done(), you can exit early instead of doing unnecessary work:

In this example, the handler waits for either a simulated delay or cancellation. If the client closes the connection before the timeout, ctx.Done() is triggered, allowing the handler to clean up without writing a response.

Contexts provide a straightforward way to enforce timeouts on database queries. Many drivers support QueryContext or similar methods that respect cancellation:

In this case, the context is automatically canceled if the database does not respond within two seconds. The query is aborted, and the application doesn’t hang indefinitely. This helps manage resources and avoids cascading failures in

[Content truncated]

Examples:

Example 1 (go):

func handler(w http.ResponseWriter, req *http.Request) {
    ctx := req.Context()
    select {
    case <-time.After(5 * time.Second):
        fmt.Fprintln(w, "Response after delay")
    case <-ctx.Done():
        log.Println("Client disconnected")
    }
}

Example 2 (unknown):

ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()

rows, err := db.QueryContext(ctx, "SELECT * FROM users")
if err != nil {
    log.Fatal(err)
}
defer rows.Close()

Example 3 (go):

func main() {
    ctx := context.WithValue(context.Background(), "requestID", "12345")
    handleRequest(ctx)
}

func handleRequest(ctx context.Context) {
    log.Printf("Handling request with ID: %v", ctx.Value("requestID"))
}

Example 4 (unknown):

ctx, cancel := context.WithCancel(context.Background())

for i := 0; i < 10; i++ {
    go worker(ctx, i)
}

// Cancel workers after some condition or signal
cancel()

Immutable Data Sharing - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/immutable-data/

Contents:

Immutable Data Sharing¶
Why Immutable Data?¶
Practical Example: Shared Config¶
- Step 1: Define the Config Struct¶
- Step 2: Ensure Deep Immutability¶
- Step 3: Atomic Swapping¶
- Step 4: Using It in Handlers¶
Practical Example: Immutable Routing Table¶
- Step 1: Define Route Structs¶
- Step 2: Build Immutable Version¶

One common source of slowdown in high-performance Go programs is the way shared data is accessed under concurrency. The usual tools—mutexes and channels—work well, but they’re not free. Mutexes can become choke points if many goroutines try to grab the same lock. Channels, while elegant for coordination, can introduce blocking and make control flow harder to reason about. Both require careful use: it’s easy to introduce subtle bugs or unexpected performance issues if synchronization isn’t tight.

A powerful alternative is immutable data sharing. Instead of protecting data with locks, you design your system so that shared data is never mutated after it's created. This minimizes contention and simplifies reasoning about your program.

Immutability brings several advantages to concurrent programs:

Imagine you have a long-running service that periodically reloads its configuration from a disk or a remote source. Multiple goroutines read this configuration to make decisions.

Here's how immutable data helps:

Maps and slices in Go are reference types. Even if the Config struct isn't changed, someone could accidentally mutate a shared map. To prevent this, we make defensive copies:

Now, every config instance is self-contained and safe to share.

Use atomic.Value to store and safely update the current config.

Now all goroutines can safely call GetConfig() with no locks. When the config is reloaded, you just Store a new immutable copy.

Suppose you're building a lightweight reverse proxy or API gateway and must route incoming requests based on path or host. The routing table is read thousands of times per second and updated only occasionally (e.g., from a config file or service discovery).

To ensure immutability, we deep-copy the slice of routes when constructing a new routing table.

Now, your routing logic can scale safely under load with zero locking overhead.

As systems grow, routing tables can expand to hundreds or even thousands of entries. While immutability brings clear benefits—safe concurrent access, predictable behavior—it becomes costly if every update means copying the entire structure. At some point, rebuilding the whole table for each minor change doesn’t scale.

To keep immutability without paying for full reconstruction on every update, the design needs to evolve. There are several ways to do this—each preserving the core benefits while reducing overhead.

Imagine a multi-tenant system where each customer has their own set of routing rules. I

[Content truncated]

Examples:

Example 1 (unknown):

// config.go
type Config struct {
    LogLevel string
    Timeout  time.Duration
    Features map[string]bool // This needs attention!
}

Example 2 (go):

func NewConfig(logLevel string, timeout time.Duration, features map[string]bool) *Config {
    copiedFeatures := make(map[string]bool, len(features))
    for k, v := range features {
        copiedFeatures[k] = v
    }

    return &Config{
        LogLevel: logLevel,
        Timeout:  timeout,
        Features: copiedFeatures,
    }
}

Example 3 (go):

var currentConfig atomic.Pointer[Config]

func LoadInitialConfig() {
    cfg := NewConfig("info", 5*time.Second, map[string]bool{"beta": true})
    currentConfig.Store(cfg)
}

func GetConfig() *Config {
    return currentConfig.Load()
}

Example 4 (go):

func handler(w http.ResponseWriter, r *http.Request) {
    cfg := GetConfig()
    if cfg.Features["beta"] {
        // Enable beta path
    }
    // Use cfg.Timeout, cfg.LogLevel, etc.
}

Atomic Operations and Synchronization Primitives - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/atomic-ops/

Contents:

Atomic Operations and Synchronization Primitives¶
Understanding Atomic Operations¶
- Memory Model and Comparison to C++¶
- Common Atomic Operations¶
- When to Use Atomic Operations in Real Life¶
  - High-throughput metrics and Counters¶
  - Fast, Lock-Free Flags¶
  - Once-Only Initialization¶
  - Lock-Free Queues or Freelist Structures¶
  - Reducing Lock Contention¶

In high-concurrency systems, performance isn't just about what you do—it's about what you avoid. Lock contention, cache line bouncing and memory fences quietly shape throughput long before you hit your scaling ceiling. Atomic operations are among the leanest tools Go offers to sidestep these pitfalls.

While Go provides the full suite of synchronization primitives, there's a class of problems where locks feel like overkill. Atomics offers clarity and speed for low-level coordination—counters, flags, and simple state machines, especially under pressure.

Atomic operations allow safe concurrent access to shared data without explicit locking mechanisms like mutexes. The sync/atomic package provides low-level atomic memory primitives ideal for counters, flags, or simple state transitions.

The key benefit of atomic operations is performance under contention. Locking introduces coordination overhead—when many goroutines contend for a mutex, performance can degrade due to context switching and lock queue management. Atomics avoids this by operating directly at the hardware level using CPU instructions like CAS (compare-and-swap). This makes them particularly useful for:

Understanding memory models is crucial when reasoning about concurrency. In C++, developers have fine-grained control over atomic operations via memory orderings, which allows them to trade-off between performance and consistency. By default, Go's atomic operations enforce sequential consistency, which means they behave like std::memory_order_seq_cst in C++. This is the strongest and safest memory ordering:

Go does not expose weaker memory models like relaxed, acquire, or release. This is an intentional simplification to promote safety and reduce the risk of subtle data races. All atomic operations in Go imply synchronization across goroutines, ensuring correct behavior without manual memory fencing.

This means you don’t have to reason about instruction reordering or memory visibility at a low level—but it also means you can’t fine-tune for performance in the way C++ or Rust developers might use relaxed atomics.

Low-level access to relaxed memory ordering in Go exists internally (e.g., in the runtime or through go:linkname), but it’s not safe or supported for use in application-level code.

Tracking request counts, dropped packets, or other lightweight stats:

This code allows multiple goroutines to safely increment a shared counter without using locks. atomic.AddInt64 ensures each addition i

[Content truncated]

Examples:

Example 1 (go):

var requests atomic.Int64

func handleRequest() {
    requests.Add(1)
}

Example 2 (go):

var shutdown atomic.Int32

func mainLoop() {
    for {
        if shutdown.Load() == 1 {
            break
        }
        // do work
    }
}

func stop() {
    shutdown.Store(1)
}

Example 3 (go):

import (
    "runtime"
    "sync/atomic"
    "unsafe"
)

var resource unsafe.Pointer
var initStatus int32 // 0: not started, 1: in progress, 2: completed

func getResource() *MyResource {
    if atomic.LoadInt32(&initStatus) == 2 {
        return (*MyResource)(atomic.LoadPointer(&resource))
    }

    if atomic.CompareAndSwapInt32(&initStatus, 0, 1) {
        newRes := expensiveInit() // initialization logic
        atomic.StorePointer(&resource, unsafe.Pointer(newRes))
        atomic.StoreInt32(&initStatus, 2)
        return newRes
    }

    for atomic.LoadInt32(&initStatus) != 2 {
        r
...

Example 4 (go):

type node struct {
    next *node
    val  any
}

var head atomic.Pointer[node]

func push(n *node) {
    for {
        old := head.Load()
        n.next = old
        if head.CompareAndSwap(old, n) {
            return
        }
    }
}

Raw

escape_analysis.md

Go-Performance - Escape Analysis

Pages: 1

Stack Allocations and Escape Analysis - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/stack-alloc/

Contents:

Stack Allocations and Escape Analysis¶
What Is Escape Analysis?¶
- Why does it matter?¶
- Example: Stack vs Heap¶
How to View Escape Analysis Output¶
What Causes Variables to Escape?¶
- Returning Pointers to Local Variables¶
- Capturing Variables in Closures¶
- Interface Conversions¶
- Assignments to Global Variables or Struct Fields¶

When writing performance-critical Go applications, one of the subtle but significant optimizations you can make is encouraging values to be allocated on the stack rather than the heap. Stack allocations are cheaper, faster, and garbage-free—but Go doesn't always put your variables there automatically. That decision is made by the Go compiler during escape analysis.

In this article, we’ll explore what escape analysis is, how to read the compiler’s escape diagnostics, what causes values to escape, and how to structure your code to minimize unnecessary heap allocations. We'll also benchmark different scenarios to show the real-world impact.

Escape analysis is a static analysis performed by the Go compiler to determine whether a variable can be safely allocated on the stack or if it must be moved ("escape") to the heap.

The compiler decides where to place each variable based on how it's used. If a variable can be guaranteed to not outlive its declaring function, it can stay on the stack. If not, it escapes to the heap.

In allocate, x is returned as a pointer. Since the pointer escapes the function, the Go compiler places x on the heap. In noEscape, x is a plain value and doesn’t escape.

You can inspect escape analysis with the -gcflags compiler option:

Or for a specific file:

This will print lines like:

Look for messages like moved to heap to identify escape points.

Here are common scenarios that force heap allocation:

When a value is stored in an interface, it may escape:

Go may allocate large structs or slices on the heap even if they don’t strictly escape.

Let’s run a benchmark to explore when heap allocations actually occur—and when they don’t, even if we return a pointer.

You might expect HeapAlloc to always allocate memory on the heap—but it doesn’t here. That’s because the compiler is smart: in this isolated benchmark, the pointer returned by HeapAlloc doesn’t escape the function in any meaningful way. The compiler can see it’s only used within the benchmark and short-lived, so it safely places it on the stack too.

As shown in BenchmarkHeapAllocEscape, assigning the pointer to a global variable causes a real heap escape. This introduces real overhead: a 40x slower call, a 24-byte allocation, and one garbage-collected object per call.

Not all escapes are worth preventing. Here’s when it makes sense to focus on stack allocation—and when it’s better to let values escape.

When It’s Fine to Let Values Escape

Examples:

Example 1 (go):

func allocate() *int {
    x := 42
    return &x // x escapes to the heap
}

func noEscape() int {
    x := 42
    return x // x stays on the stack
}

Example 2 (unknown):

go build -gcflags="-m" ./path/to/pkg

Example 3 (unknown):

go run -gcflags="-m" main.go

Example 4 (unknown):

main.go:10:6: moved to heap: x
main.go:14:6: can inline noEscape

Raw

garbage_collector.md

Go-Performance - Garbage Collector

Pages: 1

Memory Efficiency and Go’s Garbage Collector - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/gc/

Contents:

Memory Efficiency: Mastering Go’s Garbage Collector¶
How Go's Garbage Collector Works¶
- Non-generational¶
- Concurrent¶
- Tri-color Mark and Sweep¶
GC Tuning: GOGC¶
- Memory Limiting with GOMEMLIMIT¶
- GOMEMLIMIT=X and GOGC=off configuration¶
Practical Strategies for Reducing GC Pressure¶
- Prefer Stack Allocation¶

Memory management in Go is automated—but it’s not invisible. Every allocation you make contributes to GC workload. The more frequently objects are created and discarded, the more work the runtime has to do reclaiming memory.

This becomes especially relevant in systems prioritizing low latency, predictable resource usage, or high throughput. Tuning your allocation patterns and leveraging newer features like weak references can help reduce pressure on the GC without adding complexity to your code.

Highly encourage you to read the official A Guide to the Go Garbage Collector! The document provides a detailed description of multiple Go's GC internals.

Go uses a non-generational, concurrent, tri-color mark-and-sweep garbage collector. Here's what that means in practice and how it's implemented.

Many modern GCs, like those in the JVM or .NET CLR, divide memory into generations (young and old) under the assumption that most objects die young. These collectors focus on the young generation, which leads to shorter collection cycles.

Go’s GC takes a different approach. It treats all objects equally—no generational segmentation—not because generational GC conflicts with short pause times or concurrent scanning, but because it hasn’t shown clear, consistent benefits in real-world Go programs with the designs tried so far. This choice avoids the complexity of promotion logic and specialized memory regions. While it can mean scanning more objects overall, this cost is mitigated by concurrent execution and efficient write barriers.

Go’s GC runs concurrently with your application, which means it does most of its work without stopping the world. Concurrency is implemented using multiple phases that interleave with normal program execution:

Even though Go’s garbage collector is mostly concurrent, it still requires brief Stop-The-World (STW) pauses at several points to maintain correctness. These pauses are kept extremely short—typically under 100 microseconds—even with large heaps and hundreds of goroutines.

STW is essential for ensuring that memory structures are not mutated while the GC analyzes them. In most applications, these pauses are imperceptible. However, even sub-millisecond pauses in latency-sensitive systems can be significant—so understanding and monitoring STW behavior becomes important when optimizing for tail latencies or jitter.

Write barriers ensure correctness while the application mutates objects during concurrent marking. These barriers help t

[Content truncated]

Examples:

Example 1 (unknown):

GOGC=100  # Default: GC runs when heap grows 100% since last collection
GOGC=off  # Disables GC (use only in special cases like short-lived CLI tools)

Example 2 (unknown):

GOMEMLIMIT=400MiB

Example 3 (unknown):

import "runtime/debug"

debug.SetMemoryLimit(2 << 30) // 2 GiB

Example 4 (unknown):

GOGC=100 GOMEMLIMIT=4GiB ./your-service

Raw

index.md

Go-Performance Documentation Index

Go-Performance - Io Optimization

Pages: 2

Batching Operations - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/batching-ops/

Contents:

Batching Operations in Go¶
Why Batching Matters¶
How generic Batcher may looks like¶
Benchmarking Impact¶
When To Use Batching¶

Batching is one of those techniques that’s easy to overlook but incredibly useful when performance starts to matter. Instead of handling one operation at a time, you group them together—cutting down on the overhead of repeated calls, whether that’s hitting the network, writing to disk, or making a database commit. It’s a practical, low-complexity approach that can reduce latency and stretch your system’s throughput further than you’d expect.

Most systems don’t struggle because individual operations are too slow—they struggle because they do too many of them. Every call out to a database, API, or filesystem adds some fixed cost: a system call, a network round trip, maybe a lock or a context switch. When those costs add up across high-volume workloads, the impact is hard to ignore. Batching helps by collapsing those calls into fewer, more efficient units of work, which often leads to measurable gains in both performance and resource usage.

Consider a logging service writing to disk:

When invoked thousands of times per second, the file system is inundated with individual write system calls, significantly degrading performance. A better approach could be aggregates log entries and flushes them in bulk:

With batching, each write operation handles multiple entries simultaneously, reducing syscall overhead and improving disk I/O efficiency.

While batching offers substantial performance advantages, it also introduces the risk of data loss. If an application crashes before a batch is flushed, the in-memory data can be lost. Systems dealing with critical or transactional data must incorporate safeguards such as periodic flushes, persistent storage buffers, or recovery mechanisms to mitigate this risk.

We can implement a generic batcher in very straight forward manner:

This batcher implementation expects that you will never call Batcher.Add(...) from your flush() function. We have this limitation because Go mutexes are not recursive.

This batcher works with any data type, making it a flexible solution for aggregating logs, metrics, database writes, or other grouped operations. Internally, the buffer acts as a queue that accumulates items until a flush threshold is reached. The use of sync.Mutex ensures that Add() and flushNow() are safe for concurrent access, which is necessary in most real-world systems where multiple goroutines may write to the batcher.

From a performance standpoint, it's true that a lock-free implementation—using atomic operations or conc

[Content truncated]

Examples:

Example 1 (unknown):

func logLine(line string) {
    f.WriteString(line + "\n")
}

Example 2 (go):

var batch []string

func logBatch(line string) {
    batch = append(batch, line)
    if len(batch) >= 100 {
        f.WriteString(strings.Join(batch, "\n") + "\n")
        batch = batch[:0]
    }
}

Example 3 (go):

type Batcher[T any] struct {
    mu     sync.Mutex
    buffer []T
    size   int
    flush  func([]T)
}

func NewBatcher[T any](size int, flush func([]T)) *Batcher[T] {
    return &Batcher[T]{
        buffer: make([]T, 0, size),
        size:   size,
        flush:  flush,
    }
}

func (b *Batcher[T]) Add(item T) {
    b.mu.Lock()
    defer b.mu.Unlock()
    b.buffer = append(b.buffer, item)
    if len(b.buffer) >= b.size {
        b.flushNow()
    }
}

func (b *Batcher[T]) flushNow() {
    if len(b.buffer) == 0 {
        return
    }
    b.flush(b.buffer)
    b.buffer = b.buffer[:0]
}

Example 4 (go):

package perf

import (
    "crypto/sha256"
    "encoding/hex"
    "fmt"
    "os"
    "strings"
    "testing"
)

var lines = make([]string, 10000)

func init() {
    for i := range lines {
        lines[i] = fmt.Sprintf("log entry %d %s", i, strings.Repeat("x", 100))
    }
}

// --- 1. No I/O ---

func BenchmarkUnbatchedProcessing(b *testing.B) {
    for b.Loop() {
        for _, line := range lines {
            strings.ToUpper(line)
        }
    }
}

func BenchmarkBatchedProcessing(b *testing.B) {
    batchSize := 100
    for b.Loop() {
        for i := 0; i < len(lines); i += batchSize {
  
...

Efficient Buffering - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/buffered-io/

Contents:

Efficient Buffering in Go¶
Why Buffering Matters¶
- With Buffering¶
- Controlling Buffer Capacity¶
Benchmarking Impact¶
When To Buffer¶

Buffering is a core performance technique in systems programming. In Go, it's especially relevant when working with I/O—file access, network communication, and stream processing. Without buffering, many operations incur excessive system calls or synchronization overhead. Proper buffering reduces the frequency of such interactions, improves throughput, and smooths latency spikes.

Every time you read from or write to a file or socket, there’s a good chance you’re triggering a system call—and that’s not cheap. System calls move control from user space into kernel space, which means crossing a boundary that comes with overhead: entering kernel mode, possible context switches, interacting with I/O buffers, and sometimes queuing operations behind the scenes. Doing that once in a while is fine. Doing it thousands of times per second? That’s a problem. Buffering helps by batching small reads or writes into larger chunks, reducing how often you cross that boundary and making far better use of each syscall.

For example, writing to a file in a loop without buffering, like this:

This can easily result in 10,000 separate system calls, each carrying its own overhead and dragging down performance. On top of that, a flood of small writes tends to fragment disk operations, which puts extra pressure on I/O subsystems and wastes CPU cycles handling what could have been a single, efficient batch.

This version significantly reduces the number of system calls. The bufio.Writer accumulates writes in an internal memory buffer (typically 4KB or more). It only triggers a syscall when the buffer is full or explicitly flushed. As a result, you achieve faster I/O, reduced CPU usage, and improved performance.

bufio.Writer does not automatically flush when closed. If you forget to call Flush(), any unwritten data remaining in the buffer will be lost. Always call Flush() before closing or returning from a function, especially if the total written size is smaller than the buffer capacity.

By default, bufio.NewWriter() allocates a 4096-byte (4 KB) buffer. This size aligns with the common block size of file systems and the standard memory page size on most operating systems (such as Linux, BSD, and macOS). Reading or writing in 4 KB increments minimizes page faults, aligns with kernel read-ahead strategies, and maps efficiently onto underlying disk I/O operations.

While 4 KB is a practical general-purpose default, it might not be optimal for all workloads. For high-throughput scenari

[Content truncated]

Examples:

Example 1 (unknown):

f, _ := os.Create("output.txt")
for i := 0; i < 10000; i++ {
    f.Write([]byte("line\n"))
}

Example 2 (unknown):

f, _ := os.Create("output.txt")
buf := bufio.NewWriter(f)
for i := 0; i < 10000; i++ {
    buf.WriteString("line\n")
}
buf.Flush() // ensure all buffered data is written

Example 3 (unknown):

f, _ := os.Create("output.txt")
buf := bufio.NewWriterSize(f, 16*1024) // 16 KB buffer

Example 4 (unknown):

reader := bufio.NewReaderSize(f, 32*1024) // 32 KB buffer for input

Raw

memory_management.md

Go-Performance - Memory Management

Pages: 6

Memory Preallocation - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/mem-prealloc/

Contents:

Memory Preallocation¶
Why Preallocation Matters¶
Practical Preallocation Examples¶
- Slice Preallocation¶
- Map Preallocation¶
Benchmarking Impact¶
When To Preallocate¶

Memory preallocation is a simple but effective way to improve performance in Go programs that work with slices or maps that grow over time. Instead of letting the runtime resize these structures as they fill up—often at unpredictable points—you allocate the space you need upfront. This avoids the cost of repeated allocations, internal copying, and extra GC pressure as intermediate objects are created and discarded.

In high-throughput or latency-sensitive systems, preallocating memory makes execution more predictable and helps avoid performance cliffs that show up under load. If the workload size is known or can be reasonably estimated, there’s no reason to let the allocator do the guessing.

Go’s slices and maps grow automatically as new elements are added, but that convenience comes with a cost. When capacity is exceeded, the runtime allocates a larger backing array or hash table and copies the existing data over. This reallocation adds memory pressure, burns CPU cycles, and can stall tight loops in high-throughput paths. In performance-critical code—especially where the size is known or can be estimated—frequent resizing is unnecessary overhead. Preallocating avoids these penalties by giving the runtime enough room to work without interruption.

Go uses a hybrid growth strategy for slices to balance speed and memory efficiency. Early on, capacities double with each expansion—2, 4, 8, 16—minimizing the number of allocations. But once a slice exceeds around 1024 elements, the growth rate slows to roughly 25%. So instead of jumping from 1024 to 2048, the next allocation might grow to about 1280.

This shift reduces memory waste on large slices but increases the frequency of allocations if the final size is known but not preallocated. In those cases, using make([]T, 0, expectedSize) is the more efficient choice—it avoids repeated resizing and cuts down on unnecessary copying.

Output illustrating typical growth:

Without preallocation, each append operation might trigger new allocations:

This pattern causes Go to allocate larger underlying arrays repeatedly as the slice grows, resulting in memory copying and GC pressure. We can avoid that by using make with a specified capacity:

If it is known that the slice will be fully populated, we can be even more efficient by avoiding bounds checks:

Maps grow similarly. By default, Go doesn’t know how many elements you’ll add, so it resizes the underlying structure as needed.

Starting with Go 1.11, you can preallo

[Content truncated]

Examples:

Example 1 (unknown):

s := make([]int, 0)
for i := 0; i < 10_000; i++ {
    s = append(s, i)
    fmt.Printf("Len: %d, Cap: %d\n", len(s), cap(s))
}

Example 2 (unknown):

Len: 1, Cap: 1
Len: 2, Cap: 2
Len: 3, Cap: 4
Len: 5, Cap: 8
...
Len: 1024, Cap: 1024
Len: 1025, Cap: 1280

Example 3 (unknown):

// Inefficient
var result []int
for i := 0; i < 10000; i++ {
    result = append(result, i)
}

Example 4 (unknown):

// Efficient
result := make([]int, 0, 10000)
for i := 0; i < 10000; i++ {
    result = append(result, i)
}

Zero-Copy Techniques - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/zero-copy/

Contents:

Zero-Copy Techniques¶
Understanding Zero-Copy¶
Common Zero-Copy Techniques in Go¶
- Using io.Reader and io.Writer Interfaces¶
- Slicing for Efficient Data Access¶
- Memory Mapping (mmap)¶
Benchmarking Impact¶
- File I/O: Memory Mapping vs. Standard Read¶
When to Use Zero-Copy¶
- Real-World Use Cases and Libraries¶

When writing performance-critical Go code, how memory is managed often has a bigger impact than it first appears. Zero-copy techniques are one of the more effective ways to tighten that control. Instead of moving bytes from buffer to buffer, these techniques work directly on existing memory—avoiding copies altogether. That means less pressure on the CPU, better cache behavior, and fewer GC-triggered pauses. For I/O-heavy systems—whether you’re streaming files, handling network traffic, or parsing large datasets—this can translate into much higher throughput and lower latency without adding complexity.

In the usual I/O path, data moves back and forth between user space and kernel space—first copied into a kernel buffer, then into your application’s buffer, or the other way around. It works, but it’s wasteful. Every copy burns CPU cycles and clogs up memory bandwidth. Zero-copy changes that. Instead of bouncing data between buffers, it lets applications work directly with what’s already in place—no detours, no extra copies. The result? Lower CPU load, better use of memory, and faster I/O, especially when throughput or latency actually matter.

Using interfaces like io.Reader and io.Writer gives you fine-grained control over how data flows. Instead of spinning up new buffers every time, you can reuse existing ones and keep memory usage steady. In practice, this avoids unnecessary garbage collection pressure and keeps your I/O paths clean and efficient—especially when you’re dealing with high-throughput or streaming workloads.

io.CopyBuffer reuses a provided buffer, avoiding repeated allocations and intermediate copies. An in-depth io.CopyBuffer explanation is available on SO.

Slicing large byte arrays or buffers instead of copying data into new slices is a powerful zero-copy strategy:

Slices in Go are inherently zero-copy since they reference the underlying array.

Using memory mapping enables direct access to file contents without explicit read operations:

This approach maps file contents directly into memory, entirely eliminating copying between kernel and user-space.

Here's a basic benchmark illustrating performance differences between explicit copying and zero-copy slicing:

In BenchmarkCopy, each iteration copies a 64KB buffer into a fresh slice—allocating memory and duplicating data every time. That cost adds up fast. BenchmarkSlice, on the other hand, just re-slices the same buffer—no allocation, no copying, just new view on the same data. The di

[Content truncated]

Examples:

Example 1 (go):

func StreamData(src io.Reader, dst io.Writer) error {
    buf := make([]byte, 4096) // Reusable buffer
    _, err := io.CopyBuffer(dst, src, buf)
    return err
}

Example 2 (unknown):

func process(buffer []byte) []byte {
    return buffer[128:256] // returns a slice reference without copying
}

Example 3 (go):

import "golang.org/x/exp/mmap"

func ReadFileZeroCopy(path string) ([]byte, error) {
    r, err := mmap.Open(path)
    if err != nil {
        return nil, err
    }
    defer r.Close()

    data := make([]byte, r.Len())
    _, err = r.ReadAt(data, 0)
    return data, err
}

Example 4 (go):

func BenchmarkCopy(b *testing.B) {
    data := make([]byte, 64*1024)
    for b.Loop() {
        buf := make([]byte, len(data))
        copy(buf, data)
    }
}

func BenchmarkSlice(b *testing.B) {
    data := make([]byte, 64*1024)
    for b.Loop() {
        _ = data[:]
    }
}

Object Pooling - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/object-pooling/

Contents:

Object Pooling¶
How Object Pooling Works¶
- Using sync.Pool for Object Reuse¶
  - Without Object Pooling (Inefficient Memory Usage)¶
  - With Object Pooling (Optimized Memory Usage)¶
- Pooling Byte Buffers for Efficient I/O¶
Benchmarking Impact¶
When Should You Use sync.Pool?¶

Object pooling helps reduce allocation churn in high-throughput Go programs by reusing objects instead of allocating fresh ones each time. This avoids repeated work for the allocator and eases pressure on the garbage collector, especially when dealing with short-lived or frequently reused structures.

Go’s sync.Pool provides a built-in way to implement pooling with minimal code. It’s particularly effective for objects that are expensive to allocate or that would otherwise contribute to frequent garbage collection cycles. While not a silver bullet, it’s a low-friction tool that can lead to noticeable gains in latency and CPU efficiency under sustained load.

Object pooling allows programs to reuse memory by recycling previously allocated objects instead of creating new ones on every use. Rather than hitting the heap each time, objects are retrieved from a shared pool and returned once they’re no longer needed. This reduces the number of allocations, cuts down on garbage collection workload, and leads to more predictable performance—especially in workloads with high object churn or tight latency requirements.

In the above example, every iteration creates a new Data instance, leading to unnecessary allocations and increased GC pressure.

Object pooling is especially effective when working with large byte slices that would otherwise lead to high allocation and garbage collection overhead.

Using sync.Pool for byte buffers significantly reduces memory pressure when dealing with high-frequency I/O operations.

To prove that object pooling actually reduces allocations and improves speed, we can use Go's built-in memory profiling tools (pprof) and compare memory allocations between the non-pooled and pooled versions. Simulating a full-scale application that actively uses memory for benchmarking is challenging, so we need a controlled test to evaluate direct heap allocations versus pooled allocations.

The benchmark results highlight the contrast in performance and memory usage between direct allocations and object pooling. In BenchmarkWithoutPooling, each iteration creates a new object on the heap, leading to higher execution time and increased memory consumption. This constant allocation pressure triggers more frequent garbage collection, which adds latency and reduces throughput. The presence of nonzero allocation counts per operation confirms that each iteration contributes to GC load, making this approach less efficient in high-throughput scenarios.

Avoid sy

[Content truncated]

Examples:

Example 1 (go):

package main

import (
    "fmt"
)

type Data struct {
    Value int
}

func createData() *Data {
    return &Data{Value: 42}
}

func main() {
    for i := 0; i < 1000000; i++ {
        obj := createData() // Allocating a new object every time
        _ = obj // Simulate usage
    }
    fmt.Println("Done")
}

Example 2 (python):

package main

import (
    "fmt"
    "sync"
)

type Data struct {
    Value int
}

var dataPool = sync.Pool{
    New: func() any {
        return &Data{}
    },
}

func main() {
    for i := 0; i < 1000000; i++ {
        obj := dataPool.Get().(*Data) // Retrieve from pool
        obj.Value = 42 // Use the object
        dataPool.Put(obj) // Return object to pool for reuse
    }
    fmt.Println("Done")
}

Example 3 (go):

package main

import (
    "bytes"
    "fmt"
    "sync"
)

var bufferPool = sync.Pool{
    New: func() any {
        return new(bytes.Buffer)
    },
}

func main() {
    buf := bufferPool.Get().(*bytes.Buffer)
    buf.Reset()
    buf.WriteString("Hello, pooled world!")
    fmt.Println(buf.String())
    bufferPool.Put(buf) // Return buffer to pool for reuse
}

Example 4 (python):

package perf

import (
    "sync"
    "testing"
)

// Data is a struct with a large fixed-size array to simulate a memory-intensive object.
type Data struct {
    Values [1024]int
}

// BenchmarkWithoutPooling measures the performance of direct heap allocations.
func BenchmarkWithoutPooling(b *testing.B) {
    for b.Loop() {
        data := &Data{}      // Allocating a new object each time
        data.Values[0] = 42  // Simulating some memory activity
    }
}

// dataPool is a sync.Pool that reuses instances of Data to reduce memory allocations.
var dataPool = sync.Pool{
    New: func() any {
...

Avoiding Interface Boxing - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/interface-boxing/

Contents:

Avoiding Interface Boxing¶
What is Interface Boxing?¶
Why It Matters¶
Benchmarking Impact¶
- Boxing Large Structs¶
- Passing to a Function That Accepts an Interface¶
When Interface Boxing Is Acceptable¶
- When abstraction is more important than performance¶
- When values are small and boxing is allocation-free¶
- When values are short-lived¶

Go’s interfaces make it easy to write flexible, decoupled code. But behind that convenience is a detail that can trip up performance: when a concrete value is assigned to an interface, Go wraps it in a hidden structure—a process called interface boxing.

In many cases, boxing is harmless. But in performance-sensitive code—like tight loops, hot paths, or high-throughput services—it can introduce hidden heap allocations, extra memory copying, and added pressure on the garbage collector. These effects often go unnoticed during development, only showing up later as latency spikes or memory bloat.

Interface boxing refers to the process of converting a concrete value to an interface type. In Go, an interface value is internally represented as two words:

When you assign a value to an interface variable, Go creates this two-part structure. If the value is a non-pointer type—like a struct or primitive—and is not already on the heap, Go may allocate a copy of it on the heap to satisfy the interface assignment. This behavior is especially relevant when working with large values or when storing items in a slice of interfaces, where each element gets individually boxed. These implicit allocations can add up and are a common source of hidden memory pressure in Go programs.

Here’s a simple example:

In this case, the integer 42 is boxed into an interface: Go stores the type information (int) and a copy of the value 42. This is inexpensive for small values like int, but for large structs, the cost becomes non-trivial.

Pay attention to this code! In this example, even though shapes is a slice of interfaces, each Square value is copied into an interface when appended to shapes. If Square were a large struct, this would introduce 1000 allocations and large memory copying.

To avoid that, you could pass pointers:

This way, only an 8-byte pointer is stored in the interface, reducing both allocation size and copying overhead.

In tight loops or high-throughput paths, such as unmarshalling JSON, rendering templates, or processing large collections, interface boxing can degrade performance by triggering unnecessary heap allocations and increasing GC pressure. This overhead is especially costly in systems with high concurrency or real-time responsiveness constraints.

Boxing can also make profiling and benchmarking misleading, since allocations attributed to innocuous-looking lines may actually stem from implicit conversions to interfaces.

For the benchmarking we will define

[Content truncated]

Examples:

Example 1 (unknown):

var i interface{}
i = 42

Example 2 (go):

type Shape interface {
    Area() float64
}

type Square struct {
    Size float64
}

func (s Square) Area() float64 { return s.Size * s.Size }

func main() {
    var shapes []Shape
    for i := 0; i < 1000; i++ {
        s := Square{Size: float64(i)}
        shapes = append(shapes, s) // boxing occurs here
    }
}

Example 3 (unknown):

shapes = append(shapes, &s) // avoids large struct copy

Example 4 (unknown):

type Worker interface {
    Work()
}

type LargeJob struct {
    payload [4096]byte
}

func (LargeJob) Work() {}

Struct Field Alignment - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/fields-alignment/

Contents:

Struct Field Alignment¶
Why Alignment Matters¶
Benchmarking Impact¶
Avoiding False Sharing in Concurrent Workloads¶
When To Align Structs¶

When optimizing Go programs for performance, struct layout and memory alignment often go unnoticed—yet they have a measurable impact on memory usage and cache efficiency. Go automatically aligns struct fields based on platform-specific rules, inserting padding to satisfy alignment constraints. Understanding and controlling memory alignment isn’t just a low-level detail—it can have a real impact on how your Go programs perform, especially in tight loops or high-throughput systems. Proper alignment can reduce the overall memory footprint, make better use of CPU caches, and eliminate subtle performance penalties that add up under load.

Modern CPUs are tuned for predictable memory access. When struct fields are misaligned or split across cache lines, the processor often has to do extra work to fetch the data. That can mean additional memory cycles, more cache misses, and slower performance overall. These costs are easy to overlook in everyday code but show up quickly in code that’s sensitive to throughput or latency. In Go, struct fields are aligned according to their type requirements, and the compiler inserts padding bytes to meet these constraints. If fields are arranged without care, unnecessary padding may inflate struct size significantly, affecting memory use and bandwidth.

Consider the following two structs:

On a 64-bit system, PoorlyAligned requires 24 bytes due to the padding between fields, whereas WellAligned fits into 16 bytes by ordering fields from largest to smallest alignment requirement.

We benchmarked both struct layouts by allocating 10 million instances of each and measuring allocation time and memory usage:

In a test with 10 million structs, the WellAligned version used 80MB less memory than its poorly aligned counterpart—and it also ran a bit faster. This isn’t just about saving RAM; it shows how struct layout directly affects allocation behavior and memory bandwidth. When you’re working with large volumes of data or performance-critical paths, reordering fields for better alignment can lead to measurable gains with minimal effort.

In addition to memory layout efficiency, struct alignment also plays a crucial role in concurrent systems. When multiple goroutines access different fields of the same struct that reside on the same CPU cache line, they may suffer from false sharing—where changes to one field cause invalidations in the other, even if logically unrelated.

On modern CPUs, a typical cache line is 64 bytes wide. When a stru

[Content truncated]

Examples:

Example 1 (unknown):

type PoorlyAligned struct {
    flag bool
    count int64
    id byte
}

type WellAligned struct {
    count int64
    flag bool
    id byte
}

Example 2 (go):

func BenchmarkPoorlyAligned(b *testing.B) {
    for b.Loop() {
        var items = make([]PoorlyAligned, 10_000_000)
        for j := range items {
            items[j].count = int64(j)
        }
    }
}

func BenchmarkWellAligned(b *testing.B) {
    for b.Loop() {
        var items = make([]WellAligned, 10_000_000)
        for j := range items {
            items[j].count = int64(j)
        }
    }
}

Example 3 (unknown):

type SharedCounterBad struct {
    a int64
    b int64
}

type SharedCounterGood struct {
    a int64
    _ [56]byte // Padding to prevent a and b from sharing a cache line
    b int64
}

Example 4 (go):

func BenchmarkFalseSharing(b *testing.B) {
    var c SharedCounterBad  // (1)
    var wg sync.WaitGroup

    for b.Loop() {
        wg.Add(2)
        go func() {
            for i := 0; i < 1_000_000; i++ {
                c.a++
            }
            wg.Done()
        }()
        go func() {
            for i := 0; i < 1_000_000; i++ {
                c.b++
            }
            wg.Done()
        }()
        wg.Wait()
    }
}

Common Go Patterns for Performance - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/

Contents:

Common Go Patterns for Performance¶
Memory Management & Efficiency¶
Concurrency and Synchronization¶
I/O Optimization and Throughput¶
Compiler-Level Optimization and Tuning¶

Optimizing Go applications requires understanding common patterns that help reduce latency, improve memory efficiency, and enhance concurrency. This guide organizes 15 key techniques into four practical categories.

These strategies help reduce memory churn, avoid excessive allocations, and improve cache behavior.

Object Pooling Reuse objects to reduce GC pressure and allocation overhead.

Memory Preallocation Allocate slices and maps with capacity upfront to avoid costly resizes.

Struct Field Alignment Optimize memory layout to minimize padding and improve locality.

Avoiding Interface Boxing Prevent hidden allocations by avoiding unnecessary interface conversions.

Zero-Copy Techniques Minimize data copying with slicing and buffer tricks.

Memory Efficiency and Go’s Garbage Collector Reduce GC overhead by minimizing heap usage and reusing memory.

Stack Allocations and Escape Analysis Use escape analysis to help values stay on the stack where possible.

Manage goroutines, shared resources, and coordination efficiently.

Goroutine Worker Pools Control concurrency with a fixed-size pool to limit resource usage.

Atomic Operations and Synchronization Primitives Use atomic operations or lightweight locks to manage shared state.

Lazy Initialization (sync.Once) Delay expensive setup logic until it's actually needed.

Immutable Data Sharing Share data safely between goroutines without locks by making it immutable.

Efficient Context Management Use context to propagate timeouts and cancel signals across goroutines.

Reduce system call overhead and increase data throughput for I/O-heavy workloads.

Efficient Buffering Use buffered readers/writers to minimize I/O calls.

Batching Operations Combine multiple small operations to reduce round trips and improve throughput.

Tap into Go’s compiler and linker to further optimize your application.

Leveraging Compiler Optimization Flags Use build flags like -gcflags and -ldflags for performance tuning.

Stack Allocations and Escape Analysis Analyze which values escape to the heap to help the compiler optimize memory placement.

Raw

SKILL.md

name	description
go-performance	Go performance optimization patterns and best practices. Use when optimizing Go code, reducing memory allocations, improving GC behavior, tuning concurrency, or diagnosing performance issues in Go applications.

Go-Performance Skill

Comprehensive assistance with go-performance development, generated from official documentation.

When to Use This Skill

This skill should be triggered when:

Working with go-performance
Asking about go-performance features or APIs
Implementing go-performance solutions
Debugging go-performance code
Learning go-performance best practices

Quick Reference

Common Patterns

Pattern 1: Go Optimization Guide GitHub Home Blog Common Performance Patterns Common Performance Patterns Memory Management & Efficiency Memory Management & Efficiency Object Pooling Memory Preallocation Struct Field Alignment Avoiding Interface Boxing Zero-Copy Techniques Memory Efficiency and Go’s Garbage Collector Stack Allocations and Escape Analysis Concurrency and Synchronization Concurrency and Synchronization Goroutine Worker Pools Atomic Operations and Synchronization Primitives Lazy Initialization Immutable Data Sharing Efficient Context Management I/O Optimization and Throughput I/O Optimization and Throughput Efficient Buffering Batching Operations Compiler-Level Optimization and Tuning Compiler-Level Optimization and Tuning Leveraging Compiler Optimization Flags Leveraging Compiler Optimization Flags Table of contents Why Compiler Flags Matter Key Compiler and Linker Flags -ldflags="-s -w" — Strip Debug Info -gcflags — Control Compiler Optimizations Cross-Compilation Flags Build Tags -ldflags="-X ..." — Inject Build-Time Variables -extldflags='-static' — Build Fully Static Binaries Example: Static Build with libcurl via CGO Practical Networking Patterns Practical Networking Patterns Benchmarking First Benchmarking First Benchmarking and Load Testing for Networked Go Apps Practicle example of Profiling Networked Go Applications with pprof Foundations and Core Concepts Foundations and Core Concepts How Go Handles Networking Efficient Use of net/http, net.Conn, and UDP Scaling and Performance Engineering Scaling and Performance Engineering Managing 10K+ Concurrent Connections in Go GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning Diagnostics and Resilience Diagnostics and Resilience Building Resilient Connection Handling Memory Management and Leak Prevention in Long-Lived Connections Transport-Level Optimization Transport-Level Optimization Comparing TCP, HTTP/2, and gRPC Performance in Go QUIC – Building Low-Latency Services with quic-go Low-Level and Advanced Tuning Low-Level and Advanced Tuning Socket Options That Matter Tuning DNS Performance in Go Services Optimizing TLS for Speed Connection Lifecycle Observability Leveraging Compiler Optimization Flags in Go¶ When tuning Go applications for performance, most of the attention goes to runtime behavior—profiling hot paths, trimming allocations, improving concurrency. But there’s another layer that’s easy to miss: what the Go compiler does with your code before it ever runs. The build process includes several optimization passes, and understanding how to surface or influence them can give you clearer insights into what’s actually happening under the hood. It’s not about tweaking obscure flags to squeeze out extra instructions—it’s about knowing how the compiler treats your code so you’re not working against it. While Go doesn’t expose the same granular set of compiler flags as C or Rust, it still provides useful ways to influence how your code is built—especially when targeting performance, binary size, or specific environments. Why Compiler Flags Matter¶ Go's compiler (specifically cmd/compile and cmd/link) performs several default optimizations: inlining, escape analysis, dead code elimination, and more. However, there are scenarios where you can squeeze more performance or control from your build using the right flags. Use cases include: Reducing binary size for minimal containers or embedded systems Building for specific architectures or OSes Removing debug information for release builds Disabling optimizations temporarily for easier debugging Enabling experimental or unsafe performance tricks (carefully) Key Compiler and Linker Flags¶ -ldflags="-s -w" — Strip Debug Info¶ When you want to shrink binary size, especially in production or containers: go build -ldflags="-s -w" -o app main.go -s: Omit the symbol table -w: Omit DWARF debugging information Why it matters: This can reduce binary size by up to 30-40%, depending on your codebase. It is useful in Docker images or when distributing binaries. -gcflags — Control Compiler Optimizations¶ The -gcflags flag allows you to control how the compiler treats specific packages. For example, you can disable optimizations for debugging: go build -gcflags="all=-N -l" -o app main.go -N: Disable optimizations -l: Disable inlining When to use: During debugging sessions with Delve or similar tools. Turning off inlining and optimizations make stack traces and breakpoints more reliable. Cross-Compilation Flags¶ Need to build for another OS or architecture? GOOS=linux GOARCH=arm64 go build -o app main.go GOOS, GOARCH: Set target OS and architecture Common values: windows, darwin, linux, amd64, arm64, 386, wasm Build Tags¶ Build tags allow conditional compilation. Use //go:build or // +build in your source code to control what gets compiled in. Example: //go:build debug package main import "log" func debugLog(msg string) { log.Println("[DEBUG]", msg) } Then build with: go build -tags=debug -o app main.go -ldflags="-X ..." — Inject Build-Time Variables¶ You can inject version numbers or metadata into your binary at build time: // main.go package main import "fmt" var version = "dev" func main() { fmt.Printf("App version: %s\n", version) } Then build with: go build -ldflags="-s -w -X main.version=1.0.0" -o app main.go This sets the version variable at link time without modifying your source code. It's useful for embedding release versions, commit hashes, or build dates. -extldflags='-static' — Build Fully Static Binaries¶ The -extldflags '-static' option passes the -static flag to the external system linker, instructing it to produce a fully statically linked binary. This is especially useful when you're using CGO and want to avoid runtime dynamic library dependencies: CGO_ENABLED=1 GOOS=linux GOARCH=amd64 \ CC=gcc \ go build -ldflags="-linkmode=external -extldflags '-static'" -o app main.go What it does: Statically links all C libraries into the binary Produces a portable, self-contained executable Ideal for minimal containers (like scratch or distroless) To go further and ensure your binary avoids relying on C library DNS resolution (such as glibc's getaddrinfo), you can use the netgo build tag. This forces Go to use its pure Go implementation of the DNS resolver: CGO_ENABLED=1 GOOS=linux GOARCH=amd64 \ CC=gcc \ go build -tags netgo -ldflags="-linkmode=external -extldflags '-static'" -o app main.go This step is especially important when building for minimal container environments, where dynamic libc dependencies may not be available. Note Static linking requires static versions (.a) of the libraries you're using, and may not work with all C libraries by default. Example: Static Build with libcurl via CGO¶ If you’re using libcurl via CGO, here’s how you can create a statically linked Go binary: package main /* #cgo LDFLAGS: -lcurl #include <curl/curl.h> */ import "C" import "fmt" func main() { fmt.Println("libcurl version:", C.GoString(C.curl_version())) } Static Build Command: CGO_ENABLED=1 GOOS=linux GOARCH=amd64 \ CC=gcc \ go build -tags netgo -ldflags="-linkmode=external -extldflags '-static'" -o app main.go Ensure the static version of libcurl (libcurl.a) is available on your system. You may need to install development packages or build libcurl from source with --enable-static.

cmd/compile

Pattern 2: Go Optimization Guide GitHub Home Blog Common Performance Patterns Common Performance Patterns Memory Management & Efficiency Memory Management & Efficiency Object Pooling Memory Preallocation Struct Field Alignment Avoiding Interface Boxing Zero-Copy Techniques Memory Efficiency and Go’s Garbage Collector Stack Allocations and Escape Analysis Stack Allocations and Escape Analysis Table of contents What Is Escape Analysis? Why does it matter? Example: Stack vs Heap How to View Escape Analysis Output What Causes Variables to Escape? Returning Pointers to Local Variables Capturing Variables in Closures Interface Conversions Assignments to Global Variables or Struct Fields Large Composite Literals Benchmarking Stack vs Heap Allocations Forcing a Heap Allocation When to Optimize for Stack Allocation Concurrency and Synchronization Concurrency and Synchronization Goroutine Worker Pools Atomic Operations and Synchronization Primitives Lazy Initialization Immutable Data Sharing Efficient Context Management I/O Optimization and Throughput I/O Optimization and Throughput Efficient Buffering Batching Operations Compiler-Level Optimization and Tuning Compiler-Level Optimization and Tuning Leveraging Compiler Optimization Flags Practical Networking Patterns Practical Networking Patterns Benchmarking First Benchmarking First Benchmarking and Load Testing for Networked Go Apps Practicle example of Profiling Networked Go Applications with pprof Foundations and Core Concepts Foundations and Core Concepts How Go Handles Networking Efficient Use of net/http, net.Conn, and UDP Scaling and Performance Engineering Scaling and Performance Engineering Managing 10K+ Concurrent Connections in Go GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning Diagnostics and Resilience Diagnostics and Resilience Building Resilient Connection Handling Memory Management and Leak Prevention in Long-Lived Connections Transport-Level Optimization Transport-Level Optimization Comparing TCP, HTTP/2, and gRPC Performance in Go QUIC – Building Low-Latency Services with quic-go Low-Level and Advanced Tuning Low-Level and Advanced Tuning Socket Options That Matter Tuning DNS Performance in Go Services Optimizing TLS for Speed Connection Lifecycle Observability Stack Allocations and Escape Analysis¶ When writing performance-critical Go applications, one of the subtle but significant optimizations you can make is encouraging values to be allocated on the stack rather than the heap. Stack allocations are cheaper, faster, and garbage-free—but Go doesn't always put your variables there automatically. That decision is made by the Go compiler during escape analysis. In this article, we’ll explore what escape analysis is, how to read the compiler’s escape diagnostics, what causes values to escape, and how to structure your code to minimize unnecessary heap allocations. We'll also benchmark different scenarios to show the real-world impact. What Is Escape Analysis?¶ Escape analysis is a static analysis performed by the Go compiler to determine whether a variable can be safely allocated on the stack or if it must be moved ("escape") to the heap. Why does it matter?¶ Stack allocations are cheap: the memory is automatically freed when the function returns. Heap allocations are more expensive: they involve garbage collection overhead. The compiler decides where to place each variable based on how it's used. If a variable can be guaranteed to not outlive its declaring function, it can stay on the stack. If not, it escapes to the heap. Example: Stack vs Heap¶ func allocate() *int { x := 42 return &x // x escapes to the heap } func noEscape() int { x := 42 return x // x stays on the stack } In allocate, x is returned as a pointer. Since the pointer escapes the function, the Go compiler places x on the heap. In noEscape, x is a plain value and doesn’t escape. How to View Escape Analysis Output¶ You can inspect escape analysis with the -gcflags compiler option: go build -gcflags="-m" ./path/to/pkg Or for a specific file: go run -gcflags="-m" main.go This will print lines like: main.go:10:6: moved to heap: x main.go:14:6: can inline noEscape Look for messages like moved to heap to identify escape points. What Causes Variables to Escape?¶ Here are common scenarios that force heap allocation: Returning Pointers to Local Variables¶ func escape() *int { x := 10 return &x // escapes } Capturing Variables in Closures¶ func closureEscape() func() int { x := 5 return func() int { return x } // x escapes } Interface Conversions¶ When a value is stored in an interface, it may escape: func toInterface(i int) interface{} { return i // escapes if type info needed at runtime } Assignments to Global Variables or Struct Fields¶ var global *int func assignGlobal() { x := 7 global = &x // escapes } Large Composite Literals¶ Go may allocate large structs or slices on the heap even if they don’t strictly escape. func makeLargeSlice() []int { s := make([]int, 10000) // may escape due to size return s } Benchmarking Stack vs Heap Allocations¶ Let’s run a benchmark to explore when heap allocations actually occur—and when they don’t, even if we return a pointer. func StackAlloc() Data { return Data{1, 2, 3} // stays on stack } func HeapAlloc() *Data { return &Data{1, 2, 3} // escapes to heap } func BenchmarkStackAlloc(b *testing.B) { for b.Loop() { _ = StackAlloc() } } func BenchmarkHeapAlloc(b *testing.B) { for b.Loop() { _ = HeapAlloc() } } Benchmark Results Benchmark Iterations Time per op (ns) Bytes per op Allocs per op BenchmarkStackAlloc-14 1,000,000,000 0.2604 ns 0 B 0 BenchmarkHeapAlloc-14 1,000,000,000 0.2692 ns 0 B 0 You might expect HeapAlloc to always allocate memory on the heap—but it doesn’t here. That’s because the compiler is smart: in this isolated benchmark, the pointer returned by HeapAlloc doesn’t escape the function in any meaningful way. The compiler can see it’s only used within the benchmark and short-lived, so it safely places it on the stack too. Forcing a Heap Allocation¶ var sink *Data func HeapAllocEscape() { d := &Data{1, 2, 3} sink = d // d escapes to heap } func BenchmarkHeapAllocEscape(b *testing.B) { for b.Loop() { HeapAllocEscape() } } Benchmark Iterations Time per op (ns) Bytes per op Allocs per op BenchmarkHeapAllocEscape-14 331,469,049 10.55 ns 24 B 1 As shown in BenchmarkHeapAllocEscape, assigning the pointer to a global variable causes a real heap escape. This introduces real overhead: a 40x slower call, a 24-byte allocation, and one garbage-collected object per call. Show the benchmark file package main import "testing" type Data struct { A, B, C int } // heap-alloc-start func StackAlloc() Data { return Data{1, 2, 3} // stays on stack } func HeapAlloc() *Data { return &Data{1, 2, 3} // escapes to heap } func BenchmarkStackAlloc(b *testing.B) { for b.Loop() { _ = StackAlloc() } } func BenchmarkHeapAlloc(b *testing.B) { for b.Loop() { _ = HeapAlloc() } } // heap-alloc-end // escape-start var sink *Data func HeapAllocEscape() { d := &Data{1, 2, 3} sink = d // d escapes to heap } func BenchmarkHeapAllocEscape(b *testing.B) { for b.Loop() { HeapAllocEscape() } } // escape-end When to Optimize for Stack Allocation¶ Not all escapes are worth preventing. Here’s when it makes sense to focus on stack allocation—and when it’s better to let values escape. When to Avoid Escape In performance-critical paths. Reducing heap usage in tight loops or latency-sensitive code lowers GC pressure and speeds up execution. For short-lived, small objects. These can be efficiently stack-allocated without involving the garbage collector, reducing memory churn. When you control the full call chain. If the object stays within your code and you can restructure it to avoid escape, it’s often worth the small refactor. If profiling reveals GC bottlenecks. Escape analysis helps you target and shrink memory-heavy allocations identified in real-world traces. When It’s Fine to Let Values Escape When returning values from constructors or factories. Returning a pointer from NewThing() is idiomatic Go—even if it causes an escape, it improves clarity and usability. When objects must outlive the function. If you're storing data in a global, sending to a goroutine, or saving it in a struct, escaping is necessary and correct. When allocation size is small and infrequent. If the heap allocation isn’t in a hot path, the benefit of avoiding it is often negligible. When preventing escape hurts readability. Writing awkward code to keep everything on the stack can reduce maintainability for a micro-optimization that won’t matter.

func allocate() *int {
    x := 42
    return &x // x escapes to the heap
}

func noEscape() int {
    x := 42
    return x // x stays on the stack
}

Pattern 3: Go Optimization Guide GitHub Home Blog Common Performance Patterns Common Performance Patterns Memory Management & Efficiency Memory Management & Efficiency Object Pooling Memory Preallocation Struct Field Alignment Avoiding Interface Boxing Avoiding Interface Boxing Table of contents What is Interface Boxing? Why It Matters Benchmarking Impact Boxing Large Structs Passing to a Function That Accepts an Interface When Interface Boxing Is Acceptable When abstraction is more important than performance When values are small and boxing is allocation-free When values are short-lived When dynamic behavior is required How to Avoid Interface Boxing Zero-Copy Techniques Memory Efficiency and Go’s Garbage Collector Stack Allocations and Escape Analysis Concurrency and Synchronization Concurrency and Synchronization Goroutine Worker Pools Atomic Operations and Synchronization Primitives Lazy Initialization Immutable Data Sharing Efficient Context Management I/O Optimization and Throughput I/O Optimization and Throughput Efficient Buffering Batching Operations Compiler-Level Optimization and Tuning Compiler-Level Optimization and Tuning Leveraging Compiler Optimization Flags Practical Networking Patterns Practical Networking Patterns Benchmarking First Benchmarking First Benchmarking and Load Testing for Networked Go Apps Practicle example of Profiling Networked Go Applications with pprof Foundations and Core Concepts Foundations and Core Concepts How Go Handles Networking Efficient Use of net/http, net.Conn, and UDP Scaling and Performance Engineering Scaling and Performance Engineering Managing 10K+ Concurrent Connections in Go GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning Diagnostics and Resilience Diagnostics and Resilience Building Resilient Connection Handling Memory Management and Leak Prevention in Long-Lived Connections Transport-Level Optimization Transport-Level Optimization Comparing TCP, HTTP/2, and gRPC Performance in Go QUIC – Building Low-Latency Services with quic-go Low-Level and Advanced Tuning Low-Level and Advanced Tuning Socket Options That Matter Tuning DNS Performance in Go Services Optimizing TLS for Speed Connection Lifecycle Observability Avoiding Interface Boxing¶ Go’s interfaces make it easy to write flexible, decoupled code. But behind that convenience is a detail that can trip up performance: when a concrete value is assigned to an interface, Go wraps it in a hidden structure—a process called interface boxing. In many cases, boxing is harmless. But in performance-sensitive code—like tight loops, hot paths, or high-throughput services—it can introduce hidden heap allocations, extra memory copying, and added pressure on the garbage collector. These effects often go unnoticed during development, only showing up later as latency spikes or memory bloat. What is Interface Boxing?¶ Interface boxing refers to the process of converting a concrete value to an interface type. In Go, an interface value is internally represented as two words: A type descriptor, which holds information about the concrete type (its identity and method set). A data pointer, which points to the actual value being stored. When you assign a value to an interface variable, Go creates this two-part structure. If the value is a non-pointer type—like a struct or primitive—and is not already on the heap, Go may allocate a copy of it on the heap to satisfy the interface assignment. This behavior is especially relevant when working with large values or when storing items in a slice of interfaces, where each element gets individually boxed. These implicit allocations can add up and are a common source of hidden memory pressure in Go programs. Here’s a simple example: var i interface{} i = 42 In this case, the integer 42 is boxed into an interface: Go stores the type information (int) and a copy of the value 42. This is inexpensive for small values like int, but for large structs, the cost becomes non-trivial. Another example: type Shape interface { Area() float64 } type Square struct { Size float64 } func (s Square) Area() float64 { return s.Size * s.Size } func main() { var shapes []Shape for i := 0; i < 1000; i++ { s := Square{Size: float64(i)} shapes = append(shapes, s) // boxing occurs here } } Warning Pay attention to this code! In this example, even though shapes is a slice of interfaces, each Square value is copied into an interface when appended to shapes. If Square were a large struct, this would introduce 1000 allocations and large memory copying. To avoid that, you could pass pointers: shapes = append(shapes, &s) // avoids large struct copy This way, only an 8-byte pointer is stored in the interface, reducing both allocation size and copying overhead. Why It Matters¶ In tight loops or high-throughput paths, such as unmarshalling JSON, rendering templates, or processing large collections, interface boxing can degrade performance by triggering unnecessary heap allocations and increasing GC pressure. This overhead is especially costly in systems with high concurrency or real-time responsiveness constraints. Boxing can also make profiling and benchmarking misleading, since allocations attributed to innocuous-looking lines may actually stem from implicit conversions to interfaces. Benchmarking Impact¶ For the benchmarking we will define an interface and a struct with a significant payload that implements the interface. type Worker interface { Work() } type LargeJob struct { payload [4096]byte } func (LargeJob) Work() {} Boxing Large Structs¶ To demonstrate the real impact of boxing large values vs. pointers, we benchmarked the cost of assigning 1,000 large structs to an interface slice: func BenchmarkBoxedLargeSlice(b *testing.B) { jobs := make([]Worker, 0, 1000) for b.Loop() { jobs = jobs[:0] for j := 0; j < 1000; j++ { var job LargeJob jobs = append(jobs, job) } } } func BenchmarkPointerLargeSlice(b *testing.B) { jobs := make([]Worker, 0, 1000) for b.Loop() { jobs := jobs[:0] for j := 0; j < 1000; j++ { job := &LargeJob{} jobs = append(jobs, job) } } } Benchmark Results Benchmark Time per op (ns) Bytes per op Allocs per op BoxedLargeSliceGrowth 404,649 ~4.13 MB 1011 PointerLargeSliceGrowth 340,549 ~4.13 MB 1011 Boxing large values is significantly slower—about 19% in this case—due to the cost of copying the entire 4KB struct for each interface assignment. Boxing a pointer, however, avoids that cost and keeps the copy small (just 8 bytes). While both approaches allocate the same overall memory (since all values escape to the heap), pointer boxing has clear performance advantages under pressure. Passing to a Function That Accepts an Interface¶ Another common source of boxing is when a large value is passed directly to a function that accepts an interface. Even without storing to a slice, boxing will occur at the call site. var sink Worker func call(w Worker) { sink = w } func BenchmarkCallWithValue(b *testing.B) { for b.Loop() { var j LargeJob call(j) } } func BenchmarkCallWithPointer(b *testing.B) { for b.Loop() { j := &LargeJob{} call(j) } } Benchmark Results Benchmark ns/op B/op allocs/op CallWithValue 422.5 4096 1 CallWithPointer 379.9 4096 1 Passing a value to a function expecting an interface causes boxing, copying the full struct and allocating it on the heap. In our benchmark, this results in approximately 11% higher CPU cost compared to using a pointer. Passing a pointer avoids copying the struct, reduces memory movement, and results in smaller, more cache-friendly interface values, making it the more efficient choice in performance-sensitive scenarios. Show the complete benchmark file package perf import "testing" // interface-start type Worker interface { Work() } type LargeJob struct { payload [4096]byte } func (LargeJob) Work() {} // interface-end // bench-slice-start func BenchmarkBoxedLargeSlice(b *testing.B) { jobs := make([]Worker, 0, 1000) for b.Loop() { jobs = jobs[:0] for j := 0; j < 1000; j++ { var job LargeJob jobs = append(jobs, job) } } } func BenchmarkPointerLargeSlice(b *testing.B) { jobs := make([]Worker, 0, 1000) for b.Loop() { jobs := jobs[:0] for j := 0; j < 1000; j++ { job := &LargeJob{} jobs = append(jobs, job) } } } // bench-slice-end // bench-call-start var sink Worker func call(w Worker) { sink = w } func BenchmarkCallWithValue(b *testing.B) { for b.Loop() { var j LargeJob call(j) } } func BenchmarkCallWithPointer(b testing.B) { for b.Loop() { j := &LargeJob{} call(j) } } // bench-call-end When Interface Boxing Is Acceptable¶ Despite its performance implications in some contexts, interface boxing is often perfectly reasonable—and sometimes preferred. When abstraction is more important than performance¶ Interfaces enable decoupling and modularity. If you're designing a clean, testable API, the cost of boxing is negligible compared to the benefit of abstraction. type Storage interface { Save([]byte) error } func Process(s Storage) { / ... */ } When values are small and boxing is allocation-free¶ Boxing small, copyable values like int, float64, or small structs typically causes no allocations. var i interface{} i = 123 // safe and cheap When values are short-lived¶ If the boxed value is used briefly (e.g. for logging or interface-based sorting), the overhead is minimal. fmt.Println("value:", someStruct) // implicit boxing is fine When dynamic behavior is required¶ Interfaces allow runtime polymorphism. If you need different types to implement the same behavior, boxing is necessary and idiomatic. for _, s := range []Shape{Circle{}, Square{}} { fmt.Println(s.Area()) } Use boxing when it supports clarity, reusability, or design goals—and avoid it only in performance-critical code paths. How to Avoid Interface Boxing¶ Use pointers when assigning to interfaces. If the method set requires a pointer receiver or the value is large, explicitly pass a pointer to avoid repeated copying and heap allocation. for i := range tasks { result = append(result, &tasks[i]) // Avoids boxing copies } Avoid interfaces in hot paths. If the concrete type is known and stable, avoid interface indirection entirely—especially in compute-intensive or allocation-sensitive functions. Use type-specific containers. Instead of []interface{}, prefer generic slices or typed collections where feasible. This preserves static typing and reduces unnecessary allocations. Benchmark and inspect with pprof. Use go test -bench and pprof to observe where allocations occur. If the allocation site is in runtime.convT2E (convert T to interface), you're likely boxing.

var i interface{}
i = 42

Pattern 4: Go Optimization Guide GitHub Home Blog Common Performance Patterns Common Performance Patterns Memory Management & Efficiency Memory Management & Efficiency Object Pooling Memory Preallocation Struct Field Alignment Avoiding Interface Boxing Zero-Copy Techniques Memory Efficiency and Go’s Garbage Collector Memory Efficiency and Go’s Garbage Collector Table of contents How Go's Garbage Collector Works Non-generational Concurrent Tri-color Mark and Sweep GC Tuning: GOGC Memory Limiting with GOMEMLIMIT GOMEMLIMIT=X and GOGC=off configuration Practical Strategies for Reducing GC Pressure Prefer Stack Allocation Use sync.Pool for Short-Lived Objects Batch Allocations Weak References in Go Benchmarking Impact Stack Allocations and Escape Analysis Concurrency and Synchronization Concurrency and Synchronization Goroutine Worker Pools Atomic Operations and Synchronization Primitives Lazy Initialization Immutable Data Sharing Efficient Context Management I/O Optimization and Throughput I/O Optimization and Throughput Efficient Buffering Batching Operations Compiler-Level Optimization and Tuning Compiler-Level Optimization and Tuning Leveraging Compiler Optimization Flags Practical Networking Patterns Practical Networking Patterns Benchmarking First Benchmarking First Benchmarking and Load Testing for Networked Go Apps Practicle example of Profiling Networked Go Applications with pprof Foundations and Core Concepts Foundations and Core Concepts How Go Handles Networking Efficient Use of net/http, net.Conn, and UDP Scaling and Performance Engineering Scaling and Performance Engineering Managing 10K+ Concurrent Connections in Go GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning Diagnostics and Resilience Diagnostics and Resilience Building Resilient Connection Handling Memory Management and Leak Prevention in Long-Lived Connections Transport-Level Optimization Transport-Level Optimization Comparing TCP, HTTP/2, and gRPC Performance in Go QUIC – Building Low-Latency Services with quic-go Low-Level and Advanced Tuning Low-Level and Advanced Tuning Socket Options That Matter Tuning DNS Performance in Go Services Optimizing TLS for Speed Connection Lifecycle Observability Memory Efficiency: Mastering Go’s Garbage Collector¶ Memory management in Go is automated—but it’s not invisible. Every allocation you make contributes to GC workload. The more frequently objects are created and discarded, the more work the runtime has to do reclaiming memory. This becomes especially relevant in systems prioritizing low latency, predictable resource usage, or high throughput. Tuning your allocation patterns and leveraging newer features like weak references can help reduce pressure on the GC without adding complexity to your code. How Go's Garbage Collector Works¶ Info Highly encourage you to read the official A Guide to the Go Garbage Collector! The document provides a detailed description of multiple Go's GC internals. Go uses a non-generational, concurrent, tri-color mark-and-sweep garbage collector. Here's what that means in practice and how it's implemented. Non-generational¶ Many modern GCs, like those in the JVM or .NET CLR, divide memory into generations (young and old) under the assumption that most objects die young. These collectors focus on the young generation, which leads to shorter collection cycles. Go’s GC takes a different approach. It treats all objects equally—no generational segmentation—not because generational GC conflicts with short pause times or concurrent scanning, but because it hasn’t shown clear, consistent benefits in real-world Go programs with the designs tried so far. This choice avoids the complexity of promotion logic and specialized memory regions. While it can mean scanning more objects overall, this cost is mitigated by concurrent execution and efficient write barriers. Concurrent¶ Go’s GC runs concurrently with your application, which means it does most of its work without stopping the world. Concurrency is implemented using multiple phases that interleave with normal program execution: Even though Go’s garbage collector is mostly concurrent, it still requires brief Stop-The-World (STW) pauses at several points to maintain correctness. These pauses are kept extremely short—typically under 100 microseconds—even with large heaps and hundreds of goroutines. STW is essential for ensuring that memory structures are not mutated while the GC analyzes them. In most applications, these pauses are imperceptible. However, even sub-millisecond pauses in latency-sensitive systems can be significant—so understanding and monitoring STW behavior becomes important when optimizing for tail latencies or jitter. STW Start Phase: The application is briefly paused to initiate GC. The runtime scans stacks, globals, and root objects. Concurrent Mark Phase: The garbage collector traverses the heap, marking all reachable objects while the program continues running. This is the heaviest phase in terms of work but runs concurrently to avoid long stop-the-world pauses. STW Mark Termination: Once marking is mostly complete, the GC briefly pauses the program to finish any remaining work and ensure the heap is in a consistent state before sweeping begins. This pause is typically very short—measured in microseconds. Concurrent Sweep Phase: The GC reclaims memory from unreachable (white) objects and returns it to the heap for reuse, all while your program continues running. Write barriers ensure correctness while the application mutates objects during concurrent marking. These barriers help track references created or modified mid-scan so the GC doesn’t miss them. Tri-color Mark and Sweep¶ The tri-color algorithm breaks the heap into three working sets during garbage collection: White: Objects that haven’t been reached—if they stay white, they’ll be collected. Grey: Objects that have been discovered (i.e., marked as reachable) but haven’t had their references scanned yet. Black: Objects that are both reachable and fully scanned—they’re retained and don’t need further processing. Garbage collection starts by marking all root objects (stack, globals, etc.) grey. It then walks the grey set: for each object, it scans its fields. Any referenced objects that are still white are added to the grey set. Once an object’s references are fully processed, it’s marked black. When no grey objects remain, anything still white is unreachable and gets cleaned up during the sweep phase. This model ensures that no live object is accidentally collected—even if references change mid-scan—thanks to Go’s write barriers that maintain the algorithm’s core invariants. A key optimization is incremental marking: Go spreads out GC work to avoid long pauses, supported by precise stack scanning and conservative write barriers. The use of concurrent sweeping further reduces latency, allowing memory to be reclaimed without halting execution. This design gives Go a GC that’s safe, fast, and friendly to server workloads with large heaps and many cores. GC Tuning: GOGC¶ Go’s garbage collector is tuned to deliver good performance without manual configuration. The default GOGC setting typically strikes the right balance between memory consumption and CPU effort, adapting well across a wide range of workloads. In most cases, manually tweaking it offers little benefit—and in many, it actually makes things worse by increasing either pause times or memory pressure. Unless you’ve profiled a specific bottleneck and understand the trade-offs, it’s usually best to leave GOGC alone. That said, there are specific cases where tuning GOGC can yield significant gains. For example, Uber implemented dynamic GC tuning across their Go services to reduce CPU usage and saved tens of thousands of cores in the process. Their approach relied on profiling, metric collection, and automation to safely adjust GC behavior based on actual memory pressure and workload characteristics. Another unusual case is from Cloudflare. They profiled a high-concurrency cryptographic workload and found that Go’s GC became a bottleneck as goroutines increased. Their application produced minimal garbage, yet GC overhead grew with concurrency. By tuning GOGC to a much higher value—specifically 11300—they significantly reduced GC frequency and improved throughput, achieving over 22× performance gains compared to the single-core baseline. This case highlights how allowing more heap growth in CPU-bound and low-allocation scenarios can yield major improvements. So, if you decide to tune the garbage collector, be methodical: Always profile first. Use tools like pprof to confirm that GC activity is a bottleneck. Change settings incrementally. For example, increasing GOGC from 100 to 150 means the GC will run less frequently, using less CPU but more memory. Verify impact. After tuning, validate with profiling data that the change had a positive effect. Without that confirmation, it's easy to make things worse. GOGC=100 # Default: GC runs when heap grows 100% since last collection GOGC=off # Disables GC (use only in special cases like short-lived CLI tools) Memory Limiting with GOMEMLIMIT¶ In addition to GOGC, Go provides GOMEMLIMIT—a soft memory limit that caps the total heap size the runtime will try to stay under. This allows you to explicitly control memory growth, especially useful in environments like containers or systems with strict memory budgets. Why is this helpful? In containerized environments (like Kubernetes), memory limits are typically enforced at the OS or orchestrator level. If your application exceeds its memory quota, the OOM killer may abruptly terminate the container. Go's GC isn't aware of those limits by default. Setting a GOMEMLIMIT helps prevent this. For example, if your container has a 512MiB memory limit, you might set: GOMEMLIMIT=400MiB This buffer gives the Go runtime room to act before reaching the hard system-imposed memory cap. It allows the garbage collector to become more aggressive as total memory usage grows, reducing the chances of the process being killed due to an out-of-memory condition. It also leaves space for non-heap allocations—like goroutine stacks, OS threads, and other internal runtime structures—which don’t count toward heap size but still consume real memory. You can also set the limit programmatically: import "runtime/debug" debug.SetMemoryLimit(2 << 30) // 2 GiB The GC will become more aggressive as heap usage nears the limit, which can increase CPU load. Be careful not to set the limit too low—especially if your application maintains a large live set of objects—or you may trigger excessive GC cycles. While GOGC controls how frequently the GC runs based on heap growth, GOMEMLIMIT constrains the heap size itself. The two can be combined for more precise control: GOGC=100 GOMEMLIMIT=4GiB ./your-service This tells the GC to operate with the default growth ratio and to start collecting sooner if heap usage nears 4 GiB. GOMEMLIMIT=X and GOGC=off configuration¶ In scenarios where memory availability is fixed and predictable—such as within containers or VMs, you can use these two variables together: GOMEMLIMIT=X tells the runtime to aim for a specific memory ceiling. For example, GOMEMLIMIT=2GiB will trigger garbage collection when total memory usage nears 2 GiB. GOGC=off disables the default GC pacing algorithm, so garbage collection only runs when the memory limit is hit. This configuration maximizes memory usage efficiency and avoids the overhead of frequent GC cycles. It's especially effective in high-throughput or latency-sensitive systems where predictable memory usage matters. Example: GOMEMLIMIT=2GiB GOGC=off ./my-app With this setup, memory usage grows freely until the 2 GiB threshold is reached. At that point, Go performs a full garbage collection pass. Warning Always benchmark with your real workload. Disabling automatic GC can backfire if your application produces a lot of short-lived allocations. Monitor memory pressure and GC pause times using runtime.ReadMemStats or pprof. This approach works best when your memory usage patterns are well understood and stable. Practical Strategies for Reducing GC Pressure¶ Prefer Stack Allocation¶ Go allocates variables on the stack whenever possible. Avoid escaping variables to the heap: // BAD: returns pointer to heap-allocated struct func newUser(name string) *User { return &User{Name: name} // escapes to heap } // BETTER: use value types if pointer is unnecessary func printUser(u User) { fmt.Println(u.Name) } Use go build -gcflags="-m" to view escape analysis diagnostics. See Stack Allocations and Escape Analysis for more details. Use sync.Pool for Short-Lived Objects¶ sync.Pool is ideal for temporary, reusable allocations that are expensive to GC. var bufPool = sync.Pool{ New: func() any { return new(bytes.Buffer) }, } func handler(w http.ResponseWriter, r *http.Request) { buf := bufPool.Get().(*bytes.Buffer) buf.Reset() defer bufPool.Put(buf) // Use buf... } See Object Pooling for more details. Batch Allocations¶ Group allocations into fewer objects to reduce GC pressure. // Instead of allocating many small structs, allocate a slice of structs users := make([]User, 0, 1000) // single large allocation See Memory Preallocation for more details. Weak References in Go¶ Go 1.24 added the weak package, providing a standardized way to create weak references—pointers that don’t keep their target objects alive. In garbage-collected systems like Go, strong references extend an object’s lifetime: as long as something points to it, it won’t be collected. That’s usually what you want, but in structures like caches, deduplication maps, or object graphs, this can lead to memory staying alive much longer than intended. Weak references solve that by allowing you to refer to an object without blocking the GC from reclaiming it when nothing else is using it. A weak reference, by contrast, tells the garbage collector: “you can collect this object if nothing else is strongly referencing it.” This pattern is important for building memory-sensitive data structures that should not interfere with garbage collection. package main import ( "fmt" "runtime" "weak" ) type Data struct { Value string } func main() { data := &Data{Value: "Important"} wp := weak.Make(data) // create weak pointer fmt.Println("Original:", wp.Value().Value) data = nil // remove strong reference runtime.GC() if v := wp.Value(); v != nil { fmt.Println("Still alive:", v.Value) } else { fmt.Println("Data has been collected") } } Original: Important Data has been collected In this example, wp holds a weak reference to a Data object. After the strong reference (data) goes out of scope and the garbage collector runs, the Data may be collected—at which point wp.Value() will return nil. This pattern is especially useful in memory-sensitive contexts like caches or canonicalization maps, where you want to avoid artificially extending object lifetimes. Always check the result of Value() before using it, since the target may have been reclaimed. Benchmarking Impact¶ It's tempting to rely on synthetic benchmarks to evaluate the performance of Go's garbage collector, but generic benchmarks rarely capture the nuances of real-world workloads. Memory behavior is highly dependent on allocation patterns, object lifetimes, concurrency, and how frequently short-lived versus long-lived data structures are used. For example, the impact of GC in a CPU-bound microservice that maintains large in-memory indexes will differ dramatically from an I/O-heavy API server with minimal heap usage. As such, tuning decisions should always be informed by your application's profiling data. We cover targeted use cases and their GC performance trade-offs in more focused articles: Object Pooling: Reducing allocation churn using sync.Pool Stack Allocations and Escape Analysis: Minimizing heap usage by keeping values on the stack Memory Preallocation: Avoiding unnecessary growth of slices and maps When applied to the right context, these techniques can make a measurable difference, but they don’t lend themselves to one-size-fits-all benchmarks.

GOGC

Pattern 5: Example:

GOMEMLIMIT=2GiB GOGC=off ./my-app

Pattern 6: Go Optimization Guide GitHub Home Blog Common Performance Patterns Common Performance Patterns Memory Management & Efficiency Memory Management & Efficiency Object Pooling Memory Preallocation Struct Field Alignment Struct Field Alignment Table of contents Why Alignment Matters Benchmarking Impact Avoiding False Sharing in Concurrent Workloads When To Align Structs Avoiding Interface Boxing Zero-Copy Techniques Memory Efficiency and Go’s Garbage Collector Stack Allocations and Escape Analysis Concurrency and Synchronization Concurrency and Synchronization Goroutine Worker Pools Atomic Operations and Synchronization Primitives Lazy Initialization Immutable Data Sharing Efficient Context Management I/O Optimization and Throughput I/O Optimization and Throughput Efficient Buffering Batching Operations Compiler-Level Optimization and Tuning Compiler-Level Optimization and Tuning Leveraging Compiler Optimization Flags Practical Networking Patterns Practical Networking Patterns Benchmarking First Benchmarking First Benchmarking and Load Testing for Networked Go Apps Practicle example of Profiling Networked Go Applications with pprof Foundations and Core Concepts Foundations and Core Concepts How Go Handles Networking Efficient Use of net/http, net.Conn, and UDP Scaling and Performance Engineering Scaling and Performance Engineering Managing 10K+ Concurrent Connections in Go GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning Diagnostics and Resilience Diagnostics and Resilience Building Resilient Connection Handling Memory Management and Leak Prevention in Long-Lived Connections Transport-Level Optimization Transport-Level Optimization Comparing TCP, HTTP/2, and gRPC Performance in Go QUIC – Building Low-Latency Services with quic-go Low-Level and Advanced Tuning Low-Level and Advanced Tuning Socket Options That Matter Tuning DNS Performance in Go Services Optimizing TLS for Speed Connection Lifecycle Observability Struct Field Alignment¶ When optimizing Go programs for performance, struct layout and memory alignment often go unnoticed—yet they have a measurable impact on memory usage and cache efficiency. Go automatically aligns struct fields based on platform-specific rules, inserting padding to satisfy alignment constraints. Understanding and controlling memory alignment isn’t just a low-level detail—it can have a real impact on how your Go programs perform, especially in tight loops or high-throughput systems. Proper alignment can reduce the overall memory footprint, make better use of CPU caches, and eliminate subtle performance penalties that add up under load. Why Alignment Matters¶ Modern CPUs are tuned for predictable memory access. When struct fields are misaligned or split across cache lines, the processor often has to do extra work to fetch the data. That can mean additional memory cycles, more cache misses, and slower performance overall. These costs are easy to overlook in everyday code but show up quickly in code that’s sensitive to throughput or latency. In Go, struct fields are aligned according to their type requirements, and the compiler inserts padding bytes to meet these constraints. If fields are arranged without care, unnecessary padding may inflate struct size significantly, affecting memory use and bandwidth. Consider the following two structs: type PoorlyAligned struct { flag bool count int64 id byte } type WellAligned struct { count int64 flag bool id byte } On a 64-bit system, PoorlyAligned requires 24 bytes due to the padding between fields, whereas WellAligned fits into 16 bytes by ordering fields from largest to smallest alignment requirement. Benchmarking Impact¶ We benchmarked both struct layouts by allocating 10 million instances of each and measuring allocation time and memory usage: func BenchmarkPoorlyAligned(b *testing.B) { for b.Loop() { var items = make([]PoorlyAligned, 10_000_000) for j := range items { items[j].count = int64(j) } } } func BenchmarkWellAligned(b *testing.B) { for b.Loop() { var items = make([]WellAligned, 10_000_000) for j := range items { items[j].count = int64(j) } } } Benchmark Results Benchmark Iterations Time per op (ns) Bytes per op Allocs per op PoorlyAligned-14 177 20,095,621 240,001,029 1 WellAligned-14 186 19,265,714 160,006,148 1 In a test with 10 million structs, the WellAligned version used 80MB less memory than its poorly aligned counterpart—and it also ran a bit faster. This isn’t just about saving RAM; it shows how struct layout directly affects allocation behavior and memory bandwidth. When you’re working with large volumes of data or performance-critical paths, reordering fields for better alignment can lead to measurable gains with minimal effort. Avoiding False Sharing in Concurrent Workloads¶ In addition to memory layout efficiency, struct alignment also plays a crucial role in concurrent systems. When multiple goroutines access different fields of the same struct that reside on the same CPU cache line, they may suffer from false sharing—where changes to one field cause invalidations in the other, even if logically unrelated. On modern CPUs, a typical cache line is 64 bytes wide. When a struct is accessed in memory, the CPU loads the entire cache line that contains it, not just the specific field. This means that two unrelated fields within the same 64-byte block will both reside in the same line—even if they are used independently by separate goroutines. If one goroutine writes to its field, the cache line becomes invalidated and must be reloaded on the other core, leading to degraded performance due to false sharing. To illustrate, we compared two structs—one vulnerable to false sharing, and another with padding to separate fields across cache lines: type SharedCounterBad struct { a int64 b int64 } type SharedCounterGood struct { a int64 _ [56]byte // Padding to prevent a and b from sharing a cache line b int64 } Each field is incremented by a separate goroutine 1 million times: func BenchmarkFalseSharing(b *testing.B) { var c SharedCounterBad // (1) var wg sync.WaitGroup for b.Loop() { wg.Add(2) go func() { for i := 0; i < 1_000_000; i++ { c.a++ } wg.Done() }() go func() { for i := 0; i < 1_000_000; i++ { c.b++ } wg.Done() }() wg.Wait() } } FalseSharing and NoFalseSharing benchmarks are identical, except we will use SharedCounterGood for the NoFalseSharing benchmark. Benchmark Results: Benchmark Time per op (ns) Bytes per op Allocs per op FalseSharing 996,234 55 2 NoFalseSharing 958,180 58 2 Placing padding between the two fields prevented false sharing, resulting in a measurable performance improvement. The version with padding completed ~3.8% faster (the value could vary between re-runs from 3% to 6%), which can make a difference in tight concurrent loops or high-frequency counters. It also shows how false sharing may unpredictably affect memory use due to invalidation overhead. Show the complete benchmark file package perf import ( "sync" "testing" ) // types-simple-start type PoorlyAligned struct { flag bool count int64 id byte } type WellAligned struct { count int64 flag bool id byte } // types-simple-end // simple-start func BenchmarkPoorlyAligned(b *testing.B) { for b.Loop() { var items = make([]PoorlyAligned, 10_000_000) for j := range items { items[j].count = int64(j) } } } func BenchmarkWellAligned(b *testing.B) { for b.Loop() { var items = make([]WellAligned, 10_000_000) for j := range items { items[j].count = int64(j) } } } // simple-end // types-shared-start type SharedCounterBad struct { a int64 b int64 } type SharedCounterGood struct { a int64 _ [56]byte // Padding to prevent a and b from sharing a cache line b int64 } // types-shared-end // shared-start func BenchmarkFalseSharing(b *testing.B) { var c SharedCounterBad // (1) var wg sync.WaitGroup for b.Loop() { wg.Add(2) go func() { for i := 0; i < 1_000_000; i++ { c.a++ } wg.Done() }() go func() { for i := 0; i < 1_000_000; i++ { c.b++ } wg.Done() }() wg.Wait() } } // shared-end func BenchmarkNoFalseSharing(b *testing.B) { var c SharedCounterGood var wg sync.WaitGroup for b.Loop() { wg.Add(2) go func() { for i := 0; i < 1_000_000; i++ { c.a++ } wg.Done() }() go func() { for i := 0; i < 1_000_000; i++ { c.b++ } wg.Done() }() wg.Wait() } } When To Align Structs¶ Always align structs. It's free to implement and often leads to better memory efficiency without changing any logic—only field order needs to be adjusted. Guidelines for struct alignment: Order fields from largest to smallest. Starting with larger fields helps the compiler avoid inserting padding to meet alignment requirements. Smaller fields can fill in the gaps naturally. Group fields of the same size together. This lets the compiler pack them more efficiently and minimizes wasted space. Insert padding intentionally when needed. In concurrent code, separating fields that are accessed by different goroutines can prevent false sharing—a subtle but costly issue where multiple goroutines compete over the same cache line. Avoid interleaving small and large fields. Mixing sizes leads to inefficient memory usage due to extra alignment padding between fields. Use the fieldalignment linter to verify. This tool helps catch suboptimal layouts automatically during development.

type PoorlyAligned struct {
    flag bool
    count int64
    id byte
}

type WellAligned struct {
    count int64
    flag bool
    id byte
}

Pattern 7: Go Optimization Guide GitHub Home Blog Common Performance Patterns Common Performance Patterns Memory Management & Efficiency Memory Management & Efficiency Object Pooling Memory Preallocation Struct Field Alignment Avoiding Interface Boxing Zero-Copy Techniques Memory Efficiency and Go’s Garbage Collector Stack Allocations and Escape Analysis Concurrency and Synchronization Concurrency and Synchronization Goroutine Worker Pools Atomic Operations and Synchronization Primitives Lazy Initialization Immutable Data Sharing Immutable Data Sharing Table of contents Why Immutable Data? Practical Example: Shared Config Step 1: Define the Config Struct Step 2: Ensure Deep Immutability Step 3: Atomic Swapping Step 4: Using It in Handlers Practical Example: Immutable Routing Table Step 1: Define Route Structs Step 2: Build Immutable Version Step 3: Store It Atomically Step 4: Route Requests Concurrently Scaling Immutable Routing Tables Scenario 1: Segmented Routing Scenario 2: Indexed Routing Table Scenario 3: Hybrid Staging and Publishing Benchmarking Impact When to Use This Pattern Efficient Context Management I/O Optimization and Throughput I/O Optimization and Throughput Efficient Buffering Batching Operations Compiler-Level Optimization and Tuning Compiler-Level Optimization and Tuning Leveraging Compiler Optimization Flags Practical Networking Patterns Practical Networking Patterns Benchmarking First Benchmarking First Benchmarking and Load Testing for Networked Go Apps Practicle example of Profiling Networked Go Applications with pprof Foundations and Core Concepts Foundations and Core Concepts How Go Handles Networking Efficient Use of net/http, net.Conn, and UDP Scaling and Performance Engineering Scaling and Performance Engineering Managing 10K+ Concurrent Connections in Go GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning Diagnostics and Resilience Diagnostics and Resilience Building Resilient Connection Handling Memory Management and Leak Prevention in Long-Lived Connections Transport-Level Optimization Transport-Level Optimization Comparing TCP, HTTP/2, and gRPC Performance in Go QUIC – Building Low-Latency Services with quic-go Low-Level and Advanced Tuning Low-Level and Advanced Tuning Socket Options That Matter Tuning DNS Performance in Go Services Optimizing TLS for Speed Connection Lifecycle Observability Immutable Data Sharing¶ One common source of slowdown in high-performance Go programs is the way shared data is accessed under concurrency. The usual tools—mutexes and channels—work well, but they’re not free. Mutexes can become choke points if many goroutines try to grab the same lock. Channels, while elegant for coordination, can introduce blocking and make control flow harder to reason about. Both require careful use: it’s easy to introduce subtle bugs or unexpected performance issues if synchronization isn’t tight. A powerful alternative is immutable data sharing. Instead of protecting data with locks, you design your system so that shared data is never mutated after it's created. This minimizes contention and simplifies reasoning about your program. Why Immutable Data?¶ Immutability brings several advantages to concurrent programs: No locks needed: Multiple goroutines can safely read immutable data without synchronization. Easier reasoning: If data can't change, you avoid entire classes of race conditions. Copy-on-write optimizations: You can create new versions of a structure without altering the original, which is useful for config reloading or versioning a state. Practical Example: Shared Config¶ Imagine you have a long-running service that periodically reloads its configuration from a disk or a remote source. Multiple goroutines read this configuration to make decisions. Here's how immutable data helps: Step 1: Define the Config Struct¶ // config.go type Config struct { LogLevel string Timeout time.Duration Features map[string]bool // This needs attention! } Step 2: Ensure Deep Immutability¶ Maps and slices in Go are reference types. Even if the Config struct isn't changed, someone could accidentally mutate a shared map. To prevent this, we make defensive copies: func NewConfig(logLevel string, timeout time.Duration, features map[string]bool) Config { copiedFeatures := make(map[string]bool, len(features)) for k, v := range features { copiedFeatures[k] = v } return &Config{ LogLevel: logLevel, Timeout: timeout, Features: copiedFeatures, } } Now, every config instance is self-contained and safe to share. Step 3: Atomic Swapping¶ Use atomic.Value to store and safely update the current config. var currentConfig atomic.Pointer[Config] func LoadInitialConfig() { cfg := NewConfig("info", 5time.Second, map[string]bool{"beta": true}) currentConfig.Store(cfg) } func GetConfig() *Config { return currentConfig.Load() } Now all goroutines can safely call GetConfig() with no locks. When the config is reloaded, you just Store a new immutable copy. Step 4: Using It in Handlers¶ func handler(w http.ResponseWriter, r *http.Request) { cfg := GetConfig() if cfg.Features["beta"] { // Enable beta path } // Use cfg.Timeout, cfg.LogLevel, etc. } Practical Example: Immutable Routing Table¶ Suppose you're building a lightweight reverse proxy or API gateway and must route incoming requests based on path or host. The routing table is read thousands of times per second and updated only occasionally (e.g., from a config file or service discovery). Step 1: Define Route Structs¶ type Route struct { Path string Backend string } type RoutingTable struct { Routes []Route } Step 2: Build Immutable Version¶ To ensure immutability, we deep-copy the slice of routes when constructing a new routing table. func NewRoutingTable(routes []Route) *RoutingTable { copied := make([]Route, len(routes)) copy(copied, routes) return &RoutingTable{Routes: copied} } Step 3: Store It Atomically¶ var currentRoutes atomic.Pointer[RoutingTable] func LoadInitialRoutes() { table := NewRoutingTable([]Route{ {Path: "/api", Backend: "http://api.internal"}, {Path: "/admin", Backend: "http://admin.internal"}, }) currentRoutes.Store(table) } func GetRoutingTable() *RoutingTable { return currentRoutes.Load() } Step 4: Route Requests Concurrently¶ func routeRequest(path string) string { table := GetRoutingTable() for _, route := range table.Routes { if strings.HasPrefix(path, route.Path) { return route.Backend } } return "" } Now, your routing logic can scale safely under load with zero locking overhead. Scaling Immutable Routing Tables¶ As systems grow, routing tables can expand to hundreds or even thousands of entries. While immutability brings clear benefits—safe concurrent access, predictable behavior—it becomes costly if every update means copying the entire structure. At some point, rebuilding the whole table for each minor change doesn’t scale. To keep immutability without paying for full reconstruction on every update, the design needs to evolve. There are several ways to do this—each preserving the core benefits while reducing overhead. Scenario 1: Segmented Routing¶ Imagine a multi-tenant system where each customer has their own set of routing rules. Instead of one giant slice of routes, you can split them into a map: type MultiTable struct { Tables map[string]RoutingTable // key = tenant ID } If only customer "acme" updates their rules, you clone just that slice and update the map. Then you atomically swap in a new version of the full map. All other tenants continue using their existing, untouched routing tables. This approach reduces memory pressure and speeds up updates without losing immutability. It also isolates blast radius: a broken rule set in one segment doesn’t affect others. Scenario 2: Indexed Routing Table¶ Let’s say your router matches by exact path, and lookup speed is critical. You can use a map[string]RouteHandler as an index: type RouteIndex map[string]RouteHandler When a new path is added, clone the current map, add the new route, and publish the new version. Because maps are shallow, this is fast for moderate numbers of routes. Reads are constant time, and updates are efficient because only a small part of the structure changes. Scenario 3: Hybrid Staging and Publishing¶ Suppose you’re doing a batch update — maybe reading hundreds of routes from a database. Instead of rebuilding live, you keep a mutable staging area: var mu sync.Mutex var stagingRoutes []Route You load and manipulate data in staging under a mutex, then convert to an immutable RoutingTable and store it atomically. This lets you safely prepare complex changes without locking readers or affecting live traffic. Benchmarking Impact¶ Benchmarking immutable data sharing in real-world systems is difficult to do in a generic, meaningful way. Factors like structure size, read/write ratio, and memory layout all heavily influence results. Rather than presenting artificial benchmarks here, we recommend reviewing the results in the Atomic Operations and Synchronization Primitives article. Those benchmarks clearly illustrate the potential performance benefits of using atomic.Value over traditional synchronization primitives like sync.RWMutex, especially in highly concurrent read scenarios. When to Use This Pattern¶ Immutable data sharing is ideal when: The data is read-heavy and write-light (e.g., configuration, feature flags, global mappings). This works well because the cost of creating new immutable versions is amortized over many reads, and avoiding locks provides a performance boost. You want to minimize locking without sacrificing safety. By sharing read-only data, you remove the need for mutexes or coordination, reducing the chances of deadlocks or race conditions. You can tolerate minor delays between update and read (eventual consistency). Since data updates are not coordinated with readers, there might be a small delay before all goroutines see the new version. If exact timing isn't critical, this tradeoff simplifies your concurrency model. It’s less suitable when updates must be transactional across multiple pieces of data or happen frequently. In those cases, the cost of repeated copying or lack of coordination can outweigh the benefits.

// config.go
type Config struct {
    LogLevel string
    Timeout  time.Duration
    Features map[string]bool // This needs attention!
}

Example Code Patterns

Example 1 (go):

func StreamData(src io.Reader, dst io.Writer) error {
    buf := make([]byte, 4096) // Reusable buffer
    _, err := io.CopyBuffer(dst, src, buf)
    return err
}

Example 2 (go):

func allocate() *int {
    x := 42
    return &x // x escapes to the heap
}

func noEscape() int {
    x := 42
    return x // x stays on the stack
}

Example 3 (go):

var (
    resource *MyResource
    once     sync.Once
)

func getResource() *MyResource {
    once.Do(func() {
        resource = expensiveInit()
    })
    return resource
}

Example 4 (go):

var getResource = sync.OnceValue(func() *MyResource {
    return expensiveInit()
})

func processData() {
    res := getResource()
    // use res
}

Example 5 (go):

var batch []string

func logBatch(line string) {
    batch = append(batch, line)
    if len(batch) >= 100 {
        f.WriteString(strings.Join(batch, "\n") + "\n")
        batch = batch[:0]
    }
}

Reference Files

This skill includes comprehensive documentation in references/:

compiler_optimization.md - Compiler Optimization documentation
concurrency.md - Concurrency documentation
escape_analysis.md - Escape Analysis documentation
garbage_collector.md - Garbage Collector documentation
io_optimization.md - Io Optimization documentation
memory_management.md - Memory Management documentation

Use view to read specific reference files when detailed information is needed.

Working with This Skill

For Beginners

Start with the getting_started or tutorials reference files for foundational concepts.

For Specific Features

Use the appropriate category reference file (api, guides, etc.) for detailed information.

For Code Examples

The quick reference section above contains common patterns extracted from the official docs.

Resources

references/

Organized documentation extracted from official sources. These files contain:

Detailed explanations
Code examples with language annotations
Links to original documentation
Table of contents for quick navigation

scripts/

Add helper scripts here for common automation tasks.

assets/

Add templates, boilerplate, or example projects here.

Notes

This skill was automatically generated from official documentation
Reference files preserve the structure and examples from source docs
Code examples include language detection for better syntax highlighting
Quick reference patterns are extracted from common usage examples in the docs

Updating

To refresh this skill with updated documentation:

Re-run the scraper with the same configuration
The skill will be rebuilt with the latest information

efstathiosntonas/SKILL.md

Go-Performance - Compiler Optimization

Leveraging Compiler Optimization Flags - Go Optimization Guide

Go-Performance - Concurrency

Lazy Initialization - Go Optimization Guide

Goroutine Worker Pools - Go Optimization Guide

Efficient Context Management - Go Optimization Guide

Immutable Data Sharing - Go Optimization Guide

Atomic Operations and Synchronization Primitives - Go Optimization Guide

Go-Performance - Escape Analysis

Stack Allocations and Escape Analysis - Go Optimization Guide

Go-Performance - Garbage Collector

Memory Efficiency and Go’s Garbage Collector - Go Optimization Guide

Go-Performance Documentation Index

Categories

Compiler Optimization

Concurrency

Escape Analysis

Garbage Collector

Io Optimization

Memory Management

Go-Performance - Io Optimization

Batching Operations - Go Optimization Guide

Efficient Buffering - Go Optimization Guide

Go-Performance - Memory Management

Memory Preallocation - Go Optimization Guide

Zero-Copy Techniques - Go Optimization Guide

Object Pooling - Go Optimization Guide

Avoiding Interface Boxing - Go Optimization Guide

Struct Field Alignment - Go Optimization Guide

Common Go Patterns for Performance - Go Optimization Guide

Go-Performance Skill

When to Use This Skill

Quick Reference

Common Patterns

Example Code Patterns

Reference Files

Working with This Skill

For Beginners

For Specific Features

For Code Examples

Resources

references/

scripts/

assets/

Notes

Updating