Building Resilient Distributed Systems with Go

Table of Contents

Introduction #

Building distributed systems is hard. Really hard. When you’re dealing with multiple services communicating over a network, you need to expect failures and design for resilience from day one.

In this post, I’ll share some patterns I’ve found useful when building distributed systems with Go.

The Circuit Breaker Pattern #

One of my favorite patterns is the circuit breaker. It’s like a safety switch for your system - when a downstream service starts failing, the circuit breaker “trips” and temporarily stops sending requests to give it time to recover.

type CircuitBreaker struct {
    maxFailures int
    timeout     time.Duration
    failures    int
    lastFailure time.Time
    state       State
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    if cb.state == Open {
        if time.Since(cb.lastFailure) > cb.timeout {
            cb.state = HalfOpen
        } else {
            return ErrCircuitOpen
        }
    }
    
    err := fn()
    if err != nil {
        cb.failures++
        cb.lastFailure = time.Now()
        if cb.failures >= cb.maxFailures {
            cb.state = Open
        }
        return err
    }
    
    cb.failures = 0
    cb.state = Closed
    return nil
}

Retry with Exponential Backoff #

Sometimes services are just temporarily unavailable. In these cases, retrying with exponential backoff can help smooth over transient failures:

func RetryWithBackoff(ctx context.Context, fn func() error, maxRetries int) error {
    var err error
    for i := 0; i < maxRetries; i++ {
        err = fn()
        if err == nil {
            return nil
        }
        
        backoff := time.Duration(math.Pow(2, float64(i))) * time.Second
        select {
        case <-time.After(backoff):
            continue
        case <-ctx.Done():
            return ctx.Err()
        }
    }
    return err
}

Graceful Degradation #

Not all features are equally critical. When things go wrong, gracefully degrading functionality is often better than complete failure.

For example, if your recommendation service is down, you could:

Show popular items instead
Show items from cache
Hide the recommendations section entirely

The key is to keep the core user journey working.

Conclusion #

Building resilient distributed systems requires thinking about failure modes from the start. Patterns like circuit breakers, retries, and graceful degradation can help your system stay available even when components fail.

What patterns do you use? Let me know!