Building Resilient Distributed Systems with Go
Exploring patterns and practices for building fault-tolerant distributed systems using Go, including circuit breakers, retries, and graceful degradation.
Table of Contents
Introduction #
Building distributed systems is hard. Really hard. When you’re dealing with multiple services communicating over a network, you need to expect failures and design for resilience from day one.
In this post, I’ll share some patterns I’ve found useful when building distributed systems with Go.
The Circuit Breaker Pattern #
One of my favorite patterns is the circuit breaker. It’s like a safety switch for your system - when a downstream service starts failing, the circuit breaker “trips” and temporarily stops sending requests to give it time to recover.
type CircuitBreaker struct {
maxFailures int
timeout time.Duration
failures int
lastFailure time.Time
state State
}
func (cb *CircuitBreaker) Call(fn func() error) error {
if cb.state == Open {
if time.Since(cb.lastFailure) > cb.timeout {
cb.state = HalfOpen
} else {
return ErrCircuitOpen
}
}
err := fn()
if err != nil {
cb.failures++
cb.lastFailure = time.Now()
if cb.failures >= cb.maxFailures {
cb.state = Open
}
return err
}
cb.failures = 0
cb.state = Closed
return nil
}
Retry with Exponential Backoff #
Sometimes services are just temporarily unavailable. In these cases, retrying with exponential backoff can help smooth over transient failures:
func RetryWithBackoff(ctx context.Context, fn func() error, maxRetries int) error {
var err error
for i := 0; i < maxRetries; i++ {
err = fn()
if err == nil {
return nil
}
backoff := time.Duration(math.Pow(2, float64(i))) * time.Second
select {
case <-time.After(backoff):
continue
case <-ctx.Done():
return ctx.Err()
}
}
return err
}
Graceful Degradation #
Not all features are equally critical. When things go wrong, gracefully degrading functionality is often better than complete failure.
For example, if your recommendation service is down, you could:
- Show popular items instead
- Show items from cache
- Hide the recommendations section entirely
The key is to keep the core user journey working.
Conclusion #
Building resilient distributed systems requires thinking about failure modes from the start. Patterns like circuit breakers, retries, and graceful degradation can help your system stay available even when components fail.
What patterns do you use? Let me know!