Webhook Retry Strategies and Exponential Backoff Explained
A webhook that fires once and gives up isn't a webhook you can rely on. Networks are unreliable, servers go down for deployments, and rate limits kick in at the worst possible moment. If your webhook system doesn't have a solid retry strategy, you're going to lose events — and your users are going to lose trust.
This post breaks down the most common retry strategies, explains why exponential backoff with jitter is the industry standard, and gives you production-ready implementations in Node.js and Python.
Why Retries Are Non-Negotiable
Here's a scenario that happens every day: a payment processor fires a webhook to notify your app that a charge succeeded. Your server happens to be restarting during a deployment. The webhook gets a connection refused error. Without retries, that payment event is gone. Your user paid, but your system never found out.
Transient failures are the norm, not the exception. Across any meaningful volume of webhook traffic, you'll see TCP timeouts, 502 Bad Gateway responses from load balancers during deploys, 429 Too Many Requests from rate limiting, and occasional 500 Internal Server Errors from temporary bugs. A well-designed retry strategy handles all of these gracefully.
Strategy 1: Fixed Interval Retries
The simplest approach is to retry at a fixed interval — say, every 60 seconds.
// Fixed interval — simple but problematic
const RETRY_INTERVAL_MS = 60_000;
async function retryFixed(event, maxAttempts = 5) {
for (let attempt = 0; attempt < maxAttempts; attempt++) {
const result = await deliverWebhook(event);
if (result.success) return result;
await sleep(RETRY_INTERVAL_MS);
}
return { success: false, exhausted: true };
}
This works for low-volume systems with a small number of endpoints. But it has a serious flaw: if an endpoint goes down and you have 10,000 pending webhooks, they'll all retry at exactly the same interval, creating a thundering herd that can overwhelm the endpoint the moment it comes back online. The server recovers, gets slammed with 10,000 simultaneous requests, and goes right back down.
When to use it: Internal systems with low volume where simplicity matters more than resilience.
When to avoid it: Anything production-facing or multi-tenant.
Strategy 2: Linear Backoff
Linear backoff increases the delay by a fixed amount with each attempt: 30 seconds, 60 seconds, 90 seconds, 120 seconds, and so on.
// Linear backoff
function getLinearDelay(attempt, stepSeconds = 30) {
return (attempt + 1) * stepSeconds * 1000;
}
This is better than fixed intervals because it spreads retries out over time, but the delays grow slowly. After 8 attempts you're only at 4 minutes between retries. If an endpoint is down for an hour, you'll have burned through all your attempts long before it recovers.
When to use it: Short-lived transient failures where you expect recovery within minutes.
When to avoid it: Systems where outages can last hours.
Strategy 3: Exponential Backoff
Exponential backoff doubles (or otherwise multiplies) the delay with each attempt. This is the foundation of every serious retry strategy.
// Basic exponential backoff
function getExponentialDelay(attempt, baseSeconds = 10, multiplier = 2) {
return baseSeconds * Math.pow(multiplier, attempt) * 1000;
}
// Attempt 0: 10s
// Attempt 1: 20s
// Attempt 2: 40s
// Attempt 3: 80s (1.3 min)
// Attempt 4: 160s (2.7 min)
// Attempt 5: 320s (5.3 min)
// Attempt 6: 640s (10.7 min)
// Attempt 7: 1280s (21.3 min)
The key advantage is that retries spread out rapidly. Early attempts catch quick transient blips (server restart, momentary network issue). Later attempts give sustained outages time to resolve without you hammering the endpoint.
But pure exponential backoff still has the thundering herd problem. If 5,000 webhooks all fail at the same time, they'll all retry at exactly 10s, then 20s, then 40s — perfectly synchronized.
Strategy 4: Exponential Backoff with Jitter (The Standard)
Adding randomness — jitter — to exponential backoff solves the thundering herd problem. Instead of all retries landing at exactly the same moment, they spread across a window of time.
There are three common jitter approaches:
Full Jitter
The delay is a random value between 0 and the exponential backoff value.
function fullJitter(attempt, baseSeconds = 10) {
const maxDelay = baseSeconds * Math.pow(2, attempt);
return Math.random() * maxDelay * 1000;
}
This produces the widest spread but can result in very short delays (near zero) which may not be desirable.
Equal Jitter
The delay is half the exponential value plus a random portion of the other half.
function equalJitter(attempt, baseSeconds = 10) {
const expDelay = baseSeconds * Math.pow(2, attempt);
const half = expDelay / 2;
return (half + Math.random() * half) * 1000;
}
This guarantees a minimum delay of half the exponential value while still spreading retries out.
Decorrelated Jitter
Each delay is derived from the previous delay rather than the attempt number, producing good spread without tight coupling to attempt count.
function decorrelatedJitter(previousDelay, baseSeconds = 10, maxSeconds = 86400) {
const delay = Math.min(
maxSeconds,
baseSeconds + Math.random() * (previousDelay * 3 - baseSeconds)
);
return delay * 1000;
}
Which Jitter Should You Use?
For webhook delivery, equal jitter or a fixed schedule with ±20% jitter is the most practical. You want predictable retry windows that you can communicate to users ("we'll retry for up to 24 hours") while still avoiding thundering herds.
Here's the approach I recommend:
const RETRY_SCHEDULE = [10, 30, 120, 600, 1800, 7200, 28800, 86400]; // seconds
function getRetryDelay(attempt) {
const base = RETRY_SCHEDULE[Math.min(attempt, RETRY_SCHEDULE.length - 1)];
const jitter = 0.8 + Math.random() * 0.4; // ±20%
return Math.round(base * jitter) * 1000;
}
import random
RETRY_SCHEDULE = [10, 30, 120, 600, 1800, 7200, 28800, 86400]
def get_retry_delay(attempt: int) -> int:
base = RETRY_SCHEDULE[min(attempt, len(RETRY_SCHEDULE) - 1)]
jitter = 0.8 + random.random() * 0.4
return round(base * jitter)
This gives you a human-readable schedule (10s, 30s, 2min, 10min, 30min, 2h, 8h, 24h) with enough randomness to prevent synchronization issues.
Knowing When to Stop
An infinite retry loop is just a memory leak with extra steps. You need clear rules for when to stop retrying.
Attempt Limits
Set a maximum number of attempts. 8 retries over ~35 hours is a common choice. After that, mark the event as exhausted and notify the endpoint owner.
Response-Based Rules
Not all failures are retryable. Your retry logic should distinguish between transient and permanent failures:
Always retry:
- 408 Request Timeout
- 429 Too Many Requests (respect the
Retry-Afterheader if present) - 500 Internal Server Error
- 502 Bad Gateway
- 503 Service Unavailable
- 504 Gateway Timeout
- Connection refused / timeout / reset errors
Never retry:
- 400 Bad Request (your payload is wrong — retrying won't fix it)
- 401 Unauthorized (credentials are invalid)
- 403 Forbidden (access denied)
- 404 Not Found (endpoint doesn't exist)
- 410 Gone (endpoint was intentionally removed)
function isRetryable(statusCode) {
if (!statusCode) return true; // network error, no response
const retryable = [408, 429, 500, 502, 503, 504];
const permanent = [400, 401, 403, 404, 410, 422];
if (permanent.includes(statusCode)) return false;
if (retryable.includes(statusCode)) return true;
// Default: retry 5xx, don't retry 4xx
return statusCode >= 500;
}
Circuit Breakers
If an endpoint has been failing consistently for a long period — say, every delivery in the last 24 hours has failed — it's reasonable to stop sending to it entirely. This is a circuit breaker pattern.
async function checkCircuitBreaker(endpointUrl) {
const recentFailures = await db.query(
`SELECT COUNT(*) as failures FROM webhook_events
WHERE endpoint_url = $1
AND status = 'exhausted'
AND created_at > NOW() - INTERVAL '24 hours'`,
[endpointUrl]
);
if (recentFailures.rows[0].failures >= 10) {
return { open: true, reason: '10+ exhausted deliveries in 24h' };
}
return { open: false };
}
When the circuit is open, queue events but don't attempt delivery. Notify the endpoint owner that their webhook is disabled and needs attention. Once they confirm the endpoint is healthy, close the circuit and replay the queued events.
Respecting Rate Limits
Some endpoints return 429 Too Many Requests with a Retry-After header. Your retry logic should respect this.
async function handleResponse(response, event) {
if (response.status === 429) {
const retryAfter = response.headers.get('retry-after');
let delaySec;
if (retryAfter) {
// Could be seconds or an HTTP date
delaySec = isNaN(retryAfter)
? Math.max(0, (new Date(retryAfter) - Date.now()) / 1000)
: parseInt(retryAfter);
} else {
delaySec = getRetryDelay(event.attempts); // fall back to normal backoff
}
await scheduleRetryAt(event, delaySec);
return;
}
// ... handle other status codes
}
This is both polite and practical. If you ignore rate limits, the endpoint provider may block you entirely.
Putting It All Together
Here's a complete retry-aware delivery function:
import requests
import random
from datetime import datetime, timedelta
from enum import Enum
RETRY_SCHEDULE = [10, 30, 120, 600, 1800, 7200, 28800, 86400]
PERMANENT_FAILURES = {400, 401, 403, 404, 410, 422}
class DeliveryResult(Enum):
DELIVERED = "delivered"
RETRYING = "retrying"
EXHAUSTED = "exhausted"
def deliver_with_retry(event: dict) -> DeliveryResult:
try:
response = requests.post(
event["endpoint_url"],
json=event["payload"],
headers=build_headers(event),
timeout=30,
)
if 200 <= response.status_code < 300:
mark_delivered(event["id"])
return DeliveryResult.DELIVERED
if response.status_code in PERMANENT_FAILURES:
mark_exhausted(event["id"], f"Permanent failure: {response.status_code}")
return DeliveryResult.EXHAUSTED
if response.status_code == 429:
retry_after = response.headers.get("Retry-After")
if retry_after and retry_after.isdigit():
delay = int(retry_after)
else:
delay = get_retry_delay(event["attempts"])
else:
delay = get_retry_delay(event["attempts"])
return schedule_next_retry(event, delay)
except (requests.Timeout, requests.ConnectionError) as e:
delay = get_retry_delay(event["attempts"])
return schedule_next_retry(event, delay, error=str(e))
def get_retry_delay(attempt: int) -> int:
base = RETRY_SCHEDULE[min(attempt, len(RETRY_SCHEDULE) - 1)]
jitter = 0.8 + random.random() * 0.4
return round(base * jitter)
def schedule_next_retry(event: dict, delay: int, error=None) -> DeliveryResult:
next_attempt = event["attempts"] + 1
if next_attempt >= event["max_attempts"]:
mark_exhausted(event["id"], error or "Max attempts reached")
return DeliveryResult.EXHAUSTED
next_retry = datetime.utcnow() + timedelta(seconds=delay)
update_event(event["id"], attempts=next_attempt, next_retry_at=next_retry)
return DeliveryResult.RETRYING
Summary
The hierarchy of retry strategies from worst to best:
- No retries — unacceptable for production
- Fixed interval — simple but causes thundering herds
- Linear backoff — slightly better but burns through attempts too fast
- Exponential backoff — good spread but still synchronizes
- Exponential backoff + jitter — the industry standard
For webhook delivery specifically, use a predefined schedule with ±20% jitter, respect Retry-After headers, distinguish between retryable and permanent failures, and implement circuit breakers for consistently failing endpoints.
Or Just Let Someone Else Handle It
Retry logic is one of those things that seems simple until you're debugging why 50,000 webhooks all fired at the same second and took down a customer's endpoint. WebhookStream handles retries with exponential backoff, jitter, circuit breakers, and automatic endpoint health monitoring — so you can focus on building features instead of infrastructure.