How to Set Up Reliable Webhook Delivery — Retries, Signatures & Monitoring

Webhook Stream ·

Webhooks are simple in concept — an HTTP POST fires when something happens — but making them reliable in production is a different story. Endpoints go down, networks flake out, servers return 500s, and suddenly your payment notifications are vanishing into the void.

This guide walks through everything you need to build a webhook delivery system that actually works: retry logic, idempotency, signature verification, timeout handling, and monitoring. We'll cover the architecture decisions and include code examples in Node.js and Python.


Why Webhooks Fail

Before building anything, it helps to understand the failure modes you're designing against.

Receiver-side failures are the most common. The destination server might be temporarily down, deploying, rate-limiting you, or returning errors. These are usually transient — the same request will succeed if you try again in a few minutes.

Network-level failures include DNS resolution errors, TCP connection timeouts, and TLS handshake failures. These tend to be transient too, but can persist if there's a misconfiguration on the receiver's end.

Payload issues are less common but harder to debug. Malformed JSON, payloads that exceed the receiver's body size limit, or content-type mismatches can cause consistent failures that retries won't fix.

Your own infrastructure can also be the bottleneck. If you're sending webhooks synchronously during an API request, a slow receiver can tie up your application threads and degrade performance for everyone.

The takeaway: you need to assume every webhook delivery can fail, and design your system to handle it gracefully.


Architecture: Decouple Event Creation from Delivery

The single most important architectural decision is to never send webhooks inline with your application logic. If a user creates an order and you try to POST a webhook to their endpoint before returning the API response, you've coupled your application's reliability to theirs.

Instead, use a queue-based architecture:

[Your App] → writes event to database/queue
                ↓
[Worker Process] → reads events → delivers webhooks
                ↓
[Retry Queue] → failed deliveries get retried on a schedule

When an event occurs in your application, write a record to a persistent store (database table or message queue) and return immediately. A separate worker process picks up pending events and attempts delivery. Failed attempts get scheduled for retry.

Here's a basic schema for tracking webhook deliveries:

CREATE TABLE webhook_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    endpoint_url TEXT NOT NULL,
    event_type VARCHAR(100) NOT NULL,
    payload JSONB NOT NULL,
    status VARCHAR(20) DEFAULT 'pending',  -- pending, delivered, failed, exhausted
    attempts INT DEFAULT 0,
    max_attempts INT DEFAULT 8,
    next_retry_at TIMESTAMP,
    last_response_code INT,
    last_response_body TEXT,
    created_at TIMESTAMP DEFAULT NOW(),
    delivered_at TIMESTAMP
);

CREATE INDEX idx_webhook_pending ON webhook_events (next_retry_at)
    WHERE status IN ('pending', 'failed');

This gives you a complete audit trail of every delivery attempt, which becomes invaluable when debugging.


Retry Strategy: Exponential Backoff with Jitter

When a delivery fails, you should retry — but not immediately and not at fixed intervals. A naive approach like "retry every 60 seconds" will hammer a recovering server and can look like a DDoS attack.

Exponential backoff increases the delay between each retry attempt. Jitter adds randomness so that if thousands of webhooks fail simultaneously (because the receiver went down), the retries don't all hit at the exact same moment.

A common schedule looks like this:

Attempt Base Delay With Jitter (approx)
1 10 seconds 8–12 seconds
2 30 seconds 24–36 seconds
3 2 minutes 1.5–2.5 minutes
4 10 minutes 8–12 minutes
5 30 minutes 24–36 minutes
6 2 hours 1.5–2.5 hours
7 8 hours 6–10 hours
8 24 hours 20–28 hours

After all attempts are exhausted, mark the event as exhausted and alert the webhook owner. The total retry window here spans roughly 35 hours, which gives transient outages plenty of time to resolve.

Node.js Implementation

function getRetryDelay(attempt) {
  const baseDelays = [10, 30, 120, 600, 1800, 7200, 28800, 86400]; // seconds
  const base = baseDelays[Math.min(attempt, baseDelays.length - 1)];
  const jitter = base * (0.8 + Math.random() * 0.4); // ±20% jitter
  return Math.round(jitter);
}

async function deliverWebhook(event) {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), 30000); // 30s timeout

  try {
    const response = await fetch(event.endpoint_url, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-Webhook-ID': event.id,
        'X-Webhook-Timestamp': Math.floor(Date.now() / 1000).toString(),
        'X-Webhook-Signature': generateSignature(event.payload, event.secret),
      },
      body: JSON.stringify(event.payload),
      signal: controller.signal,
    });

    clearTimeout(timeout);

    if (response.status >= 200 && response.status < 300) {
      await markDelivered(event.id);
      return { success: true };
    }

    // Treat 410 Gone as "unsubscribe" — don't retry
    if (response.status === 410) {
      await markExhausted(event.id, 'Endpoint returned 410 Gone');
      return { success: false, permanent: true };
    }

    // Any other non-2xx — schedule retry
    return await scheduleRetry(event, response.status);

  } catch (err) {
    clearTimeout(timeout);

    if (err.name === 'AbortError') {
      return await scheduleRetry(event, null, 'Request timed out after 30s');
    }

    return await scheduleRetry(event, null, err.message);
  }
}

async function scheduleRetry(event, statusCode, errorMessage) {
  const nextAttempt = event.attempts + 1;

  if (nextAttempt >= event.max_attempts) {
    await markExhausted(event.id, `Failed after ${event.max_attempts} attempts`);
    return { success: false, permanent: true };
  }

  const delaySec = getRetryDelay(nextAttempt);
  const nextRetryAt = new Date(Date.now() + delaySec * 1000);

  await db.query(
    `UPDATE webhook_events
     SET attempts = $1, status = 'failed', next_retry_at = $2,
         last_response_code = $3, last_response_body = $4
     WHERE id = $5`,
    [nextAttempt, nextRetryAt, statusCode, errorMessage, event.id]
  );

  return { success: false, retryAt: nextRetryAt };
}

Python Implementation

import time
import math
import random
import hmac
import hashlib
import requests
from datetime import datetime, timedelta

BASE_DELAYS = [10, 30, 120, 600, 1800, 7200, 28800, 86400]

def get_retry_delay(attempt: int) -> int:
    base = BASE_DELAYS[min(attempt, len(BASE_DELAYS) - 1)]
    jitter = base * (0.8 + random.random() * 0.4)
    return round(jitter)

def deliver_webhook(event: dict) -> dict:
    headers = {
        "Content-Type": "application/json",
        "X-Webhook-ID": event["id"],
        "X-Webhook-Timestamp": str(int(time.time())),
        "X-Webhook-Signature": generate_signature(event["payload"], event["secret"]),
    }

    try:
        response = requests.post(
            event["endpoint_url"],
            json=event["payload"],
            headers=headers,
            timeout=30,
        )

        if 200 <= response.status_code < 300:
            mark_delivered(event["id"])
            return {"success": True}

        if response.status_code == 410:
            mark_exhausted(event["id"], "Endpoint returned 410 Gone")
            return {"success": False, "permanent": True}

        return schedule_retry(event, status_code=response.status_code)

    except requests.Timeout:
        return schedule_retry(event, error="Request timed out after 30s")
    except requests.ConnectionError as e:
        return schedule_retry(event, error=str(e))

def schedule_retry(event: dict, status_code=None, error=None) -> dict:
    next_attempt = event["attempts"] + 1

    if next_attempt >= event["max_attempts"]:
        mark_exhausted(event["id"], f"Failed after {event['max_attempts']} attempts")
        return {"success": False, "permanent": True}

    delay = get_retry_delay(next_attempt)
    next_retry_at = datetime.utcnow() + timedelta(seconds=delay)

    db.execute(
        """UPDATE webhook_events
           SET attempts = %s, status = 'failed', next_retry_at = %s,
               last_response_code = %s, last_response_body = %s
           WHERE id = %s""",
        (next_attempt, next_retry_at, status_code, error, event["id"]),
    )

    return {"success": False, "retry_at": next_retry_at}

Signing Webhooks: Proving Authenticity

Every webhook you send should include a cryptographic signature so the receiver can verify it actually came from you. Without this, anyone who discovers the endpoint URL can send fake events.

The standard approach is HMAC-SHA256. You and the receiver share a secret key. You compute a hash of the payload using that key and include it in a header. The receiver computes the same hash and compares.

Generating the Signature (Sender Side)

const crypto = require('crypto');

function generateSignature(payload, secret) {
  const timestamp = Math.floor(Date.now() / 1000).toString();
  const body = JSON.stringify(payload);
  const signedContent = `${timestamp}.${body}`;
  const signature = crypto
    .createHmac('sha256', secret)
    .update(signedContent)
    .digest('hex');
  return `v1=${signature}`;
}
import hmac
import hashlib
import time
import json

def generate_signature(payload: dict, secret: str) -> str:
    timestamp = str(int(time.time()))
    body = json.dumps(payload, separators=(",", ":"))
    signed_content = f"{timestamp}.{body}"
    signature = hmac.new(
        secret.encode(), signed_content.encode(), hashlib.sha256
    ).hexdigest()
    return f"v1={signature}"

Including the timestamp in the signed content prevents replay attacks. The receiver should reject any webhook where the timestamp is more than 5 minutes old.

Verifying the Signature (Receiver Side)

function verifyWebhook(req, secret) {
  const timestamp = req.headers['x-webhook-timestamp'];
  const signature = req.headers['x-webhook-signature'];
  const body = JSON.stringify(req.body);

  // Reject if timestamp is older than 5 minutes
  const age = Math.floor(Date.now() / 1000) - parseInt(timestamp);
  if (Math.abs(age) > 300) {
    return false;
  }

  const expected = crypto
    .createHmac('sha256', secret)
    .update(`${timestamp}.${body}`)
    .digest('hex');

  return crypto.timingSafeEqual(
    Buffer.from(signature.replace('v1=', '')),
    Buffer.from(expected)
  );
}

Always use a constant-time comparison function (timingSafeEqual in Node, hmac.compare_digest in Python) to prevent timing attacks.


Idempotency: Handling Duplicate Deliveries

Network issues can cause a webhook to be delivered more than once. Your request might succeed, but the response gets lost, so your system retries and the receiver processes it twice.

The fix is to include a unique event ID (the X-Webhook-ID header from our examples above) and have receivers track which IDs they've already processed.

On the sender side, this is straightforward — just make sure every event has a stable unique ID that doesn't change across retries. The webhook_events.id column from our schema works perfectly.

On the receiver side, they should:

  1. Check if they've seen this X-Webhook-ID before
  2. If yes, return 200 without reprocessing
  3. If no, process the event and store the ID

This is the receiver's responsibility to implement, but you should document it clearly and make idempotency keys easy to find in your webhook headers.


Timeout Handling

Set a hard timeout on every webhook delivery. 30 seconds is a reasonable default. If the receiver's endpoint takes longer than that, abort the request and treat it as a failure.

Why 30 seconds and not longer? Because your delivery worker has limited concurrency. A handful of slow endpoints can tie up all your workers and delay delivery for everyone else. A receiver that needs more than 30 seconds to acknowledge a webhook should be returning 202 Accepted immediately and processing asynchronously on their end.

Some additional timeout considerations:

  • Connection timeout should be shorter than the overall timeout — 10 seconds is enough to establish a TCP connection. If DNS resolution or the TLS handshake takes longer, something is wrong.
  • Don't follow redirects. Webhook endpoints should return a direct response. Following redirects opens you up to SSRF (server-side request forgery) attacks where a malicious user points their webhook URL at your internal infrastructure.
  • Limit response body reading. You don't need to download a 50MB response body. Read the status code and the first few KB for logging, then close the connection.

Monitoring and Alerting

Reliable delivery means nothing if you can't prove it. You need visibility into what's happening.

Metrics to Track

Delivery success rate is your primary metric. Track it per-endpoint and globally. A healthy system should be above 99% on first-attempt delivery. If it drops, something systemic is happening.

Delivery latency — how long between event creation and successful delivery. P50 should be under a few seconds. P99 will be higher due to retries, but if it's consistently in the hours range, you have reliability issues.

Retry rate — what percentage of webhooks require at least one retry. A high retry rate for a specific endpoint usually means that endpoint has issues. A high global retry rate means your delivery infrastructure might be the problem.

Exhaustion rate — what percentage of webhooks exhaust all retry attempts and are never delivered. This should be very close to zero. If it's not, your retry policy might need to be more aggressive, or you need to alert endpoint owners sooner.

Alerting Endpoint Owners

Don't wait until all retries are exhausted to notify the endpoint owner. A good notification strategy:

  • After 3 consecutive failures: send an email or in-app notification saying "We're having trouble delivering to your endpoint"
  • After retries are exhausted: send a clear alert with the event IDs that failed and the error details
  • If an endpoint has been failing for 24+ hours: consider disabling it and requiring the owner to manually re-enable after fixing their issues

The Worker Loop

Tying it all together, your delivery worker is a simple polling loop:

async function processWebhookQueue() {
  while (true) {
    const events = await db.query(
      `SELECT * FROM webhook_events
       WHERE status IN ('pending', 'failed')
         AND next_retry_at <= NOW()
       ORDER BY next_retry_at ASC
       LIMIT 50
       FOR UPDATE SKIP LOCKED`,  -- prevents duplicate processing
      []
    );

    if (events.rows.length === 0) {
      await sleep(1000); // nothing to do, wait 1s
      continue;
    }

    // Process concurrently with a concurrency limit
    await Promise.all(
      events.rows.map(event => deliverWebhook(event))
    );
  }
}

The FOR UPDATE SKIP LOCKED clause is critical if you run multiple worker instances. It ensures two workers never pick up the same event.

For higher throughput, replace the polling loop with a proper message queue (Redis, RabbitMQ, SQS) that pushes events to workers instead of having them poll.


Quick Checklist

Before going to production, make sure you've covered:

  • Events are queued asynchronously, not sent inline with API requests
  • Retry logic uses exponential backoff with jitter
  • Every webhook includes a cryptographic signature
  • Signatures include a timestamp to prevent replay attacks
  • Every webhook includes a unique event ID for idempotency
  • Delivery requests have a 30-second timeout
  • Redirects are not followed
  • Failed deliveries are logged with status codes and response bodies
  • Endpoint owners are notified of persistent failures
  • You have metrics on success rate, latency, retry rate, and exhaustion rate

Skip the Infrastructure, Ship Your Product

Building all of this yourself is doable, but it's a significant amount of infrastructure to build, maintain, and monitor. You'll spend weeks on retry logic, queue management, signature schemes, and monitoring — time you could spend on your actual product.

WebhookStream handles all of this out of the box: automatic retries with exponential backoff, cryptographic signatures, delivery monitoring, and real-time logs. You get an API to send events and a dashboard for your customers to manage their endpoints.

If you'd rather build it yourself, the code in this guide gives you a solid foundation. Either way, don't ship webhooks without retry logic and signatures — your users are counting on those events arriving.