Monitoring Webhook Failures in Production

Webhook Stream ·

You can build the most robust webhook delivery system in the world — exponential backoff, cryptographic signatures, idempotency — and still have no idea it's failing until a customer files a support ticket asking why they haven't received events in three days.

Webhook delivery is fire-and-forget by nature. Unlike API calls where the caller sees the error immediately, webhook failures are silent. The sender fires a POST into the void and hopes it lands. Without active monitoring, failures accumulate invisibly.

This post covers what to monitor, how to build alerting that catches problems early, and how to give your customers visibility into their own webhook health.


The Metrics That Matter

Not all metrics are equally useful. Here are the ones that actually tell you something actionable, ranked by importance.

1. Delivery Success Rate

This is your primary health indicator. Calculate it as: successful first-attempt deliveries divided by total delivery attempts, measured over a rolling window (1 hour, 24 hours).

-- Global delivery success rate (last 24 hours)
SELECT
  COUNT(*) FILTER (WHERE status = 'delivered' AND attempts = 1) AS first_attempt_success,
  COUNT(*) FILTER (WHERE status = 'delivered') AS total_delivered,
  COUNT(*) AS total_events,
  ROUND(
    100.0 * COUNT(*) FILTER (WHERE status = 'delivered') / NULLIF(COUNT(*), 0),
    2
  ) AS delivery_rate_pct
FROM webhook_events
WHERE created_at > NOW() - INTERVAL '24 hours';

A healthy system should see 95%+ first-attempt success and 99%+ eventual delivery (after retries). If first-attempt success drops below 90%, something systemic is likely wrong — either your delivery infrastructure has an issue or multiple endpoints are degraded simultaneously.

2. Delivery Latency

How long between event creation and successful delivery? This is best tracked as percentiles:

  • P50 (median): Should be under 5 seconds. This tells you how fast your worker picks up and delivers events under normal conditions.
  • P95: Should be under 30 seconds. Higher values suggest your worker queue is backing up.
  • P99: Will be higher due to retries. Anything under 10 minutes is reasonable. If this is consistently in the hours range, your retry strategy might be too conservative or you have endpoints that are borderline functional.
SELECT
  PERCENTILE_CONT(0.5) WITHIN GROUP (
    ORDER BY EXTRACT(EPOCH FROM delivered_at - created_at)
  ) AS p50_seconds,
  PERCENTILE_CONT(0.95) WITHIN GROUP (
    ORDER BY EXTRACT(EPOCH FROM delivered_at - created_at)
  ) AS p95_seconds,
  PERCENTILE_CONT(0.99) WITHIN GROUP (
    ORDER BY EXTRACT(EPOCH FROM delivered_at - created_at)
  ) AS p99_seconds
FROM webhook_events
WHERE status = 'delivered'
  AND created_at > NOW() - INTERVAL '24 hours';

3. Retry Rate

What percentage of deliveries require at least one retry? Track this globally and per-endpoint.

A high global retry rate (above 15%) suggests an infrastructure problem on your end. A high per-endpoint retry rate is normal — individual endpoints have issues all the time. The actionable signal is when you can distinguish between "their endpoint is flaky" and "our delivery system is struggling."

-- Per-endpoint retry rates
SELECT
  endpoint_url,
  COUNT(*) AS total_events,
  COUNT(*) FILTER (WHERE attempts > 1) AS retried_events,
  ROUND(
    100.0 * COUNT(*) FILTER (WHERE attempts > 1) / NULLIF(COUNT(*), 0),
    2
  ) AS retry_rate_pct,
  ROUND(AVG(attempts), 1) AS avg_attempts
FROM webhook_events
WHERE created_at > NOW() - INTERVAL '24 hours'
  AND status = 'delivered'
GROUP BY endpoint_url
ORDER BY retry_rate_pct DESC;

4. Exhaustion Rate

How many webhooks exhaust all retry attempts and are never delivered? This should be very close to zero — under 0.1% in a healthy system.

Any increase in the exhaustion rate is a high-severity signal. It means events are being permanently lost (from the delivery perspective). Track this as both an absolute count and a percentage, and alert aggressively on it.

5. Queue Depth and Age

How many events are sitting in the pending/retry queue, and how old is the oldest one?

A growing queue means your workers can't keep up with delivery volume. The oldest pending event tells you the worst-case delivery delay. If the oldest event is 6 hours old and hasn't been attempted yet, you have a capacity problem.

SELECT
  COUNT(*) AS pending_events,
  MIN(created_at) AS oldest_event,
  EXTRACT(EPOCH FROM NOW() - MIN(created_at)) / 60 AS oldest_age_minutes
FROM webhook_events
WHERE status IN ('pending', 'failed')
  AND next_retry_at <= NOW();

Alerting Strategy

Metrics are useless without alerts. Here's a tiered alerting approach that catches problems at different severity levels.

Tier 1: Page Someone (Critical)

These conditions mean events are being lost or severely delayed right now:

  • Exhaustion rate exceeds 1% in the last hour. Events are permanently failing.
  • Queue depth exceeds 10,000 and is growing. Workers can't keep up; delivery is backing up for everyone.
  • Oldest pending event is more than 2 hours old. The queue is severely backed up.
  • Worker processes are down. No deliveries are being attempted at all.

Tier 2: Investigate Soon (Warning)

These conditions suggest a developing problem:

  • Global first-attempt success rate drops below 90%. Something is degrading, but retries are handling it for now.
  • P95 delivery latency exceeds 5 minutes. Deliveries are getting slow.
  • A single endpoint has failed 50+ consecutive deliveries. That customer's integration is broken.
  • Retry rate exceeds 20% globally. More webhooks than usual are failing on first attempt.

Tier 3: Notify the Customer

These are endpoint-specific issues the customer needs to know about:

  • 3+ consecutive failures for their endpoint. Send an email or in-app notification.
  • Retries exhausted for an event. Tell them which events weren't delivered and include error details.
  • Endpoint disabled by circuit breaker. Explain what happened and how to re-enable it.
async function checkEndpointHealth() {
  const unhealthyEndpoints = await db.query(
    `SELECT
       endpoint_url,
       user_id,
       COUNT(*) FILTER (WHERE status = 'failed') AS consecutive_failures,
       MAX(last_response_code) AS last_status,
       MAX(last_response_body) AS last_error
     FROM webhook_events
     WHERE created_at > NOW() - INTERVAL '1 hour'
       AND status IN ('failed', 'exhausted')
     GROUP BY endpoint_url, user_id
     HAVING COUNT(*) FILTER (WHERE status = 'failed') >= 3`
  );

  for (const endpoint of unhealthyEndpoints.rows) {
    const alreadyNotified = await wasRecentlyNotified(endpoint.user_id, endpoint.endpoint_url);
    if (!alreadyNotified) {
      await sendEndpointHealthAlert(endpoint);
    }
  }
}

Building a Delivery Log

Every webhook delivery attempt should be logged with enough detail to debug failures after the fact. This is the single most valuable thing you can build for both your own operations team and your customers.

CREATE TABLE webhook_delivery_attempts (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_id UUID REFERENCES webhook_events(id),
    attempt_number INT NOT NULL,
    attempted_at TIMESTAMP DEFAULT NOW(),
    response_code INT,
    response_body TEXT,          -- first 4KB only
    response_time_ms INT,
    error_type VARCHAR(50),      -- timeout, connection_refused, dns_error, etc.
    error_message TEXT,
    request_headers JSONB,
    response_headers JSONB
);

CREATE INDEX idx_delivery_attempts_event ON webhook_delivery_attempts (event_id);

For every attempt — successful or not — record the response code, a truncated response body, the response time, and any error details. This turns debugging from "it didn't work" into "attempt 3 got a 503 with body 'Service Temporarily Unavailable' after 12,340ms from their nginx proxy."

Limit the stored response body to a few KB. You don't need to store a 50MB error page, and unbounded storage will cost you.


Customer-Facing Dashboard

If you're building a webhook delivery service, giving customers visibility into their own delivery health is a significant differentiator. Most webhook providers are a black box — events go in, and the customer has no idea what happened unless they check their own logs.

A good customer dashboard shows:

Recent deliveries with status, response code, and timing for each event. This is the first thing a customer checks when debugging.

Event detail view showing every delivery attempt for a specific event: what was sent (headers and payload), what came back (status, headers, body), and how long each attempt took.

Endpoint health overview with success rate, average response time, and recent failures. A customer should be able to see at a glance whether their endpoint is healthy.

Retry schedule for failed events. If an event is pending retry, show when the next attempt will happen and how many attempts remain. This prevents customers from filing support tickets asking "where's my webhook?" when it's sitting in the retry queue.

Manual retry button for exhausted events. After a customer fixes their endpoint, they need a way to replay events that were permanently failed. Without this, they have to ask you to do it.

// API endpoint for customer delivery log
app.get('/api/webhooks/deliveries', authenticate, async (req, res) => {
  const deliveries = await db.query(
    `SELECT
       e.id, e.event_type, e.status, e.attempts, e.created_at, e.delivered_at,
       e.last_response_code, e.next_retry_at,
       (SELECT json_agg(json_build_object(
         'attempt', a.attempt_number,
         'status', a.response_code,
         'time_ms', a.response_time_ms,
         'error', a.error_message,
         'at', a.attempted_at
       ) ORDER BY a.attempt_number)
       FROM webhook_delivery_attempts a WHERE a.event_id = e.id) AS attempts
     FROM webhook_events e
     WHERE e.user_id = $1
     ORDER BY e.created_at DESC
     LIMIT 50`,
    [req.user.id]
  );

  res.json({ deliveries: deliveries.rows });
});

Structured Logging for Debugging

Beyond database records, emit structured logs for every delivery attempt. These are invaluable when you need to debug infrastructure issues that span multiple events.

function logDeliveryAttempt(event, result) {
  const logEntry = {
    level: result.success ? 'info' : 'warn',
    message: 'webhook_delivery_attempt',
    event_id: event.id,
    endpoint_url: event.endpoint_url,
    event_type: event.event_type,
    attempt: event.attempts + 1,
    response_code: result.statusCode,
    response_time_ms: result.responseTime,
    success: result.success,
    error: result.error || null,
    will_retry: result.retryAt ? true : false,
    next_retry_at: result.retryAt || null,
  };

  // Structured JSON logging — searchable in any log aggregator
  console.log(JSON.stringify(logEntry));
}

With structured logs, you can quickly answer questions like:

  • "How many deliveries to endpoint X failed in the last hour?" → Filter by endpoint_url and success: false
  • "What's the average response time for this customer?" → Aggregate response_time_ms by endpoint_url
  • "Are timeouts increasing?" → Filter by error: 'timeout' and look at the trend

Health Check Endpoint

Give your own monitoring system a way to verify the webhook delivery pipeline is functioning. A dedicated health check endpoint should verify that the worker is running, the queue isn't backed up, and deliveries are succeeding:

app.get('/health/webhooks', async (req, res) => {
  const stats = await db.query(`
    SELECT
      COUNT(*) FILTER (
        WHERE status IN ('pending', 'failed') AND next_retry_at <= NOW()
      ) AS overdue_events,
      COUNT(*) FILTER (
        WHERE status = 'delivered' AND created_at > NOW() - INTERVAL '5 minutes'
      ) AS recent_deliveries,
      MAX(CASE
        WHEN status IN ('pending', 'failed') AND next_retry_at <= NOW()
        THEN EXTRACT(EPOCH FROM NOW() - next_retry_at)
      END) AS max_overdue_seconds
    FROM webhook_events
    WHERE created_at > NOW() - INTERVAL '1 hour'
  `);

  const { overdue_events, recent_deliveries, max_overdue_seconds } = stats.rows[0];

  const healthy =
    overdue_events < 1000 &&
    (max_overdue_seconds || 0) < 300;

  res.status(healthy ? 200 : 503).json({
    healthy,
    overdue_events: parseInt(overdue_events),
    recent_deliveries: parseInt(recent_deliveries),
    max_overdue_seconds: Math.round(max_overdue_seconds || 0),
  });
});

Point your uptime monitor (Pingdom, Better Uptime, or even a simple cron curl) at this endpoint. If it returns 503, your webhook pipeline needs attention.


Summary

Webhook monitoring isn't optional — it's the difference between catching failures in minutes versus days. At minimum, you need:

  1. Delivery success rate tracked globally and per-endpoint
  2. Delivery latency percentiles to catch queue backlogs
  3. Exhaustion rate alerting to catch permanent failures
  4. Per-attempt logging for debugging individual failures
  5. Customer-facing visibility so users can self-serve debug

The investment in monitoring pays for itself the first time you catch a queue backup before customers notice, or the first time a customer can see "oh, my endpoint was returning 500 for 2 hours during our deploy" without filing a support ticket.


Built-In Monitoring with WebhookStream

If building a monitoring stack on top of your webhook infrastructure sounds like a lot of work, it's because it is. WebhookStream includes real-time delivery logs, per-endpoint health dashboards, automatic failure alerts, and manual retry controls — all accessible to both you and your customers through the dashboard and API.