Ukraine is fighting war for its existence, hundreds die every day. We are the shield protecting Europe and the free world from Russia - the largest terrorist organization in the world
Support Ukraine Now →
Mykyta Yaremenko

Building Resilient Real-Time Payment Systems in Next.js

How we architected a fault-tolerant payment notification system using WebSockets, singleton patterns, and intelligent fallback mechanisms.

The Problem Space

Modern payment flows demand instant feedback. Users click "Pay" and expect immediate confirmation - not a loading spinner that spins into eternity. But here's the reality: WebSocket connections fail. Networks hiccup. Tokens expire mid-transaction. And when you're processing real money, "it usually works" isn't good enough.

At scale, we discovered that certain percentage of WebSocket connections fail silently. For a platform processing thousands of transactions daily, that's hundreds of frustrated users watching infinite spinners. We needed a system that:

  1. Maintains persistent connections across route transitions
  2. Handles token rotation without dropping messages
  3. Falls back gracefully when real-time channels fail
  4. Avoids duplicate notifications to prevent double-processing

This article dissects our solution - a pattern we call "Optimistic Real-Time with Pessimistic Fallback."

Architecture Overview

Payment Flow Architecture

The Singleton Connection Pattern

Our first insight: WebSocket connections should outlive React components. Standard hooks create connections on mount and destroy them on unmount. In a SPA with frequent route changes, this creates connection churn - and lost messages during the critical milliseconds of reconnection.

Code
// Global singleton - lives outside React's lifecycle
let globalCentrifuge: Centrifuge | null = null;
let globalSubscriptions: Map<string, Subscription> | null = null;
let globalCentrifugeToken: string | null = null;

// Callback registry - allows multiple components to listen
let subscriptionCallbacks: Map<
  string,
  Set<{
    onError?: (error?: any, orderUuid?: string, status?: string) => void;
    onSuccess?: (orderUuid: string) => void;
  }>
> | null = null;

This isn't "just" a global variable - it's a connection pool of one with attached callback registries. The pattern enables:

  • Connection Reuse: Navigate from /checkout to /processing - same connection
  • Token Rotation: Detect token changes, gracefully teardown, reconnect
  • Multi-Component Subscription: Multiple UI elements can react to the same payment event

Why Not React Context? Context re-renders consumers on every state change. For a high-frequency WebSocket emitting order status updates, this creates cascade re-renders. The singleton pattern decouples connection state from React's render cycle entirely.

Defensive Token Management

Tokens expire. Users leave tabs open overnight. Sessions rotate. We built a multi-layer validation strategy:

Code
const validateUserToken = useCallback(
  (userToken: string) => {
    if (!userId) {
      return validateToken(userToken);  // Basic JWT validation
    }
    return validateCentrifugeToken(userToken, userId);  // User-scoped validation
  },
  [userId]
);

When token mismatch is detected on reconnection attempt:

Code
if (
  globalCentrifuge &&
  globalCentrifugeToken &&
  globalCentrifugeToken !== token
) {
  // Nuclear option: full teardown
  globalSubscriptions?.forEach(subscription => {
    subscription.unsubscribe();
    subscription.removeAllListeners();
  });

  globalCentrifuge.disconnect();

  // Reset state completely
  globalCentrifuge = null;
  globalSubscriptions?.clear();
  globalCentrifugeToken = null;

  // This triggers fresh connection with new token
}

⚠️ Critical Insight: Partial cleanup causes ghost listeners. If you replace a connection without removing old subscriptions, you get duplicate event handling - double charges, double analytics, confused users.

The Fallback Mechanism: Pessimistic Polling

WebSockets are optimistic infrastructure - they assume stable connections. Real networks aren't optimistic. Our fallback triggers under three conditions:

  1. Connection blocked: Firewalls, corporate proxies, or network policies
  2. Connection failed: Server unreachable, handshake timeout
  3. Message timeout: Connected but no response within threshold
Code
// Timeout-based fallback
const timeoutId = helperFunc && !fallbackTriggeredRef.current && withPolling
  ? setTimeout(() => {
      if (helperFunc && !fallbackTriggeredRef.current) {
        fallbackTriggeredRef.current = true;
        helperFunc();  // Trigger polling
      }
    }, 10000)
  : null;

// Immediate fallback when connection blocked
if (
  (!centrifugeRef.current && !centrifugeInstance) ||
  !token ||
  !isTokenValid
) {
  if (helperFunc && !fallbackTriggeredRef.current && orderId && withPolling) {
    fallbackTriggeredRef.current = true;
    helperFunc();  // Don't wait, poll now
  }
}

The fallbackTriggeredRef is critical - it's a circuit breaker. Without it, reconnection attempts would spawn multiple polling loops, hammering your backend.

Preventing Duplicate Processing

The most dangerous bug in payment systems isn't failed payments - it's duplicate successful payments. We use publication-level deduplication:

Code
subscription?.on('publication', ctx => {
  const { isPaid, orderUuid, status } = ctx.data?.data || {};

  if (status === 'PROCESSED' && isPaid) {
    // CRITICAL: Disconnect BEFORE calling success callbacks
    disconnect(subscription);

    // Now safe to trigger UI updates
    const registeredCallbacks = subscriptionCallbacks?.get(channel);
    registeredCallbacks?.forEach(cb => cb.onSuccess?.(orderUuid));
  }
});

💡 Order matters: disconnect first, then notify. This ensures late-arriving duplicate messages hit a dead connection rather than triggering duplicate success handlers.

The 3D Secure Interruption Problem

3D Secure redirects users to bank verification pages. When they return, the original React tree is gone - along with all component state. Our solution leverages the singleton pattern:

Code
if (verifyUrl) {
  const isFinalPageCard =
    paymentMethod === PaymentMethods.CARD && isFinalPage;

  if (!isFinalPageCard) {
    const { origin } = window.location;

    if (verifyUrl.startsWith(origin)) {
      // Internal redirect - use client-side navigation
      const relativePath = verifyUrl.slice(origin.length) || '/';
      router.push(relativePath);
    } else {
      // External redirect - full page navigation
      window.location.assign(verifyUrl);
    }
  }
}

When users return from 3DS, the global connection persists. The hook re-mounts, re-registers its callbacks, and receives any queued messages.

Observability: You Can't Fix What You Can't See

Every critical path includes instrumentation:

Code
logEvent('event_publication', {
  ctxData,
  orderId,
  timestamp,
});

// On failure paths
logEvent('socket_connection_failed_fallback', {
  orderId,
  timestamp,
});

This isn't vanity logging - it's distributed tracing for client-side code. When a user reports "payment stuck," we can correlate:

  1. When they initiated payment
  2. Whether WebSocket connected
  3. If fallback triggered
  4. What messages (if any) arrived
  5. Exact timestamp of each state transition

Cleanup: The Forgotten Edge Case

React's useEffect cleanup runs on unmount AND before each effect re-run. Forgetting this causes callback accumulation:

Code
return () => {
  if (timeoutId) clearTimeout(timeoutId);

  listenerAttachedRef.current = false;

  // Remove THIS component's callbacks, not all callbacks
  if (callbackSetRef.current && callbacksRef.current) {
    callbackSetRef.current.delete(callbacksRef.current);

    // Clean empty sets to prevent memory leaks
    if (callbackSetRef.current.size === 0) {
      subscriptionCallbacks?.delete(channel);
    }
  }
};

ℹ️ The ref pattern here is deliberate: we store callbacks in refs to reference the current callback identity in cleanup, not a stale closure capture.

Performance Considerations

Memory Leaks

Global singletons persist indefinitely. In long-running sessions:

Code
// We clean subscriptions when callbacks empty
if (callbackSetRef.current.size === 0) {
  subscriptionCallbacks?.delete(channel);
}

// But also need periodic connection health checks
// (not shown: heartbeat monitoring)

Lessons Learned

Key Takeaways

  1. Global state isn't evil - unmanaged global state is. Explicit singleton patterns with clear ownership work.

  2. Fallbacks aren't fallbacks if they're slow. Our polling triggers at 10s, but payment UX research shows users abandon at 8s. We're investigating optimistic UI updates while awaiting confirmation.

  3. Refs beat state for synchronization logic. React state is for rendering. Connection state is for coordination.

  4. Test the failure modes. Our Jest suite includes tests that deliberately block WebSocket connections to verify fallback behavior.

Conclusion

Building resilient real-time systems requires thinking beyond the happy path. The patterns here - singleton connections, callback registries, circuit-breaker fallbacks, and aggressive deduplication - emerged from production incidents, not theoretical design sessions.

The code isn't pretty. There are refs everywhere, globals that would make a purist wince, and cleanup logic that spans 30+ lines. But it handles tons of daily payments with a 99.9% real-time notification success rate, falling back gracefully for the remaining 0.3%.

Sometimes robust beats elegant.