Building Resilient Real-Time Payment Systems in Next.js
How we architected a fault-tolerant payment notification system using WebSockets, singleton patterns, and intelligent fallback mechanisms.
The Problem Space
Modern payment flows demand instant feedback. Users click "Pay" and expect immediate confirmation - not a loading spinner that spins into eternity. But here's the reality: WebSocket connections fail. Networks hiccup. Tokens expire mid-transaction. And when you're processing real money, "it usually works" isn't good enough.
At scale, we discovered that certain percentage of WebSocket connections fail silently. For a platform processing thousands of transactions daily, that's hundreds of frustrated users watching infinite spinners. We needed a system that:
- Maintains persistent connections across route transitions
- Handles token rotation without dropping messages
- Falls back gracefully when real-time channels fail
- Avoids duplicate notifications to prevent double-processing
This article dissects our solution - a pattern we call "Optimistic Real-Time with Pessimistic Fallback."
Architecture Overview
The Singleton Connection Pattern
Our first insight: WebSocket connections should outlive React components. Standard hooks create connections on mount and destroy them on unmount. In a SPA with frequent route changes, this creates connection churn - and lost messages during the critical milliseconds of reconnection.
// Global singleton - lives outside React's lifecycle
let globalCentrifuge: Centrifuge | null = null;
let globalSubscriptions: Map<string, Subscription> | null = null;
let globalCentrifugeToken: string | null = null;
// Callback registry - allows multiple components to listen
let subscriptionCallbacks: Map<
string,
Set<{
onError?: (error?: any, orderUuid?: string, status?: string) => void;
onSuccess?: (orderUuid: string) => void;
}>
> | null = null;This isn't "just" a global variable - it's a connection pool of one with attached callback registries. The pattern enables:
- Connection Reuse: Navigate from
/checkoutto/processing- same connection - Token Rotation: Detect token changes, gracefully teardown, reconnect
- Multi-Component Subscription: Multiple UI elements can react to the same payment event
Why Not React Context? Context re-renders consumers on every state change. For a high-frequency WebSocket emitting order status updates, this creates cascade re-renders. The singleton pattern decouples connection state from React's render cycle entirely.
Defensive Token Management
Tokens expire. Users leave tabs open overnight. Sessions rotate. We built a multi-layer validation strategy:
const validateUserToken = useCallback(
(userToken: string) => {
if (!userId) {
return validateToken(userToken); // Basic JWT validation
}
return validateCentrifugeToken(userToken, userId); // User-scoped validation
},
[userId]
);When token mismatch is detected on reconnection attempt:
if (
globalCentrifuge &&
globalCentrifugeToken &&
globalCentrifugeToken !== token
) {
// Nuclear option: full teardown
globalSubscriptions?.forEach(subscription => {
subscription.unsubscribe();
subscription.removeAllListeners();
});
globalCentrifuge.disconnect();
// Reset state completely
globalCentrifuge = null;
globalSubscriptions?.clear();
globalCentrifugeToken = null;
// This triggers fresh connection with new token
}⚠️ Critical Insight: Partial cleanup causes ghost listeners. If you replace a connection without removing old subscriptions, you get duplicate event handling - double charges, double analytics, confused users.
The Fallback Mechanism: Pessimistic Polling
WebSockets are optimistic infrastructure - they assume stable connections. Real networks aren't optimistic. Our fallback triggers under three conditions:
- Connection blocked: Firewalls, corporate proxies, or network policies
- Connection failed: Server unreachable, handshake timeout
- Message timeout: Connected but no response within threshold
// Timeout-based fallback
const timeoutId = helperFunc && !fallbackTriggeredRef.current && withPolling
? setTimeout(() => {
if (helperFunc && !fallbackTriggeredRef.current) {
fallbackTriggeredRef.current = true;
helperFunc(); // Trigger polling
}
}, 10000)
: null;
// Immediate fallback when connection blocked
if (
(!centrifugeRef.current && !centrifugeInstance) ||
!token ||
!isTokenValid
) {
if (helperFunc && !fallbackTriggeredRef.current && orderId && withPolling) {
fallbackTriggeredRef.current = true;
helperFunc(); // Don't wait, poll now
}
}The fallbackTriggeredRef is critical - it's a circuit breaker. Without it, reconnection attempts would spawn multiple polling loops, hammering your backend.
Preventing Duplicate Processing
The most dangerous bug in payment systems isn't failed payments - it's duplicate successful payments. We use publication-level deduplication:
subscription?.on('publication', ctx => {
const { isPaid, orderUuid, status } = ctx.data?.data || {};
if (status === 'PROCESSED' && isPaid) {
// CRITICAL: Disconnect BEFORE calling success callbacks
disconnect(subscription);
// Now safe to trigger UI updates
const registeredCallbacks = subscriptionCallbacks?.get(channel);
registeredCallbacks?.forEach(cb => cb.onSuccess?.(orderUuid));
}
});💡 Order matters: disconnect first, then notify. This ensures late-arriving duplicate messages hit a dead connection rather than triggering duplicate success handlers.
The 3D Secure Interruption Problem
3D Secure redirects users to bank verification pages. When they return, the original React tree is gone - along with all component state. Our solution leverages the singleton pattern:
if (verifyUrl) {
const isFinalPageCard =
paymentMethod === PaymentMethods.CARD && isFinalPage;
if (!isFinalPageCard) {
const { origin } = window.location;
if (verifyUrl.startsWith(origin)) {
// Internal redirect - use client-side navigation
const relativePath = verifyUrl.slice(origin.length) || '/';
router.push(relativePath);
} else {
// External redirect - full page navigation
window.location.assign(verifyUrl);
}
}
}When users return from 3DS, the global connection persists. The hook re-mounts, re-registers its callbacks, and receives any queued messages.
Observability: You Can't Fix What You Can't See
Every critical path includes instrumentation:
logEvent('event_publication', {
ctxData,
orderId,
timestamp,
});
// On failure paths
logEvent('socket_connection_failed_fallback', {
orderId,
timestamp,
});This isn't vanity logging - it's distributed tracing for client-side code. When a user reports "payment stuck," we can correlate:
- When they initiated payment
- Whether WebSocket connected
- If fallback triggered
- What messages (if any) arrived
- Exact timestamp of each state transition
Cleanup: The Forgotten Edge Case
React's useEffect cleanup runs on unmount AND before each effect re-run. Forgetting this causes callback accumulation:
return () => {
if (timeoutId) clearTimeout(timeoutId);
listenerAttachedRef.current = false;
// Remove THIS component's callbacks, not all callbacks
if (callbackSetRef.current && callbacksRef.current) {
callbackSetRef.current.delete(callbacksRef.current);
// Clean empty sets to prevent memory leaks
if (callbackSetRef.current.size === 0) {
subscriptionCallbacks?.delete(channel);
}
}
};ℹ️ The
refpattern here is deliberate: we store callbacks in refs to reference the current callback identity in cleanup, not a stale closure capture.
Performance Considerations
Memory Leaks
Global singletons persist indefinitely. In long-running sessions:
// We clean subscriptions when callbacks empty
if (callbackSetRef.current.size === 0) {
subscriptionCallbacks?.delete(channel);
}
// But also need periodic connection health checks
// (not shown: heartbeat monitoring)Lessons Learned
Key Takeaways
-
Global state isn't evil - unmanaged global state is. Explicit singleton patterns with clear ownership work.
-
Fallbacks aren't fallbacks if they're slow. Our polling triggers at 10s, but payment UX research shows users abandon at 8s. We're investigating optimistic UI updates while awaiting confirmation.
-
Refs beat state for synchronization logic. React state is for rendering. Connection state is for coordination.
-
Test the failure modes. Our Jest suite includes tests that deliberately block WebSocket connections to verify fallback behavior.
Conclusion
Building resilient real-time systems requires thinking beyond the happy path. The patterns here - singleton connections, callback registries, circuit-breaker fallbacks, and aggressive deduplication - emerged from production incidents, not theoretical design sessions.
The code isn't pretty. There are refs everywhere, globals that would make a purist wince, and cleanup logic that spans 30+ lines. But it handles tons of daily payments with a 99.9% real-time notification success rate, falling back gracefully for the remaining 0.3%.
Sometimes robust beats elegant.