Table of Contents

    Book an Appointment

    INTRODUCTION

    While working on a digital transformation initiative for a global logistics provider, our team was tasked with modernizing their fleet tracking capabilities. The goal was to move from a polling-based legacy system to a real-time, event-driven architecture capable of tracking over 50,000 active assets simultaneously.

    We successfully deployed the initial version of the tracking engine to a staging environment. Functional tests passed, and latency was minimal. However, during a load simulation designed to mimic peak holiday traffic, we encountered a situation where our Kubernetes pods began crash-looping. The logs were flooded with FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory.

    For a system designed to be the backbone of operational visibility, this was a showstopper. This article outlines how we identified the root cause—a subtle memory leak in our WebSocket handling logic—and the specific steps we took to fix it. We share this so other engineering teams can better validate event-driven architectures before going to production.

    PROBLEM CONTEXT

    The system in question was a high-throughput middleware layer built with Node.js. Its primary role was to ingest telemetry data (GPS, temperature, fuel status) from Kafka, process it, and push updates to frontend dashboards via WebSockets.

    The architecture consisted of:

    • Ingestion Service: Consumed generic telemetry events from the message bus.
    • State Management: Used Redis for storing the latest state of assets (geo-hashing and metadata).
    • Real-Time Gateway: A cluster of Node.js instances using Socket.io and a Redis adapter to broadcast updates to connected operations managers.

    The business requirement dictated that operators could subscribe to specific “regions” or “fleets.” The backend needed to dynamically filter the firehose of data and send only relevant updates to specific socket IDs. To achieve this, we implemented dynamic subscription logic that utilized Redis Pub/Sub channels extensively.

    WHAT WENT WRONG

    The issue surfaced only under sustained load. In a development environment with 50 connected clients, memory usage was stable. However, when we scaled the simulation to 5,000 concurrent connections with frequent connect/disconnect cycles (simulating unstable cellular networks for field operators), the memory footprint of our Node.js processes grew linearly until they hit the V8 heap limit.

    We observed the following symptoms:

    • Sawtooth Memory Pattern: Monitoring tools showed RAM usage climbing steadily, dropping slightly on Garbage Collection (GC) events, but never returning to the baseline.
    • Event Loop Lag: As memory pressure increased, the event loop latency spiked from 10ms to over 500ms, causing jitter in the real-time feeds.
    • Zombie Listeners: Even after clients disconnected, the internal metrics suggested that the server was still processing subscription logic for those sessions.

    HOW WE APPROACHED THE SOLUTION

    To diagnose the leak, we couldn’t rely on standard logs. We needed to look inside the V8 engine’s memory allocation. We utilized a systematic debugging approach suitable for anyone looking to hire Node.js developers for scalable systems.

    1. Heap Snapshot Analysis

    We attached the Chrome DevTools inspector to a running remote instance and took three heap snapshots:

    • Snapshot A: Baseline (just after startup).
    • Snapshot B: After 1,000 client connections were established.
    • Snapshot C: After those 1,000 clients were forcibly disconnected.

    Theoretically, Snapshot C should have been nearly identical to Snapshot A. It was not. Comparing Snapshot C against A revealed a massive accumulation of Closure and Subscriber objects.

    2. Identifying the Retainer

    Drilling down into the retainers, we found that our custom Redis subscription wrapper was creating an anonymous function for every incoming socket connection to handle specific channel patterns. When the socket disconnected, the socket object was cleaned up, but the reference to the anonymous function inside the Redis client’s message event listener remained active.

    Essentially, the Redis client (which is a global singleton in this context) was holding onto a callback for every client that had ever connected, preventing the closure scope from being garbage collected.

    FINAL IMPLEMENTATION

    The fix required refactoring how we handled dynamic subscriptions. Instead of binding a new listener to the global Redis client for every socket, we implemented a centralized dispatcher pattern.

    Here is a sanitized representation of the problematic approach versus the corrected architecture.

    The Anti-Pattern (Memory Leak)

    // BAD: This creates a permanent reference in the Redis client
    io.on('connection', (socket) => {
        const fleetId = socket.handshake.query.fleetId;
        
        // This listener is never removed from the subClient
        subClient.on('message', (channel, message) => {
            if (channel === `updates:${fleetId}`) {
                socket.emit('fleet_update', message);
            }
        });
    });
    

    The Corrected Pattern

    We refactored the code to use a single listener that routes messages based on a local map of active sockets. This ensures that the Redis client only holds one reference, regardless of how many users are connected.

    // GOOD: Centralized dispatching
    const activeSubscriptions = new Map(); // Map<fleetId, Set<socketId>>
    
    // Single global listener
    subClient.on('message', (channel, message) => {
        // Extract ID from channel string
        const fleetId = extractId(channel); 
        
        if (activeSubscriptions.has(fleetId)) {
            const recipients = activeSubscriptions.get(fleetId);
            recipients.forEach(socketId => {
                io.to(socketId).emit('fleet_update', message);
            });
        }
    });
    
    io.on('connection', (socket) => {
        const fleetId = socket.handshake.query.fleetId;
        
        // Register socket
        if (!activeSubscriptions.has(fleetId)) {
            activeSubscriptions.set(fleetId, new Set());
            // Only subscribe to Redis if it's the first user for this fleet
            subClient.subscribe(`updates:${fleetId}`);
        }
        activeSubscriptions.get(fleetId).add(socket.id);
    
        // CLEANUP on disconnect
        socket.on('disconnect', () => {
            if (activeSubscriptions.has(fleetId)) {
                const set = activeSubscriptions.get(fleetId);
                set.delete(socket.id);
                
                if (set.size === 0) {
                    activeSubscriptions.delete(fleetId);
                    // Unsubscribe from Redis to save bandwidth
                    subClient.unsubscribe(`updates:${fleetId}`);
                }
            }
        });
    });
    

    Validation:

    We re-ran the load test with 5,000 concurrent connections. The memory profile remained flat. The heap size grew as connections came in and shrank immediately upon disconnection. The “sawtooth” pattern disappeared, and the event loop lag stabilized at sub-15ms levels.

    LESSONS FOR ENGINEERING TEAMS

    This experience highlighted several key practices that we now emphasize when clients hire dedicated engineering teams for real-time applications:

    • Understand Closure Scope: In Node.js, closures are powerful but dangerous. If a closure is referenced by a long-lived object (like a database client or singleton), everything in that closure’s scope is immune to Garbage Collection.
    • Simulate Network Instability: Testing with stable connections is not enough. You must simulate “stormy” network conditions where clients rapidly connect and disconnect to trigger edge cases in cleanup logic.
    • Monitor Event Loop Lag: CPU usage is a lagging indicator. Event loop lag is a leading indicator of performance degradation in Node.js.
    • Profile Early: Do not wait for production crashes. Integrate heap profiling into your staging pipeline.
    • Centralize Event Handling: Avoid creating unique event handlers for individual users when a routed/multiplexed approach can serve the same purpose with constant memory complexity.

    WRAP UP

    Memory leaks in event-driven systems are often subtle, hiding behind successful functional tests until scale reveals them. By adopting strict patterns for listener management and rigorous load testing, we ensured the logistics platform could handle enterprise-scale traffic without degradation.

    If you are looking to contact us regarding complex backend challenges, our teams are ready to assist.

    Social Hashtags

    #NodeJS #WebSockets #RealTimeSystems #BackendEngineering #SystemArchitecture #ScalableSystems #DevOps #SoftwareEngineering #TechLeadership

    Is your real-time Node.js system truly production-ready at scale?
    Talk to Our Engineering Experts

     

    Frequently Asked Questions

    Success Stories That Inspire

    See how our team takes complex business challenges and turns them into powerful, scalable digital solutions. From custom software and web applications to automation, integrations, and cloud-ready systems, each project reflects our commitment to innovation, performance, and long-term value.