Table of Contents

    scatter-gather workflow architecture

    INTRODUCTION

    During a recent engagement with a LegalTech SaaS client, we were tasked with modernizing their document review pipeline. The system needed to ingest massive legal contracts, split them into individual clauses, and process each clause through a specialized Large Language Model (LLM) for risk analysis.

    The scale was significant. A single document could spawn anywhere from 50 to 200 independent analysis tasks. To maintain a responsive user experience, these tasks had to run in parallel. However, the system could not proceed to the “Report Generation” phase until every single sub-task was complete.

    We encountered a situation where our initial orchestration logic—relying on standard workflow webhooks—began showing cracks under load. We faced race conditions where the main process either resumed too early or, more frequently, never resumed at all because it missed the final signal. This architectural challenge inspired us to document the most robust pattern for handling dynamic “Scatter-Gather” workflows in production environments.

    PROBLEM CONTEXT

    The business requirement was straightforward: Parallelize, then Synchronize.

    In distributed systems architecture, this is known as the Scatter-Gather pattern. The “Scatter” phase triggers $N$ sub-workflows (where $N$ is dynamic). The “Gather” phase waits for results and aggregates them.

    Our architecture utilized a low-code workflow orchestration platform integrated with custom Node.js microservices. The workflow structure looked like this:

    • Main Workflow: Received the document, calculated $N$ (number of chunks), and triggered $N$ sub-workflows via HTTP requests. It then entered a “Wait for Webhook” state to pause execution.
    • Sub-Workflows: Performed the long-running LLM analysis (taking 2–5 minutes per chunk).
    • The Challenge: How does the Main Workflow know when the last sub-workflow has finished so it can wake up and compile the final report?

    To solve this, the initial implementation used a “Collector Workflow.” Every sub-workflow, upon completion, would ping this centralized Collector. The Collector maintained a counter. When the counter reached $N$, the Collector would trigger the resume webhook for the Main Workflow.

    WHAT WENT WRONG

    While the Collector approach seemed logical on paper, it introduced significant fragility when we scaled to processing hundreds of sub-workflows simultaneously.

    1. Concurrency and Race Conditions

    When 100 sub-workflows finish at roughly the same time (e.g., within a few seconds of each other), they bombard the Collector workflow with concurrent requests. If the Collector is not perfectly transactional, two requests might read the current counter value (say, 98) simultaneously, increment it to 99, and write it back. The system effectively “loses” a completion event. The counter never reaches 100, and the Main Workflow stays paused forever—a “zombie” process.

    2. Single Point of Failure

    The Collector became a bottleneck. If the Collector workflow timed out or crashed due to a memory spike during the aggregation phase, the state of the progress was lost. There was no persistence layer independent of the workflow engine’s memory.

    3. Complexity in Error Handling

    If one sub-workflow failed (e.g., an LLM API timeout), the Collector would sit at $N-1$ indefinitely. The Main Workflow had no way of knowing that one task had died, locking the entire job in a pending state.

    HOW WE APPROACHED THE SOLUTION

    To fix this, we needed to move state management out of the ephemeral workflow execution context and into a dedicated, atomic data store. We evaluated three approaches:

    • Native Loops: Most workflow tools allow looping, but they often run sequentially or have strict limits on parallelism, which would kill our performance requirements.
    • Polling: The Main Workflow could wake up every minute to check a database. This is inefficient and introduces unnecessary latency.
    • Atomic Check-and-Resume (The Winner): We decided to implement an external state store (Redis or a lightweight SQL table) to track the “Job” status. This allowed us to perform atomic operations that are concurrency-safe.

    When companies hire workflow automation developers, distinguishing between a “working” solution and a “resilient” solution often comes down to this understanding of state persistence.

    FINAL IMPLEMENTATION

    We deprecated the “Collector Workflow” entirely. Instead, we implemented a pattern where the sub-workflows themselves determine if they are the final task.

    1. The Setup (Main Workflow)

    Before triggering the sub-workflows, the Main Workflow generates a unique job_id and writes the total count ($N$) to a Redis cache with a TTL (Time To Live).

    // Pseudo-code for Main Workflow Setup
    const jobId = generateUUID();
    const totalTasks = 100;
    
    // Set atomic counter in Redis
    await redis.set(`job:${jobId}:total`, totalTasks);
    await redis.set(`job:${jobId}:completed`, 0);
    
    // Trigger N sub-workflows, passing the jobId
    triggerSubWorkflows(subWorkflowUrl, { jobId: jobId, payload: chunks });
    

    2. The Execution (Sub-Workflows)

    Each sub-workflow performs its heavy lifting (the LLM call). Once finished, it performs an atomic increment operation on the external store.

    This is the critical shift: The sub-workflow checks if it is the last one.

     

    // Pseudo-code for Sub-Workflow completion logic
    async function onTaskComplete(jobId, result) {
        // 1. Save individual result to a persistent DB/Storage
        await saveResult(jobId, result);
    
        // 2. Atomic Increment of completed count
        // INCR returns the new value after incrementing
        const newCompletedCount = await redis.incr(`job:${jobId}:completed`);
        const totalCount = await redis.get(`job:${jobId}:total`);
    
        // 3. Check if this is the final task
        if (newCompletedCount === parseInt(totalCount)) {
            console.log("All tasks finished. Resuming Main Workflow.");
            
            // 4. Trigger the Main Workflow Resume Webhook
            await http.post(mainWorkflowResumeUrl, { 
                jobId: jobId, 
                status: "COMPLETED" 
            });
        }
    }
    

    Why this works better

    • Atomicity: Redis INCR operations are atomic. Even if 50 tasks finish simultaneously, Redis guarantees the counter increments correctly from 1 to 100 without race conditions.
    • Decoupling: There is no “Collector” sitting in the middle consuming resources. The logic is distributed.
    • Resilience: If the Main Workflow sleeps for hours, the state remains safe in Redis/DB.

    LESSONS FOR ENGINEERING TEAMS

    Implementing high-concurrency workflows requires a shift from linear thinking to distributed state management. Here are key takeaways for teams looking to hire software developer talent for backend orchestration:

    • Trust Atomic Stores, Not Variables: Never rely on in-memory variables within a workflow engine to track distributed state. Use Redis, DynamoDB, or PostgreSQL for atomic counters.
    • Design for “At Least Once” Delivery: Webhooks can fail. Ensure your “Resume” signal has a retry mechanism if the Main Workflow doesn’t acknowledge receipt immediately.
    • Implement Dead Letter Logic: What if one task fails? Your logic should include a timeout or an error handler that increments a “Failed” counter. If Completed + Failed == Total, resume the main workflow with a “Partial Success” status.
    • Don’t Pass Huge Data: Don’t pass the actual results (e.g., extensive LLM text) back through the resume webhook. Store results in a database and only pass the job_id to the Main Workflow. The Main Workflow can then query the database for the aggregated results.
    • Observability is Mandatory: Tag every log with a job_id. When debugging a process with 100 parallel threads, you need to filter logs by the parent job to make sense of the noise.

    WRAP UP

    Transitioning from a fragile collector pattern to an atomic, state-driven architecture allowed our client to scale their document processing from tens to thousands of simultaneous pages without “zombie” workflows. Whether you are building internal automation or customer-facing SaaS products, handling parallelism requires respecting the complexities of distributed state.

    If you need help architecting scalable automation pipelines or need to hire backend engineers for distributed systems to audit your current workflow infrastructure, contact us.

    Social Hashtags

    #DistributedSystems #ScatterGather #WorkflowArchitecture #BackendEngineering #Concurrency #ScalableSystems #LLMEngineering

    Need a scalable scatter-gather workflow for your system?
    Talk to an Expert

     

     

     

     

    Frequently Asked Questions