Table of Contents

    Book an Appointment

    INTRODUCTION

    During a recent project for an enterprise SaaS platform, our team was tasked with a critical data retention requirement. To comply with strict data privacy regulations, the system had to systematically purge user analytics data that exceeded a specific age threshold. The scope involved deleting over one million records from an AWS DocumentDB cluster.

    However, this was not a simple bulk drop operation. For audit and compliance tracking, every individual document deletion required an event to be dispatched to Amazon EventBridge. While working on this data purge workflow, we realized the seemingly straightforward task was causing severe instability in our infrastructure. The process triggered high CPU usage and rapid memory consumption, ultimately leading to continuous Amazon Elastic Container Service (ECS) task restarts due to resource exhaustion.

    Massive database operations combined with network-bound API calls often expose architectural bottlenecks that remain hidden during testing at a smaller scale. We encountered a situation where a monolithic processing approach simply could not survive in production. This challenge inspired this article so other engineering teams can avoid the pitfalls of synchronous mass deletions and adopt decoupled data processing architectures.

    PROBLEM CONTEXT

    The core business use case required an automated background job to identify expired records, remove them from AWS DocumentDB, and emit an audit trail payload to Amazon EventBridge. This workflow was housed within a background worker service running as an ECS task.

    In the initial architectural design, the ECS task would execute a database query to find all expired records. Because the downstream EventBridge event required specific data fields from the deleted documents, the task had to read the documents before invoking the delete command. The logic followed a sequential pattern: query the database, load the records, iterate through the list, send the event to EventBridge, and finally execute the deletion.

    While this synchronous pattern is highly readable and works perfectly for a few hundred records, it breaks down entirely when the dataset scales into the millions. It forces the application container to handle heavy I/O operations, high memory allocation, and prolonged network connections simultaneously.

    WHAT WENT WRONG

    The symptoms surfaced almost immediately during the first major production data purge. Our monitoring dashboards lit up with high CPU utilization alerts, followed shortly by memory exhaustion warnings. The ECS task was repeatedly terminated by the OOMKilled (Out of Memory) mechanism.

    The root causes of this failure were tied to three fundamental architectural oversights:

    • In-Memory Data Overload: The query fetched all one million records into memory at once. Even with relatively small document sizes, the application heap limit was quickly exceeded, causing fatal crashes.
    • Synchronous Network Bottlenecks: For every document held in memory, the system made synchronous API calls to EventBridge and DocumentDB. This caused the application to pause, holding onto memory allocations for extensive periods while waiting for network responses.
    • Lack of Resilience: Because the task crashed midway, there was no checkpointing. Upon restart, the system would attempt to query and load the exact same massive dataset, creating an infinite crash loop.

    It became evident that forcing a single container to handle discovery, transformation, API publishing, and database deletion was an anti-pattern. When companies hire software developer teams, they expect robust solutions that handle edge cases without cascading failures, meaning we had to rethink the data pipeline.

    HOW WE APPROACHED THE SOLUTION

    Our primary objective was to relieve the ECS task of the heavy lifting. We needed to transition from a monolithic batch process to a streaming, decoupled architecture. This is a common realization when you hire cloud developers for enterprise modernization; dividing the workload into independent, scalable components is often the safest path forward.

    We considered implementing database cursor pagination directly within the ECS task to process chunks of records. While this would solve the memory issue, it would still tie up the container for hours and leave us vulnerable to network timeouts or incomplete processing if the task scaled in.

    Instead, we opted for an event-driven fan-out architecture using Amazon Simple Queue Service (SQS) and AWS Lambda. The new architectural flow was designed as follows:

    • Discovery Phase (ECS): The ECS task queries DocumentDB using a highly optimized projection query. It only retrieves the document ID and the minimal fields required for the EventBridge payload. It processes these via a cursor and batches them into SQS messages.
    • Buffering Phase (SQS): SQS acts as a highly durable buffer, holding the deletion instructions. This prevents the downstream processing from being overwhelmed.
    • Execution Phase (Lambda): AWS Lambda functions consume the SQS messages in manageable batches. Each Lambda invocation performs the EventBridge dispatch and the DocumentDB deletion.

    This separation of concerns meant the ECS task only needed enough memory to process a small cursor batch, while Lambda could automatically scale to handle the I/O-heavy deletion process.

    FINAL IMPLEMENTATION

    The implementation required careful configuration to ensure AWS Lambda did not overwhelm the DocumentDB cluster with too many concurrent connections. When you hire python developers for backend architecture, connection pooling and concurrency limits are critical considerations in serverless environments.

    1. The ECS Discovery Script

    We updated the ECS task to use cursor-based iteration and projection. Instead of loading full documents, we queried only what was necessary and pushed payloads to SQS in batches of 10.

    def queue_records_for_deletion(db_collection, sqs_queue_url):
        query = {"status": "expired"}
        projection = {"_id": 1, "audit_data": 1}
        
        cursor = db_collection.find(query, projection).batch_size(1000)
        batch = []
        
        for document in cursor:
            batch.append({
                'Id': str(document['_id']),
                'MessageBody': json.dumps({
                    'document_id': str(document['_id']),
                    'audit_data': document.get('audit_data')
                })
            })
            
            if len(batch) == 10:
                sqs_client.send_message_batch(
                    QueueUrl=sqs_queue_url,
                    Entries=batch
                )
                batch.clear()
                
        # Send remaining
        if batch:
            sqs_client.send_message_batch(QueueUrl=sqs_queue_url, Entries=batch)
    

    2. The Lambda Processor

    The Lambda function was configured with an SQS trigger. Crucially, we limited the Reserved Concurrency of the Lambda function. DocumentDB has a maximum connection limit depending on instance size; allowing Lambda to scale infinitely would cause connection timeouts. We restricted concurrency to 50, ensuring a steady, safe drain of the queue.

    def lambda_handler(event, context):
        for record in event['Records']:
            payload = json.loads(record['body'])
            doc_id = payload['document_id']
            audit_data = payload['audit_data']
            
            # 1. Send to EventBridge
            eventbridge.put_events(
                Entries=[{
                    'Source': 'com.saas.data.cleanup',
                    'DetailType': 'DocumentDeleted',
                    'Detail': json.dumps({'id': doc_id, 'data': audit_data}),
                    'EventBusName': 'audit-bus'
                }]
            )
            
            # 2. Delete from DocumentDB
            db_collection.delete_one({"_id": ObjectId(doc_id)})
    

    By shifting to this architecture, the ECS task memory footprint dropped to a flat, predictable baseline. The 1 million records were safely queued within minutes, and the Lambda functions steadily drained the queue over the next hour without dropping a single EventBridge payload or straining the database.

    LESSONS FOR ENGINEERING TEAMS

    When engineering leaders hire aws developers for scalable data systems, they prioritize the ability to design resilient architectures. Here are the core insights from this implementation:

    • Never Load Unbounded Datasets into Memory: Always assume your dataset will grow beyond your container’s heap capacity. Use database cursors and pagination by default.
    • Use Database Projections: If you only need an ID and an audit field, do not fetch the entire 50KB document. Projections drastically reduce network payload size and memory consumption.
    • Decouple I/O-Heavy Operations: Mixing discovery (reading) and execution (API calls and deletes) in a single synchronous loop is a recipe for failure. Buffering with SQS isolates failures and enables independent scaling.
    • Control Serverless Concurrency: While Lambda scales beautifully, relational and document databases do not scale their connection pools infinitely. Always set concurrency limits on database-facing Lambda functions.
    • Design for Idempotency: Because network calls to EventBridge can fail, ensure your Lambda function can safely retry a message. If the event is sent but the database delete fails, the next retry should handle the state gracefully.

    WRAP UP

    Handling large-scale data modifications safely requires an architectural mindset that prioritizes decoupling and resource management. By moving from a centralized, memory-intensive batch process to a distributed SQS and Lambda pipeline, we eliminated ECS task crashes and ensured 100% compliance with the EventBridge audit requirements. If your organization is facing similar scaling challenges and you need to augment your team, you can contact us to discuss your requirements.

    Social Hashtags

    #AWS #DocumentDB #AWSLambda #AmazonSQS #EventDrivenArchitecture #Serverless #CloudComputing #DataEngineering #DevOps #SoftwareArchitecture #CloudNative #AWSArchitecture #BackendEngineering #ScalableSystems #Microservices #ECS #AmazonEventBridge #DatabaseOptimization #CloudDevelopment #TechLeadership

     

    Frequently Asked Questions

    Success Stories That Inspire

    See how our team takes complex business challenges and turns them into powerful, scalable digital solutions. From custom software and web applications to automation, integrations, and cloud-ready systems, each project reflects our commitment to innovation, performance, and long-term value.