AWS DocumentDB Mass Deletion with SQS & Lambda

Q: Why did loading records into memory crash the ECS task?

Containers have strict memory limits. Fetching one million database records simultaneously instantiates massive arrays of objects in the application's heap memory, surpassing the allocated RAM and triggering the operating system's Out of Memory (OOM) killer.

Q: Why use SQS instead of processing batches directly in ECS?

Direct processing holds the ECS task hostage for long periods. If the task fails or is interrupted by scaling policies, progress is lost. SQS provides a durable queue, meaning if a deletion fails, only that specific message is retried without affecting the rest of the batch.

Q: How does Lambda concurrency affect AWS DocumentDB?

AWS DocumentDB instances have finite connection limits based on their instance class. If thousands of SQS messages trigger thousands of parallel Lambda functions, the database will reject connections. Limiting Lambda concurrency ensures the database is queried at a safe, sustainable rate.

Q: Can we use AWS Step Functions instead of SQS?

Yes, Step Functions with a Map state can be used for distributed processing. However, SQS is often more cost-effective and simpler to implement for straightforward fan-out deletion patterns where complex workflow orchestration is not strictly required.

Q: How do you handle partial failures in the Lambda function?

By leveraging SQS dead-letter queues (DLQ) and ensuring idempotent operations. If the EventBridge call succeeds but the database deletion fails, the message returns to SQS. The next retry will resend the event and attempt deletion again; ensuring the downstream systems can handle duplicate events is essential here.

INTRODUCTION

During a recent project for an enterprise SaaS platform, our team was tasked with a critical data retention requirement. To comply with strict data privacy regulations, the system had to systematically purge user analytics data that exceeded a specific age threshold. The scope involved deleting over one million records from an AWS DocumentDB cluster.

However, this was not a simple bulk drop operation. For audit and compliance tracking, every individual document deletion required an event to be dispatched to Amazon EventBridge. While working on this data purge workflow, we realized the seemingly straightforward task was causing severe instability in our infrastructure. The process triggered high CPU usage and rapid memory consumption, ultimately leading to continuous Amazon Elastic Container Service (ECS) task restarts due to resource exhaustion.

Massive database operations combined with network-bound API calls often expose architectural bottlenecks that remain hidden during testing at a smaller scale. We encountered a situation where a monolithic processing approach simply could not survive in production. This challenge inspired this article so other engineering teams can avoid the pitfalls of synchronous mass deletions and adopt decoupled data processing architectures.

PROBLEM CONTEXT

The core business use case required an automated background job to identify expired records, remove them from AWS DocumentDB, and emit an audit trail payload to Amazon EventBridge. This workflow was housed within a background worker service running as an ECS task.

In the initial architectural design, the ECS task would execute a database query to find all expired records. Because the downstream EventBridge event required specific data fields from the deleted documents, the task had to read the documents before invoking the delete command. The logic followed a sequential pattern: query the database, load the records, iterate through the list, send the event to EventBridge, and finally execute the deletion.

While this synchronous pattern is highly readable and works perfectly for a few hundred records, it breaks down entirely when the dataset scales into the millions. It forces the application container to handle heavy I/O operations, high memory allocation, and prolonged network connections simultaneously.

WHAT WENT WRONG

The symptoms surfaced almost immediately during the first major production data purge. Our monitoring dashboards lit up with high CPU utilization alerts, followed shortly by memory exhaustion warnings. The ECS task was repeatedly terminated by the OOMKilled (Out of Memory) mechanism.

The root causes of this failure were tied to three fundamental architectural oversights:

In-Memory Data Overload: The query fetched all one million records into memory at once. Even with relatively small document sizes, the application heap limit was quickly exceeded, causing fatal crashes.
Synchronous Network Bottlenecks: For every document held in memory, the system made synchronous API calls to EventBridge and DocumentDB. This caused the application to pause, holding onto memory allocations for extensive periods while waiting for network responses.
Lack of Resilience: Because the task crashed midway, there was no checkpointing. Upon restart, the system would attempt to query and load the exact same massive dataset, creating an infinite crash loop.

It became evident that forcing a single container to handle discovery, transformation, API publishing, and database deletion was an anti-pattern. When companies hire software developer teams, they expect robust solutions that handle edge cases without cascading failures, meaning we had to rethink the data pipeline.

HOW WE APPROACHED THE SOLUTION

Our primary objective was to relieve the ECS task of the heavy lifting. We needed to transition from a monolithic batch process to a streaming, decoupled architecture. This is a common realization when you hire cloud developers for enterprise modernization; dividing the workload into independent, scalable components is often the safest path forward.

We considered implementing database cursor pagination directly within the ECS task to process chunks of records. While this would solve the memory issue, it would still tie up the container for hours and leave us vulnerable to network timeouts or incomplete processing if the task scaled in.

Instead, we opted for an event-driven fan-out architecture using Amazon Simple Queue Service (SQS) and AWS Lambda. The new architectural flow was designed as follows:

Discovery Phase (ECS): The ECS task queries DocumentDB using a highly optimized projection query. It only retrieves the document ID and the minimal fields required for the EventBridge payload. It processes these via a cursor and batches them into SQS messages.
Buffering Phase (SQS): SQS acts as a highly durable buffer, holding the deletion instructions. This prevents the downstream processing from being overwhelmed.
Execution Phase (Lambda): AWS Lambda functions consume the SQS messages in manageable batches. Each Lambda invocation performs the EventBridge dispatch and the DocumentDB deletion.

This separation of concerns meant the ECS task only needed enough memory to process a small cursor batch, while Lambda could automatically scale to handle the I/O-heavy deletion process.

FINAL IMPLEMENTATION

The implementation required careful configuration to ensure AWS Lambda did not overwhelm the DocumentDB cluster with too many concurrent connections. When you hire python developers for backend architecture, connection pooling and concurrency limits are critical considerations in serverless environments.

1. The ECS Discovery Script

We updated the ECS task to use cursor-based iteration and projection. Instead of loading full documents, we queried only what was necessary and pushed payloads to SQS in batches of 10.

def queue_records_for_deletion(db_collection, sqs_queue_url):
    query = {"status": "expired"}
    projection = {"_id": 1, "audit_data": 1}
    
    cursor = db_collection.find(query, projection).batch_size(1000)
    batch = []
    
    for document in cursor:
        batch.append({
            'Id': str(document['_id']),
            'MessageBody': json.dumps({
                'document_id': str(document['_id']),
                'audit_data': document.get('audit_data')
            })
        })
        
        if len(batch) == 10:
            sqs_client.send_message_batch(
                QueueUrl=sqs_queue_url,
                Entries=batch
            )
            batch.clear()
            
    # Send remaining
    if batch:
        sqs_client.send_message_batch(QueueUrl=sqs_queue_url, Entries=batch)

2. The Lambda Processor

The Lambda function was configured with an SQS trigger. Crucially, we limited the Reserved Concurrency of the Lambda function. DocumentDB has a maximum connection limit depending on instance size; allowing Lambda to scale infinitely would cause connection timeouts. We restricted concurrency to 50, ensuring a steady, safe drain of the queue.

def lambda_handler(event, context):
    for record in event['Records']:
        payload = json.loads(record['body'])
        doc_id = payload['document_id']
        audit_data = payload['audit_data']
        
        # 1. Send to EventBridge
        eventbridge.put_events(
            Entries=[{
                'Source': 'com.saas.data.cleanup',
                'DetailType': 'DocumentDeleted',
                'Detail': json.dumps({'id': doc_id, 'data': audit_data}),
                'EventBusName': 'audit-bus'
            }]
        )
        
        # 2. Delete from DocumentDB
        db_collection.delete_one({"_id": ObjectId(doc_id)})

By shifting to this architecture, the ECS task memory footprint dropped to a flat, predictable baseline. The 1 million records were safely queued within minutes, and the Lambda functions steadily drained the queue over the next hour without dropping a single EventBridge payload or straining the database.

LESSONS FOR ENGINEERING TEAMS

When engineering leaders hire aws developers for scalable data systems, they prioritize the ability to design resilient architectures. Here are the core insights from this implementation:

Never Load Unbounded Datasets into Memory: Always assume your dataset will grow beyond your container’s heap capacity. Use database cursors and pagination by default.
Use Database Projections: If you only need an ID and an audit field, do not fetch the entire 50KB document. Projections drastically reduce network payload size and memory consumption.
Decouple I/O-Heavy Operations: Mixing discovery (reading) and execution (API calls and deletes) in a single synchronous loop is a recipe for failure. Buffering with SQS isolates failures and enables independent scaling.
Control Serverless Concurrency: While Lambda scales beautifully, relational and document databases do not scale their connection pools infinitely. Always set concurrency limits on database-facing Lambda functions.
Design for Idempotency: Because network calls to EventBridge can fail, ensure your Lambda function can safely retry a message. If the event is sent but the database delete fails, the next retry should handle the state gracefully.

WRAP UP

Handling large-scale data modifications safely requires an architectural mindset that prioritizes decoupling and resource management. By moving from a centralized, memory-intensive batch process to a distributed SQS and Lambda pipeline, we eliminated ECS task crashes and ensured 100% compliance with the EventBridge audit requirements. If your organization is facing similar scaling challenges and you need to augment your team, you can contact us to discuss your requirements.

Social Hashtags

#AWS #DocumentDB #AWSLambda #AmazonSQS #EventDrivenArchitecture #Serverless #CloudComputing #DataEngineering #DevOps #SoftwareArchitecture #CloudNative #AWSArchitecture #BackendEngineering #ScalableSystems #Microservices #ECS #AmazonEventBridge #DatabaseOptimization #CloudDevelopment #TechLeadership

Frequently Asked Questions

Why did loading records into memory crash the ECS task?

Why use SQS instead of processing batches directly in ECS?

How does Lambda concurrency affect AWS DocumentDB?

Can we use AWS Step Functions instead of SQS?

How do you handle partial failures in the Lambda function?

Success Stories That Inspire

See how our team takes complex business challenges and turns them into powerful, scalable digital solutions. From custom software and web applications to automation, integrations, and cloud-ready systems, each project reflects our commitment to innovation, performance, and long-term value.

California photography SaaS scaled faster by hiring dedicated developers

California-based SMB Hired Dedicated Developers to Build a Photography SaaS Platform

Deleting millions of records from DocumentDB while triggering EventBridge events caused severe ECS resource exhaustion in our enterprise SaaS platform. Loading records into memory resulted in OOM kills and CPU spikes. Discover how decoupling the process with AWS SQS and Lambda provided a scalable architecture.

Who We Are

About Us

Our Team

Credentials

How We Work

Compare Hiring Costs

Explore

Modern Engineering

Enterprise Systems

Frontend & UI

Mobile Developers

Web & Backend

Product & Engineering Teams

Mobile & UX Teams

AI, Data & Automation Pods

Build Your Dedicated Team

How We Scaled AWS DocumentDB Deletions for 1 Million+ Records Using SQS and Lambda

Table of Contents

INTRODUCTION

PROBLEM CONTEXT

WHAT WENT WRONG

HOW WE APPROACHED THE SOLUTION

FINAL IMPLEMENTATION

1. The ECS Discovery Script

2. The Lambda Processor

LESSONS FOR ENGINEERING TEAMS

WRAP UP

Frequently Asked Questions

How to Fix Unsupported Currency Validation in Magento Third-Party API Integrations

How to Fix SQL Server CREATE LOGIN FROM WINDOWS Errors Across Multiple Active Directory Domains

PHP cURL Cookie File Permission Reset After cURL 8 Upgrade (Solved)

Success Stories That Inspire

California-based SMB Hired Dedicated Developers to Build a Photography SaaS Platform

Swedish Agency Built a Laravel-Based Staffing System by Hiring a Dedicated Remote Team

US SaaS Platform Cut Manual Ops by 70% After Hiring WeblineGlobal’s n8n Automation Pod

Hire Pre-Vetted Remote Developers

Amazing clients who trust us.

Who We Are

About Us

Our Team

Credentials

How We Work

Compare Hiring Costs

Explore

Modern Engineering

Enterprise Systems

Frontend & UI

Mobile Developers

Web & Backend

Product & Engineering Teams

Mobile & UX Teams

AI, Data & Automation Pods

Build Your Dedicated Team

Table of Contents

INTRODUCTION

PROBLEM CONTEXT

WHAT WENT WRONG

HOW WE APPROACHED THE SOLUTION

FINAL IMPLEMENTATION

1. The ECS Discovery Script

2. The Lambda Processor

LESSONS FOR ENGINEERING TEAMS

WRAP UP

Frequently Asked Questions

Related Posts

How to Fix Unsupported Currency Validation in Magento Third-Party API Integrations

How to Fix SQL Server CREATE LOGIN FROM WINDOWS Errors Across Multiple Active Directory Domains

PHP cURL Cookie File Permission Reset After cURL 8 Upgrade (Solved)

Success Stories That Inspire

California-based SMB Hired Dedicated Developers to Build a Photography SaaS Platform

Swedish Agency Built a Laravel-Based Staffing System by Hiring a Dedicated Remote Team

US SaaS Platform Cut Manual Ops by 70% After Hiring WeblineGlobal’s n8n Automation Pod

Hire Pre-Vetted Remote Developers

Amazing clients who trust us.

Looking to hire software developers, but unsure about budget or next steps