INTRODUCTION
During a recent project for an enterprise SaaS platform, our team was tasked with a critical data retention requirement. To comply with strict data privacy regulations, the system had to systematically purge user analytics data that exceeded a specific age threshold. The scope involved deleting over one million records from an AWS DocumentDB cluster.
However, this was not a simple bulk drop operation. For audit and compliance tracking, every individual document deletion required an event to be dispatched to Amazon EventBridge. While working on this data purge workflow, we realized the seemingly straightforward task was causing severe instability in our infrastructure. The process triggered high CPU usage and rapid memory consumption, ultimately leading to continuous Amazon Elastic Container Service (ECS) task restarts due to resource exhaustion.
Massive database operations combined with network-bound API calls often expose architectural bottlenecks that remain hidden during testing at a smaller scale. We encountered a situation where a monolithic processing approach simply could not survive in production. This challenge inspired this article so other engineering teams can avoid the pitfalls of synchronous mass deletions and adopt decoupled data processing architectures.
PROBLEM CONTEXT
The core business use case required an automated background job to identify expired records, remove them from AWS DocumentDB, and emit an audit trail payload to Amazon EventBridge. This workflow was housed within a background worker service running as an ECS task.
In the initial architectural design, the ECS task would execute a database query to find all expired records. Because the downstream EventBridge event required specific data fields from the deleted documents, the task had to read the documents before invoking the delete command. The logic followed a sequential pattern: query the database, load the records, iterate through the list, send the event to EventBridge, and finally execute the deletion.
While this synchronous pattern is highly readable and works perfectly for a few hundred records, it breaks down entirely when the dataset scales into the millions. It forces the application container to handle heavy I/O operations, high memory allocation, and prolonged network connections simultaneously.
WHAT WENT WRONG
The symptoms surfaced almost immediately during the first major production data purge. Our monitoring dashboards lit up with high CPU utilization alerts, followed shortly by memory exhaustion warnings. The ECS task was repeatedly terminated by the OOMKilled (Out of Memory) mechanism.
The root causes of this failure were tied to three fundamental architectural oversights:
- In-Memory Data Overload: The query fetched all one million records into memory at once. Even with relatively small document sizes, the application heap limit was quickly exceeded, causing fatal crashes.
- Synchronous Network Bottlenecks: For every document held in memory, the system made synchronous API calls to EventBridge and DocumentDB. This caused the application to pause, holding onto memory allocations for extensive periods while waiting for network responses.
- Lack of Resilience: Because the task crashed midway, there was no checkpointing. Upon restart, the system would attempt to query and load the exact same massive dataset, creating an infinite crash loop.
It became evident that forcing a single container to handle discovery, transformation, API publishing, and database deletion was an anti-pattern. When companies hire software developer teams, they expect robust solutions that handle edge cases without cascading failures, meaning we had to rethink the data pipeline.
HOW WE APPROACHED THE SOLUTION
Our primary objective was to relieve the ECS task of the heavy lifting. We needed to transition from a monolithic batch process to a streaming, decoupled architecture. This is a common realization when you hire cloud developers for enterprise modernization; dividing the workload into independent, scalable components is often the safest path forward.
We considered implementing database cursor pagination directly within the ECS task to process chunks of records. While this would solve the memory issue, it would still tie up the container for hours and leave us vulnerable to network timeouts or incomplete processing if the task scaled in.
Instead, we opted for an event-driven fan-out architecture using Amazon Simple Queue Service (SQS) and AWS Lambda. The new architectural flow was designed as follows:
- Discovery Phase (ECS): The ECS task queries DocumentDB using a highly optimized projection query. It only retrieves the document ID and the minimal fields required for the EventBridge payload. It processes these via a cursor and batches them into SQS messages.
- Buffering Phase (SQS): SQS acts as a highly durable buffer, holding the deletion instructions. This prevents the downstream processing from being overwhelmed.
- Execution Phase (Lambda): AWS Lambda functions consume the SQS messages in manageable batches. Each Lambda invocation performs the EventBridge dispatch and the DocumentDB deletion.
This separation of concerns meant the ECS task only needed enough memory to process a small cursor batch, while Lambda could automatically scale to handle the I/O-heavy deletion process.
FINAL IMPLEMENTATION
The implementation required careful configuration to ensure AWS Lambda did not overwhelm the DocumentDB cluster with too many concurrent connections. When you hire python developers for backend architecture, connection pooling and concurrency limits are critical considerations in serverless environments.
1. The ECS Discovery Script
We updated the ECS task to use cursor-based iteration and projection. Instead of loading full documents, we queried only what was necessary and pushed payloads to SQS in batches of 10.
def queue_records_for_deletion(db_collection, sqs_queue_url):
query = {"status": "expired"}
projection = {"_id": 1, "audit_data": 1}
cursor = db_collection.find(query, projection).batch_size(1000)
batch = []
for document in cursor:
batch.append({
'Id': str(document['_id']),
'MessageBody': json.dumps({
'document_id': str(document['_id']),
'audit_data': document.get('audit_data')
})
})
if len(batch) == 10:
sqs_client.send_message_batch(
QueueUrl=sqs_queue_url,
Entries=batch
)
batch.clear()
# Send remaining
if batch:
sqs_client.send_message_batch(QueueUrl=sqs_queue_url, Entries=batch)
2. The Lambda Processor
The Lambda function was configured with an SQS trigger. Crucially, we limited the Reserved Concurrency of the Lambda function. DocumentDB has a maximum connection limit depending on instance size; allowing Lambda to scale infinitely would cause connection timeouts. We restricted concurrency to 50, ensuring a steady, safe drain of the queue.
def lambda_handler(event, context):
for record in event['Records']:
payload = json.loads(record['body'])
doc_id = payload['document_id']
audit_data = payload['audit_data']
# 1. Send to EventBridge
eventbridge.put_events(
Entries=[{
'Source': 'com.saas.data.cleanup',
'DetailType': 'DocumentDeleted',
'Detail': json.dumps({'id': doc_id, 'data': audit_data}),
'EventBusName': 'audit-bus'
}]
)
# 2. Delete from DocumentDB
db_collection.delete_one({"_id": ObjectId(doc_id)})
By shifting to this architecture, the ECS task memory footprint dropped to a flat, predictable baseline. The 1 million records were safely queued within minutes, and the Lambda functions steadily drained the queue over the next hour without dropping a single EventBridge payload or straining the database.
LESSONS FOR ENGINEERING TEAMS
When engineering leaders hire aws developers for scalable data systems, they prioritize the ability to design resilient architectures. Here are the core insights from this implementation:
- Never Load Unbounded Datasets into Memory: Always assume your dataset will grow beyond your container’s heap capacity. Use database cursors and pagination by default.
- Use Database Projections: If you only need an ID and an audit field, do not fetch the entire 50KB document. Projections drastically reduce network payload size and memory consumption.
- Decouple I/O-Heavy Operations: Mixing discovery (reading) and execution (API calls and deletes) in a single synchronous loop is a recipe for failure. Buffering with SQS isolates failures and enables independent scaling.
- Control Serverless Concurrency: While Lambda scales beautifully, relational and document databases do not scale their connection pools infinitely. Always set concurrency limits on database-facing Lambda functions.
- Design for Idempotency: Because network calls to EventBridge can fail, ensure your Lambda function can safely retry a message. If the event is sent but the database delete fails, the next retry should handle the state gracefully.
WRAP UP
Handling large-scale data modifications safely requires an architectural mindset that prioritizes decoupling and resource management. By moving from a centralized, memory-intensive batch process to a distributed SQS and Lambda pipeline, we eliminated ECS task crashes and ensured 100% compliance with the EventBridge audit requirements. If your organization is facing similar scaling challenges and you need to augment your team, you can contact us to discuss your requirements.
Social Hashtags
#AWS #DocumentDB #AWSLambda #AmazonSQS #EventDrivenArchitecture #Serverless #CloudComputing #DataEngineering #DevOps #SoftwareArchitecture #CloudNative #AWSArchitecture #BackendEngineering #ScalableSystems #Microservices #ECS #AmazonEventBridge #DatabaseOptimization #CloudDevelopment #TechLeadership
Frequently Asked Questions
Containers have strict memory limits. Fetching one million database records simultaneously instantiates massive arrays of objects in the application's heap memory, surpassing the allocated RAM and triggering the operating system's Out of Memory (OOM) killer.
Direct processing holds the ECS task hostage for long periods. If the task fails or is interrupted by scaling policies, progress is lost. SQS provides a durable queue, meaning if a deletion fails, only that specific message is retried without affecting the rest of the batch.
AWS DocumentDB instances have finite connection limits based on their instance class. If thousands of SQS messages trigger thousands of parallel Lambda functions, the database will reject connections. Limiting Lambda concurrency ensures the database is queried at a safe, sustainable rate.
Yes, Step Functions with a Map state can be used for distributed processing. However, SQS is often more cost-effective and simpler to implement for straightforward fan-out deletion patterns where complex workflow orchestration is not strictly required.
By leveraging SQS dead-letter queues (DLQ) and ensuring idempotent operations. If the EventBridge call succeeds but the database deletion fails, the message returns to SQS. The next retry will resend the event and attempt deletion again; ensuring the downstream systems can handle duplicate events is essential here.
Success Stories That Inspire
See how our team takes complex business challenges and turns them into powerful, scalable digital solutions. From custom software and web applications to automation, integrations, and cloud-ready systems, each project reflects our commitment to innovation, performance, and long-term value.

California-based SMB Hired Dedicated Developers to Build a Photography SaaS Platform

Swedish Agency Built a Laravel-Based Staffing System by Hiring a Dedicated Remote Team

















