Table of Contents

    Book an Appointment

    INTRODUCTION

    While working on a core ledger modernization project for a FinTech client specializing in high-frequency peer-to-peer payments, we encountered a concurrency challenge that standard testing initially missed. The system was designed to handle thousands of concurrent transactions per second, ensuring strict ACID compliance for every fund transfer. However, during a load simulation mirroring “Black Friday” traffic volumes, we noticed a non-trivial percentage of transactions failing not due to logic errors, but due to database timeouts.

    We realized that as concurrency scaled, the database was aggressively terminating transactions to protect itself. The root cause wasn’t hardware capacity or index inefficiency—it was a fundamental architectural oversight in how resources were being locked. This challenge inspired this article, dissecting how we moved from erratic deadlocks to a stable, deterministic locking mechanism. It serves as a guide for engineering leaders looking to stabilize high-volume transactional systems.

    PROBLEM CONTEXT

    The application in question was a double-entry bookkeeping system serving a digital wallet platform. In this domain, a single “transfer” operation is actually two distinct database updates wrapped in a single transaction context:

    • Debit the Sender’s account balance.
    • Credit the Receiver’s account balance.
    • Insert a ledger entry recording the movement.

    The business requirement demanded strict consistency; money could not be created or destroyed. Consequently, we utilized pessimistic locking (`SELECT FOR UPDATE`) to ensure that no other transaction could modify an account’s balance while a transfer was in progress. Under low to moderate load, this architecture performed flawlessly. The latency was low, and data integrity was 100%.

    However, the issue surfaced when the system scaled to support a surge in user activity where localized clusters of users were transferring funds back and forth rapidly.

    WHAT WENT WRONG

    The failures appeared in the application logs as `Deadlock found when trying to get lock; try restarting transaction`. In the database monitoring tools, we observed a spike in “rolled back” transactions.

    The architectural oversight was the order in which locks were acquired. Consider two users, User A and User B, initiating transfers simultaneously:

    • Transaction 1 (A pays B): Locks Record A, then attempts to lock Record B.
    • Transaction 2 (B pays A): Locks Record B, then attempts to lock Record A.

    If these two transactions execute at the exact same millisecond, Transaction 1 holds the lock on A and waits for B. Transaction 2 holds the lock on B and waits for A. Neither can proceed. The database deadlock detector eventually steps in and kills one of the transactions to let the other proceed.

    In a high-velocity environment, simply retrying the transaction (the standard advice) created a “retry storm,” further bogging down the database with failed lock acquisition attempts. This is a classic scenario where companies realize the need to hire software developer teams with deep backend architectural experience rather than just feature implementation skills.

    HOW WE APPROACHED THE SOLUTION

    We gathered the engineering team to analyze the deadlock graphs provided by the database engine. We evaluated three potential solutions:

    1. Optimistic Locking:

    Instead of locking rows, we could use a version column. If the version changed between read and write, the transaction fails.

    Tradeoff: Under high contention (hot accounts), this leads to excessive retries and poor user experience.

    2. Queue-Based Serialization:

    Push all transfers into a single queue and process them sequentially.

    Tradeoff: This destroys scalability. The throughput is limited by the processing speed of a single consumer.

    3. Deterministic Resource Ordering:

    Enforce a rule where locks are always acquired in a specific mathematical order, regardless of the transaction direction.

    Decision: We chose this approach. It maintains parallelism while mathematically guaranteeing that circular dependencies (deadlocks) cannot occur.

    FINAL IMPLEMENTATION

    The fix involved refactoring the service layer to implement a canonical sorting strategy before interacting with the database repository. We mandated that whenever a transaction involves multiple resources (accounts), the IDs of those resources must be sorted, and locks must be acquired in that sorted order.

    Here is a sanitized logic representation of the fix:

    // Generic representation of the Transfer Service Logic
    public void executeTransfer(Long sourceId, Long targetId, BigDecimal amount) {
        Long firstLockId;
        Long secondLockId;
        // DETERMINISTIC SORTING
        // Always lock the smaller ID first, then the larger ID.
        if (sourceId < targetId) {
            firstLockId = sourceId;
            secondLockId = targetId;
        } else {
            firstLockId = targetId;
            secondLockId = sourceId;
        }
        transactionManager.executeInTransaction(() -> {
            // Acquire locks in strict order
            Account first = accountRepo.findByIdAndLock(firstLockId);
            Account second = accountRepo.findByIdAndLock(secondLockId);
            // Perform business logic (Debit/Credit)
            // Note: We must identify which account is source/target 
            // regardless of locking order.
            if (first.getId().equals(sourceId)) {
                first.debit(amount);
                second.credit(amount);
            } else {
                second.debit(amount);
                first.credit(amount);
            }
            accountRepo.save(first);
            accountRepo.save(second);
        });
    }
    

    Validation:

    We redeployed the service and re-ran the “Black Friday” load test. The deadlock exceptions dropped to zero. While individual transaction latency increased slightly (microseconds) due to the wait times for locks on hot accounts, the overall system throughput stabilized because we eliminated the rollback-and-retry overhead.

    This implementation proved critical for the client’s stability. When you hire backend developers for financial systems, ensuring they understand concurrency patterns like this is non-negotiable.

    LESSONS FOR ENGINEERING TEAMS

    Reflecting on this implementation, here are the key takeaways for technical leaders:

    • Database constraints are not enough: Foreign keys ensure referential integrity, but application-level logic is required to ensure transactional consistency in multi-row updates.
    • Reproducibility requires load: Concurrency bugs rarely show up in unit tests or local dev environments. You must test with high concurrency simulation.
    • Deadlocks are usually architectural: If you see deadlocks, do not just increase timeout thresholds. Analyze the lock acquisition order.
    • Deterministic ordering is powerful: Simple sorting of resource IDs is a lightweight, robust way to prevent circular dependencies in distributed systems.
    • Expertise matters: Complex locking strategies require engineers who understand database internals. If your team lacks this depth, it may be time to hire database architects for high-concurrency apps to audit your core transaction paths.

    WRAP UP

    Handling high-concurrency transactions requires looking beyond the code and understanding how the database engine manages resources. By switching to a deterministic locking strategy, we eliminated deadlocks and ensured the reliability of a critical financial ledger.

    Social Hashtags

    #DatabaseDeadlocks #HighConcurrency #FinTechEngineering #BackendArchitecture #ACIDTransactions #DistributedSystems #LedgerSystems #DatabaseDesign #ScalableSystems #EngineeringLeadership

    If you are facing stability issues in your high-scale applications, contact us to discuss how our dedicated engineering teams can help.

    Frequently Asked Questions