Table of Contents

    Book an Appointment

    INTRODUCTION

    While working on a high-throughput FinTech application, our engineering team was tasked with building a real-time market data engine. The system was designed to process and store candlestick (K-line) financial data—capturing open, high, low, close, and volume metrics—across millions of daily transactions. We utilized a distributed time-series database with a keyed stream table architecture to handle real-time ingestion and automatic deduplication.

    However, during a routine data validation phase, we realized something was wrong. Duplicate records were surfacing in our queries. A keyed stream table is explicitly designed to prevent this by enforcing uniqueness on a composite primary key. In our case, the composite key was based on the financial instrument’s symbol and its unixTime.

    Duplicate data in a financial system is a critical flaw that can heavily skew analytics, trigger false algorithmic trading signals, and compromise system integrity. We needed to identify why the database’s deduplication mechanism was failing despite the data flowing from highly controlled Python environments. This real-world challenge inspired this article, aiming to help other teams avoid similar oversights when handling time-series data at scale.

    PROBLEM CONTEXT

    The architecture of this market data engine relied heavily on Python for data ingestion. To handle both historical data backfills and live market feeds, we implemented two separate write paths into our time-series database:

    • Batch Ingestion Path: A historical data loader that pushed full batches of data via pandas DataFrames.
    • Real-Time Streaming Path: A low-latency, row-by-row writer that inserted individual candlestick bars directly from a live market websocket.

    Within the database, we configured a shared, persistent keyed stream table. The schema was defined strictly, with the composite key mapped to symbol (a string/symbol type) and unixTime (a LONG integer). The expectation was simple: if a historical batch insert and a live real-time insert both wrote a record for TICKER_A at the exact same unixTime, the database would naturally deduplicate them, keeping only the latest update.

    WHAT WENT WRONG

    Despite our schema definitions, queries consistently returned double the expected rows for overlapping time windows. Our initial assumption was a misconfiguration in the table’s cache or persistence settings. We experimented by toggling the database’s internal memory sharing and persistence flags, and even tested using a combined date and time composite key instead of unixTime. The duplicates persisted.

    When you hire software developer teams to build scalable architectures, diagnosing such anomalies requires a deep dive into the boundary where the application meets the database. We logged the payloads on the Python side before insertion. Both write paths were successfully outputting what appeared to be matching symbol and unixTime values, and both were strictly typed as LONG integers in the database.

    To the naked eye, the logs from the real-time pipeline and the historical batch pipeline looked correct. The data types matched perfectly. Yet, the deduplication engine simply refused to kick in.

    HOW WE APPROACHED THE SOLUTION

    To isolate the root cause, we built a deterministic, reproducible test environment. We seeded a pseudo-random number generator and pushed exactly five rows through the batch write path, followed by the “exact same” five rows through the row-by-row write path.

    When we queried the total row count, the result was 10. We then ran a side-by-side SQL query to inspect the raw numeric values of the unixTime column without any date-time formatting applied by the database client.

    The underlying issue instantly revealed itself: a timestamp precision mismatch.

    In our batch path, the pandas DataFrame logic preserved millisecond precision. A timestamp for a specific minute looked like 1746000000000 (13 digits). Conversely, the row-by-row ingestion path utilized standard Python datetime conversions—specifically int(bar.datetime.timestamp()). This standard library function outputs seconds, resulting in a timestamp like 1746000000 (10 digits).

    Because both 10-digit and 13-digit numbers fit perfectly within a LONG integer data type, the database accepted both without throwing a schema error. However, from the database’s perspective, 1746000000000 is mathematically distinct from 1746000000. Thus, it treated them as two entirely different moments in time, effectively bypassing our composite key deduplication.

    FINAL IMPLEMENTATION

    The fix required strict enforcement of temporal precision across all ingestion boundaries. We standardized the entire system on millisecond precision.

    1. Correcting the Python Ingestion Logic

    We updated the row-by-row ingestion script to explicitly calculate and cast the timestamp into milliseconds:

    # Corrected row-by-row write logic with millisecond precision
    self.writer.insert(
        bar.symbol,
        exchange_map.get(bar.exchange, ''),
        day,
        bar.datetime.date(),
        _time,
        round(bar.open_price, 2),
        round(bar.high_price, 2),
        round(bar.low_price, 2),
        round(bar.close_price, 2),
        int(bar.volume),
        bar.amount,
        # Standardizing to 13-digit millisecond precision
        int(bar.datetime.timestamp() * 1000) 
    )
    

    2. Validating the Fix in the Database

    We then ran our validation suite. By explicitly inserting records with matching 13-digit timestamps, we confirmed the keyed stream table behaved as designed:

    // Verification: consistent precision works correctly
    unixTime_test = 1746000060000L
    // Insert the same simulated bar twice
    insert into market_candle_stream values(`TICKER_A, `EXCHANGE_X, 2026.05.18, 2026.05.18, 10:01:00, 101.0, 102.0, 99.5, 101.5, 8500, 862750.0, unixTime_test)
    insert into market_candle_stream values(`TICKER_A, `EXCHANGE_X, 2026.05.18, 2026.05.18, 10:01:00, 101.0, 102.0, 99.5, 101.5, 8500, 862750.0, unixTime_test)
    // Query the count for that specific timestamp
    select count(*) from market_candle_stream where unixTime = 1746000060000L
    // Expected: 1 row (Deduplication successful)

    This simple mathematical normalization instantly resolved the duplication bottleneck. Memory usage stabilized, and our downstream algorithmic consumers received clean, deduplicated market states.

    LESSONS FOR ENGINEERING TEAMS

    If you plan to hire python developers for scalable data systems, it is vital to evaluate their understanding of data boundaries. Here are actionable insights engineering teams should apply:

    • Schema Types Do Not Guarantee Semantic Consistency: A LONG integer is just a bucket. It does not enforce whether the data inside represents seconds, milliseconds, or nanoseconds. Always document and enforce the semantic meaning of your primitives.
    • Standardize Time at the Edges: Do not rely on downstream systems to figure out time precision. Normalize all datetime objects into a strict UTC millisecond or nanosecond standard before the data leaves the application layer.
    • Beware of Framework Discrepancies: Pandas handles time differently than standard Python datetime, which handles it differently than numpy. When combining multiple tools, verify the output formats at the integration points.
    • Implement Idempotency Tests: Write automated tests that intentionally send duplicate records through different API paths. If your database size increases, your deduplication logic is flawed.
    • Log Raw Values During Debugging: Formatted dates in a database GUI can mask underlying integer differences. Always query the raw underlying metric (e.g., strFormat(unixTime)) when diagnosing time-series anomalies.

    WRAP UP

    Data duplication in stream processing is rarely the fault of the database engine itself; it is almost always an artifact of mismatched serialization across distributed application boundaries. By simply aligning our historical and real-time pipelines to utilize identical 13-digit millisecond Unix timestamps, we restored the integrity of our time-series architecture.

    Engineering complex, high-throughput systems requires meticulous attention to detail. Companies looking to hire data engineers for time-series platforms or hire dedicated remote backend developers with deep problem-solving maturity trust teams that can look beyond the surface level of a bug. If you need a reliable technology partner to scale your data architecture, contact us to explore how our experienced engineering teams can secure your platform.

    Social Hashtags

    #TimeSeriesDatabase #FinTechEngineering #PythonDevelopers #DataEngineering #RealTimeData #BigData #BackendDevelopment #StreamingData #DatabaseOptimization #AlgorithmicTrading #DataArchitecture #DistributedSystems #TechLeadership #SoftwareEngineering

     

    Frequently Asked Questions

    Success Stories That Inspire

    See how our team takes complex business challenges and turns them into powerful, scalable digital solutions. From custom software and web applications to automation, integrations, and cloud-ready systems, each project reflects our commitment to innovation, performance, and long-term value.