Table of Contents

    Book an Appointment

    INTRODUCTION

    While working on a large-scale data engineering and NLP project for a global regulatory organization, our team was tasked with processing decades of archived plenary session transcripts. The goal was to extract conversational data into a structured format for quantitative text analysis. Specifically, we needed to build a “speaker turn” dataset—identifying who was speaking and capturing everything they said before the next speaker took the floor.

    The scale was substantial: hundreds of documents spanning over sixty years, entirely in PDF format. On the surface, it seemed like a standard text extraction task. However, we quickly encountered a situation where the underlying OCR (Optical Character Recognition) completely broke our text parsing logic. Because the documents were originally printed in a two-column layout, the OCR engine read the text horizontally across the entire page rather than vertically down the left column and then the right.

    In production NLP pipelines, accurate text sequentiality is paramount. If sentences from the left column are spliced directly into sentences from the right column, the resulting text data is unintelligible, rendering downstream AI and sentiment analysis useless. This challenge inspired this article, demonstrating why naive string manipulation fails on complex PDF layouts and how treating documents as spatial geometries can salvage legacy data pipelines.

    PROBLEM CONTEXT

    The business use case required tracking specific policy statements made by regional delegates over time. The architecture relied on an automated ingestion pipeline that would scan the historical PDFs, isolate the text, identify known speaker names using regex, and generate a structured dataset mapping the speaker to their respective text blocks.

    We began our initial prototyping in R, utilizing standard text extraction libraries. Our speaker registry contained known historical delegates, and we needed to match their names within the text stream. The desired architectural flow was:

    • Ingest the PDF.
    • Extract raw text page by page.
    • Clean and normalize spacing.
    • Run regex matching against the speaker registry to define start and end indices for speaker turns.
    • Compile the results into a relational dataframe.

    However, when we inspected the extracted text strings, the context was hopelessly scrambled.

    WHAT WENT WRONG

    The primary symptom of the failure became obvious during manual log inspection. The extracted text strings contained mid-sentence interruptions that made no syntactic sense. When we opened the source PDFs and manually dragged our cursors down the left column to highlight the text, the highlight bled directly across the center margin into the right column.

    The OCR layer did not understand the visual layout. It simply recognized Y-coordinates and read the page from left to right as if it were a single continuous block of text.

    Our initial attempt to fix this programmatically was to rely on character counts. We wrote a function to split each extracted line of text in half based on the maximum character width of the page:

    # Initial flawed approach: Splitting strings by character width
    read_two_columns_naive <- function(page_text) {
      lines <- str_split(page_text, "n")[[1]]
      max_width <- max(nchar(lines))
      mid <- ceiling(max_width / 2)
      
      left_lines <- str_sub(lines, 1, mid)
      right_lines <- str_sub(lines, mid+1, nchar(lines))
      
      paste(c(left_lines, right_lines), collapse = " ")
    }
    

    This approach failed spectacularly in production. Because the text used proportional fonts, and because the OCR spacing was inconsistent (often introducing irregular whitespace or misaligned text boxes), splitting the string exactly at the mathematical midpoint regularly chopped words in half or missed the column margin entirely. We realized that string manipulation could not fix a geometry problem.

    HOW WE APPROACHED THE SOLUTION

    To ensure high-fidelity data extraction, we had to stop treating the PDF as a pure text stream and start treating it as a spatial image. If the OCR layer was corrupted by horizontal reading, the only reliable way to prevent horizontal bleed was to physically separate the right side of the page from the left side before the text was processed.

    We evaluated a few tradeoffs:

    • Attempting to reconstruct reading order via bounding boxes: We could use lower-level PDF tools to extract the exact X/Y coordinates of every word and write complex clustering algorithms to separate the columns. This is highly accurate but computationally expensive and difficult to maintain.
    • Image-based splitting and re-OCR: We could convert the PDF pages into high-resolution images, crop the images down the mathematical center, and run a fresh OCR process specifically on the left image, followed by the right image.

    Given the poor quality of the original OCR and the requirement for automated batch processing across hundreds of pages, we opted for the image-cropping approach. By rendering the PDF to an image and splitting the canvas, we created a physical boundary that no OCR engine could cross. When companies hire software developer teams to build document processing systems, establishing robust, failure-proof data boundaries like this is a hallmark of mature engineering.

    FINAL IMPLEMENTATION

    We implemented the solution using a combination of image processing and modern OCR libraries. We discarded the legacy OCR layer embedded in the PDFs and built a script that iterates through the document, renders the page layout, slices the image in half, and extracts the text in strict sequential order.

    # Required libraries for spatial processing
    pacman::p_load(tidyverse, magick, tesseract, pdftools)
    # Function to physically split and extract text
    process_two_column_page <- function(pdf_path, page_num, dpi = 300) {
      # Render the specific page as a high-res image
      page_img <- image_read_pdf(pdf_path, pages = page_num, density = dpi)
      
      # Get image dimensions
      img_info <- image_info(page_img)
      width <- img_info$width
      height <- img_info$height
      midpoint <- floor(width / 2)
      
      # Define cropping geometries: Width x Height + X_offset + Y_offset
      left_geo <- paste0(midpoint, "x", height, "+0+0")
      right_geo <- paste0(width - midpoint, "x", height, "+", midpoint, "+0")
      
      # Crop the images physically
      img_left <- image_crop(page_img, left_geo)
      img_right <- image_crop(page_img, right_geo)
      
      # Run fresh OCR on separated columns
      eng <- tesseract("eng")
      text_left <- ocr(img_left, engine = eng)
      text_right <- ocr(img_right, engine = eng)
      
      # Combine text sequentially
      combined_text <- paste(text_left, text_right, sep = "n")
      return(str_squish(combined_text))
    }
    # Example execution across a batch of pages
    extract_all_pages <- function(pdf_path, start_page, end_page) {
      map_chr(start_page:end_page, ~process_two_column_page(pdf_path, .x)) %>%
        paste(collapse = " ")
    }
    

    Once the clean, correctly sequenced text was generated, our regex-based speaker turn identification logic worked perfectly. The pipeline mapped out exactly where SPEAKER_A began and ended their remarks, seamlessly transitioning to SPEAKER_B.

    While this R-based prototype solved the logic flaw efficiently, scaling this to process thousands of books or millions of documents in parallel requires distributed computing. In such scenarios, enterprise leaders often hire python developers for scalable data systems to rewrite these workflows into PySpark or orchestrate them using Airflow, utilizing Python libraries like pdfplumber or unstructured.

    LESSONS FOR ENGINEERING TEAMS

    Tackling legacy unstructured data often reveals hidden complexities. Here are the core insights from this architectural shift:

    • Don’t fix spatial problems with string methods: If text is spatially disorganized, character-splitting and regex will always be brittle. You must enforce physical or coordinate-based boundaries.
    • Never trust legacy OCR: Historical PDFs often contain invisible layout errors. Visually inspect how the text layer is mapped before designing your ingestion pipeline.
    • Use rendering as a sanitization step: When document layouts are overly complex, rasterizing them into images and applying computer vision (or fresh OCR) guarantees layout control.
    • Plan for scaling early: Image processing is computationally heavier than text extraction. Ensure your architecture can handle asynchronous batch processing.
    • Anticipate NLP model requirements: LLMs and sentiment models require contextually accurate phrasing. Feeding horizontally bled text to an AI model will result in hallucinations. If you intend to hire AI developers for production deployment later, your data engineering pipelines must supply clean, sequential tokens.

    WRAP UP

    By stepping back from standard text manipulation and analyzing the physical geometry of the documents, we successfully decoupled the horizontal bleeding that plagued the legacy OCR layer. Implementing a visual cropping pipeline ensured accurate, scalable speaker turn extraction, unblocking the downstream quantitative text analysis.

    Solving edge cases in unstructured data extraction requires experience, strategic problem-solving, and robust engineering practices. If you need to scale your data processing pipelines, build resilient automation workflows, or expand your engineering capacity with dedicated remote experts, contact us.

    Social Hashtags

    #OCR #DataEngineering #NLP #MachineLearning #AI #PDFProcessing #TextExtraction #ComputerVision #BigData #DataPipeline #Tesseract #AIEngineering #Automation #DeepLearning #DocumentAI

    Frequently Asked Questions