Table of Contents

    Book an Appointment

    INTRODUCTION

    While working on a semantic search optimization project for a large-scale e-commerce retailer targeting the Italian market, we encountered a subtle but critical failure in our Natural Language Processing (NLP) pipeline. The system was designed to parse millions of product titles to extract attributes, relationships, and categorization data.

    Italian product titles are notoriously difficult to parse due to their rich morphology and the “keyword-stuffing” nature of e-commerce listings. During a routine quality assurance audit, we noticed that while our Part-of-Speech (POS) tagger was performing well—especially after we injected manual overrides for domain-specific jargon—the Dependency Parser was still generating incorrect syntax trees. It was effectively ignoring our corrections.

    We realized that our architecture was treating these components as independent silos. The parser was predicting relationships based on raw token vectors, completely bypassing the “gold” POS tags we had carefully injected. This disconnect threatened the accuracy of the search engine, as incorrect dependencies lead to wrong attribute mapping (e.g., associating a color with the wrong noun). This article outlines how we diagnosed the architectural limitation in spaCy v3 and the configuration changes we implemented to solve it.

    PROBLEM CONTEXT

    The client’s platform relied on an NLP engine to structure unstructured text. For example, a title like “Tavolo cucina legno massello rovere antico” (Kitchen table solid wood antique oak) contains a chain of nouns and adjectives where the dependency structure dictates which adjective modifies which noun.

    To improve accuracy, we implemented a “human-in-the-loop” mechanism where domain experts could define rules to override POS tags. If the system tagged “rosa” (pink/rose) as a noun when it should be an adjective in a specific context, we forced the tag to ADJ.

    The expectation was simple: if we tell the pipeline that a token is an ADJ, the Dependency Parser should treat it as a modifier. However, in production, the parser continued to attach the token as if it were a NOUN, resulting in a fractured parse tree. This rendered our manual corrections useless for downstream logic that relied on the dependency graph.

    WHAT WENT WRONG

    The issue lay in the default architectural design of spaCy v3 pipelines. In modern transformer-based or vector-based NLP pipelines, components often run in parallel or share a common embedding layer (Tok2Vec) without necessarily feeding into each other sequentially during inference.

    Our investigation revealed the following root causes:

    • Component Independence: The parser component was generating its predictions based on the tok2vec embeddings, not the output of the tagger.
    • Inference Isolation: When we manually set doc[i].pos_ = "ADJ" in a custom component before the parser ran, the parser simply didn’t look at that attribute. It looked at the underlying vector representation of the token, which hadn’t changed.
    • Missing Features: The model configuration for the parser had not been trained to include POS tags as input features. It was trained to predict dependencies directly from text embeddings.

    Essentially, we were changing the label on the outside of the box, but the parser was still judging the contents based on an X-ray of the box.

    HOW WE APPROACHED THE SOLUTION

    We needed a strategy to make the Dependency Parser “POS-aware.” We evaluated three potential approaches:

    1. Constraint-Based Parsing: Attempting to force dependency links using hard-coded constraints. We discarded this because it is brittle and doesn’t scale well across millions of varied titles.
    2. Sequential Component Dependency: Reverting to an older architecture where the parser explicitly waits for the tagger. This creates a bottleneck and doesn’t inherently solve the feature usage issue.
    3. Feature Engineering & Retraining: The robust solution. We needed to reconfigure the parser’s neural network to accept POS tags as an explicit input feature (embedding) alongside the token vectors.

    We chose the third approach. This required modifying the training configuration to ensure that the parser learns to rely on the POS tag information. This way, if we inject a “gold” POS tag at inference time, the parser incorporates that signal into its decision-making process.

    FINAL IMPLEMENTATION

    To implement this, we had to adjust the config.cfg file used for training the spaCy pipeline. The goal was to include POS tags as extra features for the parser model.

    Here is the conceptual breakdown of the configuration change. We switched from a standard listener architecture to one that explicitly concatenates POS embeddings.

    1. Configuration Adjustment

    In the standard config, the parser model usually looks only at tok2vec. We needed to verify that the model architecture supports extra_features or similar mechanisms depending on the specific architecture (e.g., transition-based parser).

    However, the most direct way in spaCy v3 to enforce this dependency is ensuring the parser is trained with the POS tags available. If using a transformer, the vector is often dominant. For our specific non-transformer, efficiency-focused model, we utilized a configuration that fed the tagger’s output into the parser.

    # config.cfg (Conceptual Snippet)
    [components.parser]
    factory = "parser"
    [components.parser.model]
    @architectures = "spacy.TransitionBasedParser.v2"
    # We ensure the token vector width matches the input
    tok2vec_t2v = ${components.tok2vec.model}
    # CRITICAL: We need to ensure the parser 'sees' the tags.
    # In custom architectures, we might concatenate tag embeddings here.
    # For standard pipelines, we ensured the training data included Gold POS tags
    # and that the parser was trained strictly *after* the tagger in the pipeline order.
    

    2. The “Pre-tagged” Training Strategy

    The configuration change alone wasn’t enough. We had to change how we trained the parser. We verified that the parser was trained on data where the POS tags were treated as “gold” features. We utilized the --paths.train data which included correct POS tags and ensured the parser component was configured to use predicted tags if available.

    However, to strictly enforce the “Gold” tag usage at inference, we implemented a custom component that runs before the parser.

    # custom_component.py
    import spacy
    from spacy.language import Language
    @Language.component("gold_pos_injector")
    def gold_pos_injector(doc):
        # Logic to look up gold tags from a dictionary or rules engine
        # Example: Override 'rosa' (pink) to ADJ if found in specific context
        for token in doc:
            if token.text.lower() in specialized_adjective_list:
                token.pos_ = "ADJ"
                token.tag_ = "ADJ" # Ensure fine-grained tag is also set   
        return doc
    # Pipeline assembly
    # nlp.add_pipe("gold_pos_injector", before="parser")
    

    3. Retraining with Dependency

    Ideally, for the parser to respect these tags, it must be trained in an environment where the POS tags are accurate and predictive. We retrained the parser while providing the “Gold” POS tags as features. In spaCy, this is often implicit if the components are sequential, but for stronger enforcement, we verified that the internal features of the parser model included tag embeddings.

    Note: If standard spaCy architectures resist this coupling, we sometimes utilize custom architectures where the input tensor to the parser is concat(tok2vec, pos_embedding).

    LESSONS FOR ENGINEERING TEAMS

    This experience highlighted several key takeaways for teams looking to hire software developers for advanced NLP tasks:

    • Don’t Assume Connectivity: Just because components reside in the same pipeline doesn’t mean they share data during inference. Validate the flow of information.
    • Inference vs. Training Logic: A model might learn a correlation during training but fail to utilize an overridden attribute during inference if the architecture relies on deep vectors rather than surface attributes.
    • Ambiguity requires Hierarchy: In languages with high morphological ambiguity like Italian, hierarchical processing (fix POS -> then fix Dependency) is mandatory. You cannot solve both simultaneously with high accuracy on limited data.
    • Configuration over Code: In frameworks like spaCy v3, the solution often lies in the config.cfg and feature definition rather than writing more Python code.
    • Validation Metrics: Always test your “override” logic. We created a unit test suite specifically for the “Gold” tags to ensure the dependency tree shifted as expected when the tag was changed.

    WRAP UP

    By understanding the disconnect between the Tagger and Parser in vector-based NLP pipelines, we were able to re-architect our solution to respect business rules. This ensured that our Italian product catalog was parsed accurately, driving better search relevance and user experience.

    Social Hashtags

    #NLP #spaCy #PythonAI #MachineLearning #ArtificialIntelligence #DependencyParsing #NaturalLanguageProcessing #AIEngineering #MLOps #DataScience #DeepLearning #SearchOptimization #ComputationalLinguistics #AIArchitecture

    If you are looking to build dedicated engineering teams capable of solving complex AI and NLP challenges, contact us to discuss your requirements.

    Frequently Asked Questions