INTRODUCTION
During a recent engagement for a large-scale telecommunications provider, our team was tasked with modernizing a customer support chatbot. The goal was to transition from a rigid decision-tree bot to a flexible, GenAI-powered assistant capable of handling complex queries across thousands of support topics. The system was built on a standard RAG (Retrieval-Augmented Generation) architecture, utilizing a vector database to retrieve technical documentation and policies.
However, during user acceptance testing, we realized a significant flaw in the user experience. Users rarely speak in precise technical terms. A user would ask, “How do I get a SIM?” while the knowledge base contained distinct, conflicting procedures for “Physical SIMs,” “eSIMs,” and “Data-only SIMs.”
Because the query was broad, the vector search retrieved a mix of all three documents. The LLM, trying to be helpful, often blended these instructions into a confusing, incorrect answer. We realized we couldn’t just improve the search; we needed a mechanism to detect when a query was too vague and force the bot to ask for clarification before attempting a search.
This challenge—balancing conversational fluidity with technical precision—inspired this article. Here is how we designed a generic, scalable ambiguity detection layer.
PROBLEM CONTEXT
The core business requirement was to automate Level 1 support for a massive knowledge base. The system needed to interpret natural language and fetch specific troubleshooting steps or provisioning protocols. We used Python as the orchestration layer, integrating with high-performance LLMs like Gemini Flash Lite for speed.
The architecture followed a standard pattern:
1. User submits a query.
2. The system embeds the query.
3. A vector search retrieves the top ‘k’ chunks.
4. An LLM synthesizes the answer.
The issue surfaced in step 3. In high-density knowledge domains, a vague query like “connection issue” is semantically close to hundreds of documents: fiber optics, 5G mobile data, home router setups, and VPN configurations. By passing ambiguous queries directly to the RAG pipeline, we were polluting the context window with irrelevant data, leading to answers that were technically correct but contextually wrong.
WHAT WENT WRONG
Initially, we attempted to solve this via “prompt engineering” at the final generation stage. We instructed the model: “If the context contains conflicting information, ask the user to clarify.”
This approach failed for two reasons:
- Latency and Cost: We were performing expensive vector searches and retrieving heavy context blocks only to discard them and ask a question. This was computationally wasteful.
- Context Confusion: Sometimes the retrieved documents were too similar. The model would confidently hallucinate a solution that combined the activation steps of an eSIM with the shipping logic of a physical SIM, creating a procedure that didn’t exist in reality.
We realized that hardcoding rules (e.g., “if keyword ‘SIM’ is present, ask X”) was impossible. With thousands of topics evolving daily, maintaining a rule engine would require us to hire software developers just to maintain `if-else` statements, which is not scalable.
HOW WE APPROACHED THE SOLUTION
We decided to introduce a lightweight “Triage Layer” before the RAG retrieval process. This layer acts as a semantic traffic cop. Its sole responsibility is to analyze the user’s input and determine if it contains enough specificity to warrant a database search.
To make this generic and scalable, we utilized structured outputs (using `pydantic_ai` capabilities). Instead of asking the LLM to chat, we asked it to classify and return a structured object.
We evaluated three approaches:
- Keyword Taxonomy: Checking queries against a fixed list of ambiguous terms. (Discarded due to maintenance overhead).
- Vector Similarity Thresholding: Checking if the distance between the query and multiple disparate clusters was high. (Discarded due to complexity and latency).
- LLM-based Intent Disambiguation: Using a fast, smaller model (Gemini Flash Lite) to classify intent and generate dynamic clarification questions. (Selected approach).
This approach allowed us to hire Python developers for AI agents who could focus on the architecture of the agent rather than the content of the knowledge base.
FINAL IMPLEMENTATION
The solution involved creating a strict Pydantic model to govern the LLM’s output. We used a “Router” pattern. The Router does not have access to the full knowledge base but understands the types of entities the system handles.
Below is a sanitized version of the implementation logic using Python.
1. Defining the Structured Intent
We defined the possible states of a user query. Either it is ambiguous (needs clarification) or specific (ready for RAG).
from pydantic import BaseModel, Field
from typing import Optional, List
class QueryAnalysis(BaseModel):
is_ambiguous: bool = Field(
...,
description="True if the user query is too broad and refers to multiple distinct concepts."
)
clarification_options: Optional[List[str]] = Field(
default=None,
description="A list of specific concepts the user might be referring to (e.g., 'eSIM', 'Physical SIM')."
)
follow_up_question: Optional[str] = Field(
default=None,
description="A polite question asking the user to specify their intent."
)
optimized_search_query: Optional[str] = Field(
default=None,
description="If not ambiguous, a rewritten query optimized for vector search."
)
2. The Triage Agent
We implemented a lightweight agent using a system prompt that encourages the model to act as a strict filter. We inject high-level domain knowledge into the prompt so the model knows what “ambiguity” looks like in this specific context.
import os
from pydantic_ai import Agent
from google_genai import Client
# Initialize the lightweight model for low latency
model_name = "gemini-2.0-flash-lite"
system_prompt = """
You are an expert intent classifier for a telecommunications support bot.
Your job is to analyze user input and decide if it is specific enough to search the knowledge base.
Our Knowledge Base covers:
- Physical SIM cards (shipping, insertion)
- eSIMs (QR codes, activation)
- ExtraSIMs (watch plans)
- Fiber Internet vs. 5G Home Internet
If a user asks "How do I get a SIM?", this is AMBIGUOUS because it could refer to Physical or eSIM.
If a user asks "How do I scan my eSIM QR code?", this is SPECIFIC.
Output must be strictly in JSON format matching the QueryAnalysis schema.
"""
agent = Agent(model=model_name, system_prompt=system_prompt, result_type=QueryAnalysis)
async def process_user_query(user_text: str):
result = await agent.run(user_text)
data = result.data
if data.is_ambiguous:
# Return the clarification question directly to the UI
return {
"action": "CLARIFY",
"message": data.follow_up_question,
"options": data.clarification_options
}
else:
# Proceed to RAG search using the optimized query
return {
"action": "SEARCH",
"query": data.optimized_search_query
}
3. The Workflow
When the user says “I need a SIM,” the Triage Agent returns is_ambiguous: True and generates a question: “Are you looking for a physical SIM card for a phone, or an eSIM profile?”
The user clicks “eSIM.” The system appends this to the conversation history. The next run produces an optimized query: “How to obtain and activate an eSIM profile.” The Triage Agent sees this as is_ambiguous: False and passes it to the RAG engine.
LESSONS FOR ENGINEERING TEAMS
Building this ambiguity layer provided us with several key insights for future projects when we hire python developers for scalable data systems:
- Shift Left on Logic: Don’t wait for the RAG step to fail. Validate the query quality immediately. It saves tokens and reduces latency by avoiding unnecessary vector searches.
- Use Structured Output Everywhere: Never parse raw strings from an LLM in a production pipeline. Pydantic models ensure that your control flow logic (if/else) is reliable.
- Optimize for Latency: The clarification step adds a round-trip. Using huge models like GPT-4 or Gemini Pro for this simple triage is wasteful. Flash/Lite models are sufficient for classification tasks.
- Avoid Hardcoded Trees: By letting the LLM generate the clarification options based on its general understanding of the domain, the system remains robust even if the product catalog changes slightly.
- Context Injection is Key: The Triage Agent doesn’t need the content of the documents, but it does need a high-level summary of the categories available to make good decisions.
WRAP UP
By implementing a pre-search ambiguity detection layer, we significantly improved the accuracy of our RAG system. The chatbot transformed from a tool that frequently hallucinated mixed answers into a helpful assistant that guides users toward the correct information. This architecture allows companies to hire dedicated remote developers to focus on feature expansion rather than debugging infinite edge cases.
Social Hashtags
#RAG #Chatbots #GenerativeAI #LLM #ArtificialIntelligence #MachineLearning #PythonAI #AIEngineering #VectorSearch #MLOps #AIArchitecture #AgenticAI #NLP #LLMDevelopment #AIChatbot
If you are looking to build enterprise-grade AI agents or need to contact us for architectural guidance, we are ready to help.
Frequently Asked Questions
Yes, it adds a small amount of latency (usually 300-500ms with "Lite" models). However, it prevents the much larger latency penalty of performing a useless vector search and generating a long, incorrect answer that the user has to reject.
Yes. The Triage Agent should receive the immediate chat history. If the previous turn was a clarification question, the agent combines the user's new response with the context to form a specific query.
You simply update the system prompt of the Triage Agent to include the new high-level categories. You do not need to retrain the model or rewrite complex Python conditional logic.
While not strictly necessary, libraries like `pydantic_ai` or `instructor` dramatically simplify the process of forcing LLMs to return valid JSON that matches your schema, reducing runtime errors.
Success Stories That Inspire
See how our team takes complex business challenges and turns them into powerful, scalable digital solutions. From custom software and web applications to automation, integrations, and cloud-ready systems, each project reflects our commitment to innovation, performance, and long-term value.

California-based SMB Hired Dedicated Developers to Build a Photography SaaS Platform

Swedish Agency Built a Laravel-Based Staffing System by Hiring a Dedicated Remote Team

















