Table of Contents

    Book an Appointment

    INTRODUCTION

    While working on a high-volume AI customer support platform for an enterprise logistics provider, we encountered a significant challenge with real-time audio streaming. We were building a conversational voicebot utilizing an Asterisk IVR system connected to a Python-based AI microservice via Asterisk’s AudioSocket capability. The goal was seamless, human-like interaction.

    However, during initial deployment, we realized the system was operating in a rigid, half-duplex state—much like a walkie-talkie. When the AI agent was speaking, the system had to intentionally deafen its microphone processing to prevent the bot from hearing its own voice and entering an infinite feedback loop. As a result, when callers tried to interrupt mid-sentence to clarify an address or tracking number, their speech was completely ignored.

    This limitation fundamentally breaks the illusion of conversational AI. When enterprise companies decide to hire software developer teams to build AI voice agents, they expect fluid, full-duplex communication (barge-in). This challenge inspired this article, where we will break down why standard echo cancellation fails over AudioSocket, and how we engineered a reliable, full-duplex barge-in architecture without false triggers.

    PROBLEM CONTEXT

    In our architecture, Asterisk handles the telephony layer (SIP/PSTN), while a Python backend orchestrates the AI logic. The call flow utilizes Asterisk’s AudioSocket to establish a bidirectional raw TCP audio stream.

    Once connected, the flow looks like this:

    • User speech is streamed from Asterisk to the Python server via AudioSocket.
    • The Python server performs Speech-to-Text (STT) and passes the text to an agentic LLM workflow.
    • The generated text response is sent to a Text-to-Speech (TTS) engine.
    • TTS audio chunks are streamed back through the same AudioSocket and played to the caller.

    To avoid processing the bot’s own TTS output as user input, the initial implementation simply dropped incoming audio packets during TTS playback. This created a jarring user experience. To fix this, we needed a full-duplex pipeline capable of listening continuously and accurately identifying when the caller was speaking over the bot.

    WHAT WENT WRONG

    To enable continuous listening, we removed the half-duplex block and introduced Voice Activity Detection (VAD) and WebRTC-based Acoustic Echo Cancellation (AEC) directly into the Python audio processing loop.

    Despite heavy tuning, this introduced two critical failures in the production environment:

    • False Barge-Ins: The system would instantly trigger a user interruption the moment the bot started speaking. The AI was detecting its own TTS audio echoing back from the PSTN network.
    • Missed Barge-Ins: To combat the false triggers, we increased the VAD aggression and AEC suppression levels. This over-correction resulted in the system failing to detect actual human speech when the user attempted to interrupt.

    The root of the issue lies in how Asterisk handles AudioSocket. AudioSocket is a “dumb” pipe. It streams raw, un-timestamped 8kHz or 16kHz PCM audio. When you feed TTS audio out through the socket, it takes time to travel through Asterisk, to the SIP provider, over the PSTN to the caller’s mobile phone, and bounce back as line or acoustic echo.

    Software AEC requires a highly synchronized “reference signal” (what the bot is saying) and “capture signal” (what the mic is hearing). Because of the variable network jitter and PSTN latency, the TTS reference signal in our Python app was completely out of sync with the echo arriving hundreds of milliseconds later. WebRTC AEC fails entirely under these conditions.

    HOW WE APPROACHED THE SOLUTION

    We had to rethink the boundary between the telephony switch and the AI application. We evaluated whether Asterisk dialplan features like `Background()` could help, but native Asterisk media applications are not designed for asynchronous, continuous chunked streaming via TCP.

    We realized that relying solely on Python for echo cancellation over an asynchronous TCP pipe was an architectural flaw. The solution required a hybrid approach:

    • Asterisk Layer: Normalize the audio timing and eliminate as much network echo as possible before the audio ever enters the AudioSocket.
    • Python Layer: Implement a dynamic delay-buffer to precisely align the TTS reference stream with the incoming stream before passing it to the VAD/AEC pipeline.

    If your team is struggling with similar media streaming challenges, this is often the point where you might look to hire python developers for scalable voice ai systems who deeply understand digital signal processing (DSP) alongside application code.

    FINAL IMPLEMENTATION

    Our final fix required changes in both the Asterisk dialplan and the Python streaming architecture.

    1. Asterisk Dialplan Optimizations

    Before launching the AudioSocket connection, we configured Asterisk to handle jitter and apply native software echo cancellation (if the SIP channel supports it). We utilized Asterisk’s `JITTERBUFFER` to stabilize the incoming stream, ensuring the Python server receives a consistent flow of PCM frames.


    ; Standard Initialization
    same => n,Wait(1)
    same => n,Set(CIVR_HOST=ipconfig)
    same => n,Set(CALL_UUID=${UUID()})

    ; Enable Jitterbuffer to normalize network latency before AudioSocket
    same => n,Set(JITTERBUFFER(adaptive)=default)

    ; Notify the Python microservice of the incoming call
    same => n,Log(NOTICE, Initiating call context for UUID: ${CALL_UUID})
    same => n,System(curl -s “http://${CIVR_HOST}:1650/api/call-start?uuid=${CALL_UUID}” >/dev/null 2>&1)

    ; Open AudioSocket
    same => n,Log(NOTICE, Starting AudioSocket to ${CIVR_HOST}:3000)
    same => n,AudioSocket(${CALL_UUID},${CIVR_HOST}:3000)
    same => n,Log(NOTICE, AudioSocket connection closed)

    2. Python Buffer Alignment (The Barge-in Logic)

    On the Python side, we could no longer just pass the TTS audio directly to the AEC module. We implemented a dynamic cross-correlation function that continuously measures the delay between the outgoing TTS stream and the incoming AudioSocket stream.

    When the Python server writes TTS audio to the socket, it also pushes that audio into a Ring Buffer. By calculating the delay (e.g., 250ms), we fetch the exact frame of TTS audio from the buffer that corresponds to the current incoming microphone frame. Only then do we pass both frames to the WebRTC AEC process.

    Once the echo is cleanly subtracted, the remaining audio is passed to a lightweight VAD (like Silero VAD). If the VAD detects speech, we immediately halt the TTS streaming queue, flush the AudioSocket output buffer, and send a signal to the LLM that the user has barged in.

    LESSONS FOR ENGINEERING TEAMS

    Implementing barge-in over raw TCP sockets is notoriously difficult. When you hire asterisk developers for enterprise communication platforms, ensure they understand the intersection of VoIP protocols and machine learning pipelines. Here are our key takeaways:

    • AudioSocket is a Transport, Not a Media Engine: AudioSocket does not provide timestamps, echo cancellation, or media synchronization. It is solely a transport mechanism for raw bits.
    • Latency Alignment is Mandatory: Software AEC (like WebRTC) will fail and cause false VAD triggers if the reference audio (TTS) and capture audio (Mic) are misaligned by even a few milliseconds.
    • Buffer Your References: Always maintain a ring buffer of your outgoing audio on the AI server so you can dynamically align it with incoming echo based on real-time latency calculations.
    • Asterisk Jitter Buffers Help: Applying `JITTERBUFFER` in the Asterisk dialplan reduces the packet variability, making the DSP algorithms on the Python side much more accurate.
    • Separate VAD from STT: Do not rely on your cloud STT provider to detect barge-in. STT has too much latency. Run a localized, ultra-fast VAD model (like Silero) purely on the echo-cancelled audio to trigger the interruption instantly.

    WRAP UP

    By shifting SIP network normalization back to Asterisk and implementing a dynamically aligned DSP pipeline in Python, we transformed a rigid, half-duplex IVR into a fluid, conversational AI agent capable of handling complex barge-ins reliably. Designing these systems requires a deep understanding of network behavior, Asterisk internals, and signal processing. If your organization is looking to build or scale complex enterprise AI and telephony architectures, contact us to discuss your engineering needs.

    Social Hashtags

    #Asterisk #VoiceAI #ConversationalAI #PythonDevelopment #VoIP #Telephony #AIVoicebot #WebRTC #SpeechRecognition #AudioSocket #BargeIn #LLM #DSP #IVR #ArtificialIntelligence #VoiceBots #EnterpriseAI #AsteriskPBX #RealtimeAudio #MachineLearning

     

    Frequently Asked Questions