How to optimise latency for voice agents

Over the last week, I have been researching and reading about different horizontal platforms like Vapi, Retell and Bland to understand how they work behind the scenes. My main motivation was to 1) figure out the voice agent stack 2) see if there are any unsolved problems in the space and 3) look for interesting companies which I can then take up with my team at Accel for investing. Along the way, I learnt a fair bit about the importance of latency in building voice apps and how important is to achieve a sub ~800ms latency for end to end setups. It really makes or breaks the voice agent experience along with other things like interrupt and turn detection, mid-sentence redirection, etc and this posts looks at the all these considerations from a latency POV(and suggests tips) to build a kickass voice agent. To see how important is latency is in building voice applications, I vibe coded a small application to simulate how the user experience is for different latency configurations. You can play with it at comparevoiceai.com .

Most voice AI apps follow the below pattern. The user speaks into the microphone. THe audio is processed client side (noise compression, speaker isolation etc) and then piped with WebRTC (e.g., Daily) to a server, then Speed to Text (STT) models (like Deepgram) transcribes speech to text. The Dialogue/LLM layer turns that text into an appropriate reply transcript(probably calling other LLMs, function calls etc), which Text-to-Speech (TTS)—providers like ElevenLabs render as audio. A second WebRTC hop streams the audio back to the user, with each leg adding latency and failure points that orchestration must hide.

Achieving human-like responsiveness requires optimizations at every layer of this pipeline, especially in how we manage LLM context and system architecture. In the first section, we will look at all the methods we can use to cut latency on the central LLM block. If you want to learn more about choosing the right LLM provider for building your voice agents, you can read more on the blog post I wrote on the comparevoiceai.com website.

Link: Which LLM to choose for voice agents.

Semantic caching

Prompt Optimisation Techniques

Feeding long conversation histories or verbose prompts into an LLM is a major source of latency. The more tokens the model must process, the longer it takes to produce a response. In this section, we will look at some tricks to minimise the prompt and complexity per request while making sure the context and functionality isn’t lost.

Streaming and Overlap of STT, LLM, and TTS

Traditional voice agents operated in a strictly turn-based fashion: the user speaks, the system waits for them to finish, then processes the query, and finally speaks the response. This results in noticeable dead air while the user waits for the agent’s reply. Modern real-time architectures instead use streaming at each stage, overlapping tasks to eliminate idle gaps. The goal is to make the conversation feel fluid, as if the agent is listening and formulating a response almost simultaneously.

Startup and Warm up Latency Minimisation

Latency isn’t only about how fast the model generates tokens; it also includes any delays in starting up the model or service. In real-time voice interactions, even a one-time delay (like a cold start) can ruin the user experience on the first query. Therefore, systems must minimize initialization overhead and avoid cold starts during a session.

·