Logo
Datadrifters Blog Header Image

Say Hello to ‘Her’: Real-Time AI Voice Agents with 500ms Latency, Now Open Source

2024-08-17


Voice Mode is hands down one of the coolest features in ChatGPT, right?


OpenAI didn’t just slap some voices together — they went all out, working with top-notch casting and directing pros.


They sifted through over 400 voice submissions before finally landing on the 5 voices you hear in ChatGPT today.


Just imagine all the people involved — voice actors, talent agencies, casting directors, industry advisors. It’s wild!


There was even some drama along the way, like with the voice of Sky, which had to be paused because it sounded a little too much like Scarlett Johansson.


Now, here’s where it gets really exciting…


Imagine having that kind of speech-to-speech magic, but with less than 500ms latency, running locally on your own GPU.


Sounds like a dream, right?


Well, it’s just open-source nowadays.


Today, I’m going to walk you through how you can set up your very own modular speech-to-speech pipeline, which is built with a series of consecutive parts:



Let’s GOOOO!



Why modularity matters


One of the coolest things about this pipeline is how modular it is.


Each component is designed as a class, making it super easy to swap things in and out depending on your needs.


Want to use a different Whisper model for STT? Go ahead.


Need to change the language model? No problem — just tweak the Hugging Face model ID.


This modularity isn’t just a nice-to-have; it’s crucial for keeping your pipeline adaptable.


As everything is moving so fast, having a flexible setup means you can quickly integrate new models and techniques without rewriting everything from scratch.


OK, let’s see it in action.


Setting up environment for Llama 3.1 fine-tuning


Let’s first set up a virtual environment and install libraries. Open your command line interface — this could be your Command Prompt, Terminal, or any other CLI tool you’re comfortable with — and run the following commands:

# Create a virtual environment  
git clone  https://github.com/eustlb/speech-to-speech.git  
cd speech-to-speech  
<LineBreak />
python3 -m venv speech-to-speech-env  
source speech-to-speech-env/bin/activate  
<LineBreak />
pip3 install git+https://github.com/nltk/nltk.git@3.8.2  
pip3 install -r requirements.txt

The rest is relatively easy.


How to Run the Pipeline


You’ve got two main options for running the pipeline:



Server/Client Approach:


If you want to run the pipeline on a server and stream audio input/output from a client, you first run:

python3 s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0

Then, on your client machine you run:

python3 listen_and_play.py --host <IP address of your server>

Local Approach:


If you prefer to keep everything on one machine, just use the loopback address:

python s2s_pipeline.py --recv_host localhost --send_host localhost  
python listen_and_play.py --host localhost

This flexibility is great because it allows you to scale your setup depending on your needs.


Running locally? No problem.


Need to deploy on a remote server? You’re covered.


Only thing you may need additionaly at this point is boosting performance with Torch Compile, before getting into that, let us quickly remind:


If you want to build a full-stack GenAI SaaS Products that people love — don’t miss out on our upcoming cohort-based course. Together, we’ll build, ship, and scale your GenAI product alongside a community of like-minded people!


To the next section.


Boosting performance with Torch Compile


For the best performance, especially with Whisper and Parler-TTS, you’ll want to leverage Torch Compile.


Here’s a command that puts it all together:


python s2s_pipeline.py \  
    --recv_host 0.0.0.0 \  
    --send_host 0.0.0.0 \  
    --lm_model_name microsoft/Phi-3-mini-4k-instruct \  
    --init_chat_role system \  
    --stt_compile_mode reduce-overhead \  
    --tts_compile_mode default

This setup optimizes the pipeline for reduced latency, making it super responsive.


Just a heads-up: while these modes are great, certain CUDA Graphs modes aren’t compatible with streaming in Parler-TTS yet, so keep that in mind.


Command line power


Now, let’s talk about the command-line options that give you fine-grained control over each part of the pipeline.


Model Parameters: You can specify parameters for each model component (STT, LM, TTS) directly via the command line. For example:


--lm_model_name google/gemma-2b-it

Generation Parameters: You can also adjust generation-specific settings, like setting the maximum number of new tokens for STT:


--stt_gen_max_new_tokens 128

This kind of granular control is also a lifesaver when you’re trying to fine-tune performance or tweak specific behaviors in the pipeline.


Some Key Parameters You Should Know About


Here are a few parameters that you’ll likely find useful:



Language Model:



Speech to Text:



This pipeline is a fantastic starting point if you’re building AI voice agents.


Stay in the loop


The modularity and flexibility it offers means you can adapt it to a wide range of use cases, whether you’re focusing on real-time applications or something more static.


So go ahead, clone the repo, and start experimenting.


The more you play around with these tools, the more you’ll realize just how much potential they have.


Happy building!


Bonus Content : Building with AI


And don’t forget to have a look at some practitioner resources that we published recently:


Say Hello to ‘Her’: Real-Time AI Voice Agents with 500ms Latency, Now Open Source

Fine-Tune Meta’s Latest AI Model: Customize Llama 3.1 5x Faster with 80% Less Memory

Fine Tuning FLUX: Personalize AI Image Models on Minimal Data for Custom Look and Feel

Data Management with Drizzle ORM, Supabase and Next.js for Web & Mobile Applications


Thank you for stopping by, and being an integral part of our community.


Happy building!