Say Hello to ‘Her’: Real-Time AI Voice Agents with 500ms Latency, Now Open Source

2024-08-17

Voice Mode is hands down one of the coolest features in ChatGPT, right?

OpenAI didn’t just slap some voices together — they went all out, working with top-notch casting and directing pros.

They sifted through over 400 voice submissions before finally landing on the 5 voices you hear in ChatGPT today.

Just imagine all the people involved — voice actors, talent agencies, casting directors, industry advisors. It’s wild!

There was even some drama along the way, like with the voice of Sky, which had to be paused because it sounded a little too much like Scarlett Johansson.

Now, here’s where it gets really exciting…

Imagine having that kind of speech-to-speech magic, but with less than 500ms latency, running locally on your own GPU.

Sounds like a dream, right?

Well, it’s just open-source nowadays.

Today, I’m going to walk you through how you can set up your very own modular speech-to-speech pipeline, which is built with a series of consecutive parts:

Voice Activity Detection (VAD): Pipeline is using Silero VAD v5, it’s the gatekeeper that decides when someone is actually speaking.
Speech to Text (STT): This is where Whisper comes in. Whisper models are known for their accuracy, and you can even use distilled versions for faster performance.
Language Model (LM): You can plug in any instruct model available on the Hugging Face Hub. Whether you’re into GPT-like models or something else, the choice is yours!
Text to Speech (TTS): Parler-TTS takes care of converting the text back into speech. It’s flexible enough to work with different checkpoints, so you can tweak the output to fit your needs.

Let’s GOOOO!

Why modularity matters

One of the coolest things about this pipeline is how modular it is.

Each component is designed as a class, making it super easy to swap things in and out depending on your needs.

Want to use a different Whisper model for STT? Go ahead.

Need to change the language model? No problem — just tweak the Hugging Face model ID.

This modularity isn’t just a nice-to-have; it’s crucial for keeping your pipeline adaptable.

As everything is moving so fast, having a flexible setup means you can quickly integrate new models and techniques without rewriting everything from scratch.

OK, let’s see it in action.

Setting up environment for Llama 3.1 fine-tuning

Let’s first set up a virtual environment and install libraries. Open your command line interface — this could be your Command Prompt, Terminal, or any other CLI tool you’re comfortable with — and run the following commands:

# Create a virtual environment  
git clone  https://github.com/eustlb/speech-to-speech.git  
cd speech-to-speech  
<LineBreak />
python3 -m venv speech-to-speech-env  
source speech-to-speech-env/bin/activate  
<LineBreak />
pip3 install git+https://github.com/nltk/nltk.git@3.8.2  
pip3 install -r requirements.txt

The rest is relatively easy.

How to Run the Pipeline

You’ve got two main options for running the pipeline:

a server/client approach
a local approach.

Server/Client Approach:

If you want to run the pipeline on a server and stream audio input/output from a client, you first run:

python3 s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0

Then, on your client machine you run:

python3 listen_and_play.py --host <IP address of your server>

Local Approach:

If you prefer to keep everything on one machine, just use the loopback address:

python s2s_pipeline.py --recv_host localhost --send_host localhost  
python listen_and_play.py --host localhost

This flexibility is great because it allows you to scale your setup depending on your needs.

Running locally? No problem.

Need to deploy on a remote server? You’re covered.

Only thing you may need additionaly at this point is boosting performance with Torch Compile, before getting into that, let us quickly remind:

If you want to build a full-stack GenAI SaaS Products that people love — don’t miss out on our upcoming cohort-based course. Together, we’ll build, ship, and scale your GenAI product alongside a community of like-minded people!

To the next section.

Boosting performance with Torch Compile

For the best performance, especially with Whisper and Parler-TTS, you’ll want to leverage Torch Compile.

Here’s a command that puts it all together:

python s2s_pipeline.py \  
    --recv_host 0.0.0.0 \  
    --send_host 0.0.0.0 \  
    --lm_model_name microsoft/Phi-3-mini-4k-instruct \  
    --init_chat_role system \  
    --stt_compile_mode reduce-overhead \  
    --tts_compile_mode default

This setup optimizes the pipeline for reduced latency, making it super responsive.

Just a heads-up: while these modes are great, certain CUDA Graphs modes aren’t compatible with streaming in Parler-TTS yet, so keep that in mind.

Command line power

Now, let’s talk about the command-line options that give you fine-grained control over each part of the pipeline.

Model Parameters: You can specify parameters for each model component (STT, LM, TTS) directly via the command line. For example:

--lm_model_name google/gemma-2b-it

Generation Parameters: You can also adjust generation-specific settings, like setting the maximum number of new tokens for STT:

--stt_gen_max_new_tokens 128

This kind of granular control is also a lifesaver when you’re trying to fine-tune performance or tweak specific behaviors in the pipeline.

Some Key Parameters You Should Know About

Here are a few parameters that you’ll likely find useful:

--thresh: Sets the threshold for detecting voice activity.
--min_speech_ms: The minimum duration for detected speech.
--min_silence_ms: Balances between cutting sentences and reducing latency.

Language Model:

--init_chat_role: Sets the initial role in the chat template. For some models, this is crucial for getting the right output tone.
--init_chat_prompt: Defines the initial context for the conversation. This can be key in steering the model’s responses.

Speech to Text:

--description: Customizes the voice description used by Parler-TTS. This is where you can get creative!
--play_steps_s: Adjusts how the audio chunks are sent out during streaming, which affects both readiness and decoding.

This pipeline is a fantastic starting point if you’re building AI voice agents.

Stay in the loop

The modularity and flexibility it offers means you can adapt it to a wide range of use cases, whether you’re focusing on real-time applications or something more static.

So go ahead, clone the repo, and start experimenting.

The more you play around with these tools, the more you’ll realize just how much potential they have.

Happy building!

Bonus Content : Building with AI

And don’t forget to have a look at some practitioner resources that we published recently:

Say Hello to ‘Her’: Real-Time AI Voice Agents with 500ms Latency, Now Open Source

Fine-Tune Meta’s Latest AI Model: Customize Llama 3.1 5x Faster with 80% Less Memory

Fine Tuning FLUX: Personalize AI Image Models on Minimal Data for Custom Look and Feel

Data Management with Drizzle ORM, Supabase and Next.js for Web & Mobile Applications

Thank you for stopping by, and being an integral part of our community.

Happy building!

Back to All Posts