Llama 3 Powered Voice Assistant: Integrating Local RAG with Qdrant, Whisper, and LangChain


Voice-enabled AI applications will forever change how we interact with technology.

You all heard the recent news from OpenAI and Google, multimodal systems are the future.

With human-like voices, voice assistants will scale any conversational task, whether it’s inbound sales, customer support or data collection and verification.

That’s why OpenAI and Google introduced multimodal capabilities across the GPT and Gemini family of models, to accommodate text, audio, images, and video inputs — to get an early share of enteprise adoption for various use-cases.

For example, GPT-4o matches and exceeds the performance of GPT-4, and it’s also

There were also many posts on social media to show how much better the code interpeter is, and also it does an absolutely better job at data analysis and visualisations.

This is huge for application developers, and we know that open-source will fully catch-up closed sourced models in 2024.

That’s why in this tutorial, I would like to walk you through the creation of a sophisticated voice assistant using some of the most advanced open source models available today.

We will use the following:

Let’s build a Llama 3 powered voice assistant that is not only responsive but also intelligent and capable of scaling efficiently.

Voice Assistants across Industries

Multimodal voice assistants can interact with users through smart speakers, smartphones, wearable devices, and smart home systems.

They are capable of integrating with other technologies, such as augmented reality (AR) and virtual reality (VR), to offer immersive experiences.

For instance, a user could ask a voice assistant for directions, and it could not only provide verbal instructions but also display the route on a connected VR/AR headset such as Apple Vision Pro.

This convergence of modalities allows for richer and more interactive engagements, catering to a wider range of user needs and preferences.

The way industries provide their products and services to their customers will also change:

You got the idea — by integrating these systems into our daily operations, we can enhance productivity, customer experience and satisfaction.

Let’s start crafting a tool that could redefine the way we interact with our digital environments.

Developing Llama 3 Powered Voice Assistant

Before starting the tutorial, ensure you have the following resources ready:

For Google Colab users, we first need to mount Google Drive in Colab environment to access and utilize our data for computation.

# Specify a different mountpoint
mountpoint = "/content/my_drive"  

# Mount the Google Drive
from google.colab import drive  

Otherwise, I will run all example locally.

Now, we need to install the following libraries

Let’s first set up a virtual environment and install libraries. Open your command line interface — this could be your Command Prompt, Terminal, or any other CLI tool you’re comfortable with — and run the following commands:

# Create a virtual environment  
mkdir llama3-whisper && cd llama3-whisper  
python3 -m venv llama3-whisper-envsource llama3-whisper-env/bin/activate
# Install dependencies  
pip3 install --no-deps torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1  
pip3 install openai  
pip3 install -q transformers==4.33.0   
pip3 install -q accelerate==0.22.0   
pip3 install -q einops==0.6.1   
pip3 install -q langchain==0.0.300   
pip3 install -q xformers==0.0.21  
pip3 install -q bitsandbytes==0.41.1   
pip3 install -q sentence_transformers==2.2.2  
pip3 install arxiv  
pip3 install -q ipykernel jupyter  
pip3 install -q --upgrade huggingface_hub

Lastly, to prepare your environment to extract data from PDF files, perform OCR, and create embeddings for advanced data handling and retrieval, we have to install a few more:

pip3 install unstructured  
pip3 install "unstructured[pdf]"  
apt-get install -y poppler-utils  
pip3 install pytesseract  
apt-get install -y tesseract-ocr  
pip3 install --upgrade qdrant-client  
pip3 install WhisperSpeech

As a last step, let’s loging to Hugging Face Hub and open our IDE.

# Loging to Huggingface Hub  
huggingface-cli login  

# Optionally, fire up VSCode or your favorite IDE and let's get rolling!  
code .

Great — to continue, you can either create .py file or .ipynb file (notebook). I will continue with Jupyter notebook to run code in blocks and interactively inspect the results.

Time to build the voice assistant!

Importing Libraries

We import all necessary libraries that support various aspects of this setup, including model interactions, document processing, and embeddings management.

import os  
import sys  
import arxiv  
from torch import cuda, bfloat16  
import torch  
import transformers  
from transformers import AutoTokenizer, AutoModelForCausalLM  
from time import time  
from langchain.llms import HuggingFacePipeline  
from langchain.document_loaders import PyPDFLoader,DirectoryLoader,WebBaseLoader  
from langchain.text_splitter import RecursiveCharacterTextSplitter,CharacterTextSplitter  
from langchain.embeddings import HuggingFaceEmbeddings  
from langchain.chains import RetrievalQA  
from langchain.vectorstores import Qdrant  
from pathlib import Path  
from openai import OpenAI  
from IPython.display import Audio, display  
from whisperspeech.pipeline import Pipeline

Handling Data for Voice Assistants

Before we continue, I want to stop and elaborate a little on building data pipelines for AI enabled applications.

Data pipelines are crucial for efficiently managing and processing data within modern applications, especially when developing sophisticated applications like voice assistants powered by RAG enabled LLMs.

These pipelines typically involve five key phases:

However, building data pipelines can be painstakingly complex and it’s beyond the scope of this tutorial. If you want to see it in action, please drop a comment.

For the simplicity of this tutorial, we will to use research papers from Arxiv.

Let’s create a directory, search and download papers for “LLM” search term:

dirpath = "arxiv_papers"  
if not os.path.exists(dirpath):  

search = arxiv.Search(  
  query = "LLM", # your query length is limited by ARXIV_MAX_QUERY_LENGTH which is 300 characters  
  max_results = 10,  
  sort_by = arxiv.SortCriterion.LastUpdatedDate, # you can also use SubmittedDate or Relevance  
  sort_order = arxiv.SortOrder.Descending  

Search is completed, download the papers:

for result in search.results():  
    while True:  
            print(f"-> Paper id {result.get_short_id()} with title '{result.title}' is downloaded.")  
        except FileNotFoundError:  
            print("File not found")  
        except HTTPError:  
        except ConnectionResetError as e:  
            print("Connection reset by peer")  

The output will be:

-> Paper id 2405.10311v1 with title 'UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models' is downloaded.
-> Paper id 2405.10288v1 with title 'Timeline-based Sentence Decomposition with In-Context Learning for Temporal Fact Extraction' is downloaded.
-> Paper id 2405.07703v4 with title 'OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs' is downloaded.
-> Paper id 2405.10276v1 with title 'Revisiting OPRO: The Limitations of Small-Scale LLMs as Optimizers' is downloaded.
-> Paper id 2405.10255v1 with title 'When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models' is downloaded.
-> Paper id 2405.10251v1 with title 'A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks' is downloaded.
-> Paper id 2405.10250v1 with title 'IntelliExplain: Enhancing Interactive Code Generation through Natural Language Explanations for Non-Professional Programmers' is downloaded.
-> Paper id 2405.08997v2 with title 'LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages' is downloaded.
-> Paper id 2308.04662v2 with title 'VulLibGen: Identifying Vulnerable Third-Party Libraries via Generative Pre-Trained Model' is downloaded.
-> Paper id 2405.10212v1 with title 'CPsyExam: A Chinese Benchmark for Evaluating Psychology using Examinations' is downloaded.

Cool, we will now chunk these papers into meaningful pieces.

Brief Overview of Retrieval Augmented Generation

RAG workflows help us to manage and utilize data from various sources, to deliver accurate and relevant results.

Here’s a brief overview:

1.Data Loading: Collect data from different sources like text files, PDFs, websites, databases, or APIs. For example, Llama Hub provides many connectors to make this step easier.

2. Indexing: In the indexing stage, the system transforms raw data into vector embeddings and organize them

3. Storage: Save the indexed data and labels so you won’t have to organize it again later.

4. Querying During the querying stage, the system retrieves the most relevant documents based on the query vector:

5. Evaluation can be quite challenging due to LLMs’ stochastic nature. However, there are effective metrics and tools available to conduct objective evaluations.

Some of the example metrics could be: Faithfulness, answer relevance, context precision, recall, relevance, and entities recall, answer semantic similarity, answer correctness.

If you want me to elaborate on LLM and RAG evaluation with latest libraries and frameworks, please drop a comment.

Let’s move on.

Text Splitters

To do that, we will use text_splitter to manage large text documents by dividing them into smaller, manageable chunks:

1.RecursiveCharacterTextSplitter splits text recursively into smaller pieces, suitable for very large texts.

It has 2 main parameters:

This is usually best suited for very large texts without natural segmentation points, and it prevents context loss by maintaining overlaps between chunks, ensuring that subsequent processing has continuity.

2. CharacterTextSplitter splits text based on specified character separators, ideal for texts with natural breaks.

It has 3main parameters:

Ideal for texts with clear demarcation points, such as scripts or documents with well-defined sections, and it ensures data integrity by splitting texts at natural breaks, which helps in maintaining the meaning and context without the need for overlaps.

These tools are crucial for preparing text for NLP models, we want data to be in a manageable size while retaining necessary context.

Documents Loader

Another piece of the puzzle is document loaders, which are essential for handling different data sources in NLP workflows.

Each type of loader is tailored for specific sources:

All loaders serve the primary function of collecting data, which is then processed and potentially used for generating embeddings.

In this setup, we will use DirectoryLoader and RecursiveCharacterTextSplitter to efficiently chunk and manage multiple files, but you can select any loader that suits your data source needs.

Let’s see how splitter and document loader come together in practice

papers = []  
loader = DirectoryLoader(dirpath, glob="./*.pdf", loader_cls=PyPDFLoader)  
papers = loader.load()  
print("Total number of pages loaded:", len(papers)) # Total number of pages loaded: 410  

# This merges all papes from all papers into single text block for chunking  
full_text = ''  
for paper in papers:  
    full_text = full_text + paper.page_content  

full_text = " ".join(l for l in full_text.splitlines() if l)  

text_splitter = RecursiveCharacterTextSplitter(  
    chunk_size = 500,  
    chunk_overlap  = 50  

paper_chunks = text_splitter.create_documents([full_text])

Total number of pages loaded: 157

OK, we can now configure our model.

Model Configuration

This code configures a Meta LLaMA 3 model for language generation tasks:

  1. Configuration:

2. Initialization:

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"  
device = "cuda"  
dtype = torch.bfloat16  

tokenizer = AutoTokenizer.from_pretrained(model_id)  
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map=device)

Setup query pipeline

Now we will set-up a query_pipeline for text generation using the Hugging Face transformers library, designed to simplify the use of the pre-trained model and tokenizer:

query_pipeline = transformers.pipeline(  

Initialized pipeline

The code initializes a HuggingFacePipeline object with our configured query_pipeline for streamlined text generation.

llm = HuggingFacePipeline(pipeline=query_pipeline)

Handling Model Loading with Fallback to Local Resources

We will now load the sentence-transformers/all-mpnet-base-v2 embedding model from Hugging Face's repository, configured to run on a CUDA device.

If this process encounters any issues, such as connectivity problems or access restrictions, you can also add exception to return to using a locally stored model of embeddings.

With this approach, our application can continue processing with an alternative model if the primary source is unavailable, which helps us to maintain robustness in different operational environments.

model_name = "sentence-transformers/all-mpnet-base-v2"  
model_kwargs = {"device": "cuda"}  

# try to access the sentence transformers from HuggingFace: https://huggingface.co/api/models/sentence-transformers/all-mpnet-base-v2  
    embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)  
except Exception as ex:  
    print("Exception: ", ex)  
    # # alternatively, we will access the embeddings models locally  
    # local_model_path = "/kaggle/input/sentence-transformers/minilm-l6-v2/all-MiniLM-L6-v2"  
    # print(f"Use alternative (local) model: {local_model_path}\n")  
    # embeddings = HuggingFaceEmbeddings(model_name=local_model_path, model_kwargs=model_kwargs)

Integrating Qdrant for Embedding Storage and Retrieval

We will use Qdrant as our vector database for its exceptional capabilities in handling vector similarity searches, scalability, and flexible vector data management.

Additionally, Qdrant supports local and cloud storage options, so that you can adapt to various on-prem and cloud environments.

We have already installed Qdrant and are importing it from LangChain’s vector stores, as indicated by the line in our code: from langchain.vectorstores import Qdrant

We can now integrate Qdrant’s vector database capabilities into our application to manage and retrieve embeddings, let’s go!

Storing Document Embeddings in Qdrant Vector Database

The Qdrant.from_documents method facilitates the process by taking the documents and their corresponding embeddings as input.

vectordb = Qdrant.from_documents(  

Here's a breakdown of the parameters used:

Reusing Persisted Data in Qdrant Vector Database

You can skip this section but if you want to use an existing persisted vector database, you can set up a QdrantClient to connect to the specific storage location:

This configuration allows you to reconnect with and utilize the existing database.

from qdrant_client import QdrantClient  

client = QdrantClient(path = "Qdrant_Persist")  

vectordb = Qdrant(  

Setting-up the Retriever

We now need to set-up a retrieval-based question-answering (QA) system that utilizes the stored embeddings within our Qdrant vector database:

retriever = vectordb.as_retriever()  

qa = RetrievalQA.from_chain_type(  

Testing and Visualizing the RAG System

We implement functions to test and visualize the Retrieval-Augmented Generation (RAG) system:

These functions help test the RAG system’s performance while presenting the results in a clear, visually appealing manner.

from IPython.display import display, Markdown  

def colorize_text(text):  
    for word, color in zip(["Reasoning", "Question", "Answer", "Total time"], ["blue", "red", "green", "magenta"]):  
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")  
    return text  

def test_rag(qa, query):  

    time_start = time()  
    response = qa.run(query)  
    time_end = time()  
    total_time = f"{round(time_end-time_start, 3)} sec."  

    full_response =  f"Question: {query}\nAnswer: {response}\nTotal time: {total_time}"  
    return response

Integration of Llama 3 and Whisper for Text-to-Speech Processing

Let’s see what’s happening here:

Let’s first define the Whisper pipeline.

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')

Then we use Llama 3 for text generation by passing our query, and following this, we can use Whisper for audio generation.

query = "How LLMs can be used to understand and interact with the complex 3D world"  
aud = test_rag(qa, query)  


Answer: LLMs can be used to understand and interact with the complex 3D
world by leveraging their inherent strengths, including world knowledge
and reasoning abilities. This can be achieved by integrating LLMs with 3D
data, such as 3D models, point clouds, or meshes, to enable tasks like
spatial comprehension, navigation, and interaction within 3D environments.
LLMs can also be used to generate 3D data, such as 3D models or textures,
and to reason about the relationships between objects and their spatial
layout. Furthermore, LLMs can be used to plan and predict the outcomes of
actions in 3D environments, enabling more sophisticated forms of interaction
and manipulation. Overall, the integration of LLMs with 3D data presents a
unique opportunity to enhance computational models' understanding of and
interaction with the physical world, leading to innovations across various
domains. 11, 12], and robotic manipulation [13, 14, 15]. Recent works have
demonstrated the potential of integrating LLMs with 3D data to interpret,
reason, or plan in complex 3D environments, by leveraging the inherent
strengths of LLMs

Nice, you can now go ahead and try a few more questions from the downloaded papers to understand the strengths, and start thinking of ways to overcome its weaknesess.

Typically, to enhance the performance, several key parameters and strategies can be optimized. For example:

There is a lot to focus on, but don’t get overwhelmed with all the things you can do, just focus on creating the first prototype as the components that we are using here are already high quality.

Over time, you can iterate on the system to make it more accurate, efficient, and user-friendly.

Closing Remarks

Multimodal applications are the future.

I wanted to give a glimpse by integrating technologies like Whisper, LLaMA 3, LangChain, and vector databases such as Qdrant, to build responsive, intelligent voice assistants that process human language in real-time.

Thank you for stopping by, and being an integral part of our community.

Happy building!