Logo
Datadrifters Blog Header Image

Denser Retriever: Combining Keyword, Vector, and ML Re-Ranking for Superior RAG Performance

2024-11-12


Denser-retriever leverages gradient boosting to synergize different retrieval paradigms:



Now, here’s where Denser Retriever shines.


It uses XGBoost, a powerful gradient boosting algorithm, to combine the scores from these different methods.


By training on features like keyword relevance scores, vector similarities, and re-ranker outputs, it learns the optimal way to weight each component to maximize retrieval performance.


In experiments with the Massive Text Embedding Benchmark (MTEB) datasets, this ensemble approach (denoted as ES+VS+RR_n) significantly outperformed the baseline vector search (VS).


Substantial improvements in metrics like NDCG@20 and Recall@20.



The team also put Denser Retriever to the test using the Anthropic Contextual Retrieval Dataset.


What’s cool about this dataset is that it includes augmented document contexts, allowing us to see how well our approach handles more complex retrieval scenarios.


Some key findings from the Anthropic experiments:



Let’s see it in action.



Getting Started with Denser Retriever


Let’s create a virtual environment and install required libraries.

mkdir denser-retriever && cd denser-retriever  

python3 -m venv denser-retriever-env  
source denser-retriever-env/bin/activate  

pip3 install denser-retriever  
pip3 install cohere  
pip3 install jupyter ipykernel


Then install Elasticsearch and Milvus locally by running (you need to have docker and docker compose, both are included in Docker Desktop for Mac or Windows users):

wget https://raw.githubusercontent.com/denser-org/denser-retriever/main/docker-compose.dev.yml -O docker-compose.yml


Then start the services:

docker compose up -d



Once it’s finished, you can see that the containers are running:



One last thing is to create the following file experiments/models/msmarco_xgb_es+vs+rr_n.json in the root folder


We are ready to roll!


Indexing and Querying a Local File with Denser Retriever


The following script sets up a dense retrieval system that pulls relevant documents based on text queries.


It combines different approaches for searching and ranking results, making it pretty powerful.


Here’s how it works, step-by-step.


Step 0: Initial Setup


This step is pretty self-explanatory:

from langchain_community.document_loaders import TextLoader  

from langchain_text_splitters import RecursiveCharacterTextSplitter  

from denser_retriever.gradient_boost import XGradientBoost  
from denser_retriever.keyword import ElasticKeywordSearch, create_elasticsearch_client  
from denser_retriever.retriever import DenserRetriever  
from denser_retriever.vectordb.milvus import MilvusDenserVectorDB  

from denser_retriever.embeddings import SentenceTransformerEmbeddings  
from denser_retriever.reranker import HFReranker  

reranker = HFReranker(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2", top_k=3)  
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2", embedding_size=384, one_model=True)


Step 1: Loading the Document


Then we start by loading a text file into our system using TextLoader from LangChain, which helps us load documents with ease.

docs = TextLoader("./state_of_the_union.txt").load()


Here, docs is now a list of documents ready for processing. The state_of_the_union.txt file (put it into root folder) holds the content we’re querying against later.


Step 2: Splitting the Text


Since documents can be too large for our model, we split them into smaller chunks. RecursiveCharacterTextSplitter divides the document into manageable parts that overlap slightly, making it easier for the model to catch context between chunks.

# Set up the splitter  
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)  
texts = text_splitter.split_documents(docs)  

index_name = "state_of_the_union"


Now, texts contains smaller sections of our document, each one overlapping slightly with the next. This overlap can help when a query matches across chunk boundaries.


Step 3: Setup Denser Retriever

retriever = DenserRetriever(  
    index_name=index_name,  
    vector_db=MilvusDenserVectorDB(  
        auto_id=True,  
        connection_args={"uri": "http://localhost:19530"},  
    ),  
    keyword_search=ElasticKeywordSearch(  
        es_connection=create_elasticsearch_client(url="http://localhost:9200"), drop_old=True  
    ),  
    reranker=reranker,  
    gradient_boost=XGradientBoost("experiments/models/msmarco_xgb_es+vs+rr_n.json"),  
    embeddings=embeddings,  
    combine_mode="model",  
    xgb_model_features="es+vs+rr_n",  
)


It has few components so let’s explain them one by one:


Configuring the Vector Database


Here, we set up Milvus as our vector database, where all those split text chunks will be stored as embeddings. It’ll be responsible for storing the semantic vectors, so we can quickly retrieve similar ones later.

vector_db = MilvusDenserVectorDB(  
    auto_id=True,  
    connection_args={"uri": "http://localhost:19530"}  
)


In this case, auto_id=True generates an ID for each document automatically. We also connect to Milvus at localhost:19530, assuming it’s already running on our machine.


Setting Up Keyword Search


For traditional keyword search, we use Elasticsearch. This lets us combine dense (semantic) and sparse (keyword) search results, so the system is flexible with user queries.

keyword_search = ElasticKeywordSearch(  
    es_connection=create_elasticsearch_client(url="http://localhost:9200")  
)


Here, create_elasticsearch_client connects to an Elasticsearch instance, letting us pull results based on keyword matching.


Step 4: Ingesting the Documents


We feed the texts we split earlier into our retriever, effectively storing them in Milvus so they’re ready for querying.

retriever.ingest(texts)


Step 5: Running a Query


Finally, we test our retriever by querying what the president said about “Ketanji Brown Jackson.” The retriever returns the top result based on the combined search and ranking strategy.

query = "What did the president say about Ketanji Brown Jackson"  
res = retriever.retrieve(query, 1)  

for r in res:  
    print("page_content: " + r[0].page_content)  
    print("metadata: " + str(r[0].metadata))  
    print("score: " + str(r[1]))


Each result includes content, metadata, and a score. The score shows how relevant the document chunk is to the query.

page_content: One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.   

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.  
metadata: {'source': './state_of_the_union.txt', 'title': '', 'pid': 'bd5195489a294650b4bae5f4ae011b89'}  
score: 1.7323288917541504


Next cohort will start soon! Reserve your spot for building full-stack GenAI SaaS applications


Step 6: Cleanup


When done, we can clear the index to free up space.

retriever.delete_all()


That’s it! By combining Milvus, Elasticsearch, and an XGBoost ranking model, we make sure that the responses to our queries are accurate and relevant.


Here’s the full code snippet:

from langchain_community.document_loaders import TextLoader  
from langchain_text_splitters import RecursiveCharacterTextSplitter  

from denser_retriever.gradient_boost import XGradientBoost  
from denser_retriever.keyword import ElasticKeywordSearch, create_elasticsearch_client  
from denser_retriever.retriever import DenserRetriever  
from denser_retriever.vectordb.milvus import MilvusDenserVectorDB  
from experiments.utils import embeddings, reranker  

docs = TextLoader("tests/test_data/state_of_the_union.txt").load()  

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)  
texts = text_splitter.split_documents(docs)  

index_name = "state_of_the_union"  
retriever = DenserRetriever(  
    index_name=index_name,  
    vector_db=MilvusDenserVectorDB(  
        auto_id=True,  
        connection_args={"uri": "http://localhost:19530"},  
    ),  
    keyword_search=ElasticKeywordSearch(  
        es_connection=create_elasticsearch_client(url="http://localhost:9200"),  
    ),  
    reranker=reranker,  
    gradient_boost=XGradientBoost("experiments/models/msmarco_xgb_es+vs+rr_n.json"),  
    embeddings=embeddings,  
    combine_mode="model",  
    xgb_model_features="es+vs+rr_n",  
)  

retriever.ingest(texts)  

query = "What did the president say about Ketanji Brown Jackson"  
res = retriever.retrieve(query, 1)  

for r in res:  
    print("page_content: " + r[0].page_content)  
    print("metadata: " + str(r[0].metadata))  
    print("score: " + str(r[1]))  

retriever.delete_all()


For more information, check out the repo.


Bonus Content : Building with AI


And don’t forget to have a look at some practitioner resources that we published recently:


Llama 3.2-Vision for High-Precision OCR with Ollama

LitServe: FastAPI on Steroids for Serving AI Models — Tutorial with Llama 3.2 Vision

Run FLUX Models Locally on Your Mac!

GOT-OCR2.0 in Action: Optical Character Recognition Applications and Code Examples


Thank you for stopping by, and being an integral part of our community.


Happy building!