Denser Retriever: Combining Keyword, Vector, and ML Re-Ranking for Superior RAG Performance

2024-11-12

Denser-retriever leverages gradient boosting to synergize different retrieval paradigms:

Keyword-Based Search: This is our classic search method, leveraging techniques like BM25 to fetch documents that precisely match the query terms. It’s great for exact matches but can miss semantically relevant content that uses different wording.
Vector Search with Embeddings: By encoding documents and queries into high-dimensional vectors using models like BERT or Sentence Transformers, we capture semantic relationships. This means we can find relevant results even when the exact keywords aren’t present.
Machine Learning Re-Rankers: After retrieving candidates from the above methods, we apply a re-ranker — often a transformer-based model — that fine-tunes the ordering based on deeper contextual understanding.

Now, here’s where Denser Retriever shines.

It uses XGBoost, a powerful gradient boosting algorithm, to combine the scores from these different methods.

By training on features like keyword relevance scores, vector similarities, and re-ranker outputs, it learns the optimal way to weight each component to maximize retrieval performance.

In experiments with the Massive Text Embedding Benchmark (MTEB) datasets, this ensemble approach (denoted as ES+VS+RR_n) significantly outperformed the baseline vector search (VS).

Substantial improvements in metrics like NDCG@20 and Recall@20.

The team also put Denser Retriever to the test using the Anthropic Contextual Retrieval Dataset.

What’s cool about this dataset is that it includes augmented document contexts, allowing us to see how well our approach handles more complex retrieval scenarios.

Some key findings from the Anthropic experiments:

Scalability: The original Anthropic setup wasn’t scalable — it loaded all embeddings into memory, which isn’t feasible for large corpora. Denser Retriever integrates with scalable solutions like Elasticsearch for keyword search and vector databases like Milvus for vector search, making it suitable for industrial-scale applications.
Flexibility with Models: Denser Retriever lets you choose between paid API models and open-source alternatives. For instance, the team compared results using the paid Voyage-2 embedding model and Cohere’s re-ranker versus open-source models like BAAI’s bge-reranker-base. Surprisingly, and achieved comparable accuracy with the open-source models, which is a big win for cost-sensitive deployments.
Improved Accuracy: By combining keyword search, vector search, and re-ranking using XGBoost, you can achieve higher recall and NDCG scores compared to using any single method. For example, in the contextual experiments, the combination method achieved an NDCG@20 of 0.81169, outperforming individual methods.

Let’s see it in action.

Getting Started with Denser Retriever

Let’s create a virtual environment and install required libraries.

mkdir denser-retriever && cd denser-retriever  

python3 -m venv denser-retriever-env  
source denser-retriever-env/bin/activate  

pip3 install denser-retriever  
pip3 install cohere  
pip3 install jupyter ipykernel

Then install Elasticsearch and Milvus locally by running (you need to have docker and docker compose, both are included in Docker Desktop for Mac or Windows users):

wget https://raw.githubusercontent.com/denser-org/denser-retriever/main/docker-compose.dev.yml -O docker-compose.yml

Then start the services:

docker compose up -d

Once it’s finished, you can see that the containers are running:

One last thing is to create the following file experiments/models/msmarco_xgb_es+vs+rr_n.json in the root folder

We are ready to roll!

Indexing and Querying a Local File with Denser Retriever

The following script sets up a dense retrieval system that pulls relevant documents based on text queries.

It combines different approaches for searching and ranking results, making it pretty powerful.

Here’s how it works, step-by-step.

Step 0: Initial Setup

This step is pretty self-explanatory:

from langchain_community.document_loaders import TextLoader  

from langchain_text_splitters import RecursiveCharacterTextSplitter  

from denser_retriever.gradient_boost import XGradientBoost  
from denser_retriever.keyword import ElasticKeywordSearch, create_elasticsearch_client  
from denser_retriever.retriever import DenserRetriever  
from denser_retriever.vectordb.milvus import MilvusDenserVectorDB  

from denser_retriever.embeddings import SentenceTransformerEmbeddings  
from denser_retriever.reranker import HFReranker  

reranker = HFReranker(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2", top_k=3)  
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2", embedding_size=384, one_model=True)

Step 1: Loading the Document

Then we start by loading a text file into our system using TextLoader from LangChain, which helps us load documents with ease.

docs = TextLoader("./state_of_the_union.txt").load()

Here, docs is now a list of documents ready for processing. The state_of_the_union.txt file (put it into root folder) holds the content we’re querying against later.

Step 2: Splitting the Text

Since documents can be too large for our model, we split them into smaller chunks. RecursiveCharacterTextSplitter divides the document into manageable parts that overlap slightly, making it easier for the model to catch context between chunks.

# Set up the splitter  
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)  
texts = text_splitter.split_documents(docs)  

index_name = "state_of_the_union"

Now, texts contains smaller sections of our document, each one overlapping slightly with the next. This overlap can help when a query matches across chunk boundaries.

Step 3: Setup Denser Retriever

retriever = DenserRetriever(  
    index_name=index_name,  
    vector_db=MilvusDenserVectorDB(  
        auto_id=True,  
        connection_args={"uri": "http://localhost:19530"},  
    ),  
    keyword_search=ElasticKeywordSearch(  
        es_connection=create_elasticsearch_client(url="http://localhost:9200"), drop_old=True  
    ),  
    reranker=reranker,  
    gradient_boost=XGradientBoost("experiments/models/msmarco_xgb_es+vs+rr_n.json"),  
    embeddings=embeddings,  
    combine_mode="model",  
    xgb_model_features="es+vs+rr_n",  
)

It has few components so let’s explain them one by one:

Configuring the Vector Database

Here, we set up Milvus as our vector database, where all those split text chunks will be stored as embeddings. It’ll be responsible for storing the semantic vectors, so we can quickly retrieve similar ones later.

vector_db = MilvusDenserVectorDB(  
    auto_id=True,  
    connection_args={"uri": "http://localhost:19530"}  
)

In this case, auto_id=True generates an ID for each document automatically. We also connect to Milvus at localhost:19530, assuming it’s already running on our machine.

Setting Up Keyword Search

For traditional keyword search, we use Elasticsearch. This lets us combine dense (semantic) and sparse (keyword) search results, so the system is flexible with user queries.

keyword_search = ElasticKeywordSearch(  
    es_connection=create_elasticsearch_client(url="http://localhost:9200")  
)

Here, create_elasticsearch_client connects to an Elasticsearch instance, letting us pull results based on keyword matching.

Step 4: Ingesting the Documents

We feed the texts we split earlier into our retriever, effectively storing them in Milvus so they’re ready for querying.

retriever.ingest(texts)

Step 5: Running a Query

Finally, we test our retriever by querying what the president said about “Ketanji Brown Jackson.” The retriever returns the top result based on the combined search and ranking strategy.

query = "What did the president say about Ketanji Brown Jackson"  
res = retriever.retrieve(query, 1)  

for r in res:  
    print("page_content: " + r[0].page_content)  
    print("metadata: " + str(r[0].metadata))  
    print("score: " + str(r[1]))

Each result includes content, metadata, and a score. The score shows how relevant the document chunk is to the query.

page_content: One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.   

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.  
metadata: {'source': './state_of_the_union.txt', 'title': '', 'pid': 'bd5195489a294650b4bae5f4ae011b89'}  
score: 1.7323288917541504

Next cohort will start soon! Reserve your spot for building full-stack GenAI SaaS applications

Step 6: Cleanup

When done, we can clear the index to free up space.

retriever.delete_all()

That’s it! By combining Milvus, Elasticsearch, and an XGBoost ranking model, we make sure that the responses to our queries are accurate and relevant.

Here’s the full code snippet:

from langchain_community.document_loaders import TextLoader  
from langchain_text_splitters import RecursiveCharacterTextSplitter  

from denser_retriever.gradient_boost import XGradientBoost  
from denser_retriever.keyword import ElasticKeywordSearch, create_elasticsearch_client  
from denser_retriever.retriever import DenserRetriever  
from denser_retriever.vectordb.milvus import MilvusDenserVectorDB  
from experiments.utils import embeddings, reranker  

docs = TextLoader("tests/test_data/state_of_the_union.txt").load()  

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)  
texts = text_splitter.split_documents(docs)  

index_name = "state_of_the_union"  
retriever = DenserRetriever(  
    index_name=index_name,  
    vector_db=MilvusDenserVectorDB(  
        auto_id=True,  
        connection_args={"uri": "http://localhost:19530"},  
    ),  
    keyword_search=ElasticKeywordSearch(  
        es_connection=create_elasticsearch_client(url="http://localhost:9200"),  
    ),  
    reranker=reranker,  
    gradient_boost=XGradientBoost("experiments/models/msmarco_xgb_es+vs+rr_n.json"),  
    embeddings=embeddings,  
    combine_mode="model",  
    xgb_model_features="es+vs+rr_n",  
)  

retriever.ingest(texts)  

query = "What did the president say about Ketanji Brown Jackson"  
res = retriever.retrieve(query, 1)  

for r in res:  
    print("page_content: " + r[0].page_content)  
    print("metadata: " + str(r[0].metadata))  
    print("score: " + str(r[1]))  

retriever.delete_all()

For more information, check out the repo.

Bonus Content : Building with AI

And don’t forget to have a look at some practitioner resources that we published recently:

Llama 3.2-Vision for High-Precision OCR with Ollama

LitServe: FastAPI on Steroids for Serving AI Models — Tutorial with Llama 3.2 Vision

Run FLUX Models Locally on Your Mac!

GOT-OCR2.0 in Action: Optical Character Recognition Applications and Code Examples

Thank you for stopping by, and being an integral part of our community.

Happy building!

Back to All Posts