•
Tech stack: Gradio, Hugging Face, Llama 3.1, Llama Index, Ollama, Python, and Weaviate
Update:
TL;DR: A local book RAG system powered by Llama3.1, Llama Index, and a vector embedding, allowing users to retrieve information and summaries with or without books' metadata. The retrieval processes can be done on a UI for ease of use. Even though this system is capable of providing relatively accurate responses, this RAG system is still generating summaries in real time. Meaning that the system has to go through thousand of chunks of documents whenever users ask for book summaries, and it's considered computationally expensive.
This project is a local Retrieval Augmented Generation (RAG) system that retrieves and summarizes books. Since I have over 200 books in my library, I often find it difficult to discover interesting readings.
I created this project to help me summarize books that I have never read before. If the summary is interesting, I would consider spend time reading the book. If not, I would love to just read the summary of each chapter of the book.
What Llama Index and Langchain focus on (Source: https://superwise.ai/blog/lets-talk-about-llamaindex-and-langchain/)
Why Llama Index? From the figure above, Llama Index focuses on the indexing, storing, and retrieving data. While Langchain focuses on prompting and the output of the query. Since I am only focusing on indexing and retrieving data, I am going to incorporate Llama Index only for this project.
Why would we need RAG? According to Llama Index, RAG helps processing and adding external or personal data in the generation process. Adding this external data can help the model to generate more relevant and accurate responses. However, adding this external data to the model requires five stages: loading, indexing, storing, querying, and evaluation.
In order to get the local RAG system to work, three components are needed: an embedding model, a vector index, and a retrieval model.
Note that LLM models process information in chunks. If documents are too long, the model may be slow to find relevant or, worse, not be able to process them due to memory constraints. If too short, the model would generate irrelevant responses, or halluciante, due to lack of context.
1024 chunk size, and 100 chunk overlap
https://chunkviz.up.railway.app/ can help visualize the chunking process.
Since I am dealing with .epub
files, each document would be seperated into a chunk of 1024 words.
Thus, each document would have different number of chunks.
Initially, these are the limitations of the RAG system:
It is challenging for users who do not have a technical background to interact with the RAG system since it was only accessible via the command line.
Llama Index converts documents into vector embeddings and stores them in memory by default. This will be a problem if the number of documents increases.
The RAG system is only capable of retrieving job descriptions from 2-3 documents at a time. That is the default setting of Llama Index. User should be able to have more control over how many documents and how the documents are retrieved.
In some cases, the model responds with "No documents found" even though the documents are in the vector database. This is due to each document not having metadata.
In the case of this project, the model produces irrelevent responses even though the one of the documents retrieved is relevant. I later found out that, as the number of retrieved documents increases, the model is sometimes lost in contexts and generate irrelevant responses because the retrieval model has to combine the retrieved documents into a big chunk of text before passing them it into the language model.
I also tried the refine
mode to generate responses.
However, the model also generates irrelevant reponses as well.
The reason being is the model refines its response as it sees a new document.
Meaning, if there were 5 documents retrieved, the model will refine its response 5 times.
In addition to that, the generation process is slower since the model has to generate 5 different responses.
It's computationally expensive to retrieve information from thousand relevant chunks of documents and generate summarizations in real time whenever users ask for book summaries. So, it's better to have a separate database that stores each book's summary in a few paragraphs.
To improve the user experience, I created a web interface for the RAG system using Gradio.
Initial RAG UI
In order to improve the efficiency of retrieval processes, I incorporated Weaviate, a vector database, to store the documents that have been converted into vector embeddings. Not only the RAM usage is reduced significantly, but the retrieval process is generally faster.
From the figure above, you can see that uses can determine how many relevant documents they want to retrieve, as well as how the retrieval processes should be done.
The alpha parameter is used to determine the retrieval mode of the model. If the alpha parameter is 0, the model will retrieve data using the bm25 approach, meaning the model will retrieve data based on the similarity of the query and the document.
If the alpha parameter is 1, the model will retrieve data using the vector approach, meaning the model will retrieve data based on the similarity of the query and the vector embeddings. By default, the alpha parameter is set to .
RAG UI with search settings
Since each document has metadata, users can search documents accurately based on titles and authors. Thus, the retrieval model can retrieve corrent details from the correct documents so that the language model provides accurate facts or summarizations in its responses.
RAG UI with metadata search and search settings
By default, the model retrieves documents using compact
mode.
Meaning the model will combine texts into a big chunk of text and compose the relevant answer.
However, this mode is very accurate when the number of documents is small.
Conversely, the model may produce irrelevant responses as the number of documents increases.
To fix this issue, the model should use the tree_summarize
mode.
From the codebase tells me, this is what it does:
Build a tree index over the set of candidate nodes, with a summary prompt seeded with the query. The tree is built in a bottoms-up fashion, and in the end the root node is returned as the response.