30 May 2024

Personalized Retrieval-Augmented Generation System

Tech stack: Gradio, Hugging Face, Llama 3, Llama Index, Ollama, Python, and Weaviate

I created this project to gain insight into the job market and the specific requirements of industries like Machine Learning or Front-end Development. However, I want to take it a step further by designing my own Local Retrieval Augmented Generation (RAG) system that can generate more relevant responses to the job descriptions I've collected.

What Llama Index and Langchain focus on (Source: Llama Index and Langchain focus on (Source:

You might be wondering why use Llama Index? Why not Langchain? Looking at the image above, we can see that Llama Index focuses on the indexing, storing, and retrieving data. While Langchain focuses on prompting and the output of the query. Since I am only focusing on indexing and retrieving data, Llama Index is the choice for this project.

Why would we need RAG? According to Llama Index, RAG helps processing and adding external or personal data in the generation process. Adding this external data can help the model to generate more relevant and accurate responses. However, adding this external data to the model requires five stages: loading, indexing, storing, querying, and evaluation.

In this project, I will be including my resume and job descriptions from a local job board in Taiwan and see how the RAG system can help generate more relevant responses to the job descriptions. The job descriptions are in markdown format, and here one example:

AI Engineer
**Company:** ...
**Industry:** IT  
**Position:** AI Engineer
**Job Updated:** about 1 month ago  
**Type:** Full-time  
**Level:** Entry level  
**Experience Required:** 3 years  
**Salary:** $$$ TWD / month  
**Remote Work:** Optional
Job Description
- Research and share new AI-related technologies.
- Prepare and deliver in-house training course materials.
- Guide and execute AI-related projects.
**Mandatory Requirements:**
1. Bachelor's degree or above, any major.
2. Proactive, highly cooperative, with analytical problem-solving and communication skills.
3. Proficient in Python programming.
4. Experience in machine learning/deep learning/statistical analysis.
5. Willingness to travel occasionally to partner companies or institutions.
**Preferred Requirements:**
1. Experience in data visualization.
2. Familiarity with TensorFlow/Keras/Scikit-learn/Pytorch.
3. Familiarity with various data analysis algorithms.

My local RAG system is limited only to dealing with markdown files.

In order to get the local RAG system to work, three components are needed: an embedding model, a vector index, and a retrieval model.

As for the embedding mode, I am using BAAI/bge-large-en-v1.5 from Hugging Face As for the data processor and indexing, I am using Llama Index. As for the language model, I am using Llama 3 from Meta.

Note that LLM models process information in chunks. If the job descriptions are too long, the model may not be able to process them. If too short, the model may not be able to generate relevant responses.

1500 chunk size, and 100 chunk overlap.1500 chunk size, and 100 chunk overlap. can help visualize the chunking process.

Since all job descriptions are separated in different markdown files, I did not need to split the job descriptions into chunks. In addition to that, the job descriptions are quite short, so the model should be able to process with no hassle.


Prompt 2: "What skills required for those job descriptions?"Prompt 2: "What skills required for those job descriptions?"

Prompt 3: "What are common responsibilities these jobs have?"Prompt 3: "What are common responsibilities these jobs have?"

Prompt 4: "As a senior HR, how would you rate my resume?"Prompt 4: "As a senior HR, how would you rate my resume?"



It is challenging for users who do not have a technical background to interact with the RAG system on the command line.

Insufficent Memory

Llama Index converts documents into vector embeddings and stores them in memory by default. This will be a problem if the number of documents increases. Thus a vector database is needed to store the vector embeddings to avoid the over reliance on memory.

Limited Control

The RAG system is only capable of retrieving job descriptions from 2-3 documents at a time. That is the default setting of Llama Index. User should be able to have more control over how many documents and how the documents are retrieved.


In the case of this project, the model produces irrelevent responses even though the one of the documents retrieved is relevant. I later found out that, as the number of retrieved documents increases, the model captures a lot of noises from irrelevant documents and produces irrelevant responses.


User Interface

To improve the user experience, I created a web interface for the RAG system using Gradio.

Local RAG System Web InterfaceLocal RAG System Web Interface

Vector Database

In order to reduce the RAM usage, I incorporated Weaviate, a vector database, to store the documents that have been converted into vector embeddings. Not only the RAM usage is reduced significantly, but the retrieval process is also faster.

More Control

From the figure above, you can see that uses can determine how many relevant documents they want to retrieve, as well as the alpha parameter for the retrieval model.

The alpha parameter is used to determine the retrieval mode of the model. If the alpha parameter is 0, the model will retrieve data using the bm25 approach, meaning the model will retrieve data based on the similarity of the query and the document.

If the alpha parameter is 1, the model will retrieve data using the vector approach, meaning the model will retrieve data based on the similarity of the query and the vector embeddings. By default, the alpha parameter is set to 0.750.75.

Better Responses

By default, the model retrieves documents using compact mode. Meaning the model will combine texts into a big chunk of text and compose the relevant answer. However, this mode is very accurate when the number of documents is small. Conversely, the model may produce irrelevant responses as the number of documents increases.

I also tried the refine mode. As result, the model will refine its response as it sees a new document. Meaning, if there were 5 documents retrieved, the model will refine its response 5 times. In addition to that, the generation process is slower since the model has to generate 5 different responses.

To fix this issue, I used the tree_summarize mode. From the code base, this is what it does

Build a tree index over the set of candidate nodes, with a summary prompt seeded with the query. The tree is built in a bottoms-up fashion, and in the end the root node is returned as the response.

Instead of

query_engine = RetrieverQueryEngine.from_args(

The retriever engine should be configured as follows:

query_engine = RetrieverQueryEngine.from_args(
    "The original query is as follows: {query_str}. Here are the sources: {sources}"
    "Given the sources, refine the answer to better answer the original query."
    "If the answer is in the sources, please provide the answer or refine the answer."
    "If the answer is not in the sources, return the original answer."
  1. Observations
    • Limitations
    • Improvements