RAG Introduction

In this part, we’ll start the journey of building Retrieval-Augmented Generation (RAG) applications using Spring AI. RAG applications have already gained a lot of attentions from industry. It’s a typical area where AI technology has practical usage.

Reduce Hallucinations

The main goal of using RAG is to reduce hallucinations when using language models for chat completions. Language models are trained using material from various sources. Once a language model is trained, its parametric knowledge is frozen. When the prompt to a language model contains information not seen in the training materials, the output will hallucinate.

For example, when asking a language model (Llama 3) the following question: Who won the gold medal in men’s 100 meters at 2020 Olympic Games?. The language model can provide the correct result.

QA using model internal knowledge
Figure 4. QA using model internal knowledge

If we change the question to Who won the gold medal in men’s 100 meters at 2024 Olympic Games?, the language model cannot provide a meaningful answer.

QA without model internal knowledge
Figure 5. QA without model internal knowledge

There are three approaches to reduce hallucinations, model fine-tuning, tools, and RAG.

  • Model fine-tuning works by fine-tuning the model to include extra materials.
  • Tools allow a model to interact with external systems to provide necessary information.
  • RAG works by augmenting the original prompts with retrieved content for a model to generate output.

The idea behind RAG is quite simple. A frozen language model itself lacks information to generate output for certain prompts. If we augment the prompt to include content from external sources, the model can leverage the provided content to generate the output. The included content is retrieved from an external system and must be semantically similar with the original prompt.

Let’s go back to the example above. If we augment the original prompt with the content from a web page, the model can generate meaningful output.

QA with provided content
Figure 6. QA with provided content

Naive RAG

Naive RAG is a simple way to implement RAG. As the name suggests, naive RAG may not provide best results. However, naive RAG is a good starting point to leverage RAG.

The diagram below shows the architecture of naive RAG. Given a user prompt, similar documents related the prompt are retrieved from a vector database. Similar documents and the original prompt are combined together to form the final input to a language model to generate the output.

Naive RAG
Figure 7. Naive RAG

Vector databases play an important role in RAG. Reference documents are stored in vector databases. Text content is converted to a vector (an array of floating point numbers) using a text embedding model. For a given document, its content and embedding vector are both stored in a vector database. For a prompt, its content is also converted to a vector using the same embedding model. Similarity between the vector of prompt and vectors of reference documents are calculated using vector similarity algorithms. By doing this, similarity between documents is converted to similarity between vectors.

After retrieving similar documents from the vector database, these documents are combined with the original prompt. The combination is usually done with a predefined template. In the template below, we instruct the language model to answer the question using provided content.

1 Answer the question using provided content:
2 
3 Question: {question}
4 
5 Provided content: {content}
6 
7 Answer: