Skip to main content

Why is RAG all the rage?

· One min read
Benjamin D. Brodie

Hero image

Image credit: DALL-E 3 by OpenAI - prompt: "Cool header image for an article about Retrieval Augmented Generation (RAG)"

About the time that ChatGPT hit the scene in December 2022, a new technique was discovered that could turn any kind of documentation into a conversational agent, capable of answering an endless stream of questions without ever tiring.

Probably due to it's simplicity and universal appeal, this use case quickly became a popular entry point for experimenting with large language models and private data. So, how does it work?

I started by running the source code that accompanies the following tutorial, one of the first to demonstrate a local RAG pipeline using the brand-new Llama 2 model and no external dependencies. Perfect for training on private data!

Running Llama 2 on CPU Inference Locally for Document Q&A

As the tutorial explains, LLMs are capable of processing queries that have a list of embeddings, as well as a text prompt.

But one thing still confuses me - what format does langchain use to "stuff" the prompt with embeddings? A topic for another post, when I figure it out.