Image credit: DALL-E 3 by OpenAI - prompt: "Cool header image for an article about Retrieval Augmented Generation (RAG)"
About the time that ChatGPT hit the scene in December 2022, a new technique was discovered that could turn any kind of documentation into a conversational agent, capable of answering an endless stream of questions without ever tiring.
Probably due to it's simplicity and universal appeal, this use case quickly became a popular entry point for experimenting with large language models and private data. So, how does it work?
I started by running the source code that accompanies the following tutorial, one of the first to demonstrate a local RAG pipeline using the brand-new Llama 2 model and no external dependencies. Perfect for training on private data!
Running Llama 2 on CPU Inference Locally for Document Q&A
As the tutorial explains, LLMs are capable of processing queries that have a list of embeddings, as well as a text prompt.
But one thing still confuses me - what format does langchain use to "stuff" the prompt with embeddings? A topic for another post, when I figure it out.