What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that enhances large language models (LLMs) by allowing them to pull in external knowledge during inference. Instead of relying solely on pre-trained data, RAG retrieves relevant documents from a database or knowledge base and feeds them into the model to generate more accurate, up-to-date responses.
The Problem It Solves
LLMs often struggle with:
- Outdated knowledge (training data cutoff dates)
- Hallucinations (confidently wrong answers)
- Lack of specific domain expertise
RAG solves these by connecting the model to a living knowledge source.
How RAG Works
- Indexing: Documents are chunked, embedded into vectors, and stored in a vector database.
- Retrieval: A user query is embedded, and the system finds the most semantically similar document chunks.
- Generation: The retrieved chunks are added to the prompt as context, and the LLM produces an answer grounded in that content.
A Simple Analogy
Think of RAG as an open-book exam: the LLM is the student, and the retrieved documents are the reference book. The student uses the book to answer questions accurately, rather than relying on memory alone.
Why RAG > Fine-Tuning
- RAG is dynamic—you can swap or update the knowledge base without retraining.
- Fine-tuning adjusts the model's weights for a specific task but doesn't add new factual knowledge.
For most use cases requiring current or domain-specific information, RAG is the more flexible and cost-effective choice.