RAG: Retrieval Augmented Generation Explained

What is RAG?

Retrieval Augmented Generation (RAG) is a technique that enhances Large Language Models by giving them access to external knowledge sources. Instead of relying solely on the knowledge learned during training, RAG allows LLMs to retrieve relevant information from your documents, databases, or other sources before generating a response.

                        The RAG Process
                        Query: User asks a question
Retrieve: System finds relevant documents/chunks
Augment: Retrieved context is added to the prompt
Generate: LLM produces an answer using the context

                    

Why RAG Matters

LLMs have significant limitations that RAG addresses:

Knowledge cutoff: LLMs only know information from their training data, which has a cutoff date
Hallucinations: LLMs can confidently generate false information
No private data: LLMs don't have access to your company's internal documents
Domain specificity: Generic LLMs may lack deep expertise in specialized fields
Source attribution: Without RAG, it's hard to verify where information came from

RAG solves these by grounding LLM responses in your actual data, making answers more accurate, current, and verifiable.

How RAG Works: The Technical Flow

1. Document Ingestion

First, your documents are processed and prepared for retrieval:

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load documents
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(documents)

2. Creating Embeddings

Each chunk is converted into a numerical vector (embedding) that captures its semantic meaning:

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Each chunk becomes a vector like:
# [0.023, -0.041, 0.089, ..., 0.012]  # 1536 dimensions
# Similar content = similar vectors

3. Storing in Vector Database

Embeddings are stored in a vector database for fast similarity search:

from langchain_community.vectorstores import Chroma

# Create vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

4. Retrieval at Query Time

When a user asks a question, we find the most relevant chunks:

# User's question is also embedded
query = "What is the vacation policy?"

# Find similar chunks using vector similarity
relevant_docs = vectorstore.similarity_search(query, k=4)

# Returns the 4 most relevant document chunks

5. Generation with Context

The retrieved chunks are added to the prompt, and the LLM generates an answer:

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4")
retriever = vectorstore.as_retriever()

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What is the vacation policy?"})
print(result["result"])
print(result["source_documents"])  # Citations!

Chunking Strategies

How you split documents significantly impacts retrieval quality:

Fixed Size Chunking

Split by character count. Simple but may break sentences.

chunk_size=1000, overlap=200

Recursive Chunking

Tries to split at natural boundaries (paragraphs, sentences).

RecursiveCharacterTextSplitter

Semantic Chunking

Groups content by meaning, keeping related ideas together.

SemanticChunker

Document-Specific

Special splitters for Markdown, code, HTML, etc.

MarkdownHeaderTextSplitter

Best Practices for Chunking

Chunk size should be large enough for context but small enough for specificity
Use overlap to avoid losing information at boundaries
Consider your document structure (headers, sections)
Test different chunk sizes for your specific use case

Vector Databases

Vector databases are specialized for storing and querying embeddings:

Database	Type	Best For	Key Features
Pinecone	Managed	Production apps	Fully managed, scales well
ChromaDB	Open-source	Prototyping	Easy setup, Python-native
Weaviate	Open-source	Complex queries	GraphQL, hybrid search
Qdrant	Open-source	High performance	Rust-based, filtering
Milvus	Open-source	Large scale	Billion-scale vectors
pgvector	Extension	Existing Postgres	Use with Postgres DB

Advanced RAG Techniques

1. Hybrid Search

Combine semantic search with keyword search for better results:

# Combine vector similarity with BM25 keyword search
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever

bm25 = BM25Retriever.from_documents(docs)
vector_retriever = vectorstore.as_retriever()

ensemble = EnsembleRetriever(
    retrievers=[bm25, vector_retriever],
    weights=[0.4, 0.6]
)

2. Reranking

Use a more sophisticated model to reorder retrieved results:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

reranker = CohereRerank(top_n=3)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=retriever
)

3. Query Transformation

Improve queries before searching:

Query expansion: Generate multiple versions of the query
HyDE: Generate a hypothetical answer, then search for similar content
Step-back prompting: Ask a more general question first

4. Parent Document Retrieval

Store small chunks for retrieval but return larger parent documents for context:

from langchain.retrievers import ParentDocumentRetriever

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,  # Small chunks
    parent_splitter=parent_splitter  # Large chunks
)

Common RAG Use Cases

Document Q&A

Answer questions about PDFs, Word docs, and other files.

Customer Support

AI that answers based on knowledge base articles and FAQs.

Code Documentation

Query codebases and technical documentation.

Research Assistant

Search and synthesize information from research papers.

Enterprise Search

Find information across company documents and wikis.

Legal/Compliance

Query regulations, contracts, and legal documents.

Evaluation & Monitoring

Measuring RAG quality is crucial:

Retrieval metrics: Precision, recall, MRR (Mean Reciprocal Rank)
Generation metrics: Faithfulness, relevance, groundedness
End-to-end: Answer correctness, user satisfaction

Tools like Ragas, TruLens, and LangSmith help evaluate RAG systems automatically.

Build Production RAG Systems

Our Agentic AI program covers RAG in depth - from basic document Q&A to advanced production systems. Build real projects with hands-on mentorship.

Explore Agentic AI Program

RAG: Retrieval Augmented Generation