What is RAG?
Retrieval Augmented Generation (RAG) is a technique that enhances Large Language Models by giving them access to external knowledge sources. Instead of relying solely on the knowledge learned during training, RAG allows LLMs to retrieve relevant information from your documents, databases, or other sources before generating a response.
The RAG Process
- Query: User asks a question
- Retrieve: System finds relevant documents/chunks
- Augment: Retrieved context is added to the prompt
- Generate: LLM produces an answer using the context
Why RAG Matters
LLMs have significant limitations that RAG addresses:
- Knowledge cutoff: LLMs only know information from their training data, which has a cutoff date
- Hallucinations: LLMs can confidently generate false information
- No private data: LLMs don't have access to your company's internal documents
- Domain specificity: Generic LLMs may lack deep expertise in specialized fields
- Source attribution: Without RAG, it's hard to verify where information came from
RAG solves these by grounding LLM responses in your actual data, making answers more accurate, current, and verifiable.
How RAG Works: The Technical Flow
1. Document Ingestion
First, your documents are processed and prepared for retrieval:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load documents
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(documents)
2. Creating Embeddings
Each chunk is converted into a numerical vector (embedding) that captures its semantic meaning:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
# Each chunk becomes a vector like:
# [0.023, -0.041, 0.089, ..., 0.012] # 1536 dimensions
# Similar content = similar vectors
3. Storing in Vector Database
Embeddings are stored in a vector database for fast similarity search:
from langchain_community.vectorstores import Chroma
# Create vector store
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
4. Retrieval at Query Time
When a user asks a question, we find the most relevant chunks:
# User's question is also embedded
query = "What is the vacation policy?"
# Find similar chunks using vector similarity
relevant_docs = vectorstore.similarity_search(query, k=4)
# Returns the 4 most relevant document chunks
5. Generation with Context
The retrieved chunks are added to the prompt, and the LLM generates an answer:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4")
retriever = vectorstore.as_retriever()
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True
)
result = qa_chain.invoke({"query": "What is the vacation policy?"})
print(result["result"])
print(result["source_documents"]) # Citations!
Chunking Strategies
How you split documents significantly impacts retrieval quality:
Fixed Size Chunking
Split by character count. Simple but may break sentences.
chunk_size=1000, overlap=200
Recursive Chunking
Tries to split at natural boundaries (paragraphs, sentences).
RecursiveCharacterTextSplitter
Semantic Chunking
Groups content by meaning, keeping related ideas together.
SemanticChunker
Document-Specific
Special splitters for Markdown, code, HTML, etc.
MarkdownHeaderTextSplitter
Best Practices for Chunking
- Chunk size should be large enough for context but small enough for specificity
- Use overlap to avoid losing information at boundaries
- Consider your document structure (headers, sections)
- Test different chunk sizes for your specific use case
Vector Databases
Vector databases are specialized for storing and querying embeddings:
| Database | Type | Best For | Key Features |
|---|---|---|---|
| Pinecone | Managed | Production apps | Fully managed, scales well |
| ChromaDB | Open-source | Prototyping | Easy setup, Python-native |
| Weaviate | Open-source | Complex queries | GraphQL, hybrid search |
| Qdrant | Open-source | High performance | Rust-based, filtering |
| Milvus | Open-source | Large scale | Billion-scale vectors |
| pgvector | Extension | Existing Postgres | Use with Postgres DB |
Advanced RAG Techniques
1. Hybrid Search
Combine semantic search with keyword search for better results:
# Combine vector similarity with BM25 keyword search
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever
bm25 = BM25Retriever.from_documents(docs)
vector_retriever = vectorstore.as_retriever()
ensemble = EnsembleRetriever(
retrievers=[bm25, vector_retriever],
weights=[0.4, 0.6]
)
2. Reranking
Use a more sophisticated model to reorder retrieved results:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
reranker = CohereRerank(top_n=3)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=retriever
)
3. Query Transformation
Improve queries before searching:
- Query expansion: Generate multiple versions of the query
- HyDE: Generate a hypothetical answer, then search for similar content
- Step-back prompting: Ask a more general question first
4. Parent Document Retrieval
Store small chunks for retrieval but return larger parent documents for context:
from langchain.retrievers import ParentDocumentRetriever
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
child_splitter=child_splitter, # Small chunks
parent_splitter=parent_splitter # Large chunks
)
Common RAG Use Cases
Document Q&A
Answer questions about PDFs, Word docs, and other files.
Customer Support
AI that answers based on knowledge base articles and FAQs.
Code Documentation
Query codebases and technical documentation.
Research Assistant
Search and synthesize information from research papers.
Enterprise Search
Find information across company documents and wikis.
Legal/Compliance
Query regulations, contracts, and legal documents.
Evaluation & Monitoring
Measuring RAG quality is crucial:
- Retrieval metrics: Precision, recall, MRR (Mean Reciprocal Rank)
- Generation metrics: Faithfulness, relevance, groundedness
- End-to-end: Answer correctness, user satisfaction
Tools like Ragas, TruLens, and LangSmith help evaluate RAG systems automatically.
Build Production RAG Systems
Our Agentic AI program covers RAG in depth - from basic document Q&A to advanced production systems. Build real projects with hands-on mentorship.
Explore Agentic AI Program