What is RAG?

Retrieval Augmented Generation (RAG) is a technique that enhances Large Language Models by giving them access to external knowledge sources. Instead of relying solely on the knowledge learned during training, RAG allows LLMs to retrieve relevant information from your documents, databases, or other sources before generating a response.

The RAG Process

  1. Query: User asks a question
  2. Retrieve: System finds relevant documents/chunks
  3. Augment: Retrieved context is added to the prompt
  4. Generate: LLM produces an answer using the context

Why RAG Matters

LLMs have significant limitations that RAG addresses:

  • Knowledge cutoff: LLMs only know information from their training data, which has a cutoff date
  • Hallucinations: LLMs can confidently generate false information
  • No private data: LLMs don't have access to your company's internal documents
  • Domain specificity: Generic LLMs may lack deep expertise in specialized fields
  • Source attribution: Without RAG, it's hard to verify where information came from

RAG solves these by grounding LLM responses in your actual data, making answers more accurate, current, and verifiable.

How RAG Works: The Technical Flow

1. Document Ingestion

First, your documents are processed and prepared for retrieval:

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load documents
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(documents)

2. Creating Embeddings

Each chunk is converted into a numerical vector (embedding) that captures its semantic meaning:

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Each chunk becomes a vector like:
# [0.023, -0.041, 0.089, ..., 0.012]  # 1536 dimensions
# Similar content = similar vectors

3. Storing in Vector Database

Embeddings are stored in a vector database for fast similarity search:

from langchain_community.vectorstores import Chroma

# Create vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

4. Retrieval at Query Time

When a user asks a question, we find the most relevant chunks:

# User's question is also embedded
query = "What is the vacation policy?"

# Find similar chunks using vector similarity
relevant_docs = vectorstore.similarity_search(query, k=4)

# Returns the 4 most relevant document chunks

5. Generation with Context

The retrieved chunks are added to the prompt, and the LLM generates an answer:

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4")
retriever = vectorstore.as_retriever()

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What is the vacation policy?"})
print(result["result"])
print(result["source_documents"])  # Citations!

Chunking Strategies

How you split documents significantly impacts retrieval quality:

Fixed Size Chunking

Split by character count. Simple but may break sentences.

chunk_size=1000, overlap=200

Recursive Chunking

Tries to split at natural boundaries (paragraphs, sentences).

RecursiveCharacterTextSplitter

Semantic Chunking

Groups content by meaning, keeping related ideas together.

SemanticChunker

Document-Specific

Special splitters for Markdown, code, HTML, etc.

MarkdownHeaderTextSplitter

Best Practices for Chunking

  • Chunk size should be large enough for context but small enough for specificity
  • Use overlap to avoid losing information at boundaries
  • Consider your document structure (headers, sections)
  • Test different chunk sizes for your specific use case

Vector Databases

Vector databases are specialized for storing and querying embeddings:

Database Type Best For Key Features
Pinecone Managed Production apps Fully managed, scales well
ChromaDB Open-source Prototyping Easy setup, Python-native
Weaviate Open-source Complex queries GraphQL, hybrid search
Qdrant Open-source High performance Rust-based, filtering
Milvus Open-source Large scale Billion-scale vectors
pgvector Extension Existing Postgres Use with Postgres DB

Advanced RAG Techniques

1. Hybrid Search

Combine semantic search with keyword search for better results:

# Combine vector similarity with BM25 keyword search
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever

bm25 = BM25Retriever.from_documents(docs)
vector_retriever = vectorstore.as_retriever()

ensemble = EnsembleRetriever(
    retrievers=[bm25, vector_retriever],
    weights=[0.4, 0.6]
)

2. Reranking

Use a more sophisticated model to reorder retrieved results:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

reranker = CohereRerank(top_n=3)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=retriever
)

3. Query Transformation

Improve queries before searching:

  • Query expansion: Generate multiple versions of the query
  • HyDE: Generate a hypothetical answer, then search for similar content
  • Step-back prompting: Ask a more general question first

4. Parent Document Retrieval

Store small chunks for retrieval but return larger parent documents for context:

from langchain.retrievers import ParentDocumentRetriever

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,  # Small chunks
    parent_splitter=parent_splitter  # Large chunks
)

Common RAG Use Cases

Document Q&A

Answer questions about PDFs, Word docs, and other files.

Customer Support

AI that answers based on knowledge base articles and FAQs.

Code Documentation

Query codebases and technical documentation.

Research Assistant

Search and synthesize information from research papers.

Enterprise Search

Find information across company documents and wikis.

Legal/Compliance

Query regulations, contracts, and legal documents.

Evaluation & Monitoring

Measuring RAG quality is crucial:

  • Retrieval metrics: Precision, recall, MRR (Mean Reciprocal Rank)
  • Generation metrics: Faithfulness, relevance, groundedness
  • End-to-end: Answer correctness, user satisfaction

Tools like Ragas, TruLens, and LangSmith help evaluate RAG systems automatically.

Build Production RAG Systems

Our Agentic AI program covers RAG in depth - from basic document Q&A to advanced production systems. Build real projects with hands-on mentorship.

Explore Agentic AI Program

Related Articles