RAG Pipelines Explained: Building AI That Knows Your Data

TL;DR: Retrieval-Augmented Generation (RAG) solves the hallucination problem by grounding LLM responses in your actual data. The pipeline works in five stages: chunking your documents, embedding them into vectors, storing them in a vector database, retrieving relevant chunks at query time, and feeding them into the LLM for generation. This post walks through each stage, compares popular vector databases, and includes a working Python example with ChromaDB.

The Hallucination Problem
What Is RAG?
The RAG Pipeline, Stage by Stage
Vector Databases Compared
Building a RAG Pipeline in Python
Common Pitfalls and How to Avoid Them
When RAG Is Not Enough
References

The Hallucination Problem

Large Language Models are extraordinarily capable, but they have a fundamental flaw: they make things up. Ask GPT-4 about your company's internal API documentation and it will confidently generate something that looks plausible but is entirely fabricated. Ask it about a recent event past its training cutoff and it will either refuse or confabulate.

This is not a bug that will be patched away. Hallucination is an inherent property of how language models work. They are trained to predict the next most likely token, not to retrieve facts from a verified knowledge base. The model has no concept of "I don't know" -- it only has probability distributions over vocabulary tokens.

For consumer chatbots, this is an inconvenience. For production applications where accuracy matters -- legal research, medical information, customer support, internal tooling -- it is a dealbreaker.

Enter Retrieval-Augmented Generation.

What Is RAG?

RAG is an architecture pattern that gives an LLM access to external knowledge at inference time. Instead of relying solely on what the model memorized during training, you retrieve relevant documents from your own data store and include them in the prompt context.

The concept was formalized in a 2020 paper by Lewis et al. at Facebook AI Research, but the underlying idea is simple: don't ask the model to remember everything -- give it an open-book exam.

Here is the core intuition in pseudocode:

user_query = "What is our refund policy for enterprise customers?"

# Without RAG:
response = llm.generate(user_query)  # Hallucinates a policy

# With RAG:
relevant_docs = vector_store.search(user_query, top_k=5)
context = "\n".join(relevant_docs)
prompt = f"Based on the following documents:\n{context}\n\nAnswer: {user_query}"
response = llm.generate(prompt)  # Answers from your actual policy

The result is dramatically more accurate, verifiable (you can cite the source documents), and keeps your data fresh without retraining the model.

The RAG Pipeline, Stage by Stage

A production RAG system consists of five stages. Let's walk through each one.

Stage 1: Chunking

Raw documents -- PDFs, web pages, Markdown files, database records -- need to be split into smaller pieces. Language models have finite context windows, and even with 128K token models, stuffing entire documents into the prompt is wasteful and degrades retrieval quality.

Common chunking strategies include:

Fixed-size chunks (e.g., 512 tokens with 50-token overlap): Simple and predictable, but can split sentences and ideas mid-thought.
Recursive character splitting: Tries to split on paragraph boundaries, then sentences, then words. This is the default in LangChain and works well for most text.
Semantic chunking: Uses embeddings to detect topic shifts and splits at natural boundaries. More expensive but produces more coherent chunks.
Document-aware chunking: Respects document structure like headers, sections, and code blocks. Essential for technical documentation.

The overlap between chunks is important. Without it, a question whose answer spans two chunks might miss critical context. A 10-20% overlap is a reasonable starting point.

Stage 2: Embedding

Each chunk is converted into a dense vector -- a list of floating-point numbers that captures the semantic meaning of the text. Similar meanings produce similar vectors, enabling semantic search rather than keyword matching.

Popular embedding models include:

OpenAI text-embedding-3-small: 1536 dimensions, excellent quality, API-based.
Cohere embed-v3: Strong multilingual support.
BGE / E5 (open source): Run locally, no API costs, competitive quality.
Sentence Transformers: Flexible open-source library with many model options.

The choice of embedding model matters more than most people think. A poor embedding model will produce poor retrieval, and no amount of prompt engineering will fix it downstream.

Stage 3: Vector Store

The embedded chunks need to be stored somewhere that supports efficient similarity search. This is where vector databases come in. They use algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) to find the nearest neighbors to a query vector without scanning every record.

We will compare the options in detail in the next section.

Stage 4: Retrieval

When a user query arrives, it is embedded using the same model, and the vector store returns the top-K most similar chunks. This is where retrieval quality is won or lost.

Advanced retrieval strategies include:

Hybrid search: Combine vector similarity with BM25 keyword search for better recall.
Re-ranking: Use a cross-encoder model to re-score the top-K results for precision.
Query expansion: Rephrase the query or generate multiple variants to improve recall.
Metadata filtering: Narrow the search space by date, source, category, or access level before running similarity search.

Stage 5: Generation

The retrieved chunks are assembled into a prompt alongside the user's query and any system instructions. The LLM generates a response grounded in the provided context.

A well-designed generation prompt typically includes:

A system message defining the assistant's role and constraints
The retrieved context, clearly delineated
Instructions to only answer from the provided context
The user's question

You are a helpful assistant for Acme Corp. Answer questions using ONLY
the provided context. If the context does not contain enough information
to answer, say "I don't have enough information to answer that."

Context:
---
{retrieved_chunks}
---

Question: {user_query}

Vector Databases Compared

Choosing a vector database is one of the most consequential decisions in a RAG pipeline. Here is how the major options stack up:

Feature	Pinecone	Weaviate	ChromaDB	pgvector
Type	Managed SaaS	Self-hosted / Cloud	Embedded / Self-hosted	PostgreSQL extension
Ease of Setup	Very easy	Moderate	Very easy	Easy (if you use Postgres)
Scalability	Excellent	Excellent	Limited (single-node)	Good (Postgres-bound)
Hybrid Search	Yes	Yes (BM25 + vector)	No (vector only)	Yes (with `tsvector`)
Metadata Filtering	Yes	Yes	Yes	Yes (SQL WHERE)
Cost	Pay-per-use	Free (self-hosted)	Free (open source)	Free (open source)
Best For	Production SaaS apps	Feature-rich self-hosted	Prototyping, small-scale	Teams already on Postgres
Language Support	Python, Node, Go, REST	Python, Go, Java, REST	Python, JS	Any Postgres client
Max Dimensions	20,000	Configurable	Configurable	2,000

Pinecone is the easiest path to production if you want a managed service. You get automatic scaling, backups, and a clean API. The trade-off is vendor lock-in and cost at scale.

Weaviate is feature-rich and supports hybrid search natively. It is a strong choice if you need BM25 + vector search and want to self-host.

ChromaDB is the go-to for prototyping and small projects. It runs in-process with no server setup. However, it is not designed for large-scale production workloads.

pgvector is the pragmatic choice for teams already running PostgreSQL. You get vector search alongside your relational data with no additional infrastructure. Performance is reasonable up to a few million vectors.

Building a RAG Pipeline in Python

Let's build a working RAG pipeline using ChromaDB and the OpenAI API. This example is intentionally minimal to focus on the core mechanics.

Installation

pip install chromadb openai tiktoken

The Complete Pipeline

import chromadb
from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY env var
chroma_client = chromadb.Client()

# --- Stage 1 & 2: Chunk and embed documents ---
# ChromaDB handles embedding automatically using its default model,
# but we'll use OpenAI embeddings for better quality.

def get_embedding(text: str) -> list[float]:
    """Get OpenAI embedding for a text chunk."""
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

# --- Stage 3: Create vector store ---
collection = chroma_client.create_collection(
    name="company_docs",
    metadata={"hnsw:space": "cosine"}
)

# Sample documents (in production, these come from your data pipeline)
documents = [
    "Our standard refund policy allows returns within 30 days of purchase. "
    "Enterprise customers have an extended 90-day window.",

    "Enterprise plans include 24/7 priority support with a guaranteed "
    "4-hour response time for critical issues.",

    "All API rate limits are documented at docs.example.com/limits. "
    "Enterprise tier gets 10,000 requests per minute.",

    "Data residency options are available for enterprise customers. "
    "We support US, EU, and APAC regions.",
]

# Add documents with embeddings
for i, doc in enumerate(documents):
    collection.add(
        ids=[f"doc_{i}"],
        embeddings=[get_embedding(doc)],
        documents=[doc],
        metadatas=[{"source": "internal_docs", "chunk_index": i}]
    )

# --- Stage 4: Retrieve relevant chunks ---
def retrieve(query: str, top_k: int = 3) -> list[str]:
    """Retrieve the most relevant document chunks for a query."""
    query_embedding = get_embedding(query)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    return results["documents"][0]

# --- Stage 5: Generate response ---
def rag_query(user_question: str) -> str:
    """Full RAG pipeline: retrieve context, then generate."""
    relevant_chunks = retrieve(user_question)
    context = "\n\n".join(relevant_chunks)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful assistant for our company. "
                    "Answer questions using ONLY the provided context. "
                    "If the context doesn't contain the answer, say so."
                )
            },
            {
                "role": "user",
                "content": (
                    f"Context:\n---\n{context}\n---\n\n"
                    f"Question: {user_question}"
                )
            }
        ],
        temperature=0.1  # Low temperature for factual responses
    )

    return response.choices[0].message.content

# --- Usage ---
answer = rag_query("What is the refund policy for enterprise customers?")
print(answer)
# Output: Enterprise customers have an extended 90-day refund window...

Adding Persistent Storage

For production use, swap the in-memory client for persistent storage:

# Persistent ChromaDB (data survives restarts)
chroma_client = chromadb.PersistentClient(path="./chroma_data")

# Or connect to a ChromaDB server
chroma_client = chromadb.HttpClient(host="localhost", port=8000)

Common Pitfalls and How to Avoid Them

Chunks too large or too small. Chunks that are too large dilute the relevant information with noise. Chunks that are too small lose context. Start with 256-512 tokens and experiment.

Ignoring chunk overlap. Without overlap, answers that span chunk boundaries will be missed. A 10-20% overlap is a sensible default.

Using the wrong similarity metric. Cosine similarity is the standard for normalized embeddings. Euclidean distance can work but is sensitive to magnitude differences. Match the metric to your embedding model's recommendations.

Not evaluating retrieval quality. Most teams obsess over the generation step and ignore retrieval. If the right documents are not being retrieved, no prompt will save you. Build an evaluation set of queries with known relevant documents and measure recall.

Stuffing too many chunks into context. More is not always better. Including irrelevant chunks adds noise and can confuse the model. Three to five high-quality chunks often outperform twenty mediocre ones.

When RAG Is Not Enough

RAG is not a silver bullet. It struggles with:

Reasoning over entire datasets: RAG retrieves snippets, not full databases. Questions like "what is the average deal size across all customers" require aggregation, not retrieval.
Multi-hop reasoning: Questions that require connecting information across many documents can exceed retrieval capabilities.
Structured data: For SQL-like queries over structured data, text-to-SQL or direct database access is more appropriate.
Real-time data: If your data changes by the second, the indexing lag in vector stores can be a problem.

For these cases, consider combining RAG with tool use, SQL agents, or fine-tuning. The best production systems use RAG as one component in a larger architecture, not as the entire solution.

References

Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." arXiv:2005.11401. https://arxiv.org/abs/2005.11401
ChromaDB Documentation. https://docs.trychroma.com/
OpenAI Embeddings Guide. https://platform.openai.com/docs/guides/embeddings
Pinecone Learning Center. https://www.pinecone.io/learn/
Weaviate Documentation. https://weaviate.io/developers/weaviate
pgvector GitHub Repository. https://github.com/pgvector/pgvector
LangChain RAG Tutorial. https://python.langchain.com/docs/tutorials/rag/
Gao, Y. et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv:2312.10997. https://arxiv.org/abs/2312.10997