Python

Building RAG Pipelines in Python: A Step-by-Step Guide for Engineering Teams

DSi
DSi Team
· · 13 min read
Building RAG Pipelines in Python

Retrieval-augmented generation is the standard architecture for AI applications that need to work with private or domain-specific data. Rather than fine-tuning a model on your corpus or hoping that the LLM memorized relevant facts during pre-training, RAG fetches the most relevant documents at query time and supplies them to the model as context. You get more accurate, grounded, and current responses without the expense and overhead of custom model training.

The gap between a RAG demo that works on 10 documents and a production pipeline that processes thousands of queries daily against millions of documents is enormous -- and most tutorials never cross it. This guide covers the complete pipeline: architecture, document ingestion, chunking, embedding, retrieval, reranking, evaluation, and production deployment. Everything is in Python, and the recommendations reflect patterns we have seen succeed and fail in real-world AI product development.

RAG Architecture: The Two-Phase Pipeline

Before writing code, you need a clear mental model of what a RAG pipeline does. It is fundamentally a two-phase system: an offline indexing phase that prepares your data, and an online query phase that retrieves context and generates answers.

The indexing phase (offline)

This phase executes whenever you add or update documents. It consists of four steps:

  1. Document ingestion: Load raw documents from wherever they live -- PDFs, web pages, databases, APIs, Confluence wikis, Notion exports, or any other knowledge repository your organization maintains.
  2. Chunking: Break documents into smaller, semantically coherent pieces. A 50-page PDF cannot be passed to an LLM as a single block. How you chunk directly determines retrieval quality.
  3. Embedding: Transform each chunk into a dense vector using an embedding model. These vectors encode semantic meaning, enabling similarity-based retrieval at query time.
  4. Storage: Persist the vectors and their associated text in a vector database optimized for fast approximate nearest neighbor search.

The query phase (online)

This phase runs on every user query:

  1. Query embedding: Convert the user's question into a vector using the same embedding model from the indexing phase.
  2. Retrieval: Search the vector database for the top-k chunks most similar to the query vector.
  3. Reranking (strongly recommended): Apply a cross-encoder model to re-score the retrieved chunks by actual relevance to the query, not just vector proximity.
  4. Generation: Pass the question and the retrieved (and reranked) chunks to the LLM, which produces a grounded answer.
The most common mistake in RAG development is spending 90 percent of effort on the generation step and 10 percent on retrieval. In reality, retrieval quality drives 80 percent of output quality. If the right documents never reach the model, no prompt engineering in the world will produce a good answer.

Step 1: Document Ingestion

Ingestion is the foundation of your pipeline, and the step teams most frequently underestimate. Your RAG system can only be as good as the data it indexes. Both LangChain and LlamaIndex provide document loaders for standard formats, but production pipelines almost always need custom ingestion logic.

Common sources and loaders

  • PDFs: Use PyMuPDF or pdfplumber for extraction. Steer clear of PyPDF2 for anything with complex layouts -- it routinely mangles tables and multi-column text.
  • Web pages: BeautifulSoup with requests handles static pages; Playwright is necessary for JavaScript-rendered content. Always strip navigation, footers, and boilerplate before chunking.
  • Databases: Query directly and transform rows or records into text representations. For relational data, generating natural-language descriptions of each record often produces better retrieval results than raw field dumps.
  • APIs and SaaS tools: Build custom loaders for Notion, Confluence, Google Drive, or Slack. Both frameworks have community-maintained integrations, but test them carefully -- they tend to break when upstream APIs change.

Production ingestion patterns

In production, ingestion is not a one-time script. It is a continuous process that requires:

  • Incremental updates: Re-process only documents that have changed. Track document hashes or modification timestamps to avoid re-indexing the full corpus on every run.
  • Metadata extraction: Store metadata alongside each chunk -- source document, page number, section heading, author, last modified date. This metadata powers filtering during retrieval and enables citation in responses.
  • Robust error handling: Documents will fail to parse. PDFs will be corrupted. APIs will time out. Build retry logic, dead-letter queues for unparseable documents, and alerting for ingestion failures that need manual attention.

Step 2: Chunking Strategies

Chunking is where RAG pipelines are won or lost. The objective is to split documents into pieces small enough to be relevant to a specific query, yet large enough to preserve meaningful context. Getting this wrong undermines everything downstream.

Fixed-size chunking

The simplest strategy: divide text into chunks of a fixed token count (for example, 512 tokens) with overlap between adjacent chunks (for example, 100 tokens). This works acceptably for homogeneous documents like articles or reports.

  • Pros: Easy to implement, predictable sizing, simple to reason about.
  • Cons: Splits can land mid-sentence or mid-paragraph, breaking semantic coherence and reducing retrieval precision.

Recursive character splitting

LangChain's RecursiveCharacterTextSplitter attempts to split on paragraph boundaries first, then sentence boundaries, then word boundaries, falling back to character splits only when necessary. This preserves significantly more semantic structure than naive fixed-size approaches.

Semantic chunking

Uses embedding similarity to detect natural breakpoints. Consecutive sentences with similar meaning stay grouped together; points where the topic shifts become chunk boundaries. LlamaIndex's SemanticSplitterNodeParser implements this approach. It is more expensive computationally but produces notably higher-quality chunks for documents with variable structure or mixed topics.

Document-aware chunking

For structured documents -- legal contracts, technical manuals, medical records -- chunking should follow the document's inherent organization. Split on section headings, clause boundaries, or logical units specific to the document type. This requires domain-specific parsing logic but yields substantially better retrieval results than any generic strategy.

Whichever strategy you choose, include chunk overlap (typically 10 to 20 percent of chunk size) to prevent losing context at boundaries. And always test your chunking quality against real documents before optimizing any other component.

Step 3: Embedding Models

The embedding model transforms text into dense vectors that encode semantic meaning. This choice directly determines retrieval quality and is one of the most consequential decisions in the entire pipeline.

Selecting an embedding model

The embedding landscape has matured considerably. Here are the options worth evaluating in 2026:

  • OpenAI text-embedding-3-large: Excellent general-purpose performance with easy integration. Supports dimensionality reduction via the dimensions parameter, letting you trade off storage cost against retrieval precision. A strong default for teams that prefer managed infrastructure.
  • Cohere embed-v4: Competitive with OpenAI, with strong multilingual support and separate query/document embedding modes that improve retrieval accuracy out of the box.
  • Open-source models (BGE, E5, GTE families): Run locally using Hugging Face or sentence-transformers. Zero API costs, full data sovereignty, and benchmark performance that rivals commercial offerings. Ideal for teams with GPU infrastructure or strict privacy requirements.

The MTEB leaderboard tracks current benchmark rankings, but remember that benchmark scores do not always predict performance on your specific domain. Test candidate models against your own data and queries before committing.

Embedding best practices

  • Use the same model for indexing and querying. Mixing embedding models across phases produces meaningless similarity scores -- the vectors live in different spaces.
  • Normalize your vectors. Most vector databases expect normalized vectors for cosine similarity. Some models normalize by default; verify this before indexing.
  • Be deliberate about dimensionality. Higher dimensions capture more nuance but cost more to store and slow down retrieval. For most applications, 1024 dimensions is sufficient. OpenAI's model lets you reduce from 3072 without re-embedding.
  • Batch embedding calls. Processing chunks one at a time is slow and wasteful. Batch in groups of 100 to 500 for dramatically better throughput.

Step 4: Vector Databases

The vector database stores your embedded chunks and handles similarity search at query time. The right choice depends on your scale, operational maturity, and existing infrastructure. If you are adding AI features to an existing SaaS product, this decision has long-term maintainability implications.

Database Type Best For Key Trade-off
Chroma In-process / self-hosted Prototyping, small datasets Fastest to start, limited production scaling
Pinecone Fully managed Production with minimal ops burden Effortless scaling, vendor dependency
Weaviate Self-hosted or cloud Hybrid search (vector + keyword) Feature-rich, higher setup complexity
pgvector PostgreSQL extension Teams already running PostgreSQL No new infrastructure, limited at very large scale
Qdrant Self-hosted or cloud High-performance filtered search Very fast, smaller ecosystem
Milvus Self-hosted or cloud (Zilliz) Billion-scale production deployments Massive scale ceiling, significant operational overhead

Practical guidance

For teams building their first RAG pipeline:

  • Start with Chroma in development. It runs in-process, needs no external services, and the API patterns transfer cleanly to production databases.
  • Graduate to Pinecone or Weaviate for production. Choose Pinecone for zero operational overhead, Weaviate if you need hybrid search (combining vector similarity with BM25 keyword matching) or prefer to self-host.
  • Consider pgvector if you already run PostgreSQL and your dataset is under a few million vectors. Adding a column to an existing table is far simpler than standing up a new database.

Regardless of your choice, always persist the original text alongside the vectors. You will need it for debugging, reranking, and passing context to the LLM.

Step 5: Retrieval Strategies

Basic vector similarity search -- returning the top-k most similar chunks -- is a starting point, not a final answer. Production RAG pipelines layer multiple retrieval strategies to improve both recall and precision.

Hybrid search

Combine dense vector search with sparse keyword search (BM25). Vector search handles semantic similarity well ("find documents about employee onboarding") but can miss exact keyword matches ("find the Q3 2025 revenue report"). BM25 catches exact matches but misses semantic relationships. Hybrid search merges both result sets, typically using reciprocal rank fusion (RRF).

Weaviate and Qdrant support hybrid search natively. For other vector databases, you can run BM25 separately using a library like rank_bm25 and merge the results in application code.

Metadata filtering

Narrow the search space using metadata before vector similarity runs. If the user asks about a specific product, filter to documents tagged with that product. If they need recent information, filter by date. This reduces noise and sharpens relevance without changing the core retrieval algorithm.

Multi-query retrieval

A single user query often fails to capture the complete information need. Multi-query retrieval uses the LLM to generate several rephrased versions of the original question, runs retrieval for each variant, and deduplicates the combined results. This materially improves recall for ambiguous or complex questions.

Parent document retrieval

Index small chunks for precise matching, but return the larger parent document or surrounding chunks for context. You get the precision of fine-grained retrieval with the comprehensiveness of broader context. LlamaIndex's AutoMergingRetriever implements this pattern directly.

Step 6: Reranking

Reranking is the single highest-leverage improvement for a RAG pipeline that already retrieves reasonable results. Vector similarity search returns chunks that are semantically close to the query, but "close in embedding space" and "actually relevant" are not the same thing.

A reranker evaluates each query-chunk pair using a cross-encoder model and reorders the results by genuine relevance rather than embedding distance. This step reliably improves answer quality by 10 to 30 percent across standard benchmarks.

Reranking options in Python

  • Cohere Rerank: Managed API with no infrastructure to run. Submit your query and candidate documents, receive reranked results with relevance scores. Easy to integrate and effective across domains.
  • Cross-encoder models via sentence-transformers: Self-hosted using models like cross-encoder/ms-marco-MiniLM-L-12-v2. Eliminates API costs, but needs a GPU for acceptable latency at scale.
  • ColBERT-based rerankers: Late interaction models that strike a practical balance between bi-encoder speed and cross-encoder accuracy.

The standard pattern: retrieve a broad initial set (20 to 50 chunks) with vector search, then rerank down to the top 3 to 5 for the LLM. This cuts context window usage while dramatically improving the relevance of what reaches the model.

Step 7: Prompt Construction and Generation

With retrieval and reranking complete, you need to assemble a prompt that combines the user's question with the retrieved context and passes it to the LLM for generation.

Prompt anatomy

A production RAG prompt typically contains:

  • System instructions: Define the model's role, behavioral constraints (such as "answer only from the provided context"), and output formatting requirements.
  • Retrieved context: The reranked chunks, clearly separated so the model can distinguish between different source documents.
  • User question: The original query, placed after the context so the model addresses it with the retrieved material top of mind.
  • Citation guidance: Instructions for the model to reference which source documents it drew from, enabling users to verify the answer.

Keep the system prompt focused and concise. The most frequent prompt engineering mistake in RAG systems is overloading the system instructions to the point where the model starts ignoring the retrieved context or hallucinating despite having accurate information available.

Edge case handling

  • No relevant results: When the reranker scores all results below your confidence threshold, respond with a clear "I do not have enough information to answer this" rather than generating a hallucinated response.
  • Context window overflow: If retrieved chunks exceed the LLM's context window, truncate or summarize the lowest-ranked chunks rather than discarding them entirely.
  • Multi-turn conversations: Carry conversation history and use it to reformulate the current query before retrieval. A follow-up like "What about the pricing?" is meaningless without the context of the previous question.

Step 8: Evaluation with RAGAS

You cannot improve a pipeline you cannot measure. RAG evaluation is notoriously challenging because it requires assessing both retrieval quality and generation quality simultaneously. The RAGAS framework (Retrieval Augmented Generation Assessment) has become the standard automated evaluation approach for RAG pipelines in Python.

Key RAGAS metrics

  • Faithfulness: Does the generated answer remain grounded in the retrieved context? A score of 0.95 means 95 percent of the claims in the answer trace back to the retrieved chunks. This is your primary defense against hallucination.
  • Answer relevancy: Does the generated response actually address the question asked? High scores indicate focused, on-topic answers rather than tangential output.
  • Context precision: Are the retrieved chunks relevant to the question? Low scores indicate noisy retrieval -- chunks that are semantically adjacent but not actually useful for generating the answer.
  • Context recall: Did the retrieval step capture all the information needed to answer the question fully? Low scores mean relevant data exists in your corpus but was not found.

Building your evaluation dataset

RAGAS needs a dataset of questions, ground-truth answers, retrieved contexts, and generated responses. For a production pipeline:

  1. Start with 50 to 100 question-answer pairs drawn from actual user queries. Source these from support tickets, search logs, or user interviews -- not from your own imagination.
  2. Include challenging cases: Questions spanning multiple documents, questions with no answer in the corpus, ambiguous phrasing, and queries requiring reasoning across several facts.
  3. Run evaluation on every change: New chunking strategy? Different embedding model? Revised system prompt? Run RAGAS before and after to quantify the impact. Integrate evaluation into your CI/CD pipeline with pytest assertions against RAGAS thresholds.
Teams that skip evaluation debug RAG failures in production through user complaints. Building an evaluation suite takes a day. Investigating a production hallucination incident takes a week and costs user trust you cannot easily rebuild. Invest in evaluation before you deploy.

Step 9: Production Deployment

Moving from a functional RAG pipeline to a production system demands attention to performance, reliability, and operational visibility. This is where having engineers experienced with production AI systems makes a measurable difference.

Performance optimization

  • Caching: Cache retrieval results for frequent queries. If 20 percent of queries are repeated or near-duplicates, caching eliminates redundant embedding calls and vector searches.
  • Streaming: Stream LLM responses so users see output as it generates rather than waiting for the complete answer. This dramatically improves perceived responsiveness.
  • Async processing: Run embedding, retrieval, and reranking concurrently where dependencies allow. Python's asyncio with async-compatible database and API clients makes this straightforward.
  • Batch indexing: During ingestion, batch document processing, embedding calls, and vector upserts. Single-document processing is an order of magnitude slower than batched operations.

Monitoring and observability

In production, you need visibility into every pipeline stage:

  • Per-phase latency: Measure time consumed by embedding, retrieval, reranking, and generation independently. When total response time degrades, you need to know which stage is the bottleneck.
  • Retrieval quality tracking: Log retrieved chunks and their relevance scores for every query. Alert on queries that consistently return low-relevance results.
  • Cost monitoring: Track LLM API cost per query. A single poorly optimized query path that retrieves too many chunks or calls an oversized model can cost 10x more than a tuned one.
  • User feedback loops: Add thumbs-up/thumbs-down controls to RAG responses. This direct signal is your most valuable input for identifying failure patterns and prioritizing improvements.

LangSmith, Phoenix (Arize), and Langfuse are the leading observability platforms for LLM applications in 2026. Integrate one from the start -- retrofitting observability into an already-deployed pipeline is significantly more painful than building it in from day one.

Reliability patterns

  • Model fallbacks: When your primary LLM is unavailable, route to a secondary model rather than returning an error. Claude to GPT-4o, or a cloud API to a self-hosted open-source model.
  • Circuit breakers: If your vector database or LLM API is failing, stop sending requests rather than accumulating thousands of timeouts. Fail fast, recover gracefully.
  • Rate limiting: Protect the pipeline from traffic bursts that could overwhelm your vector database or exhaust your LLM API quota.
  • Data freshness: Build automated re-indexing pipelines that keep vectors in sync with source data. Stale indexes are one of the most common production RAG failures -- a user asks about a document that was updated yesterday, but the pipeline still serves last month's version.

Common RAG Failures and Their Fixes

After building RAG pipelines across multiple production systems, these are the failure patterns that surface repeatedly.

Failure 1: Retrieving irrelevant context

Symptom: The LLM generates plausible but incorrect answers because the retrieved chunks are tangentially related rather than genuinely relevant.

Fix: Introduce reranking. If already reranking, tighten the relevance threshold. Also audit your chunking -- chunks that blend multiple topics produce misleading retrieval results.

Failure 2: Missing information that exists in the corpus

Symptom: Users ask answerable questions but the pipeline claims insufficient information.

Fix: Increase the initial retrieval count (top-k), add multi-query retrieval to rephrase the question, or implement hybrid search to catch keyword matches that pure vector search misses.

Failure 3: Hallucination with correct context present

Symptom: The right chunks are retrieved, but the model ignores them or fabricates information not present in the context.

Fix: Strengthen grounding instructions in the system prompt ("Answer only from the provided context. If it does not contain the answer, say so explicitly."). Reduce the number of retrieved chunks to minimize distraction. If the problem persists, switch to a model with stronger instruction adherence.

Failure 4: Unacceptable response latency

Symptom: Queries take 10 or more seconds, rendering the application unusable for interactive scenarios.

Fix: Profile each pipeline stage individually. Typical bottlenecks: embedding API latency (solve with batching and caching), slow vector search on unoptimized indexes (configure HNSW parameters properly), reranking across too many candidates (reduce the initial retrieval count), and LLM generation with excessive context (feed fewer, higher-quality chunks).

Failure 5: Strong dev performance, weak production results

Symptom: The pipeline performs well on test questions but struggles with real user queries, which tend to be shorter, more ambiguous, or phrased in different vocabulary than your indexed documents.

Fix: Build your evaluation dataset from actual user queries, not synthetic ones. Add query preprocessing that expands abbreviations, resolves ambiguity, and reformulates vague questions before they hit retrieval.

Production Readiness Checklist

Before deploying a RAG pipeline to production, verify that each of these items is addressed:

  • Ingestion supports incremental updates, not just full re-indexing
  • Chunking strategy has been tested against your actual documents, not just sample data
  • Embedding model is evaluated on domain-specific queries, not only general benchmarks
  • Vector database is provisioned for your current corpus plus 2 to 3x projected growth
  • Retrieval uses hybrid search or multi-query strategies beyond basic top-k
  • Reranking filters low-relevance results before they consume LLM context
  • RAGAS evaluation suite runs automatically on every pipeline change
  • Monitoring covers latency per stage, cost per query, retrieval quality, and user feedback
  • Fallback and circuit breaker patterns handle component failures without user-facing errors
  • Automated re-indexing keeps the vector database synchronized with source data

Conclusion

Building a RAG pipeline in Python is straightforward at the prototype stage and deceptively complex at the production stage. The frameworks are mature -- LangChain, LlamaIndex, and the vector database ecosystem provide solid foundations. The difficult part is making the right decisions at each step: chunking strategies that suit your document types, embedding models that perform on your domain, retrieval strategies that handle the diversity of real user queries, and evaluation practices that catch failures before users do.

Start simple. Fixed-size chunking, a single embedding model, basic top-k retrieval, a managed vector database. Get the pipeline running end to end, measure with RAGAS, identify the weakest stage, and improve it. Iterate based on actual user queries, not assumptions about what users will ask.

The teams that build the strongest RAG systems are not the ones with the most elaborate architecture. They are the ones that ship early, measure rigorously, and improve iteratively. Every improvement to retrieval quality compounds: better chunks produce better retrieval, which yields better context, which generates better answers, which earns deeper user trust.

At DSi, our AI engineers build production RAG pipelines for engineering teams across industries. Whether you are starting from a blank slate or scaling an existing prototype, our team of 300+ engineers can help you move from retrieval experiments to production-grade systems. Talk to our AI engineering team about the right approach for your pipeline.

FAQ

Frequently Asked
Questions

There is no single best option — it depends on your requirements. For rapid prototyping and small datasets, Chroma is the simplest to start with because it runs in-process with no external infrastructure. For production workloads requiring managed scaling, Pinecone offers a fully hosted solution with minimal operational overhead. Weaviate provides a strong balance of features and self-hosting flexibility. If your team already runs PostgreSQL, pgvector lets you add vector search without introducing a new database. Most teams start with Chroma during development and migrate to Pinecone or Weaviate for production.
Chunk size depends on your document type and retrieval goals. Smaller chunks (200 to 500 tokens) provide more precise retrieval but can lose surrounding context. Larger chunks (500 to 1500 tokens) preserve more context but may dilute relevance and consume more of the LLM context window. The best approach is to start with 500 to 800 tokens with 100 to 200 token overlap, then evaluate retrieval quality on your specific data using metrics like context precision and context recall from the RAGAS framework. Semantic chunking often outperforms fixed-size chunking for unstructured documents.
Both frameworks are mature options in 2026. LangChain is more general-purpose and better suited if your project involves complex agent orchestration, tool use, and multi-step chains beyond just RAG. LlamaIndex is purpose-built for data indexing and retrieval, making it a more focused choice if RAG is your primary use case. Many production teams use LlamaIndex for the retrieval pipeline and LangChain for the broader application orchestration layer. Either can work — choose based on your team's existing familiarity and the scope of your project.
Use the RAGAS framework, which provides automated metrics specifically designed for RAG evaluation. The key metrics are faithfulness (does the answer stay grounded in the retrieved context), answer relevancy (does the answer address the question), context precision (are the retrieved chunks relevant), and context recall (did retrieval capture all needed information). Run these metrics against a golden dataset of question-answer pairs specific to your domain. Scores above 0.8 on faithfulness and context precision are a reasonable production threshold, but the target depends on your use case.
The most common failures are poor chunking strategies that destroy meaning, using generic embeddings when the domain requires specialized ones, retrieving too few or too many chunks, and not implementing reranking to filter out low-relevance results. Other frequent issues include stale data from missing re-indexing pipelines, exceeding LLM context windows by stuffing too many chunks into the prompt, and hallucination from tangentially related but not actually relevant context. Most of these failures are preventable with proper evaluation metrics and testing against real user queries before deployment.
DSi engineering team
LET'S CONNECT
Build production RAG
pipelines that work
Our AI engineers specialize in building retrieval-augmented generation systems — from prototype to production-grade.
Talk to the team