RAG Architecture: How to Design a Production Retrieval Pipeline
Retrieval-augmented generation (RAG) connects large language models to external knowledge — but the quality of that connection depends entirely on how you design the pipeline between your documents and your LLM. This guide walks through RAG architecture end-to-end, from ingestion through generation, with a focus on where production systems fail and how to avoid those failures.
The two pipelines
Every RAG system has two distinct pipelines:
Offline: ingestion pipeline — runs when documents change
Document → Parse → Chunk → Embed → Store (vector DB)Online: query pipeline — runs on every user question
Query → Embed → Retrieve → Rerank (optional) → GenerateThe ingestion pipeline determines what's available for retrieval. The query pipeline determines what's selected and how it's presented to the LLM. Both matter, but mistakes in ingestion are harder to fix because they affect every downstream query.
Component breakdown
1. Document ingestion
The first step is getting clean text out of your source documents. PDFs, Word files, HTML pages, and Markdown all have different structures, and the quality of your parser directly affects everything downstream.
Common tools: Unstructured.io, Docling, Apache Tika, PyMuPDF, POMA PrimeCut.
What breaks: PDF parsing is the most common failure point. Tables get linearized into nonsense, headers are missed, and multi-column layouts produce interleaved text. If your retrieval quality is poor, start debugging here — not at the chunking or embedding layer.
For a deeper dive, see the document ingestion guide.
2. Chunking
Chunking splits parsed documents into retrieval units — the pieces your vector database stores and your retriever returns. This is the most consequential design decision in RAG architecture, because it determines both what the LLM sees and how much context it gets.
The standard approach is to cut text into fixed-size or recursively-split fragments:
# Typical LangChain chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(docs)This works for prototyping, but production systems hit three recurring problems:
- Context fragmentation — section headers get separated from content
- Table destruction — rows split across chunks
- Overlap inflation — near-duplicate chunks degrade retrieval precision
Hierarchical alternative: Instead of slicing text, POMA PrimeCut preserves the document's structure as chunksets — complete root-to-leaf paths that keep every sentence linked to its structural context. No overlap needed, no orphaned facts, no parameter tuning.
# POMA hierarchical chunking
from poma import PrimeCut
pc = PrimeCut()
result = pc.process("policy.pdf")
# → chunksets with full document lineageFor a comprehensive comparison of all chunking strategies, see the chunking strategies guide.
3. Embedding
Embedding converts text chunks into dense vectors that capture semantic meaning. The embedding model determines how well your retrieval can match queries to relevant content.
Key decisions:
- Model choice — OpenAI
text-embedding-3-small/large, Cohereembed-v4, open-source models like BGE-M3 or E5. Match your model to your domain and language requirements. - Dimensionality — higher dimensions capture more nuance but increase storage and latency. 768–1536 dimensions is the common range.
- Batch processing — embed at ingestion time, not query time. Store vectors alongside metadata in your vector database.
What breaks: Embedding quality is bounded by chunk quality. If your chunks are incoherent fragments, even the best embedding model can't produce meaningful vectors. Fix chunking first, then optimize embeddings.
4. Vector storage and retrieval
Vector databases store your embedded chunks and handle similarity search at query time. The retrieval step finds the most relevant chunks for a given query.
Common options: Qdrant, Pinecone, Weaviate, Chroma, pgvector, TurboPuffer.
Retrieval strategies:
- Dense retrieval (ANN) — approximate nearest neighbor search on embeddings. Fast, handles semantic similarity well.
- Sparse retrieval (BM25) — keyword matching. Catches exact terms that dense search misses.
- Hybrid — combine dense + sparse results, typically with Reciprocal Rank Fusion (RRF). Best of both worlds.
Reranking: After initial retrieval, a cross-encoder reranker (e.g., BGE-reranker, Cohere Rerank) scores each candidate against the actual query. This dramatically improves precision — the reranker sees the full query-document pair, not just vector similarity.
What breaks: Retrieval returns the wrong chunks when chunking loses context. A chunk containing "employees are eligible for..." without the preceding "Health Insurance" header matches too many queries. Hierarchical chunking avoids this by keeping the structural path attached.
5. Generation
The LLM receives the retrieved context in its prompt and generates a grounded answer. The prompt structure matters:
System: You are a helpful assistant. Answer based on the provided context.
Context: [retrieved chunks inserted here]
User: [original question]What breaks:
- "Lost in the middle" — LLMs attend more to the beginning and end of context. If the relevant information is buried in the middle of many chunks, accuracy drops.
- Context overflow — stuffing too many chunks into the prompt wastes tokens and dilutes attention. Fewer, higher-quality chunks outperform more, lower-quality ones.
- Hallucination from missing context — if chunking lost the structural context (e.g., "the above policy" without the policy), the LLM confabulates.
Hierarchical chunksets help here because they produce fewer, more complete retrieval units. A chunkset carries its full lineage, so the LLM gets structured context rather than a pile of fragments.
Common RAG architecture patterns
Naive RAG
The simplest pattern: chunk → embed → retrieve → generate. No reranking, no hybrid search, fixed chunk sizes. Good for prototyping, inadequate for production.
Advanced RAG
Adds pre-retrieval optimization (query rewriting, HyDE), hybrid retrieval (dense + sparse), reranking, and careful chunking. Most production systems land here.
Modular RAG
Treats each pipeline component as a swappable module. Different document types get different parsers and chunking strategies. Retrieval can route to different indexes based on query type. This is where POMA fits naturally — it replaces the chunking module without requiring changes to your embedding, retrieval, or generation layers.
6. Evaluation and feedback
Production RAG systems need a way to measure retrieval quality and catch regressions. Without evaluation, you're flying blind — changes to chunking, embedding models, or retrieval parameters can silently degrade answer quality.
Key metrics:
- Retrieval precision/recall — are the right chunks being returned?
- Answer faithfulness — is the LLM's response grounded in the retrieved context?
- Answer relevancy — does the response actually address the query?
Tools like RAGAS, DeepEval, and TruLens automate these checks. Build evaluation into your CI pipeline — run it on a fixed question set after every ingestion pipeline change.
Where most RAG systems fail
After working with hundreds of RAG deployments, the failure pattern is remarkably consistent:
- Chunking, not retrieval, is the bottleneck. Teams spend weeks tuning embedding models and retrieval parameters when the real problem is that their chunks are incoherent fragments.
- Tables and structured data break everything. Financial tables, compliance matrices, and technical specs are where RAG accuracy matters most — and where naive chunking fails hardest.
- Overlap is a symptom, not a solution. If you need 20% overlap to avoid losing context at boundaries, your chunking strategy is treating the symptom. Hierarchical chunking eliminates the boundary problem entirely.
- Token efficiency matters at scale. Returning 10 chunks at 512 tokens each is 5,120 tokens of context. Hierarchical chunksets typically achieve the same or better accuracy with 77% fewer tokens, directly reducing LLM costs.
Continue reading
- RAG chunking strategies — compare all chunking approaches side by side
- Document ingestion guide — deep dive into the parsing layer
- Chunksets explained — how POMA's hierarchical approach works
- SDK Quickstart — integrate PrimeCut in four lines of Python
- LangChain integration — drop-in replacement for RecursiveCharacterTextSplitter
- LlamaIndex integration — drop-in replacement for SentenceSplitter
- Try PrimeCut for free — 1,000 pages free, no credit card required
- Pricing — from €0.003/page