What is RAG architecture?

RAG (Retrieval-Augmented Generation) architecture is a system design pattern where an LLM is connected to an external knowledge base. Documents are ingested, chunked, and embedded into a vector database. At query time, relevant chunks are retrieved and injected into the LLM's prompt as context. The architecture typically has two pipelines: an offline ingestion pipeline (document → parse → chunk → embed → store) and an online query pipeline (query → embed → retrieve → rerank → generate).

What are the components of a RAG pipeline?

A production RAG pipeline has five core components: (1) Document ingestion — parsing PDFs, HTML, and other formats into clean text. (2) Chunking — splitting documents into retrieval units. (3) Embedding — converting chunks into vectors using an embedding model. (4) Retrieval — finding relevant chunks via vector search, optionally with reranking. (5) Generation — injecting retrieved context into an LLM prompt to produce grounded answers.

Why does chunking matter in RAG architecture?

Chunking determines the quality of your retrieval unit — and retrieval quality directly controls generation quality. Poor chunking (splitting mid-sentence, losing headers, breaking tables) means the LLM receives fragments without context, leading to hallucinations or vague answers. Hierarchical chunking approaches like POMA AI's chunksets preserve the full document structure, so retrieved facts always arrive with their contextual lineage.

RAG Architecture: How to Design a Production Retrieval Pipeline

by Dr. Alexander Kihm (POMA AI)

Retrieval-augmented generation (RAG) connects large language models to external knowledge — but the quality of that connection depends entirely on how you design the pipeline between your documents and your LLM. This guide walks through RAG architecture end-to-end, from ingestion through generation, with a focus on where production systems fail and how to avoid those failures.

The two pipelines

Every RAG system has two distinct pipelines:

Offline: ingestion pipeline — runs when documents change

Document → Parse → Chunk → Embed → Store (vector DB)

Online: query pipeline — runs on every user question

Query → Embed → Retrieve → Rerank (optional) → Generate

The ingestion pipeline determines what's available for retrieval. The query pipeline determines what's selected and how it's presented to the LLM. Both matter, but mistakes in ingestion are harder to fix because they affect every downstream query.

Component breakdown

1. Document ingestion

The first step is getting clean text out of your source documents. PDFs, Word files, HTML pages, and Markdown all have different structures, and the quality of your parser directly affects everything downstream.

Common tools: Unstructured.io, Docling, Apache Tika, PyMuPDF, POMA PrimeCut.

What breaks: PDF parsing is the most common failure point. Tables get linearized into nonsense, headers are missed, and multi-column layouts produce interleaved text. If your retrieval quality is poor, start debugging here — not at the chunking or embedding layer.

For a deeper dive, see the document ingestion guide.

2. Chunking

Chunking splits parsed documents into retrieval units — the pieces your vector database stores and your retriever returns. This is the most consequential design decision in RAG architecture, because it determines both what the LLM sees and how much context it gets.

The standard approach is to cut text into fixed-size or recursively-split fragments:

python

# Typical LangChain chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(docs)

This works for prototyping, but production systems hit three recurring problems:

Context fragmentation — section headers get separated from content
Table destruction — rows split across chunks
Overlap inflation — near-duplicate chunks degrade retrieval precision

Hierarchical alternative: Instead of slicing text, POMA PrimeCut preserves the document's structure as chunksets — complete root-to-leaf paths that keep every sentence linked to its structural context. No overlap needed, no orphaned facts, no parameter tuning.

python

# POMA hierarchical chunking
from poma import PrimeCut
pc = PrimeCut()
result = pc.process("policy.pdf")
# → chunksets with full document lineage

For a comprehensive comparison of all chunking strategies, see the chunking strategies guide.

3. Embedding

Embedding converts text chunks into dense vectors that capture semantic meaning. The embedding model determines how well your retrieval can match queries to relevant content.

Key decisions:

Model choice — OpenAI text-embedding-3-small/large, Cohere embed-v4, open-source models like BGE-M3 or E5. Match your model to your domain and language requirements.
Dimensionality — higher dimensions capture more nuance but increase storage and latency. 768–1536 dimensions is the common range.
Batch processing — embed at ingestion time, not query time. Store vectors alongside metadata in your vector database.

What breaks: Embedding quality is bounded by chunk quality. If your chunks are incoherent fragments, even the best embedding model can't produce meaningful vectors. Fix chunking first, then optimize embeddings.

4. Vector storage and retrieval

Vector databases store your embedded chunks and handle similarity search at query time. The retrieval step finds the most relevant chunks for a given query.

Common options: Qdrant, Pinecone, Weaviate, Chroma, pgvector, TurboPuffer.

Retrieval strategies:

Dense retrieval (ANN) — approximate nearest neighbor search on embeddings. Fast, handles semantic similarity well.
Sparse retrieval (BM25) — keyword matching. Catches exact terms that dense search misses.
Hybrid — combine dense + sparse results, typically with Reciprocal Rank Fusion (RRF). Best of both worlds.

Reranking: After initial retrieval, a cross-encoder reranker (e.g., BGE-reranker, Cohere Rerank) scores each candidate against the actual query. This dramatically improves precision — the reranker sees the full query-document pair, not just vector similarity.

What breaks: Retrieval returns the wrong chunks when chunking loses context. A chunk containing "employees are eligible for..." without the preceding "Health Insurance" header matches too many queries. Hierarchical chunking avoids this by keeping the structural path attached.

5. Generation

The LLM receives the retrieved context in its prompt and generates a grounded answer. The prompt structure matters:

System: You are a helpful assistant. Answer based on the provided context.
Context: [retrieved chunks inserted here]
User: [original question]

What breaks:

"Lost in the middle" — LLMs attend more to the beginning and end of context. If the relevant information is buried in the middle of many chunks, accuracy drops.
Context overflow — stuffing too many chunks into the prompt wastes tokens and dilutes attention. Fewer, higher-quality chunks outperform more, lower-quality ones.
Hallucination from missing context — if chunking lost the structural context (e.g., "the above policy" without the policy), the LLM confabulates.

Hierarchical chunksets help here because they produce fewer, more complete retrieval units. A chunkset carries its full lineage, so the LLM gets structured context rather than a pile of fragments.

Common RAG architecture patterns

Naive RAG

The simplest pattern: chunk → embed → retrieve → generate. No reranking, no hybrid search, fixed chunk sizes. Good for prototyping, inadequate for production.

Advanced RAG

Adds pre-retrieval optimization (query rewriting, HyDE), hybrid retrieval (dense + sparse), reranking, and careful chunking. Most production systems land here.

Modular RAG

Treats each pipeline component as a swappable module. Different document types get different parsers and chunking strategies. Retrieval can route to different indexes based on query type. This is where POMA fits naturally — it replaces the chunking module without requiring changes to your embedding, retrieval, or generation layers.

6. Evaluation and feedback

Production RAG systems need a way to measure retrieval quality and catch regressions. Without evaluation, you're flying blind — changes to chunking, embedding models, or retrieval parameters can silently degrade answer quality.

Key metrics:

Retrieval precision/recall — are the right chunks being returned?
Answer faithfulness — is the LLM's response grounded in the retrieved context?
Answer relevancy — does the response actually address the query?

Tools like RAGAS, DeepEval, and TruLens automate these checks. Build evaluation into your CI pipeline — run it on a fixed question set after every ingestion pipeline change.

Where most RAG systems fail

After working with hundreds of RAG deployments, the failure pattern is remarkably consistent:

Chunking, not retrieval, is the bottleneck. Teams spend weeks tuning embedding models and retrieval parameters when the real problem is that their chunks are incoherent fragments.
Tables and structured data break everything. Financial tables, compliance matrices, and technical specs are where RAG accuracy matters most — and where naive chunking fails hardest.
Overlap is a symptom, not a solution. If you need 20% overlap to avoid losing context at boundaries, your chunking strategy is treating the symptom. Hierarchical chunking eliminates the boundary problem entirely.
Token efficiency matters at scale. Returning 10 chunks at 512 tokens each is 5,120 tokens of context. Hierarchical chunksets typically achieve the same or better accuracy with 77% fewer tokens, directly reducing LLM costs.

Continue reading

RAG chunking strategies — compare all chunking approaches side by side
Document ingestion guide — deep dive into the parsing layer
Chunksets explained — how POMA's hierarchical approach works
SDK Quickstart — integrate PrimeCut in four lines of Python
LangChain integration — drop-in replacement for RecursiveCharacterTextSplitter
LlamaIndex integration — drop-in replacement for SentenceSplitter
Try PrimeCut for free — 1,000 pages free, no credit card required
Pricing — from €0.003/page

Grill

Getting started

Concepts

Reference

PrimeCut

Getting started

Concepts

Reference

Python SDK

Getting started

Concepts

Reference

Integrations

Migration

CLI

MCP

Learn (study path)

Chunking

Ingestion

RAG Architecture: How to Design a Production Retrieval Pipeline

The two pipelines

Component breakdown

1. Document ingestion

2. Chunking

3. Embedding

4. Vector storage and retrieval

5. Generation

Common RAG architecture patterns

Naive RAG

Advanced RAG

Modular RAG

6. Evaluation and feedback

Where most RAG systems fail

Continue reading

Chunking

Ingestion

RAG Architecture: How to Design a Production Retrieval Pipeline ​

The two pipelines ​

Component breakdown ​

1. Document ingestion ​

2. Chunking ​

3. Embedding ​

4. Vector storage and retrieval ​

5. Generation ​

Common RAG architecture patterns ​

Naive RAG ​

Advanced RAG ​

Modular RAG ​

6. Evaluation and feedback ​

Where most RAG systems fail ​

Continue reading ​

RAG Architecture: How to Design a Production Retrieval Pipeline

The two pipelines

Component breakdown

1. Document ingestion

2. Chunking

3. Embedding

4. Vector storage and retrieval

5. Generation

Common RAG architecture patterns

Naive RAG

Advanced RAG

Modular RAG

6. Evaluation and feedback

Where most RAG systems fail

Continue reading