Ingestion and chunking as one system

If you care about correctness, latency, and token spend, you need to design document ingestion and chunking for RAG as one coherent system.

The end-to-end processing chain

In a well-designed RAG system, the flow looks like this:

Data ingestion
Pull data from source systems, decode formats, and normalize content into a structured representation.
Chunking
Split the normalized content into retrieval-sized units aligned with logical boundaries instead of arbitrary character limits.
Embedding
Generate embeddings for each chunk with the chosen embedding model.
Indexing
Store chunks and embeddings in vector and keyword indexes.
Retrieval and generation
Retrieve the most relevant context, compose the prompt, and call the LLM.

Without chunking between ingestion and indexing, retrieval quality degrades and token costs explode.

What the system needs to do well

An effective design should:

respect logical units instead of arbitrary character cuts
leverage document types and preserved structure from ingestion
control chunk size and overlap explicitly rather than inheriting blind defaults
make sure the tokens you pay for during embedding and generation actually improve answers

Why "just ingest it and let the framework handle the rest" fails

That shortcut is attractive, but expensive. Consider a typical scenario: you extract text from a 40-page contract using a general-purpose parser. The parser returns a flat string — section numbers, headings, body paragraphs, table cells, and footnotes all collapsed into one stream. By the time a chunker sees this text, there is no way to distinguish a section heading from a footnote. Every chunking strategy you apply downstream is working with degraded input.

The consequences compound at each stage:

Chunking splits on character or token boundaries because it has no structural signal to split on. Clauses land in different chunks than their governing definitions.
Embedding produces vectors that blend unrelated topics within the same chunk, diluting similarity scores for precise queries.
Retrieval returns chunks that are partially relevant at best. The model receives context where only a fraction of the tokens actually answer the question.
Generation hallucinates or hedges because the retrieved context is noisy. You increase top_k to compensate, which raises token cost without proportionally improving accuracy.

This is the "flat ingest" trap: the information is technically in the system, but retrieval cannot surface it cleanly.

What integrated design looks like

When ingestion and chunking share a representation, the pipeline changes fundamentally:

Ingestion recovers the document's hierarchy — headings, subheadings, list nesting, table structure — and passes that hierarchy forward, not just raw text.
Chunking uses the hierarchy to define boundaries. A chunk never splits a logical unit. Each chunk carries its ancestry (e.g. "Chapter 3 → Section 3.2 → Paragraph 4") as metadata.
Retrieval returns chunks that are self-contained. The model gets clean, contextually grounded input without needing to infer where a chunk came from.

POMA AI's PrimeCut is built around this principle. The ingestion step reconstructs document hierarchy before chunking begins, so chunks always respect structural boundaries. The result is fewer tokens per query (often 70–80% fewer) with equal or higher recall.

The practical levers

Done right, document ingestion and chunking become the main levers for:

higher recall — chunks align with the units users actually ask about
fewer hallucinations — context arrives without contradictory fragments from unrelated sections
lower token spend — no redundant overlap, no irrelevant padding
more reliable grounding — every chunk traces back to a specific location in the source document

TL;DR

The pipeline only works well when ingestion preserves structure and chunking uses that structure deliberately. Designing them separately is one of the most common reasons RAG systems underperform.

Continue reading

Tooling comparison — see how different tools handle the ingestion-chunking boundary
Why RAG needs chunking — the fundamentals
The full ingestion guide — end-to-end deep dive
PrimeCut product page

Grill

Getting started

Concepts

Reference

PrimeCut

Getting started

Concepts

Reference

Python SDK

Getting started

Concepts

Reference

Integrations

Migration

CLI

MCP

Learn (study path)

Chunking

Ingestion

Ingestion and chunking as one system

The end-to-end processing chain

What the system needs to do well

Why "just ingest it and let the framework handle the rest" fails

What integrated design looks like

The practical levers

Continue reading

Chunking

Ingestion

Ingestion and chunking as one system ​

The end-to-end processing chain ​

What the system needs to do well ​

Why "just ingest it and let the framework handle the rest" fails ​

What integrated design looks like ​

The practical levers ​

Continue reading ​

Ingestion and chunking as one system

The end-to-end processing chain

What the system needs to do well

Why "just ingest it and let the framework handle the rest" fails

What integrated design looks like

The practical levers

Continue reading