Ingestion and chunking as one system
If you care about correctness, latency, and token spend, you need to design document ingestion and chunking for RAG as one coherent system.
The end-to-end processing chain
In a well-designed RAG system, the flow looks like this:
- Data ingestion
Pull data from source systems, decode formats, and normalize content into a structured representation. - Chunking
Split the normalized content into retrieval-sized units aligned with logical boundaries instead of arbitrary character limits. - Embedding
Generate embeddings for each chunk with the chosen embedding model. - Indexing
Store chunks and embeddings in vector and keyword indexes. - Retrieval and generation
Retrieve the most relevant context, compose the prompt, and call the LLM.
Without chunking between ingestion and indexing, retrieval quality degrades and token costs explode.
What the system needs to do well
An effective design should:
- respect logical units instead of arbitrary character cuts
- leverage document types and preserved structure from ingestion
- control chunk size and overlap explicitly rather than inheriting blind defaults
- make sure the tokens you pay for during embedding and generation actually improve answers
Why "just ingest it and let the framework handle the rest" fails
That shortcut is attractive, but expensive. Consider a typical scenario: you extract text from a 40-page contract using a general-purpose parser. The parser returns a flat string — section numbers, headings, body paragraphs, table cells, and footnotes all collapsed into one stream. By the time a chunker sees this text, there is no way to distinguish a section heading from a footnote. Every chunking strategy you apply downstream is working with degraded input.
The consequences compound at each stage:
- Chunking splits on character or token boundaries because it has no structural signal to split on. Clauses land in different chunks than their governing definitions.
- Embedding produces vectors that blend unrelated topics within the same chunk, diluting similarity scores for precise queries.
- Retrieval returns chunks that are partially relevant at best. The model receives context where only a fraction of the tokens actually answer the question.
- Generation hallucinates or hedges because the retrieved context is noisy. You increase
top_kto compensate, which raises token cost without proportionally improving accuracy.
This is the "flat ingest" trap: the information is technically in the system, but retrieval cannot surface it cleanly.
What integrated design looks like
When ingestion and chunking share a representation, the pipeline changes fundamentally:
- Ingestion recovers the document's hierarchy — headings, subheadings, list nesting, table structure — and passes that hierarchy forward, not just raw text.
- Chunking uses the hierarchy to define boundaries. A chunk never splits a logical unit. Each chunk carries its ancestry (e.g. "Chapter 3 → Section 3.2 → Paragraph 4") as metadata.
- Retrieval returns chunks that are self-contained. The model gets clean, contextually grounded input without needing to infer where a chunk came from.
POMA AI's PrimeCut is built around this principle. The ingestion step reconstructs document hierarchy before chunking begins, so chunks always respect structural boundaries. The result is fewer tokens per query (often 70–80% fewer) with equal or higher recall.
The practical levers
Done right, document ingestion and chunking become the main levers for:
- higher recall — chunks align with the units users actually ask about
- fewer hallucinations — context arrives without contradictory fragments from unrelated sections
- lower token spend — no redundant overlap, no irrelevant padding
- more reliable grounding — every chunk traces back to a specific location in the source document
TL;DR
Continue reading
- Tooling comparison — see how different tools handle the ingestion-chunking boundary
- Why RAG needs chunking — the fundamentals
- The full ingestion guide — end-to-end deep dive
- PrimeCut product page