Best way to chunk PDF documents for RAG?

You'll want to use document-structure-aware chunking that respects headers, sections, and page boundaries rather than fixed-size splitting. A well-suited tool parses the PDF structure first, then chunks by logical units (both Unstructured.io and POMA AI fit this description). For PDFs with tables, use table-aware chunking to avoid mangling rows and columns. Hierarchical approaches like POMA AI preserve the full context path (e.g., chapter → section → paragraph) so retrieved chunks arrive with their structural lineage intact.

Best text splitter library for RAG?

LangChain and LlamaIndex are commonly used, if not necessarily optimal. Chonkie offers more specialized chunkers (semantic, code-aware, late chunking) with cleaner APIs. Unstructured.io is a solid option for multi-format document parsing with element-based chunking. For high-stakes use cases requiring preserved document hierarchy, POMA AI's chunkset approach eliminates context loss by keeping sentences linked to their structural path, rather than breaking text into isolated fragments.

LangChain text splitter alternatives?

Popular options include Chonkie (semantic, neural, and late chunking), Unstructured.io (element-based parsing for PDFs/HTML/DOCX with by_title and by_similarity modes), and POMA AI (hierarchical chunksets that preserve document structure without breaking text). These alternatives often provide better semantic coherence, specialized chunkers for tables and code, and more control over chunk boundaries than LangChain's recursive character splitter.

Semantic chunking vs. recursive chunking?

Recursive chunking splits text by trying paragraph breaks first, then sentences, then words. This is fast and deterministic, but purely syntactic. Semantic chunking embeds sentences and splits where similarity drops, creating more topically coherent chunks. The tradeoff is higher compute costs. Recursive chunking is commonly used for articles and general content, while semantic chunking is often preferred for dense technical or legal text where topic boundaries matter. Both still break text into isolated fragments, whereas hierarchical approaches like POMA AI chunksets avoid this entirely.

Optimal chunk size and overlap for RAG?

Start with 512 tokens and 10–20% overlap as a baseline, then tune based on retrieval evaluation. Smaller chunks (256–512) suit precise factoid retrieval, whereas larger chunks (1000+) help when answers span multiple sentences. Overlap reduces boundary loss but increases storage and can inflate duplicate retrieval. The real issue is that fixed parameters force a tradeoff. Hierarchical chunking (e.g., POMA AI chunksets) sidesteps this by preserving context paths rather than cutting text.

The Ultimate Guide to RAG Chunking Strategies & Text Splitters

by Dr. Alexander Kihm (POMA AI)

How to find the right chunking strategy, chunk size and chunk overlap

Retrieval-augmented generation (RAG) has been used as a way to turbocharge large language models (LLMs) since the early 2020s. By allowing LLMs to draw on information from sources not included in training libraries, RAG solves the problems inherent to having a static knowledge base.

But just as you can’t instantly absorb all the information in a book by glancing at it, RAG can’t magically transfer all the relevant information from Outside Source X into an LLM pipeline. The solution is called “chunking.”

Chunking is splitting large text into smaller units (“chunks”) so (1) embedding models don’t truncate your input, and (2) retrieval returns self-contained pieces that are actually useful for search and answering.

The sweet spot is chunks small enough for precise retrieval, but complete enough to read sensibly on their own. What confuses a human also confuses the model. Producing perfectly-sized, perfectly-composed chunks is a puzzle that has stumped researchers for years.

However, the Great Chunking Conundrum might have just been solved.

Chunking strategies: which one is the best?

Ever since RAG was first developed, researchers have experimented with a wide range of chunking strategies. Predictably, they’ve also argued over which strategy works best. Some approaches have been simplistic (e.g. character count), others have been more advanced (e.g. modality-specific), and all have come with both advantages and downsides.

And theoretically one could simply not chunk, so let’s start by explaining that option:

No chunking

Also known as: full-document embedding; “don’t chunk”; “each doc is a chunk”, “whole-document embedding”, “single chunk”

What it is: You embed an entire document as one vector and retrieve whole documents (or you only store already-small “documents”, like single sentences, so chunking is unnecessary).

Upsides

Simplest pipeline: no boundary bugs, no overlap tuning.
Works when “documents” are naturally tiny (e.g., sentence-level corpora).

Downsides

Often impossible: embedding models have token limits. Exceeding them can truncate input and lose important context.
Whole-document vectors dilute fine-grained facts. Retrieval gets coarse.

Fixed-size chunking

Also known as: fixed-size chunking; token chunking; TokenChunker (Chonkie); split-by-token splitters; character/word splitters, “length-based”, “naive”, “token chunking”

What it is: Split every N tokens (or characters/words), optionally with overlap to reduce boundary loss. Overlap” is when parts of the same piece of information appear in multiple chunks.

Upsides

Fast, deterministic baseline.Great for a first RAG prototype.
Often good enough when documents are messy or unreliably structured.

Downsides

Can cut mid-sentence or mid-idea. Overlap helps but increases duplication and index size.

Chunk size & overlap

Now that we have established the concepts of token size and overlap, let’s find their sweet spots:

Most practitioners deem 512 tokens with 50–100 token overlap (--> overlap often 10–20%) as a general rule-of-thumb baseline.
Others suggest testing a range like 128/256 (granular) through 512/1024 (more context).
A 2025 multi-dataset study reports that “best” fixed chunk size can swing: 64–128 tokens wins on concise fact-style tasks, while 512–1024 helps when broader context is required (and embedding models differ in sensitivity).

Important: These both carry on through all of the chunking methods below (with maybe different understandings of overlap).

Sliding-window chunking

Also known as: sliding window; “overlap + stride”; Slumber Chunker (Chonkie), “windowed chunking”, “stride-based overlap”

What it is: Fixed-size windows marched across the text (high overlap by design).

Upsides

Great continuity. Facts near boundaries appear in multiple chunks.

Downsides

Lots of near-duplicates. Retrieval gets noisy unless you dedupe or rerank.
Still “blind cutting,”just repeated.

Sentence / paragraph chunking

Also known as: sentence splitting; paragraph splitting; Sentence Chunker (Chonkie); NLTK/spaCy sentence segmentation pipelines, “sentence splitter”, “passage splitter”

What it is: Split on sentence/paragraph boundaries. Frequently paired with a max-size cap.

Upsides

Natural boundaries. Avoids mid-sentence cuts.

Downsides

Sentences are often too small to answer questions. You end up retrieving many fragments (higher top‑k, more prompt stuffing).

Recursive delimiter chunking

Also known as: recursive chunking; RecursiveCharacterTextSplitter (LangChain); Recursive Chunker (Chonkie), “recursive character splitter”, “multi-separator splitting”

What it is: Try higher-level separators first (paragraph → sentence → word…), and only fall back when chunks are still too large.

Upsides

A solid general-purpose default. More structure-respecting than fixed-size, simpler than semantic/LLM methods.

Downsides

Separator lists become a maintenance chore across formats (PDF dumps vs Markdown vs HTML).

Structure-aware chunking

Also known as: document structure-based chunking; header-based; Markdown/HTML splitting; “by_title” / “by_page” / “basic” / “by_similarity” strategies (Unstructured.io); format-aware chunking, “document-based”, “header/title/page-aware”, “element packing”

What it is: Use the document’s native structure (headings, pages, elements) to decide safe boundaries, instead of generic separators.

Format structure chunking (Markdown headings, HTML tags, code blocks)

Split Markdown by headings, HTML by tags, code by functions/classes—so units align with what the author meant.

Partition-then-pack (“element-based chunking”)

Instead of splitting raw text, first parse the document into semantic elements (paragraphs, list items, titles, tables…), then pack consecutive elements into chunks up to a max. Only “text-split” when a single element is too large.

You’ll see parameters like:

hard max (max_characters)
soft max (new_after_n_chars)
overlap used when text-splitting oversized elements (not for every boundary)

Upsides

Fewer nonsense splits. Keeps tables/titles/lists coherent.
More “universal” across doc types once parsing is good, because you’re not manually curating separator lists.

Downsides

Depends on extraction quality. PDFs are visually structured, so text extraction can be unreliable. Scanned PDFs need OCR.

Semantic similarity chunking

Also known as: semantic chunking; similarity-based chunking; Semantic Chunker (Chonkie); “by_similarity”, “semantic chunking”, “context-aware chunking”

What it is: Use embeddings to detect topic shifts and cut where meaning changes, not where characters hit N. A common recipe: split into sentences → group locally → embed groups → compute semantic distance between neighbors → choose boundaries at big jumps.

Some platforms also use similarity to decide which sequential elements are safe to combine into a chunk.

Upsides

Typically improves coherence vs delimiter-only methods. Helps retrieval precision.

Downsides

More compute (more embeddings).
Still assumes the “right unit” is a contiguous slice of text; it just picks smarter cut points.

LLM-based / agentic chunking

Also known as: LLM-based chunking; agentic chunking; “proposition extraction”; “summarize-then-embed chunks”, “propositional chunking”, “LLM decides boundaries”

What it is: An LLM chooses boundaries and/or rewrites text into retrieval-friendly units (propositions, summaries, key points). Agentic variants choose among strategies per-document.

Upsides

Can align chunks to what QA actually needs (claims, procedures, requirements).
Potentially less “junk text”, better grounding.

Downsides

Cost + latency + nondeterminism.
You’re trusting an LLM to preprocess your truth.

Neural chunking

Also known as: Neural Chunker (Chonkie), “learned boundary detection”

What it is: A trained model predicts “good boundaries” based on learned coherence patterns.

Upsides

Can outperform hand-built heuristics in domains it matches.

Downsides

Harder to debug. Can fail silently under domain shift.

Late chunking

Also known as: Late Chunker (Chonkie), “embed first, split second”

What it is: Embed the full document with a long-context embedding model to get token-level embeddings, then pool token embeddings into chunk embeddings after you decide chunk spans. This keeps each chunk’s vector “aware” of surrounding context.

Upsides

Directly attacks “context loss” from independently-embedded chunks (pronouns, cross-references, definitions earlier in the doc).

Downsides

Requires token-level embedding outputs and long-context embedding infra.

Hierarchical chunking

Also known as: hierarchical chunking; HierarchicalNodeParser (LlamaIndex); parent/child nodes; auto-merging retrieval, “multi-level chunks”, “parent/child”, “multi-granularity”

What it is: Create multiple layers: big chunks for sections/themes, smaller chunks for details. Retrieval can start granular and then “roll up” to parents when needed.

One common default hierarchy (example): 2048 → 512 → 128 token chunk sizes.

Upsides

Handles both “high-level” and “needle detail” questions without committing to one chunk size.

Downsides

More moving parts: extra indexing, metadata, retrieval logic.

TL;DR

For general-purpose use that permits tradeoffs of accuracy and versatility in exchange for lowered compute costs, recursive delimiter chunking is a popular choice. When the stakes are higher—such as in legal or financial settings—POMA AI chunksets are the best option.

Problems with these common chunking strategies

No matter how fancy the strategy is (even LLM-based or late chunking), they’re all still variations of the same primitive approach: choose cut points and slice the text into chunks.

Fixed-size chunking makes that obvious, but the more advanced methods just choose “better” boundaries.

That means you keep running into the same failure modes:

Orphaned facts: the line you retrieve often depends on context that lives “just outside the chunk”.
Overlap inflation: overlap reduces boundary loss, but bloats storage and retrieval noise.
Context dilution: stuffing more chunks into the prompt can degrade model performance (“lost in the middle” / attention dilution).

You can patch this by attaching extra context (e.g., adding a generated “context description” to each chunk before embedding), but that’s still chunking.You’re just stapling a helper note onto a slice.

POMA AI Chunksets: the non-breaking, hierarchical alternative

POMA changes the retrieval unit entirely, rather than just cutting at a different point. Instead of returning a chunk that might start mid-thought, POMA returns a chunkset: a complete, unbreakable root-to-leaf path through the document’s hierarchy.

A chunkset contains leaf sentences you actually care about and the full breadcrumb trail that tells you what those sentences mean in context.

What a chunkset looks like

Traditional chunking can split a section header away from its content:

Chunk 1: “…end of paragraph. 3. Health Insurance”
Chunk 2: “Employees are eligible for…”
Chunk 3: “…enrollment deadline is December 15. 4. Dental Coverage”

A POMA chunkset keeps the “breadcrumbs” attached:

Chunkset: Employee Handbook → Benefits → Health Insurance → “Employees are eligible for… enrollment deadline is December 15.”

How POMA builds chunksets (in plain terms)

POMA’s pipeline is:

Parse the document into a clean sentence-by-sentence structure.
Identify hierarchy by assigning each sentence a depth in a tree representation—based on both explicit structure (headings/formatting) and implicit “this sentence elaborates that one” structure.
Group sentences into chunksets: complete, unbreakable root-to-leaf paths.

As a result, that retrieved text never arrives without its lineage (section → subsection → procedure → requirement), so the model doesn’t have to guess what the retrieved sentence was “under.”

Cheatsheets: query-time compression without losing lineage

At query time, POMA assembles the relevant chunksets and compiles them into a per-document “cheatsheet”: a single, deduplicated, structured block of text optimized for LLM consumption.

As a result, the LLM requires far fewer tokens to produce a far more relevant and accurate output.

To illustrate with a (very) niche example: a legal-document query about Andorra’s personalized license-plate law (a notoriously tough document for RAGs) needed 1,542 tokens of retrieved context with traditional RAG, versus 337 tokens with POMA ( a roughly80% reduction), with zero information loss.

TL;DR

Unlike traditional chunking strategies, POMA keeps section headers connected to relevant following content when retrieving text from a document to create “chunksets.” These are then compiled into “cheatsheets,” whose information is structured and deduplicated for optimal use by LLMs.

Comparison table: All chunking strategies compared

Complexity	Chunker strategy	Names you’ll see (mapping across sources)	How boundaries are chosen	Retrieval unit	Typical knobs (most common)	Advantages	Disadvantages / common failure modes	Good fit
0	No chunking	“No chunking”, “whole document”, “document-as-unit”	Don’t split; store each record/document as-is	Whole doc / record	Mostly retrieval knobs (top‑k, filters)	Simplest; preserves full context; minimal ingest logic	Coarse embeddings; “lost-in-the-middle” risk; higher latency/cost if you stuff whole docs into prompts (even if they fit)	FAQs, short tickets, short posts, already atomic content
1	Fixed-size (tokens/chars)	Weaviate “Fixed-Size (Token) Chunking” ; Pinecone “Fixed-size chunking” ; Chonkie TokenChunker	Cut every N tokens (or chars), regardless of meaning	Chunk	chunk_size, overlap / chunk_overlap	Fast, cheap, deterministic baseline; easy to A/B test	Breaks sentences/ideas; can “orphan” facts at boundaries	Quick baseline; messy text; speed-first ingestion
2	Sliding window (heavy overlap)	“Sliding window”, “windowed chunking”, “overlap chunking”; Chonkie API “Slumber Chunker” page describes sliding window ; Unstructured overlap / overlap_all	Like fixed-size, but step size < chunk size so chunks heavily overlap	Chunk	overlap ratio; step size; (Unstructured: overlap chars + overlap_all)	Reduces boundary loss without “smarter” semantics; still simple	Redundant index + higher embedding/storage cost; can inflate retrieval duplicates	When boundary misses hurt, but you still need deterministic/cheap
3	Sentence-based	Pinecone “Sentence/Paragraph splitting” ; Chonkie SentenceChunker	Split on sentence boundaries (then pack sentences until size limit)	Chunk (sentence-preserving)	tokenizer, max tokens, overlap, min sentences, delimiter set	Cleaner “thought units”; fewer mid-sentence breaks	Sentence length varies; single sentences can be too context-poor	QA + summarization where sentence integrity matters
4	Recursive delimiter chunking	Weaviate “Recursive chunking” ; Pinecone “RecursiveCharacterTextSplitter” style ; Chonkie RecursiveChunker	Try paragraph breaks first, then sentence breaks, then words, etc—recursing until under limit	Chunk	separators/rules list, max size, min chunk size	Solid “default”; respects common structure more than fixed-size	Still heuristic; weird formatting can defeat it; still “breaks text”	Articles, blog posts, papers, reports
5	Document-structure aware (sections/pages/elements)	Weaviate “Document-Based chunking” (split by Markdown/HTML/PDF structure, code blocks) ; Pinecone “Document structure-based chunking” ; Unstructured element-based “smart chunking” + by_title / by_page / basic	Parse format → split on headers/tags/pages, or partition into document elements then pack elements into chunks while preserving section/page boundaries	Chunk (structure-aligned)	Unstructured: max_characters, new_after_n_chars, combine_text_under_n_chars, multipage_sections	Better topical coherence; less “random tearing”; robust across file types if partitioning works	Depends on extraction quality (PDF/OCR); mis-detected titles/sections can create tiny or wrong chunks	PDFs/HTML/Markdown/manuals/policies where structure matters
6	Table-aware	Unstructured Table / TableChunkbehavior ; Chonkie TableChunker	Identify tables and keep them isolated; split oversized tables as special table-chunks	Table chunk (special type)	size limit; table serialization (text vs HTML), etc.	Prevents mangling rows/columns; keeps table semantics intact	Large tables can dominate retrieval and prompt budget; sometimes needs summarization or cell-level indexing	Financials, specs, legal tables, evidence tables
7	Code-aware	Weaviate notes code splitting by functions/classes ; Chonkie CodeChunker uses ASTs	Parse code syntax and split by logical code blocks (functions/classes/modules)	Code chunk	language detection, chunk size, preserve docstrings/import context	Preserves runnable/semantic units; improves retrieval for code QA	Cross-file dependencies and “global context” still tricky; large functions still need splitting	Repos, SDK docs, notebooks, code-heavy KBs
8	Semantic similarity chunking (meaning-based)	Weaviate “Semantic chunking (Context-Aware)” ; Unstructured by_similarity (packs similar sequential elements) ; Chonkie SemanticChunker	Embed sentences/elements → measure similarity → choose breakpoints (or merge only if similarity above threshold)	Chunk (topic-coherent)	embedding model, similarity threshold, window size, max size; Unstructured: similarity_threshold	Better topical purity; fewer “topic salad” chunks	More compute at ingest; brittle if topics drift gradually; may fail when important context is far away	Dense legal/academic/technical narrative text
9	Neural boundary detection	Chonkie “Neural Chunker”	A learned model predicts “good” boundaries from patterns of semantic coherence	Chunk	model choice; min chunk size	Less hand-tuning; can outperform naive separators when formatting is messy	Opaque decisions; model/domain mismatch can create odd splits; harder to debug	Mixed corpora; when heuristics underperform
10	LLM-based chunking	Weaviate “LLM-Based Chunking”	Ask an LLM to propose boundaries; may also create propositions/summaries/extra context	Chunk (often enriched)	prompts, target size, cost/latency budget	High semantic quality on complex text; can align chunks to task (QA/compliance)	Slow + expensive; nondeterministic; risk of “summary drift” if you store rewrites	High-value docs where quality > cost
11	Agentic chunking (strategy-selection)	Weaviate “Agentic chunking” ; Naming collision: Chonkie changelog calls SlumberChunker“agentic” but the API doc describes it as sliding window	An “agent” inspects each document and chooses a strategy (or mix), possibly adding metadata tags	Chunk (often enriched + tagged)	model choice, allowed tools/strategies, budget	Flexibility across wildly different docs; can tailor to structure/density	Highest complexity; expensive; can be inconsistent across runs; needs monitoring	Regulatory/compliance corpora, multi-format enterprise KBs
12	Adaptive chunking (parameter-tuning)	Weaviate “Adaptive chunking”	Keep one approach, but dynamically adjust size/overlap depending on content density/structure	Chunk	rules/ML for “density”, dynamic size/overlap ranges	Avoids one-size-fits-all across a single long doc; better balance of precision vs context	Hard to predict index stats; more moving parts; can be tough to reproduce exactly	Long docs with mixed sections (dense + sparse)
13	Late chunking (embed first, split later)	Weaviate “Late chunking” ; Chonkie LateChunker	Embed whole doc with full context → then split → derive each chunk embedding from the already-contextual token embeddings (e.g., average)	Chunk (but with document-aware embeddings)	long-context embedding model; split rules; pooling method	Reduces “context loss” from isolated chunk embeddings; improves cross-reference understanding	Needs long-context embedding; more compute; still returns cut text (generation can still see fragments)	Technical/legal docs with lots of cross-references
14	Hierarchical chunking (multi-level)	Weaviate “Hierarchical chunking”	Create multiple layers: big section-level chunks, then smaller sub-chunks; retrieve coarse→fine	Multi-level chunks (parent/child)	levels, sizes per level, parent/child retrieval/merge logic	Answers both broad and specific questions; good “bridge” from structure to precision	More indexing + retrieval logic; still fundamentally breaks text	Textbooks, handbooks, manuals, long contracts
15	POMA Chunksets + Cheatsheets (non-breaking unit)	“Chunksets”, “hierarchical chunksets”, “breadcrumb context”, “cheatsheets”	Parse sentence-by-sentence → infer the document’s hierarchy (explicit + implicit) → group sentences into unbreakable root-to-leaf paths	Chunkset (path), later compiled per-doc “cheatsheet”	Minimal “knobs” emphasized; relies on structure inference	Retrieved facts always arrive with their contextual lineage; optimally digestible context; often much smaller total context	slower due to higher complexity; not feasible w/o integrated ingestion; requires different retrieval mental model (covered by SDK)	High-stakes grounding (policies, legal, compliance, manuals)

The Ultimate Guide to RAG Chunking Strategies & Text Splitters ​

How to find the right chunking strategy, chunk size and chunk overlap ​

Chunking strategies: which one is the best? ​

No chunking ​

Fixed-size chunking ​

Chunk size & overlap ​

Sliding-window chunking ​

Sentence / paragraph chunking ​

Recursive delimiter chunking ​

Structure-aware chunking ​

Format structure chunking (Markdown headings, HTML tags, code blocks) ​

Partition-then-pack (“element-based chunking”) ​

Semantic similarity chunking ​

LLM-based / agentic chunking ​

Neural chunking ​

Late chunking ​

Hierarchical chunking ​

Problems with these common chunking strategies ​

POMA AI Chunksets: the non-breaking, hierarchical alternative ​

What a chunkset looks like ​

How POMA builds chunksets (in plain terms) ​

Cheatsheets: query-time compression without losing lineage ​

Comparison table: All chunking strategies compared ​

The Ultimate Guide to RAG Chunking Strategies & Text Splitters

How to find the right chunking strategy, chunk size and chunk overlap

Chunking strategies: which one is the best?

No chunking

Fixed-size chunking

Chunk size & overlap

Sliding-window chunking

Sentence / paragraph chunking

Recursive delimiter chunking

Structure-aware chunking

Format structure chunking (Markdown headings, HTML tags, code blocks)

Partition-then-pack (“element-based chunking”)

Semantic similarity chunking

LLM-based / agentic chunking

Neural chunking

Late chunking

Hierarchical chunking

Problems with these common chunking strategies

POMA AI Chunksets: the non-breaking, hierarchical alternative

What a chunkset looks like

How POMA builds chunksets (in plain terms)

Cheatsheets: query-time compression without losing lineage

Comparison table: All chunking strategies compared