Skip to content

The Ultimate Guide to RAG Chunking Strategies & Text Splitters

How to find the right chunking strategy, chunk size and chunk overlap

Retrieval-augmented generation (RAG) has been used as a way to turbocharge large language models (LLMs) since the early 2020s. By allowing LLMs to draw on information from sources not included in training libraries, RAG solves the problems inherent to having a static knowledge base.

But just as you can’t instantly absorb all the information in a book by glancing at it, RAG can’t magically transfer all the relevant information from Outside Source X into an LLM pipeline. The solution is called “chunking.”

Chunking is splitting large text into smaller units (“chunks”) so (1) embedding models don’t truncate your input, and (2) retrieval returns self-contained pieces that are actually useful for search and answering.

The sweet spot is chunks small enough for precise retrieval, but complete enough to read sensibly on their own. What confuses a human also confuses the model. Producing perfectly-sized, perfectly-composed chunks is a puzzle that has stumped researchers for years.

However, the Great Chunking Conundrum might have just been solved.


Chunking strategies: which one is the best?

Ever since RAG was first developed, researchers have experimented with a wide range of chunking strategies. Predictably, they’ve also argued over which strategy works best. Some approaches have been simplistic (e.g. character count), others have been more advanced (e.g. modality-specific), and all have come with both advantages and downsides.

And theoretically one could simply not chunk, so let’s start by explaining that option:

No chunking

Also known as: full-document embedding; “don’t chunk”; “each doc is a chunk”, “whole-document embedding”, “single chunk”

What it is: You embed an entire document as one vector and retrieve whole documents (or you only store already-small “documents”, like single sentences, so chunking is unnecessary).

Upsides

  • Simplest pipeline: no boundary bugs, no overlap tuning.
  • Works when “documents” are naturally tiny (e.g., sentence-level corpora).

Downsides

  • Often impossible: embedding models have token limits. Exceeding them can truncate input and lose important context.
  • Whole-document vectors dilute fine-grained facts. Retrieval gets coarse.

Fixed-size chunking

Also known as: fixed-size chunking; token chunking; TokenChunker (Chonkie); split-by-token splitters; character/word splitters, “length-based”, “naive”, “token chunking”

What it is: Split every N tokens (or characters/words), optionally with overlap to reduce boundary loss. Overlap” is when parts of the same piece of information appear in multiple chunks.

Upsides

  • Fast, deterministic baseline.Great for a first RAG prototype.
  • Often good enough when documents are messy or unreliably structured.

Downsides

  • Can cut mid-sentence or mid-idea. Overlap helps but increases duplication and index size.

Chunk size & overlap

Now that we have established the concepts of token size and overlap, let’s find their sweet spots:

  • Most practitioners deem 512 tokens with 50–100 token overlap (--> overlap often 10–20%) as a general rule-of-thumb baseline.
  • Others suggest testing a range like 128/256 (granular) through 512/1024 (more context).
  • A 2025 multi-dataset study reports that “best” fixed chunk size can swing: 64–128 tokens wins on concise fact-style tasks, while 512–1024 helps when broader context is required (and embedding models differ in sensitivity).

Important: These both carry on through all of the chunking methods below (with maybe different understandings of overlap).


Sliding-window chunking

Also known as: sliding window; “overlap + stride”; Slumber Chunker (Chonkie), “windowed chunking”, “stride-based overlap”

What it is: Fixed-size windows marched across the text (high overlap by design).

Upsides

  • Great continuity. Facts near boundaries appear in multiple chunks.

Downsides

  • Lots of near-duplicates. Retrieval gets noisy unless you dedupe or rerank.
  • Still “blind cutting,”just repeated.

Sentence / paragraph chunking

Also known as: sentence splitting; paragraph splitting; Sentence Chunker (Chonkie); NLTK/spaCy sentence segmentation pipelines, “sentence splitter”, “passage splitter”

What it is: Split on sentence/paragraph boundaries. Frequently paired with a max-size cap.

Upsides

  • Natural boundaries. Avoids mid-sentence cuts.

Downsides

  • Sentences are often too small to answer questions. You end up retrieving many fragments (higher top‑k, more prompt stuffing).

Recursive delimiter chunking

Also known as: recursive chunking; RecursiveCharacterTextSplitter (LangChain); Recursive Chunker (Chonkie), “recursive character splitter”, “multi-separator splitting”

What it is: Try higher-level separators first (paragraph → sentence → word…), and only fall back when chunks are still too large.

Upsides

  • A solid general-purpose default. More structure-respecting than fixed-size, simpler than semantic/LLM methods.

Downsides

  • Separator lists become a maintenance chore across formats (PDF dumps vs Markdown vs HTML).

Structure-aware chunking

Also known as: document structure-based chunking; header-based; Markdown/HTML splitting; “by_title” / “by_page” / “basic” / “by_similarity” strategies (Unstructured.io); format-aware chunking, “document-based”, “header/title/page-aware”, “element packing”

What it is: Use the document’s native structure (headings, pages, elements) to decide safe boundaries, instead of generic separators.

Format structure chunking (Markdown headings, HTML tags, code blocks)

Split Markdown by headings, HTML by tags, code by functions/classes—so units align with what the author meant.

Partition-then-pack (“element-based chunking”)

Instead of splitting raw text, first parse the document into semantic elements (paragraphs, list items, titles, tables…), then pack consecutive elements into chunks up to a max. Only “text-split” when a single element is too large.

You’ll see parameters like:

  • hard max (max_characters)
  • soft max (new_after_n_chars)
  • overlap used when text-splitting oversized elements (not for every boundary)

Upsides

  • Fewer nonsense splits. Keeps tables/titles/lists coherent.
  • More “universal” across doc types once parsing is good, because you’re not manually curating separator lists.

Downsides

  • Depends on extraction quality. PDFs are visually structured, so text extraction can be unreliable. Scanned PDFs need OCR.

Semantic similarity chunking

Also known as: semantic chunking; similarity-based chunking; Semantic Chunker (Chonkie); “by_similarity”, “semantic chunking”, “context-aware chunking”

What it is: Use embeddings to detect topic shifts and cut where meaning changes, not where characters hit N. A common recipe: split into sentences → group locally → embed groups → compute semantic distance between neighbors → choose boundaries at big jumps.

Some platforms also use similarity to decide which sequential elements are safe to combine into a chunk.

Upsides

  • Typically improves coherence vs delimiter-only methods. Helps retrieval precision.

Downsides

  • More compute (more embeddings).
  • Still assumes the “right unit” is a contiguous slice of text; it just picks smarter cut points.

LLM-based / agentic chunking

Also known as: LLM-based chunking; agentic chunking; “proposition extraction”; “summarize-then-embed chunks”, “propositional chunking”, “LLM decides boundaries”

What it is: An LLM chooses boundaries and/or rewrites text into retrieval-friendly units (propositions, summaries, key points). Agentic variants choose among strategies per-document.

Upsides

  • Can align chunks to what QA actually needs (claims, procedures, requirements).
  • Potentially less “junk text”, better grounding.

Downsides

  • Cost + latency + nondeterminism.
  • You’re trusting an LLM to preprocess your truth.

Neural chunking

Also known as: Neural Chunker (Chonkie), “learned boundary detection”

What it is: A trained model predicts “good boundaries” based on learned coherence patterns.

Upsides

  • Can outperform hand-built heuristics in domains it matches.

Downsides

  • Harder to debug. Can fail silently under domain shift.

Late chunking

Also known as: Late Chunker (Chonkie), “embed first, split second”

What it is: Embed the full document with a long-context embedding model to get token-level embeddings, then pool token embeddings into chunk embeddings after you decide chunk spans. This keeps each chunk’s vector “aware” of surrounding context.

Upsides

  • Directly attacks “context loss” from independently-embedded chunks (pronouns, cross-references, definitions earlier in the doc).

Downsides

  • Requires token-level embedding outputs and long-context embedding infra.

Hierarchical chunking

Also known as: hierarchical chunking; HierarchicalNodeParser (LlamaIndex); parent/child nodes; auto-merging retrieval, “multi-level chunks”, “parent/child”, “multi-granularity”

What it is: Create multiple layers: big chunks for sections/themes, smaller chunks for details. Retrieval can start granular and then “roll up” to parents when needed.

One common default hierarchy (example): 2048 → 512 → 128 token chunk sizes.

Upsides

  • Handles both “high-level” and “needle detail” questions without committing to one chunk size.

Downsides

  • More moving parts: extra indexing, metadata, retrieval logic.

Problems with these common chunking strategies

No matter how fancy the strategy is (even LLM-based or late chunking), they’re all still variations of the same primitive approach: choose cut points and slice the text into chunks.

Fixed-size chunking makes that obvious, but the more advanced methods just choose “better” boundaries.

That means you keep running into the same failure modes:

  • Orphaned facts: the line you retrieve often depends on context that lives “just outside the chunk”.
  • Overlap inflation: overlap reduces boundary loss, but bloats storage and retrieval noise.
  • Context dilution: stuffing more chunks into the prompt can degrade model performance (“lost in the middle” / attention dilution).

You can patch this by attaching extra context (e.g., adding a generated “context description” to each chunk before embedding), but that’s still chunking.You’re just stapling a helper note onto a slice.


POMA AI Chunksets: the non-breaking, hierarchical alternative

POMA changes the retrieval unit entirely, rather than just cutting at a different point. Instead of returning a chunk that might start mid-thought, POMA returns a chunkset: a complete, unbreakable root-to-leaf path through the document’s hierarchy.

A chunkset contains leaf sentences you actually care about and the full breadcrumb trail that tells you what those sentences mean in context.

What a chunkset looks like

Traditional chunking can split a section header away from its content:

  • Chunk 1: “…end of paragraph. 3. Health Insurance”
  • Chunk 2: “Employees are eligible for…”
  • Chunk 3: “…enrollment deadline is December 15. 4. Dental Coverage”

A POMA chunkset keeps the “breadcrumbs” attached:

  • Chunkset: Employee Handbook → Benefits → Health Insurance → “Employees are eligible for… enrollment deadline is December 15.”

How POMA builds chunksets (in plain terms)

POMA’s pipeline is:

  1. Parse the document into a clean sentence-by-sentence structure.
  2. Identify hierarchy by assigning each sentence a depth in a tree representation—based on both explicit structure (headings/formatting) and implicit “this sentence elaborates that one” structure.
  3. Group sentences into chunksets: complete, unbreakable root-to-leaf paths.

As a result, that retrieved text never arrives without its lineage (section → subsection → procedure → requirement), so the model doesn’t have to guess what the retrieved sentence was “under.”

Cheatsheets: query-time compression without losing lineage

At query time, POMA assembles the relevant chunksets and compiles them into a per-document “cheatsheet”: a single, deduplicated, structured block of text optimized for LLM consumption.

As a result, the LLM requires far fewer tokens to produce a far more relevant and accurate output.

To illustrate with a (very) niche example: a legal-document query about Andorra’s personalized license-plate law (a notoriously tough document for RAGs) needed 1,542 tokens of retrieved context with traditional RAG, versus 337 tokens with POMA ( a roughly80% reduction), with zero information loss.


Comparison table: All chunking strategies compared

ComplexityChunker strategyNames you’ll see (mapping across sources)How boundaries are chosenRetrieval unitTypical knobs (most common)AdvantagesDisadvantages / common failure modesGood fit
0No chunking“No chunking”, “whole document”, “document-as-unit”Don’t split; store each record/document as-isWhole doc / recordMostly retrieval knobs (top‑k, filters)Simplest; preserves full context; minimal ingest logicCoarse embeddings; “lost-in-the-middle” risk; higher latency/cost if you stuff whole docs into prompts (even if they fit)FAQs, short tickets, short posts, already atomic content
1Fixed-size (tokens/chars)Weaviate “Fixed-Size (Token) Chunking” ; Pinecone “Fixed-size chunking” ; Chonkie TokenChunkerCut every N tokens (or chars), regardless of meaningChunkchunk_size, overlap / chunk_overlapFast, cheap, deterministic baseline; easy to A/B testBreaks sentences/ideas; can “orphan” facts at boundariesQuick baseline; messy text; speed-first ingestion
2Sliding window (heavy overlap)“Sliding window”, “windowed chunking”, “overlap chunking”; Chonkie API “Slumber Chunker” page describes sliding window ; Unstructured overlap / overlap_allLike fixed-size, but step size < chunk size so chunks heavily overlapChunkoverlap ratio; step size; (Unstructured: overlap chars + overlap_all)Reduces boundary loss without “smarter” semantics; still simpleRedundant index + higher embedding/storage cost; can inflate retrieval duplicatesWhen boundary misses hurt, but you still need deterministic/cheap
3Sentence-basedPinecone “Sentence/Paragraph splitting” ; Chonkie SentenceChunkerSplit on sentence boundaries (then pack sentences until size limit)Chunk (sentence-preserving)tokenizer, max tokens, overlap, min sentences, delimiter setCleaner “thought units”; fewer mid-sentence breaksSentence length varies; single sentences can be too context-poorQA + summarization where sentence integrity matters
4Recursive delimiter chunkingWeaviate “Recursive chunking” ; Pinecone “RecursiveCharacterTextSplitter” style ; Chonkie RecursiveChunkerTry paragraph breaks first, then sentence breaks, then words, etc—recursing until under limitChunkseparators/rules list, max size, min chunk sizeSolid “default”; respects common structure more than fixed-sizeStill heuristic; weird formatting can defeat it; still “breaks text”Articles, blog posts, papers, reports
5Document-structure aware (sections/pages/elements)Weaviate “Document-Based chunking” (split by Markdown/HTML/PDF structure, code blocks) ; Pinecone “Document structure-based chunking” ; Unstructured element-based “smart chunking” + by_title / by_page / basicParse format → split on headers/tags/pages, or partition into document elements then pack elements into chunks while preserving section/page boundariesChunk (structure-aligned)Unstructured: max_characters, new_after_n_chars, combine_text_under_n_chars, multipage_sectionsBetter topical coherence; less “random tearing”; robust across file types if partitioning worksDepends on extraction quality (PDF/OCR); mis-detected titles/sections can create tiny or wrong chunksPDFs/HTML/Markdown/manuals/policies where structure matters
6Table-awareUnstructured Table / TableChunkbehavior ; Chonkie TableChunkerIdentify tables and keep them isolated; split oversized tables as special table-chunksTable chunk (special type)size limit; table serialization (text vs HTML), etc.Prevents mangling rows/columns; keeps table semantics intactLarge tables can dominate retrieval and prompt budget; sometimes needs summarization or cell-level indexingFinancials, specs, legal tables, evidence tables
7Code-awareWeaviate notes code splitting by functions/classes ; Chonkie CodeChunker uses ASTsParse code syntax and split by logical code blocks (functions/classes/modules)Code chunklanguage detection, chunk size, preserve docstrings/import contextPreserves runnable/semantic units; improves retrieval for code QACross-file dependencies and “global context” still tricky; large functions still need splittingRepos, SDK docs, notebooks, code-heavy KBs
8Semantic similarity chunking (meaning-based)Weaviate “Semantic chunking (Context-Aware)” ; Unstructured by_similarity (packs similar sequential elements) ; Chonkie SemanticChunkerEmbed sentences/elements → measure similarity → choose breakpoints (or merge only if similarity above threshold)Chunk (topic-coherent)embedding model, similarity threshold, window size, max size; Unstructured: similarity_thresholdBetter topical purity; fewer “topic salad” chunksMore compute at ingest; brittle if topics drift gradually; may fail when important context is far awayDense legal/academic/technical narrative text
9Neural boundary detectionChonkie “Neural Chunker”A learned model predicts “good” boundaries from patterns of semantic coherenceChunkmodel choice; min chunk sizeLess hand-tuning; can outperform naive separators when formatting is messyOpaque decisions; model/domain mismatch can create odd splits; harder to debugMixed corpora; when heuristics underperform
10LLM-based chunkingWeaviate “LLM-Based Chunking”Ask an LLM to propose boundaries; may also create propositions/summaries/extra contextChunk (often enriched)prompts, target size, cost/latency budgetHigh semantic quality on complex text; can align chunks to task (QA/compliance)Slow + expensive; nondeterministic; risk of “summary drift” if you store rewritesHigh-value docs where quality > cost
11Agentic chunking (strategy-selection)Weaviate “Agentic chunking” ; Naming collision: Chonkie changelog calls SlumberChunker“agentic” but the API doc describes it as sliding windowAn “agent” inspects each document and chooses a strategy (or mix), possibly adding metadata tagsChunk (often enriched + tagged)model choice, allowed tools/strategies, budgetFlexibility across wildly different docs; can tailor to structure/densityHighest complexity; expensive; can be inconsistent across runs; needs monitoringRegulatory/compliance corpora, multi-format enterprise KBs
12Adaptive chunking (parameter-tuning)Weaviate “Adaptive chunking”Keep one approach, but dynamically adjust size/overlap depending on content density/structureChunkrules/ML for “density”, dynamic size/overlap rangesAvoids one-size-fits-all across a single long doc; better balance of precision vs contextHard to predict index stats; more moving parts; can be tough to reproduce exactlyLong docs with mixed sections (dense + sparse)
13Late chunking (embed first, split later)Weaviate “Late chunking” ; Chonkie LateChunkerEmbed whole doc with full context → then split → derive each chunk embedding from the already-contextual token embeddings (e.g., average)Chunk (but with document-aware embeddings)long-context embedding model; split rules; pooling methodReduces “context loss” from isolated chunk embeddings; improves cross-reference understandingNeeds long-context embedding; more compute; still returns cut text (generation can still see fragments)Technical/legal docs with lots of cross-references
14Hierarchical chunking (multi-level)Weaviate “Hierarchical chunking”Create multiple layers: big section-level chunks, then smaller sub-chunks; retrieve coarse→fineMulti-level chunks (parent/child)levels, sizes per level, parent/child retrieval/merge logicAnswers both broad and specific questions; good “bridge” from structure to precisionMore indexing + retrieval logic; still fundamentally breaks textTextbooks, handbooks, manuals, long contracts
15POMA Chunksets + Cheatsheets (non-breaking unit)“Chunksets”, “hierarchical chunksets”, “breadcrumb context”, “cheatsheets”Parse sentence-by-sentence → infer the document’s hierarchy (explicit + implicit) → group sentences into unbreakable root-to-leaf pathsChunkset (path), later compiled per-doc “cheatsheet”Minimal “knobs” emphasized; relies on structure inferenceRetrieved facts always arrive with their contextual lineage; optimally digestible context; often much smaller total contextslower due to higher complexity; not feasible w/o integrated ingestion; requires different retrieval mental model (covered by SDK)High-stakes grounding (policies, legal, compliance, manuals)