Chunking strategy comparison

This table maps the chunker names you will see across different vendors and sources, shows how boundaries are chosen, and summarizes the most common tradeoffs.

Complexity	Chunker strategy	Names you'll see	How boundaries are chosen	Retrieval unit	Typical knobs	Advantages	Disadvantages / common failure modes	Good fit
0	No chunking	"No chunking", "whole document", "document-as-unit"	Don't split; store each record or document as-is	Whole doc or record	Mostly retrieval knobs such as top-k and filters	Simplest; preserves full context; minimal ingest logic	Coarse embeddings; lost-in-the-middle risk; higher latency or cost if you stuff whole docs into prompts	FAQs, short tickets, short posts, already atomic content
1	Fixed-size (tokens or chars)	Weaviate fixed-size chunking, Pinecone fixed-size chunking, Chonkie TokenChunker	Cut every N tokens or characters regardless of meaning	Chunk	`chunk_size`, `overlap`, `chunk_overlap`	Fast, cheap, deterministic baseline; easy to A/B test	Breaks sentences and ideas; can orphan facts at boundaries	Quick baseline; messy text; speed-first ingestion
2	Sliding window	Sliding window, overlap chunking, windowed chunking, Chonkie Slumber Chunker	Like fixed-size, but step size is smaller than chunk size so chunks heavily overlap	Chunk	overlap ratio, step size	Reduces boundary loss without smarter semantics	Redundant index; higher embedding and storage cost; duplicate-heavy retrieval	When boundary misses hurt but you still need a deterministic path
3	Sentence-based	Sentence splitting, paragraph splitting, Chonkie SentenceChunker	Split on sentence boundaries and then pack to a size limit	Chunk	tokenizer, max tokens, overlap, min sentences	Cleaner thought units; fewer mid-sentence breaks	Sentence length varies; a single sentence can still be too context-poor	QA and summarization where sentence integrity matters
4	Recursive delimiter chunking	Recursive chunking, RecursiveCharacterTextSplitter, Chonkie RecursiveChunker	Try paragraph breaks first, then sentence breaks, then words until the chunk fits	Chunk	separator list, max size, min chunk size	Solid default; respects common structure more than fixed-size	Still heuristic; weird formatting can defeat it; still returns broken text	Articles, blog posts, papers, reports
5	Document-structure aware	Document-based chunking, header-based chunking, element packing, Unstructured smart chunking	Parse the format and split on structure such as headers, tags, pages, or semantic elements	Chunk	`max_characters`, `new_after_n_chars`, section rules	Better topical coherence; less random tearing; robust across file types when partitioning works	Depends on extraction quality; bad title detection can create tiny or wrong chunks	PDFs, HTML, Markdown, manuals, policies
6	Table-aware	Table chunking, specialized table chunkers	Identify tables and keep them isolated; split oversized tables separately	Table chunk	size limit, table serialization	Preserves row and column structure	Large tables can dominate retrieval and prompt budget	Financials, specs, legal tables
7	Code-aware	Code chunking, AST chunkers	Parse code syntax and split by logical code blocks such as functions, classes, or modules	Code chunk	language detection, chunk size, docstring or import preservation	Preserves runnable units; improves code retrieval	Cross-file dependencies remain hard; large functions still need splitting	Repos, SDK docs, notebooks, code-heavy knowledge bases
8	Semantic similarity chunking	Semantic chunking, context-aware chunking, Unstructured `by_similarity`, Chonkie SemanticChunker	Embed local sentences or elements and place boundaries where similarity drops or merge only if similarity remains high	Chunk	embedding model, similarity threshold, window size, max size	Better topical purity; fewer mixed-topic chunks	More ingest-time compute; brittle if topic drift is gradual	Dense legal, academic, or technical narrative text
9	Neural boundary detection	Neural chunking, learned boundary detection	A learned model predicts good boundaries from semantic-coherence patterns	Chunk	model choice, min chunk size	Less hand-tuning; can outperform naive separator lists	Opaque decisions; domain mismatch can create odd splits	Mixed corpora where heuristics underperform
10	LLM-based chunking	LLM-based chunking, proposition extraction	Ask an LLM to propose boundaries or rewrite text into retrieval-friendly units	Chunk	prompts, target size, cost and latency budget	High semantic quality on complex text	Slow, expensive, nondeterministic; risk of summary drift	High-value docs where quality beats cost
11	Agentic chunking	Agentic chunking, strategy-selection chunking	An agent inspects each document and chooses a strategy or mix, possibly adding metadata	Chunk	model choice, allowed tools, budget	Flexible across wildly different document types	Highest complexity; inconsistent across runs without strong monitoring	Regulatory and multi-format enterprise corpora
12	Adaptive chunking	Adaptive chunking	Keep one approach but dynamically adjust size or overlap depending on content density or structure	Chunk	dynamic size ranges, density rules	Avoids one-size-fits-all behavior within a document	Harder to predict index stats and reproduce exactly	Long docs with mixed dense and sparse sections
13	Late chunking	Late chunking, embed-first-split-later chunking	Embed the whole document with full context and derive chunk embeddings afterward	Chunk with document-aware embeddings	long-context embedding model, split rules, pooling method	Reduces context loss in embeddings; improves cross-reference understanding	Requires long-context embeddings and more compute	Technical or legal docs with cross-references
14	Hierarchical chunking	Hierarchical chunking, parent-child chunking, multi-level chunking	Create multiple layers of chunks and retrieve coarse-to-fine	Multi-level chunks	levels, sizes per level, parent-child retrieval logic	Answers broad and specific questions more gracefully	More indexing and retrieval logic; still breaks text	Textbooks, handbooks, manuals, long contracts
15	POMA chunksets plus cheatsheets	Chunksets, hierarchical chunksets, breadcrumb context, cheatsheets	Parse sentence-by-sentence, infer hierarchy, and group sentences into unbreakable root-to-leaf paths	Chunkset, later compiled into a cheatsheet	Minimal exposed knobs; relies on structure inference	Facts arrive with contextual lineage attached; structured, token-efficient context	Requires an integrated ingestion and retrieval design; different mental model from standard chunk retrieval	High-stakes grounding in policies, legal, compliance, and manuals

For the narrative framing around this table, go back to the RAG chunking guide.

Continue reading

Strategy landscape — detailed explanation of each strategy
Common failure modes — why most strategies fail in similar ways
POMA chunksets — the non-breaking alternative
The full chunking guide — complete narrative guide
PrimeCut product page — how POMA implements hierarchical chunking
Pricing — from €0.003/page