Chunking strategy comparison
This table maps the chunker names you will see across different vendors and sources, shows how boundaries are chosen, and summarizes the most common tradeoffs.
| Complexity | Chunker strategy | Names you'll see | How boundaries are chosen | Retrieval unit | Typical knobs | Advantages | Disadvantages / common failure modes | Good fit |
|---|---|---|---|---|---|---|---|---|
| 0 | No chunking | "No chunking", "whole document", "document-as-unit" | Don't split; store each record or document as-is | Whole doc or record | Mostly retrieval knobs such as top-k and filters | Simplest; preserves full context; minimal ingest logic | Coarse embeddings; lost-in-the-middle risk; higher latency or cost if you stuff whole docs into prompts | FAQs, short tickets, short posts, already atomic content |
| 1 | Fixed-size (tokens or chars) | Weaviate fixed-size chunking, Pinecone fixed-size chunking, Chonkie TokenChunker | Cut every N tokens or characters regardless of meaning | Chunk | chunk_size, overlap, chunk_overlap | Fast, cheap, deterministic baseline; easy to A/B test | Breaks sentences and ideas; can orphan facts at boundaries | Quick baseline; messy text; speed-first ingestion |
| 2 | Sliding window | Sliding window, overlap chunking, windowed chunking, Chonkie Slumber Chunker | Like fixed-size, but step size is smaller than chunk size so chunks heavily overlap | Chunk | overlap ratio, step size | Reduces boundary loss without smarter semantics | Redundant index; higher embedding and storage cost; duplicate-heavy retrieval | When boundary misses hurt but you still need a deterministic path |
| 3 | Sentence-based | Sentence splitting, paragraph splitting, Chonkie SentenceChunker | Split on sentence boundaries and then pack to a size limit | Chunk | tokenizer, max tokens, overlap, min sentences | Cleaner thought units; fewer mid-sentence breaks | Sentence length varies; a single sentence can still be too context-poor | QA and summarization where sentence integrity matters |
| 4 | Recursive delimiter chunking | Recursive chunking, RecursiveCharacterTextSplitter, Chonkie RecursiveChunker | Try paragraph breaks first, then sentence breaks, then words until the chunk fits | Chunk | separator list, max size, min chunk size | Solid default; respects common structure more than fixed-size | Still heuristic; weird formatting can defeat it; still returns broken text | Articles, blog posts, papers, reports |
| 5 | Document-structure aware | Document-based chunking, header-based chunking, element packing, Unstructured smart chunking | Parse the format and split on structure such as headers, tags, pages, or semantic elements | Chunk | max_characters, new_after_n_chars, section rules | Better topical coherence; less random tearing; robust across file types when partitioning works | Depends on extraction quality; bad title detection can create tiny or wrong chunks | PDFs, HTML, Markdown, manuals, policies |
| 6 | Table-aware | Table chunking, specialized table chunkers | Identify tables and keep them isolated; split oversized tables separately | Table chunk | size limit, table serialization | Preserves row and column structure | Large tables can dominate retrieval and prompt budget | Financials, specs, legal tables |
| 7 | Code-aware | Code chunking, AST chunkers | Parse code syntax and split by logical code blocks such as functions, classes, or modules | Code chunk | language detection, chunk size, docstring or import preservation | Preserves runnable units; improves code retrieval | Cross-file dependencies remain hard; large functions still need splitting | Repos, SDK docs, notebooks, code-heavy knowledge bases |
| 8 | Semantic similarity chunking | Semantic chunking, context-aware chunking, Unstructured by_similarity, Chonkie SemanticChunker | Embed local sentences or elements and place boundaries where similarity drops or merge only if similarity remains high | Chunk | embedding model, similarity threshold, window size, max size | Better topical purity; fewer mixed-topic chunks | More ingest-time compute; brittle if topic drift is gradual | Dense legal, academic, or technical narrative text |
| 9 | Neural boundary detection | Neural chunking, learned boundary detection | A learned model predicts good boundaries from semantic-coherence patterns | Chunk | model choice, min chunk size | Less hand-tuning; can outperform naive separator lists | Opaque decisions; domain mismatch can create odd splits | Mixed corpora where heuristics underperform |
| 10 | LLM-based chunking | LLM-based chunking, proposition extraction | Ask an LLM to propose boundaries or rewrite text into retrieval-friendly units | Chunk | prompts, target size, cost and latency budget | High semantic quality on complex text | Slow, expensive, nondeterministic; risk of summary drift | High-value docs where quality beats cost |
| 11 | Agentic chunking | Agentic chunking, strategy-selection chunking | An agent inspects each document and chooses a strategy or mix, possibly adding metadata | Chunk | model choice, allowed tools, budget | Flexible across wildly different document types | Highest complexity; inconsistent across runs without strong monitoring | Regulatory and multi-format enterprise corpora |
| 12 | Adaptive chunking | Adaptive chunking | Keep one approach but dynamically adjust size or overlap depending on content density or structure | Chunk | dynamic size ranges, density rules | Avoids one-size-fits-all behavior within a document | Harder to predict index stats and reproduce exactly | Long docs with mixed dense and sparse sections |
| 13 | Late chunking | Late chunking, embed-first-split-later chunking | Embed the whole document with full context and derive chunk embeddings afterward | Chunk with document-aware embeddings | long-context embedding model, split rules, pooling method | Reduces context loss in embeddings; improves cross-reference understanding | Requires long-context embeddings and more compute | Technical or legal docs with cross-references |
| 14 | Hierarchical chunking | Hierarchical chunking, parent-child chunking, multi-level chunking | Create multiple layers of chunks and retrieve coarse-to-fine | Multi-level chunks | levels, sizes per level, parent-child retrieval logic | Answers broad and specific questions more gracefully | More indexing and retrieval logic; still breaks text | Textbooks, handbooks, manuals, long contracts |
| 15 | POMA chunksets plus cheatsheets | Chunksets, hierarchical chunksets, breadcrumb context, cheatsheets | Parse sentence-by-sentence, infer hierarchy, and group sentences into unbreakable root-to-leaf paths | Chunkset, later compiled into a cheatsheet | Minimal exposed knobs; relies on structure inference | Facts arrive with contextual lineage attached; structured, token-efficient context | Requires an integrated ingestion and retrieval design; different mental model from standard chunk retrieval | High-stakes grounding in policies, legal, compliance, and manuals |
For the narrative framing around this table, go back to the RAG chunking guide.
Continue reading
- Strategy landscape — detailed explanation of each strategy
- Common failure modes — why most strategies fail in similar ways
- POMA chunksets — the non-breaking alternative
- The full chunking guide — complete narrative guide