Skip to content

Cheatsheets: Structure-Preserving RAG Context

Cheatsheets turn retrieved chunksets into concise, structure-aware prompt context.

Why cheatsheets exist: the chunking/retrieval alignment problem

Hierarchical chunking creates coherent boundaries — chapter → section → paragraph — but those boundaries only pay off if the retrieval layer knows the structure exists. Parent-document retrieval, metadata filters, and structured search tied to a chunk hierarchy all require the vector store, the reranker, and the prompt assembler to be hierarchy-aware. Plug a hierarchical chunker into a flat retrieval pipeline and you still get context windows that miss the surrounding signal: the structure is in the index but invisible at query time.

Cheatsheets are POMA's answer to that alignment cost. Instead of asking the retrieval layer to understand hierarchy, we push structural awareness into a doc-level artifact:

  1. The retrieval layer can stay simple — it just returns chunksets relevant to the query.
  2. generate_cheatsheets(...) assembles those chunksets back into the document's lineage at query time, deduplicating and ordering so each cheatsheet reads as a coherent, prompt-ready block.
  3. The LLM sees structure in the text, not in the index — no parent-doc joins, no metadata filters, no hierarchy-aware reranker required.

Net result: you keep the precision wins of hierarchical chunking without paying the integration tax of a hierarchy-aware retrieval stack.

Usage

generate_cheatsheets(...) is a query-time call: you give it the chunksets your retrieval layer flagged as relevant and the corresponding chunks, and it returns one structured cheatsheet per document. It splits across two phases — ingest, then query:

python
from poma import PrimeCut, generate_cheatsheets

# --- Phase 1: ingest once, persist the artefacts -----------------------
client = PrimeCut()
result = client.ingest("example.pdf")

# Persist `result.chunksets` (with embeddings) to your vector DB, and
# keep `result.chunks` somewhere addressable by file_id — typically the
# same store, or the original `.poma` archive on disk. See the
# integrations pages for end-to-end Qdrant / LangChain / LlamaIndex
# examples.

# --- Phase 2: at query time, build a cheatsheet from the hits ---------
# `relevant_chunksets` is whatever your retrieval returned for the
# user's query — could be one chunkset, dozens, spanning one document
# or many. `all_chunks` is the full chunk list for every document
# those chunksets reference; `generate_cheatsheets` uses it to expand
# hierarchy and stitch siblings together.
cheatsheets = generate_cheatsheets(
    relevant_chunksets=hits_from_your_vector_db,
    all_chunks=chunks_for_those_docs,
)

for sheet in cheatsheets:
    print(sheet["file_id"], "→", sheet["content"][:200], "…")

If you just want to see what the function does without wiring up a vector DB, pass the full result back in — it reconstructs the whole document as a single cheatsheet:

python
cheatsheets = generate_cheatsheets(
    relevant_chunksets=result.chunksets,   # use every chunkset
    all_chunks=result.chunks,
)
print(cheatsheets[0]["content"])

generate_cheatsheets(...) accepts either raw dicts or SDK model objects — anything implementing to_dict() is coerced internally. Pass the model objects straight through, or load dicts from a .poma archive's chunks.json / chunksets.json; both work.

When the relevant chunksets span multiple documents, generate_cheatsheets(...) returns one cheatsheet per file_id in the result list.

PrimeCut.create_cheatsheet(...) and PrimeCut.create_cheatsheets(...) still exist for compatibility, but both are deprecated.

What to embed, what to store

A common question: "what exactly do I put into the vector DB?" The rule is short:

Embed chunkset.to_embed and nothing else. Store the rest of the chunkset and the chunks wherever fits your stack.

to_embed is a str on every PomaChunkSet — a normalised, embedding-ready text produced by PrimeCut from the chunkset's members. It is the canonical embedding input. Don't synthesise your own from the chunk contents; don't embed the chunkset metadata; don't embed the raw doc. Using anything other than to_embed undoes the work PrimeCut did to keep the hierarchical signal embedding-friendly.

Two storage decisions are independent and depend on your stack:

ArtefactWhere it livesWhen you need it
Embedding of chunkset.to_embedVector DB, with one vector per chunksetEvery query — semantic search runs against this
The chunkset itself (chunkset_index, chunks, file_id)VDB payload alongside the vector, or object storage, filesystem, the original .poma archiveAfter a hit, to know which chunks to assemble
The chunks (result.chunks for each doc)VDB payload, object storage, filesystem, or the .poma archiveWhen generate_cheatsheets(...) runs — it needs them to expand hierarchy and stitch siblings

Two common architectures:

  • Single-store (the Qdrant pattern). Each chunkset is one point in the collection: vector = to_embed embedding, payload = chunkset fields, and — when store_chunk_details=True — the chunks for that chunkset inlined as well. One round trip per query, no second lookup. See Qdrant integration.
  • Split-store. Chunksets (with vectors + minimal metadata) in the VDB, chunks held externally — disk, S3, or just keeping the .poma archive around. Lower VDB cost, but every query needs a second fetch to load the chunks for the matched docs before calling generate_cheatsheets(...). The LangChain and LlamaIndex helpers support this shape too.

Either way, the embedding input is always chunkset.to_embed, and the chunk corpus has to be reachable by file_id at query time so generate_cheatsheets(...) can find the parents and siblings the cheatsheet needs.

How file_id ties it together

Every chunkset and every chunk carries a file_id. That field is the join key between the two — and the reason generate_cheatsheets(...) can match retrieved chunksets back to their source document. So whatever you store, keep file_id reachable on both sides.

At query time, when your retrieval returns N chunksets from possibly multiple documents:

  1. Group the hits by file_id.
  2. For each file_id, load the chunks for that document. Two valid workflows:
    • Minimum payload — fetch only the chunks whose chunk_index appears in one of that doc's retrieved chunksets' chunks arrays. Smallest fetch, but generate_cheatsheets(...) has nothing extra to walk; the resulting cheatsheet will be exactly the leaves you returned, without ancestor or sibling enrichment.
    • Easier workflow — fetch all chunks for that file_id (the original result.chunks from ingest). Slightly more data per query, but generate_cheatsheets(...) gets to walk the full hierarchy and emit a richer cheatsheet with parents, siblings, and proper depth-based ordering. This is what the bundled Qdrant integration does by default when store_chunk_details=True — it stores enough of the chunks alongside each chunkset point that one query returns the whole assembly set.
  3. Pass the union of those chunks as all_chunks= to generate_cheatsheets(...). It sorts internally by (file_id, chunk_index) and returns one cheatsheet per document.

The choice between minimum-payload and full-doc is a knob, not a contract — both produce a valid cheatsheet. Pick "all chunks per file_id" when storage cost or the extra fetch hop isn't a worry; pick "only the referenced indices" when payload size matters more than lineage richness.

Input rules

  • relevant_chunksets must contain a chunks list
  • all_chunks must contain the matching chunk content
  • file_id should identify the source document
  • Duplicate chunk_index values within one document will fail validation

This same cheatsheet logic is also used by the bundled Qdrant, LangChain, and LlamaIndex integrations.