Concepts / Overview

POMA AI is a context engine for retrieval-augmented generation (RAG). It turns documents into prompt-ready context for large language models, without you having to design a chunking strategy, run a vector store, or hand-write retrieval glue.

This page is a high-level orientation. Each idea links down to a deeper concept page when you want detail.

The two-line summary

Ingestion parses your document, walks its structure, and emits a hierarchy of typed units called chunks and chunksets.
Retrieval finds the chunksets relevant to a query and renders them as a single, prompt-ready text block — a cheatsheet (PrimeCut) or a RetrievalContext (Grill).

That is the entire mental model. Everything else is implementation detail.

What POMA gives you

POMA's pipeline takes a source file and produces three things:

source file  →  chunks      →  chunksets    →  prompt-ready context
                (units)        (paths)         (drop into an LLM)

Artifact	What it is	Why it exists
Chunk	The smallest typed unit: a sentence or paragraph with a `depth` integer, page number, and a `to_embed` field.	The atom of retrieval.
Chunkset	A root-to-leaf path through the document's hierarchy — a chunk together with the ancestor breadcrumbs that make it self-explanatory.	The retrieval unit. Sentences carry their structural context with them.
`.poma` archive	A ZIP file bundling the chunks, chunksets, images, page renders, and metadata.	The portable on-disk format. PrimeCut hands it back to you; Grill keeps it server-side.
Cheatsheet / RetrievalContext	A single string of XML + Markdown, deduplicated and budget-fit, that drops straight into an LLM prompt.	What the model actually sees at query time.

Full ingestion architecture: /sdk/concepts/ingestion and /grill/concepts/ingestion.

Why hierarchical chunking

Most RAG stacks chunk by cutting text into fixed-size windows, then embed each slice independently. That works for short documents and breaks for long ones — section headers get separated from their content, tables get shredded, and the LLM sees orphaned facts without their lineage.

POMA never cuts text. Instead, it preserves the document's hierarchy and groups sentences into chunksets — complete root-to-leaf paths.

A chunkset for an HR-policy section looks like:

Employee Handbook → Benefits → Health Insurance →
  "Employees are eligible for the standard plan after 30 days.
   Open enrollment runs annually from November 1 to November 15."

The model reads the leaf sentence and every breadcrumb that gives it meaning. No "section header lost to a different chunk" failure mode, no overlap inflation, no token budget wasted on duplicate context.

Deeper read: /learn/chunking/chunksets and the long-form Ultimate guide to RAG chunking strategies.

Cheatsheets — the retrieval primitive

Hierarchical chunking only pays off if the retrieval side knows the structure exists. Plug a hierarchical chunker into a flat retrieval pipeline and you keep the indexing overhead while losing the structural signal at query time.

POMA's answer is the cheatsheet: a per-document artifact assembled at query time. The retrieval layer can stay simple — it just returns chunksets relevant to a query. generate_cheatsheets(...) re-assembles them into the document's lineage, dedupes, orders, and budget-fits into a single prompt-ready block.

The LLM sees structure in the text, not in the index. No parent-document retrieval, no metadata filters, no hierarchy-aware reranker required.

Deeper read: /sdk/concepts/cheatsheets.

The two products, side by side

POMA ships the same pipeline behind two product surfaces:

	Grill	PrimeCut
What it returns	A prompt-ready `RetrievalContext` string (XML + Markdown)	A `.poma` archive with raw chunks, chunksets, images, and metadata
Who runs retrieval	POMA (server-side: hybrid search, sandwich ordering, token budgeting)	You (use the chunks however you like — Qdrant, LangChain, LlamaIndex, your own stack)
You need a vector store	No	Yes (for embeddings)
API version	v3 (project-scoped)	v2 + v3 (account-scoped)
Best for	"I want a managed RAG endpoint that returns context for my prompt."	"I need raw chunks because I run my own retrieval stack."
Use it from	Grill product docs, Grill in the SDK, `poma-grill-mcp`, or the hosted MCP endpoint	PrimeCut product page, PrimeCut in the SDK, `poma-mcp`, or the CLI

Both are powered by the same ingestion pipeline. The difference is what happens after the chunks are produced — Grill indexes them into a project namespace and serves search results; PrimeCut hands the .poma archive back to you.

The standard lifecycle

Every POMA workflow follows the same three stages:

1. Ingest

You submit a source file (PDF, DOCX, HTML, Markdown, image, …). The server runs conversion → indentation → chunking → chunkset assembly → archiving. The job is asynchronous — you get a job_id back and poll until status: done.

Grill: /grill/concepts/ingestion (also see Quickstart)
PrimeCut: /sdk/concepts/ingestion (also see Quickstart)

2. Retrieve

You issue a query and POMA returns the chunksets that match. With Grill, that's a single POST /grill/search call returning a prompt-ready context block. With PrimeCut, you embed the chunksets yourself (in Qdrant, pgvector, …) and run retrieval — POMA's generate_cheatsheets(...) helper assembles the result back into a context block at the end.

Grill: /grill/concepts/retrieval and RetrievalContext format
PrimeCut: /sdk/concepts/cheatsheets and the integration walkthroughs at Qdrant / LangChain / LlamaIndex

3. Prompt

You drop the resulting cheatsheet or RetrievalContext into your LLM call. POMA's output is structured XML + Markdown:

xml

<query>How did operating margin change year over year?</query>
<doc id="annual-report-2025" title="Annual Report 2025" pages="84">
  [p84]
  ## Operating margin

  Operating margin rose from **18.4%** in FY24 to **21.1%** in FY25, …

  […]

  Cost-of-goods-sold improvements contributed roughly 1.6 pts …
</doc>

The block is sandwich-ordered (most relevant passages at top and bottom for best LLM recall), skip-marked ([…] between non-consecutive chunks of the same document), token-budgeted, and citation-ready.

What POMA doesn't do

POMA's surface ends at "you have prompt-ready context." It does not:

Call the LLM for you. You prompt, you stream, you handle errors.
Run agent loops. POMA returns text; agent logic is yours.
Manage your vector store (PrimeCut workflow). Grill manages a namespace internally; PrimeCut hands you chunks and you embed.
Render citations in the UI. The context block has doc id, title, and pages attributes — your UI maps those to whatever citation style you want.

This is a deliberate scope. Every layer above retrieval is opinionated by the LLM provider and the product surface. POMA's job is to give you the cleanest possible input to that layer.

Where to go next

If you want to…	Go here
Ship a managed-RAG endpoint in 10 minutes	Grill quickstart
Run your own retrieval stack with POMA chunks	PrimeCut quickstart
Drive POMA from an MCP-aware agent	MCP servers overview
Drive POMA from the shell	CLI
Understand chunking strategies more deeply	Learn / Chunking
Read the long-form ingestion guide	Document ingestion + chunking for RAG
Read the architectural deep dive	The internal `Explain_concept.md` (covers chunks, chunksets, archive format, retrieval modes, and the three implementation paths)

Reference glossary

A one-line definition for each term you'll see in the rest of the docs.

Term	Definition
Chunk	The smallest typed unit. A sentence or paragraph with `depth`, `to_embed`, `page`, and optional `image_name` / `table` / `code` references.
Chunkset	A root-to-leaf path through the document's hierarchy. The retrieval unit.
`.poma` archive	A ZIP file bundling `chunks.json`, `chunksets.json`, image renders, page renders, and metadata. The portable on-disk format.
Cheatsheet	A per-document context block assembled at query time from relevant chunksets. PrimeCut term.
RetrievalContext	Grill's equivalent of a cheatsheet: a single prompt-ready string returned by `POST /grill/search`.
Sandwich order	Placement strategy where the most-relevant passages sit at the top and bottom of the context block — LLMs recall those positions best.
`job_id`	An ingestion job's identifier. Polled via `/jobs/{job_id}/status`.
`doc_id`	After a Grill ingest reaches `done`, the indexed document's stable identifier. Equal to the `job_id` of the ingest.
Project (Grill)	A namespace for ingested documents and vectors. Created with `POST /projects` (`product: "grill"`). Has its own API key (prefix `poma_prod_gr_…`).
Eco vs Pro	Two ingestion modes. Pro uses the full pipeline; Eco trades some quality for cost. See Eco ingestion.

Grill

Getting started

Concepts

Reference

PrimeCut

Getting started

Concepts

Reference

Python SDK

Getting started

Concepts

Reference

Integrations

Migration

CLI

MCP

Learn (study path)

Chunking

Ingestion

Concepts / Overview

The two-line summary

What POMA gives you

Why hierarchical chunking

Cheatsheets — the retrieval primitive

The two products, side by side

The standard lifecycle

1. Ingest

2. Retrieve

3. Prompt

What POMA doesn't do

Where to go next

Reference glossary

Chunking

Ingestion

Concepts / Overview ​

The two-line summary ​

What POMA gives you ​

Why hierarchical chunking ​

Cheatsheets — the retrieval primitive ​

The two products, side by side ​

The standard lifecycle ​

1. Ingest ​

2. Retrieve ​

3. Prompt ​

What POMA doesn't do ​

Where to go next ​

Reference glossary ​

Concepts / Overview

The two-line summary

What POMA gives you

Why hierarchical chunking

Cheatsheets — the retrieval primitive

The two products, side by side

The standard lifecycle

1. Ingest

2. Retrieve

3. Prompt

What POMA doesn't do

Where to go next

Reference glossary