Concepts / Overview
POMA AI is a context engine for retrieval-augmented generation (RAG). It turns documents into prompt-ready context for large language models, without you having to design a chunking strategy, run a vector store, or hand-write retrieval glue.
This page is a high-level orientation. Each idea links down to a deeper concept page when you want detail.
The two-line summary
- Ingestion parses your document, walks its structure, and emits a hierarchy of typed units called chunks and chunksets.
- Retrieval finds the chunksets relevant to a query and renders them as a single, prompt-ready text block — a cheatsheet (PrimeCut) or a RetrievalContext (Grill).
That is the entire mental model. Everything else is implementation detail.
What POMA gives you
POMA's pipeline takes a source file and produces three things:
source file → chunks → chunksets → prompt-ready context
(units) (paths) (drop into an LLM)| Artifact | What it is | Why it exists |
|---|---|---|
| Chunk | The smallest typed unit: a sentence or paragraph with a depth integer, page number, and a to_embed field. | The atom of retrieval. |
| Chunkset | A root-to-leaf path through the document's hierarchy — a chunk together with the ancestor breadcrumbs that make it self-explanatory. | The retrieval unit. Sentences carry their structural context with them. |
.poma archive | A ZIP file bundling the chunks, chunksets, images, page renders, and metadata. | The portable on-disk format. PrimeCut hands it back to you; Grill keeps it server-side. |
| Cheatsheet / RetrievalContext | A single string of XML + Markdown, deduplicated and budget-fit, that drops straight into an LLM prompt. | What the model actually sees at query time. |
Full ingestion architecture: /sdk/concepts/ingestion and /grill/concepts/ingestion.
Why hierarchical chunking
Most RAG stacks chunk by cutting text into fixed-size windows, then embed each slice independently. That works for short documents and breaks for long ones — section headers get separated from their content, tables get shredded, and the LLM sees orphaned facts without their lineage.
POMA never cuts text. Instead, it preserves the document's hierarchy and groups sentences into chunksets — complete root-to-leaf paths.
A chunkset for an HR-policy section looks like:
Employee Handbook → Benefits → Health Insurance →
"Employees are eligible for the standard plan after 30 days.
Open enrollment runs annually from November 1 to November 15."The model reads the leaf sentence and every breadcrumb that gives it meaning. No "section header lost to a different chunk" failure mode, no overlap inflation, no token budget wasted on duplicate context.
Deeper read: /learn/chunking/chunksets and the long-form Ultimate guide to RAG chunking strategies.
Cheatsheets — the retrieval primitive
Hierarchical chunking only pays off if the retrieval side knows the structure exists. Plug a hierarchical chunker into a flat retrieval pipeline and you keep the indexing overhead while losing the structural signal at query time.
POMA's answer is the cheatsheet: a per-document artifact assembled at query time. The retrieval layer can stay simple — it just returns chunksets relevant to a query. generate_cheatsheets(...) re-assembles them into the document's lineage, dedupes, orders, and budget-fits into a single prompt-ready block.
The LLM sees structure in the text, not in the index. No parent-document retrieval, no metadata filters, no hierarchy-aware reranker required.
Deeper read: /sdk/concepts/cheatsheets.
The two products, side by side
POMA ships the same pipeline behind two product surfaces:
| Grill | PrimeCut | |
|---|---|---|
| What it returns | A prompt-ready RetrievalContext string (XML + Markdown) | A .poma archive with raw chunks, chunksets, images, and metadata |
| Who runs retrieval | POMA (server-side: hybrid search, sandwich ordering, token budgeting) | You (use the chunks however you like — Qdrant, LangChain, LlamaIndex, your own stack) |
| You need a vector store | No | Yes (for embeddings) |
| API version | v3 (project-scoped) | v2 + v3 (account-scoped) |
| Best for | "I want a managed RAG endpoint that returns context for my prompt." | "I need raw chunks because I run my own retrieval stack." |
| Use it from | Grill product docs, Grill in the SDK, poma-grill-mcp, or the hosted MCP endpoint | PrimeCut product page, PrimeCut in the SDK, poma-mcp, or the CLI |
Both are powered by the same ingestion pipeline. The difference is what happens after the chunks are produced — Grill indexes them into a project namespace and serves search results; PrimeCut hands the .poma archive back to you.
The standard lifecycle
Every POMA workflow follows the same three stages:
1. Ingest
You submit a source file (PDF, DOCX, HTML, Markdown, image, …). The server runs conversion → indentation → chunking → chunkset assembly → archiving. The job is asynchronous — you get a job_id back and poll until status: done.
- Grill:
/grill/concepts/ingestion(also see Quickstart) - PrimeCut:
/sdk/concepts/ingestion(also see Quickstart)
2. Retrieve
You issue a query and POMA returns the chunksets that match. With Grill, that's a single POST /grill/search call returning a prompt-ready context block. With PrimeCut, you embed the chunksets yourself (in Qdrant, pgvector, …) and run retrieval — POMA's generate_cheatsheets(...) helper assembles the result back into a context block at the end.
- Grill:
/grill/concepts/retrievaland RetrievalContext format - PrimeCut:
/sdk/concepts/cheatsheetsand the integration walkthroughs at Qdrant / LangChain / LlamaIndex
3. Prompt
You drop the resulting cheatsheet or RetrievalContext into your LLM call. POMA's output is structured XML + Markdown:
<query>How did operating margin change year over year?</query>
<doc id="annual-report-2025" title="Annual Report 2025" pages="84">
## Operating margin
Operating margin rose from **18.4%** in FY24 to **21.1%** in FY25, …
<gap pages="3" />
Cost-of-goods-sold improvements contributed roughly 1.6 pts …
</doc>The block is sandwich-ordered (most relevant passages at top and bottom for best LLM recall), gap-marked, token-budgeted, and citation-ready.
What POMA doesn't do
POMA's surface ends at "you have prompt-ready context." It does not:
- Call the LLM for you. You prompt, you stream, you handle errors.
- Run agent loops. POMA returns text; agent logic is yours.
- Manage your vector store (PrimeCut workflow). Grill manages a namespace internally; PrimeCut hands you chunks and you embed.
- Render citations in the UI. The context block has
doc id,title, andpagesattributes — your UI maps those to whatever citation style you want.
This is a deliberate scope. Every layer above retrieval is opinionated by the LLM provider and the product surface. POMA's job is to give you the cleanest possible input to that layer.
Where to go next
| If you want to… | Go here |
|---|---|
| Ship a managed-RAG endpoint in 10 minutes | Grill quickstart |
| Run your own retrieval stack with POMA chunks | PrimeCut quickstart |
| Drive POMA from an MCP-aware agent | MCP servers overview |
| Drive POMA from the shell | CLI |
| Understand chunking strategies more deeply | Learn / Chunking |
| Read the long-form ingestion guide | Document ingestion + chunking for RAG |
| Read the architectural deep dive | The internal Explain_concept.md (covers chunks, chunksets, archive format, retrieval modes, and the three implementation paths) |
Reference glossary
A one-line definition for each term you'll see in the rest of the docs.
| Term | Definition |
|---|---|
| Chunk | The smallest typed unit. A sentence or paragraph with depth, to_embed, page, and optional image_name / table / code references. |
| Chunkset | A root-to-leaf path through the document's hierarchy. The retrieval unit. |
.poma archive | A ZIP file bundling chunks.json, chunksets.json, image renders, page renders, and metadata. The portable on-disk format. |
| Cheatsheet | A per-document context block assembled at query time from relevant chunksets. PrimeCut term. |
| RetrievalContext | Grill's equivalent of a cheatsheet: a single prompt-ready string returned by POST /grill/search. |
| Sandwich order | Placement strategy where the most-relevant passages sit at the top and bottom of the context block — LLMs recall those positions best. |
job_id | An ingestion job's identifier. Polled via /jobs/{job_id}/status. |
doc_id | After a Grill ingest reaches done, the indexed document's stable identifier. Equal to the job_id of the ingest. |
| Project (Grill) | A namespace for ingested documents and vectors. Created with POST /projects (product: "grill"). Has its own API key (prefix poma_prod_gr_…). |
| Eco vs Pro | Two ingestion modes. Pro uses the full pipeline; Eco trades some quality for cost. See Eco ingestion. |