Skip to content

Concepts / Overview

POMA AI is a context engine for retrieval-augmented generation (RAG). It turns documents into prompt-ready context for large language models, without you having to design a chunking strategy, run a vector store, or hand-write retrieval glue.

This page is a high-level orientation. Each idea links down to a deeper concept page when you want detail.

The two-line summary

  1. Ingestion parses your document, walks its structure, and emits a hierarchy of typed units called chunks and chunksets.
  2. Retrieval finds the chunksets relevant to a query and renders them as a single, prompt-ready text block — a cheatsheet (PrimeCut) or a RetrievalContext (Grill).

That is the entire mental model. Everything else is implementation detail.

What POMA gives you

POMA's pipeline takes a source file and produces three things:

source file  →  chunks      →  chunksets    →  prompt-ready context
                (units)        (paths)         (drop into an LLM)
ArtifactWhat it isWhy it exists
ChunkThe smallest typed unit: a sentence or paragraph with a depth integer, page number, and a to_embed field.The atom of retrieval.
ChunksetA root-to-leaf path through the document's hierarchy — a chunk together with the ancestor breadcrumbs that make it self-explanatory.The retrieval unit. Sentences carry their structural context with them.
.poma archiveA ZIP file bundling the chunks, chunksets, images, page renders, and metadata.The portable on-disk format. PrimeCut hands it back to you; Grill keeps it server-side.
Cheatsheet / RetrievalContextA single string of XML + Markdown, deduplicated and budget-fit, that drops straight into an LLM prompt.What the model actually sees at query time.

Full ingestion architecture: /sdk/concepts/ingestion and /grill/concepts/ingestion.

Why hierarchical chunking

Most RAG stacks chunk by cutting text into fixed-size windows, then embed each slice independently. That works for short documents and breaks for long ones — section headers get separated from their content, tables get shredded, and the LLM sees orphaned facts without their lineage.

POMA never cuts text. Instead, it preserves the document's hierarchy and groups sentences into chunksets — complete root-to-leaf paths.

A chunkset for an HR-policy section looks like:

Employee Handbook → Benefits → Health Insurance →
  "Employees are eligible for the standard plan after 30 days.
   Open enrollment runs annually from November 1 to November 15."

The model reads the leaf sentence and every breadcrumb that gives it meaning. No "section header lost to a different chunk" failure mode, no overlap inflation, no token budget wasted on duplicate context.

Deeper read: /learn/chunking/chunksets and the long-form Ultimate guide to RAG chunking strategies.

Cheatsheets — the retrieval primitive

Hierarchical chunking only pays off if the retrieval side knows the structure exists. Plug a hierarchical chunker into a flat retrieval pipeline and you keep the indexing overhead while losing the structural signal at query time.

POMA's answer is the cheatsheet: a per-document artifact assembled at query time. The retrieval layer can stay simple — it just returns chunksets relevant to a query. generate_cheatsheets(...) re-assembles them into the document's lineage, dedupes, orders, and budget-fits into a single prompt-ready block.

The LLM sees structure in the text, not in the index. No parent-document retrieval, no metadata filters, no hierarchy-aware reranker required.

Deeper read: /sdk/concepts/cheatsheets.

The two products, side by side

POMA ships the same pipeline behind two product surfaces:

GrillPrimeCut
What it returnsA prompt-ready RetrievalContext string (XML + Markdown)A .poma archive with raw chunks, chunksets, images, and metadata
Who runs retrievalPOMA (server-side: hybrid search, sandwich ordering, token budgeting)You (use the chunks however you like — Qdrant, LangChain, LlamaIndex, your own stack)
You need a vector storeNoYes (for embeddings)
API versionv3 (project-scoped)v2 + v3 (account-scoped)
Best for"I want a managed RAG endpoint that returns context for my prompt.""I need raw chunks because I run my own retrieval stack."
Use it fromGrill product docs, Grill in the SDK, poma-grill-mcp, or the hosted MCP endpointPrimeCut product page, PrimeCut in the SDK, poma-mcp, or the CLI

Both are powered by the same ingestion pipeline. The difference is what happens after the chunks are produced — Grill indexes them into a project namespace and serves search results; PrimeCut hands the .poma archive back to you.

The standard lifecycle

Every POMA workflow follows the same three stages:

1. Ingest

You submit a source file (PDF, DOCX, HTML, Markdown, image, …). The server runs conversion → indentation → chunking → chunkset assembly → archiving. The job is asynchronous — you get a job_id back and poll until status: done.

2. Retrieve

You issue a query and POMA returns the chunksets that match. With Grill, that's a single POST /grill/search call returning a prompt-ready context block. With PrimeCut, you embed the chunksets yourself (in Qdrant, pgvector, …) and run retrieval — POMA's generate_cheatsheets(...) helper assembles the result back into a context block at the end.

3. Prompt

You drop the resulting cheatsheet or RetrievalContext into your LLM call. POMA's output is structured XML + Markdown:

xml
<query>How did operating margin change year over year?</query>
<doc id="annual-report-2025" title="Annual Report 2025" pages="84">
  ## Operating margin

  Operating margin rose from **18.4%** in FY24 to **21.1%** in FY25, …

  <gap pages="3" />

  Cost-of-goods-sold improvements contributed roughly 1.6 pts …
</doc>

The block is sandwich-ordered (most relevant passages at top and bottom for best LLM recall), gap-marked, token-budgeted, and citation-ready.

What POMA doesn't do

POMA's surface ends at "you have prompt-ready context." It does not:

  • Call the LLM for you. You prompt, you stream, you handle errors.
  • Run agent loops. POMA returns text; agent logic is yours.
  • Manage your vector store (PrimeCut workflow). Grill manages a namespace internally; PrimeCut hands you chunks and you embed.
  • Render citations in the UI. The context block has doc id, title, and pages attributes — your UI maps those to whatever citation style you want.

This is a deliberate scope. Every layer above retrieval is opinionated by the LLM provider and the product surface. POMA's job is to give you the cleanest possible input to that layer.

Where to go next

If you want to…Go here
Ship a managed-RAG endpoint in 10 minutesGrill quickstart
Run your own retrieval stack with POMA chunksPrimeCut quickstart
Drive POMA from an MCP-aware agentMCP servers overview
Drive POMA from the shellCLI
Understand chunking strategies more deeplyLearn / Chunking
Read the long-form ingestion guideDocument ingestion + chunking for RAG
Read the architectural deep diveThe internal Explain_concept.md (covers chunks, chunksets, archive format, retrieval modes, and the three implementation paths)

Reference glossary

A one-line definition for each term you'll see in the rest of the docs.

TermDefinition
ChunkThe smallest typed unit. A sentence or paragraph with depth, to_embed, page, and optional image_name / table / code references.
ChunksetA root-to-leaf path through the document's hierarchy. The retrieval unit.
.poma archiveA ZIP file bundling chunks.json, chunksets.json, image renders, page renders, and metadata. The portable on-disk format.
CheatsheetA per-document context block assembled at query time from relevant chunksets. PrimeCut term.
RetrievalContextGrill's equivalent of a cheatsheet: a single prompt-ready string returned by POST /grill/search.
Sandwich orderPlacement strategy where the most-relevant passages sit at the top and bottom of the context block — LLMs recall those positions best.
job_idAn ingestion job's identifier. Polled via /jobs/{job_id}/status.
doc_idAfter a Grill ingest reaches done, the indexed document's stable identifier. Equal to the job_id of the ingest.
Project (Grill)A namespace for ingested documents and vectors. Created with POST /projects (product: "grill"). Has its own API key (prefix poma_prod_gr_…).
Eco vs ProTwo ingestion modes. Pro uses the full pipeline; Eco trades some quality for cost. See Eco ingestion.