The `RetrievalContext` Format

Grill's search endpoints return a single field — context — containing a prompt-ready block of XML + Markdown. The shape is opinionated on purpose: it encodes everything an LLM needs to use the retrieved passages well, and everything you need to surface citations to the end user, without you assembling a prompt yourself.

This page documents the grammar, the ordering algorithm, and how to consume it on the client side.

The shape, end to end

xml

<context>
<doc id="annual-report-2025" title="Annual Report 2025" pages="3-5">
  [p3]
  ## Operating margin

  Operating margin rose from **18.4%** in FY24 to **21.1%** in FY25 …

  […]

  [p5]
  Cost-of-goods-sold improvements contributed roughly 1.6 pts to the lift …
</doc>
<doc id="cfo-q4-call" title="CFO Q4 earnings call" pages="14">
  [p14]
  > "We expect operating margin to expand again in FY26, supported by …"
</doc>
</context>

Three things are happening here:

<context> wrapper encloses the whole block — a single root element you can hand to a parser or drop into a prompt as-is.
<doc> wrappers make per-document boundaries explicit so the model can attribute statements correctly. Markdown content (headings, lists, blockquotes) is preserved from the original chunking so the model gets the same structural cues a reader would.
Inline markers — [pN] tags the start of content from page N of the source; […] is dropped between two non-consecutive chunks of the same document. Together they let the model attribute by page and know when chunks were skipped between passages.

Top-level structure

text

context  ::= "<context>" (DOC)* "</context>"
DOC      ::= "<doc id=\"…\" title=\"…\" pages=\"…\">" MARKDOWN "</doc>"

When nothing matches, the block is an empty <context></context>.

Attributes you can rely on:

Attribute	Meaning	Always present?
`id` on `<doc>`	The same `doc_id` you see in `DocInfo.doc_id`	Yes
`title` on `<doc>`	Document title (from PrimeCut metadata)	When known
`pages` on `<doc>`	The page range covered by the included passages, e.g. `"3-5"` or `"14"`	When the source has page numbers

Sandwich ordering

Inside the context block, passages are placed in sandwich order rather than strict relevance order. Concretely: the highest-relevance passage goes near the top, the second-highest near the bottom, and the rest fill in toward the middle.

text

position    relevance
─────────  ───────────
top        ★★★★★      ← highest
…          ★★★★
…          ★★★
middle     ★★          ← lowest
…          ★★★
…          ★★★★
bottom     ★★★★★       ← second-highest

Why? Modern LLMs exhibit a U-shaped attention curve over long contexts (the "lost in the middle" effect): tokens at the very start and the very end of the prompt are recalled better than tokens in the middle. Sandwich ordering puts the most important passages where the model is most likely to use them, without you having to think about it.

If you need a different ordering for a downstream tool, you can re-parse the <doc> blocks and reorder yourself — but for a default LLM call, sandwich order will out-perform plain relevance order on long-context recall benchmarks.

Inline markers

Within each <doc> body, literal markers appear inline with the Markdown:

[pN] — page anchor. Marks the start of content from page N of the source. It is a property of the chunk content (set during ingestion), not something the context-assembly layer inserts. Use it for per-page citations alongside the doc-level pages attribute.
[…] — chunk-skip marker. Inserted between two chunks of the same document when their chunk indices are not consecutive — one or more chunks were skipped (they fell below the relevance threshold or were dropped by the token budget). Purely binary: no count, no span, no page reference. Tells the model "chunks were dropped here, do not assume continuity" — so it won't write transitions like "as stated above" that would imply a flow that isn't in the source. Adjacent chunks get a plain newline; the marker only appears for actual skips.
[IMAGE: name] — figure placeholder. Appears only when return_assets: true and the passage references a figure. It names the asset inline at the point the figure occurs; the image bytes themselves live in the response's assets map (base64 data URIs), keyed by doc_id. With return_assets off, figures simply don't appear in the context.

[…] is intra-document only — it never crosses document boundaries. Across-document transitions are marked by the <doc> element itself, not by […].

[pN] and […] are independent. A [pN] does not imply a chunk skip; a […] does not imply a page break. They co-occur only when a skip happens to land at a page boundary.

Token budgeting

Two budgets bound the size of the rendered context (not the number of hits). Grill measures the rendered length — including all wrapper tags — and admits the highest-ranked hits first, dropping the rest (preserving the sandwich shape):

target_tokens (default 5000) is the soft target — the typical answer size. Hits are admitted best-first up to this budget. It can never exceed max_tokens: if you set a max_tokens lower than it, the target is clamped down to match.
max_tokens (default 15000) is the hard ceiling. Grill expands past target_tokens toward it only when extra hits are nearly as relevant as the best one (tight multi-document clusters); single-answer queries stop at the target.

This means:

The block you receive is always within budget, even when the underlying corpus has many high-relevance passages.
Hits dropped for budget reasons are silently discarded — they do not appear with a "truncated" marker. Use min_relevance (precision) and target_tokens (size) to influence which hits survive.
The block can come back empty (<context></context>) if every hit fell below min_relevance or if the query found nothing in the namespace.

If you need predictable lengths, set target_tokens to a value comfortably below your model's context window minus your prompt's other content (system prompt, user message, response budget). Lowering max_tokens below target_tokens also works — the soft target is clamped down to the hard ceiling, so the response can never exceed max_tokens.

Surfacing citations

Because every passage is wrapped in <doc id=…>, you can extract citations with a one-pass regex over the response:

python

import re

def cite(context: str):
    return [
        { "doc_id": m.group(1), "title": m.group(2), "pages": m.group(3) }
        for m in re.finditer(
            r'<doc\s+id="([^"]+)"(?:\s+title="([^"]*)")?(?:\s+pages="([^"]*)")?',
            context,
        )
    ]

Pair that with a system prompt like:

"Answer using ONLY the context. After each claim, cite the matching doc.id in square brackets."

…and you have a citation-grounded assistant without writing a reranker, a re-formatter, or a prompt assembler.

What is not in the response

No raw scores (in the default prompt_ready block). Relevance scores are used server-side for filtering and ordering and then dropped from the rendered context. The set of returned passages is the contract; the underlying scores are not.
No chunk ids. The granularity you can address is the document, not the chunk. If you need per-chunk addressability, use PrimeCut and own the retrieval layer.

This is deliberate: the default contract guarantees prompt-readiness, not the internals of how that prompt was assembled.

If you do want structured hits

Set format: "json" on the search request and Grill returns structured ranked hits instead of the rendered prompt_ready block — useful when you're building your own prompt assembler or want to inspect the ranking. The default is prompt_ready (the XML+Markdown context string documented above); leave format unset to keep it.

Working with assets and page images

When you set return_assets: true, Grill returns the figures (and, where available, tables) belonging to the cited documents in a separate assets field on the response, keyed by doc_id. Inside the context string, each figure is referenced inline by an [IMAGE: name] marker (see Inline markers); the matching bytes live in assets[doc_id]. Images come back as base64 data URIs (not signed URLs), and assets are resolved at the document level — you get the figures from the documents the answer draws on, not a per-passage subset. The assets field is omitted when no cited document has any.

Use it when you need to render visual citations alongside the text. Skip it otherwise — base64 figures add payload size.

Retrieval — the request side, including min_relevance / max_tokens tuning.
Document management — keeping the namespace tidy.

Grill

Getting started

Concepts

Reference

PrimeCut

Getting started

Concepts

Reference

Python SDK

Getting started

Concepts

Reference

Integrations

Migration

CLI

MCP

Learn (study path)

Chunking

Ingestion

The `RetrievalContext` Format

The shape, end to end

Top-level structure

Sandwich ordering

Inline markers

Token budgeting

Surfacing citations

What is not in the response

If you do want structured hits

Working with assets and page images

Next

Chunking

Ingestion

The RetrievalContext Format ​

The shape, end to end ​

Top-level structure ​

Sandwich ordering ​

Inline markers ​

Token budgeting ​

Surfacing citations ​

What is not in the response ​

If you do want structured hits ​

Working with assets and page images ​

Next ​

The `RetrievalContext` Format

The shape, end to end

Top-level structure

Sandwich ordering

Inline markers

Token budgeting

Surfacing citations

What is not in the response

If you do want structured hits

Working with assets and page images

Next