What is Context Engineering? Why It Starts Before the Prompt

by Dr. Alexander Kihm (POMA AI)

Context engineering has become one of the most important concepts in applied AI. This article explains why it matters, how it saves teams money, improves accuracy, and how you can get started. As Google DeepMind researcher Philipp Schmid put it, "The new skill in AI is not prompting, it's context engineering."

We have to be honest: this trend is validating for us. When we first started building POMA PrimeCut, we had to spend a lot of time explaining what a "context engine" was, and even more time explaining why it was useful. People liked the idea of reducing token usage and reducing hallucinations, but they tended to imagine a more trustworthy ChatGPT.

Luckily, we're in more enlightened times now.

As LLMs move from chatbots into production systems — like RAG pipelines, autonomous agents, and enterprise search — the question is no longer "which model should I use?" but "what context is the model working with?"

The term gained traction through voices like Andrej Karpathy, founder of Eureka Labs, who described prompt engineering as evolving into something broader: the engineering of the entire information environment that surrounds a model call. Anthropic defined it as "the art and science of ensuring that models have exactly the right information at the right time." LangChain, Gartner, and others have written extensively about it from the perspective of agent architectures and prompt construction.

Their framing is correct as far as it goes. But it does not go far enough.

The dominant narrative treats context engineering as a problem that begins at the prompt layer — how you select, arrange, and inject information into a context window. This misses the more fundamental problem: where does that information come from, and what shape is it in when it arrives?

TL;DR

Context engineering is a full-stack discipline. It starts at the document, not the prompt. If your documents are poorly parsed, naively chunked, and stripped of structural meaning before retrieval, no amount of prompt engineering will recover what was lost.

Why Context Engineering Matters Now

Three converging trends have made context engineering a first-class engineering discipline.

Context windows are large but not infinite. Claude and Gemini claim to support 1M tokens. But "support" does not mean "use well." Research consistently shows that model performance degrades in the middle of long contexts — the "lost in the middle" problem. Throwing more tokens at a model is not a strategy; it is a liability. Every irrelevant token dilutes the tokens that matter.

RAG is now the default architecture for grounding LLMs in private or current data. But RAG quality is bottlenecked by retrieval quality, and retrieval quality is bottlenecked by how documents were chunked and indexed. A bad chunking strategy poisons every downstream step.

Agents amplify context mistakes. Agentic systems — where an LLM makes sequential decisions, calls tools, and builds up context across turns — are extremely sensitive to context quality. An irrelevant chunk retrieved in step 2 can cascade into a hallucinated tool call in step 5. When your system has autonomy, the cost of bad context is non-linear.

Context Engineering vs. Prompt Engineering

This distinction matters because conflating the two leads teams to optimize the wrong layer.

Prompt engineering operates within a fixed information environment. You have a set of retrieved documents, a conversation history, and a system prompt. The engineering challenge is how to arrange these elements — what instructions to give, what format to request, how to phrase the task. Prompt engineering is important and, for simple applications, may even be sufficient.

Context engineering is a broader concept that operates on the information environment itself. It asks: what should be in the context window, and how should it get there?

Concern	Prompt Engineering	Context Engineering
Scope	The instruction text	The entire information pipeline
Primary question	"How do I phrase this?"	"What should the model see?"
Failure mode	Model misunderstands the task	Model lacks the information to do the task
Operates on	Static context	Dynamic context construction
Includes document processing	No	Yes
Includes retrieval design	Partially (query formulation)	Fully (ingestion, chunking, indexing, ranking)

Prompt engineering is the last mile of context engineering. But most RAG failures happen in the first mile — at the ingestion and chunking layer — where decisions are made long before a prompt is ever written.

No Context Engineering, Mo' Problems: A Case Study

Consider a legal team building a RAG system to answer questions about commercial contracts. The corpus contains 10,000 PDFs. A user asks: "What are the termination rights under the Acme Corp MSA, including any modifications from subsequent amendments?"

Scenario A: Naive chunking. The PDF is split into 512-token chunks using a recursive text splitter. The termination clause spans chunks 47 and 48. Chunk 47 contains the beginning of Section 12.3 ("Termination for Convenience") but not its conditions, which overflow into chunk 48. Chunk 48 starts mid-sentence and contains the conditions but no section header. Neither chunk references the 2023 amendment that modified this section, because that amendment is a separate document whose chunks have no structural link to the original MSA.

The retrieval system returns chunk 47 because it matches "termination." The LLM receives a clause fragment without its conditions and without its amendments. It generates an answer that is fluent, confident — and wrong.

Scenario B: Hierarchy-preserving chunking. The PDF is parsed with document structure awareness. Section 12.3 is identified as a complete logical unit with its header, body, conditions, and subsections. The chunk preserves the section's position in the document hierarchy (Part IV → Section 12 → 12.3). Metadata links it to the amendment that modified this section.

The answer is accurate because the context was accurate. The difference between these scenarios is not in the prompt, the retrieval model, or the LLM. It is in the ingestion pipeline. The quality of context was determined before any prompt was written.

Context Engineering: The Full Stack

Context engineering is a pipeline with four layers. Each layer transforms information, and each can introduce or destroy signal.

Layer 1: Document Ingestion

Raw documents — PDFs, Word files, HTML, spreadsheets, slide decks — must be converted into machine-readable text while preserving meaningful structure. This is harder than it sounds.

A PDF is a visual format devised for printers, not a semantic one for computers, never mind LLMs. It encodes where glyphs appear on a page, not what a section header is or where a table begins. Most PDF parsers lose table structure, merge columns incorrectly, or silently drop content. A financial table with revenue figures becomes meaningless when converted to a flat string: "Revenue 2023 $4.2M 2024 $5.1M" tells the model nothing about which number belongs to which year.

Engineering decisions at this stage — parser selection, table handling, metadata extraction, format normalization — directly determine the ceiling for everything downstream.

Layer 2: Chunking

Chunking is where the most consequential context engineering decisions are made, and where most RAG pipelines go wrong.

Fixed-size chunking splits text into N-token blocks with M-token overlap. It breaks sentences, paragraphs, tables, and logical sections arbitrarily. The overlap is a band-aid that creates redundancy without solving the structural problem.

Recursive character splitting (popularized by LangChain) tries paragraph boundaries first, then sentence boundaries, then character boundaries. Better than fixed-size, but still has no awareness of document structure. A section heading might end up at the bottom of one chunk instead of the top of the next.

Semantic chunking uses embedding similarity to detect topic shifts. Clever but expensive and brittle — similarity thresholds are hard to tune and domain-dependent. The main issue remains: at one point you cut, even if that point is found more elaborately.

Hierarchical or structure-aware chunking preserves the document's own organization — headings, sections, subsections, tables, lists — as the primary chunking signal. Instead of imposing an external splitting strategy on the text, it uses the structure that the document author already created.

This is the approach that PrimeCut takes. Rather than producing flat chunks, PrimeCut generates "chunksets" — hierarchical retrieval units that preserve parent-child relationships between sections, maintain table integrity, and carry structural metadata. When you retrieve a chunk about "Section 3.2 conditions," you also get the context of what Section 3 is about and where it sits in the document. In benchmarks, this achieves 100% context recall while using 77% fewer tokens than traditional chunking — because you retrieve precisely the structure you need instead of overlapping text blocks.

Layer 3: Retrieval and Ranking

Chunks are embedded into vector space and indexed for search. This layer has received the most attention from the AI community, and for good reason — the engineering frontier is moving rapidly beyond single-vector similarity search.

But retrieval quality is bounded by chunk quality. A retrieval system can only return what exists in the index. If the index is full of decontextualized fragments because the chunking strategy stripped structural information, even a perfect retrieval system will return incomplete answers.

Hybrid retrieval combines dense vector search (for semantic similarity) with sparse keyword search (BM25, for exact matches). This covers both the "what does this mean" and "does this contain this exact term" retrieval modes. Production systems that rely on semantic search alone miss queries that depend on specific identifiers, clause numbers, or proper nouns.

Reranking applies a cross-encoder model to re-score the top-K results from initial retrieval. Cross-encoders like Cohere Rerank or bge-reranker-v2 are more accurate than bi-encoders but too expensive to run over the full index, so they serve as a second-pass precision filter.

Query transformation rewrites the user's query to improve retrieval. Techniques include HyDE (hypothetical document embeddings), query decomposition, and step-back prompting (asking a more general version of the query first to establish context).

Contextual retrieval (introduced by Anthropic) prepends each chunk with a short LLM-generated summary of where it sits in the document before embedding. This gives the embedding model more context about the chunk's role, improving retrieval precision significantly. Notably, this technique works best when chunks already have good structural boundaries — if your chunking strategy already preserves section paths and heading hierarchy, you get much of this benefit without the additional LLM call at indexing time.

Layer 4: Context Assembly and Window Management

Retrieved chunks must be assembled into a coherent context window alongside system instructions, conversation history, and tool outputs. This is where context engineering meets prompt engineering.

Ordering matters. Place the most relevant information at the beginning and end of the context, not in the middle. The "lost in the middle" effect is real and well-documented — models attend more strongly to the edges of their input.

Token budgeting. Decide how to allocate your context window deliberately: how many tokens for system instructions? For retrieved context? For conversation history? For the model's response? Most teams don't think about this explicitly and end up with context windows dominated by conversation history, leaving insufficient room for retrieved documents. A general guideline: reserve 20–30% for instructions and output space, and use the rest for retrieved content prioritized by relevance.

Deduplication. If you used overlap-based chunking, your retrieved chunks may contain repeated text. This wastes tokens and can confuse the model. Structure-aware chunking avoids this because chunks nest instead of overlapping.

Citation formatting. If you need the model to cite its sources, include source metadata (document name, page number, section) in a format the model can reference. This is trivially easy when chunks carry structural metadata; painful when they don't.

For multi-turn conversations and agentic systems, context accumulates over time. Managing it requires conversation summarization (compress older turns to free tokens for fresh retrievals), sliding windows (keep recent turns verbatim, older turns as summaries), and dynamic re-retrieval (re-retrieve relevant chunks each turn rather than carrying forward stale context).

Context Engineering's Compounding Effects

Context engineering decisions compound. Good ingestion produces clean, structured text. Good structure enables structure-aware chunking. Good chunks produce better embeddings. Better embeddings improve retrieval. Better retrieval means the model sees more relevant and less irrelevant information.

What's important is that this formula means the model generates better answers, hallucinates less, and cites more accurately.

The reverse is also true. Bad ingestion produces garbled text. Fixed-size chunking breaks structure. Broken chunks produce noisy embeddings. Noisy embeddings return irrelevant results. The model hallucinates to fill the gaps. And you blame the model — when the real problem was in the first stage of the pipeline.

Practical Checklist: Auditing Your RAG System's Context Engineering

A good context engineering audit is one of our personal favorite rabbit holes to go down, and there's quite a lot to be said on the topic. For readers who prefer the boxes-to-check version:

Ingestion

Are tables preserved as structured data, not linearized text?
Are section headings extracted and linked to their content?
Are page numbers, document metadata, and section paths preserved?
Does your parser handle your actual document formats correctly? (Test with real documents, not toy examples.)

Chunking

Do your chunks respect document structure (sections, paragraphs, tables)?
Can you retrieve a chunk and understand its context without reading the surrounding chunks?
Are you carrying unnecessary token overhead from overlap-based strategies?
Do chunks carry metadata (section path, document name, page number)?

Retrieval

Are you using hybrid retrieval (dense + sparse)?
Have you tested reranking on your actual queries?
Are you measuring retrieval recall — not just end-to-end answer quality?

Assembly

Do you have explicit token budgets for each context component?
Are retrieved chunks ordered by relevance, with the most relevant at the edges?
Are you deduplicating retrieved content?

Where the Context Engineering Field Is Heading

Ingestion-time intelligence. More teams are investing in smarter ingestion — using vision models to parse documents, LLMs to generate chunk summaries, and structure-aware algorithms to preserve hierarchy. The upfront cost is justified by the downstream quality improvement.

Smaller, better contexts over larger, worse ones. The trend is toward precision over volume. Instead of retrieving 20 chunks and hoping the model finds the answer, retrieve 3 chunks that contain exactly the answer. This requires better chunking and retrieval, but produces faster, cheaper, and more accurate results. PrimeCut's 77% token reduction with 100% context recall is an example of this direction. You don't need more tokens — you need the right tokens.

Standardized evaluation. Frameworks for evaluating context quality (not just answer quality) are emerging. Metrics like context precision, context recall, and context relevance (as defined by RAGAS) let you measure and optimize each pipeline stage independently.

Context engineering as a distinct discipline. Just as "data engineering" emerged as distinct from "data science," context engineering is emerging as distinct from ML engineering. The skills are different: parsing, chunking, and retrieval optimization are closer to information retrieval and systems engineering than to model training.

Summary

Context engineering is the discipline of designing every stage of the pipeline that determines what an LLM sees. It starts at document ingestion, runs through chunking and retrieval, and ends at context assembly. Prompt engineering is the final layer, but the ingestion and chunking layers are where context quality is fundamentally determined.

If your RAG system is underperforming, don't start by rewriting your prompt. Start by looking at your chunks. Open ten retrieved results for a representative query and read them. Are they coherent? Do they contain the answer? Do they carry enough context to be understood in isolation? If not, no prompt will save you. Fix the context, and the model will follow.

Ready to try structure-preserving context engineering?

Try PrimeCut for free — 1,000 pages free, no credit card required
PrimeCut product page — how it works
Pricing — from €0.003/page

Continue reading

The Ultimate Guide to RAG Chunking Strategies & Text Splitters — deep dive into chunking approaches, tradeoffs, and when to use each one
Document Ingestion & Chunking for RAG — how to build a production ingestion pipeline that preserves document structure

What is Context Engineering? Why It Starts Before the Prompt ​

Why Context Engineering Matters Now ​

Context Engineering vs. Prompt Engineering ​

No Context Engineering, Mo' Problems: A Case Study ​

Context Engineering: The Full Stack ​

Layer 1: Document Ingestion ​

Layer 2: Chunking ​

Layer 3: Retrieval and Ranking ​

Layer 4: Context Assembly and Window Management ​

Context Engineering's Compounding Effects ​

Practical Checklist: Auditing Your RAG System's Context Engineering ​

Where the Context Engineering Field Is Heading ​

Summary ​

Ready to try structure-preserving context engineering? ​

Continue reading ​