Context Engineering: The Skill That Replaced Prompt Hacking
Managing the context window is the real craft now. What to put in, retrieval vs stuffing, ordering, caching, compaction, token budgets, and multi-turn memory.
A few years ago the highest-leverage skill in building with language models was wording. You learned the magic phrases, the "think step by step," the role-play framings, the threats and bribes that nudged a stubborn model into compliance. That era is mostly over. Models got better at following plain instructions, and context windows got enormous. What decides output quality now is not how cleverly you phrase the ask. It is what you choose to put in front of the model, in what order, and what you leave out. That is context engineering, and it is the craft that quietly replaced prompt hacking.
The context window is the model's entire working memory for a single call. Everything it knows in that moment, your instructions, the conversation so far, retrieved documents, tool results, lives in that window and nowhere else. The window is finite and it is not free. Managing it deliberately is the whole game.
What to actually put in
The instinct, especially now that windows are huge, is to throw everything in and let the model sort it out. That instinct is wrong, and learning why was the thing that most improved my results.
A model does not pay equal attention to everything in its window. Bury the three sentences that matter under thirty pages of loosely related background and the answer gets worse, not better, even though technically all the information was "there." More context is not more signal; past a point it is more noise. So the job is curation. For each call I ask: what is the minimum set of things the model genuinely needs to answer this well? That set goes in. The rest stays out.
This reframes the work. You are not writing a prompt so much as assembling a briefing. Instructions, the relevant facts, a couple of examples if the task is fuzzy, and the actual question. Tight and relevant beats sprawling and complete almost every time.
Retrieval versus stuffing
The clearest place this shows up is how you handle a large body of knowledge, a documentation set, a product catalog, a year of support tickets.
Stuffing means dumping the whole corpus into the context and asking your question at the end. Retrieval means keeping the corpus outside the model, fetching only the few passages relevant to the current question, and putting just those in the window. Retrieval almost always wins. It keeps the context small and high-signal, it costs less per call, and it scales past whatever the window size happens to be. A million-token window does not change this calculus; it just raises the ceiling on how much you can afford to be sloppy before it bites.
There is real nuance in when long context beats retrieval and when fine-tuning beats both, and I work through that tradeoff in depth in RAG vs fine-tuning vs long context. The short version for this guide: default to retrieving the relevant slice, not stuffing the whole thing, and reserve giant-context stuffing for when the task genuinely needs the model to reason across an entire document at once.
Ordering and recency
Where something sits in the context matters, not just whether it is present.
The two strongest positions are the beginning and the end. Stable, foundational material, your system instructions, the durable rules, goes near the front. The most immediately relevant material, the actual question and the freshest, most pertinent facts, goes near the end, closest to where the model starts generating. The vast, less-critical middle is where attention is weakest, so it is the worst place to hide something load-bearing.
Recency is a real effect. The model leans on what is most recent and most proximate to the answer. So if there is one fact the response absolutely must use, do not leave it stranded in the middle of a long block; pull it close to the end. Ordering is a lever, and it costs nothing to pull.
Caching: pay for stable context once
Here is where context engineering and cost engineering meet. Most applications send a large chunk of identical context on every single call: the same system prompt, the same tool definitions, the same reference preamble. Reprocessing that from scratch every time is wasteful.
Prompt caching fixes it. The model processes a stable prefix of your prompt once, stores the result, and reuses it on later calls that share that exact prefix. The repeated portion comes back cheaper and faster; you pay full price only for what is new.
Caching is a prefix match, and that one fact dictates how you structure everything. Any change anywhere in the prefix invalidates the cache from that point on. So the architecture is: stable content first, volatile content last. The frozen system prompt and the deterministic tool list go at the front, behind the cache boundary. The per-request stuff, the user's new question, a timestamp, a session id, goes after it. The classic mistake is interpolating something like "current time: 14:32:07" into the top of your system prompt; that one moving value changes the prefix every call and silently destroys your cache. Keep the front of the context byte-for-byte identical across calls and caching mostly takes care of itself.
Compaction and summarization
A different problem shows up in long conversations and long agent runs: the context keeps growing. Every turn, every tool call and its result, piles on. Eventually you approach the window limit, and now you have to decide what to do with the accumulated history.
Two moves handle this. Compaction (or summarization) condenses the older part of the conversation into a short summary that preserves the thread, then continues from there. The model keeps the gist of what happened without carrying every verbatim word. Clearing is more surgical: you drop stale content that is no longer relevant, like old tool results you have already acted on, while keeping the conversation's structure intact. Summarizing replaces; clearing prunes. Long-running agents often use both, plus caching, together.
The thing to internalize is that you do not have to carry the entire history forever just because the conversation is long. You carry the live, relevant slice and a condensed memory of the rest. A four-hour agent run does not mean a four-hour context; with compaction it means a small working window plus a running summary.
Token budgets
All of this rolls up into a budget. The window is finite, and every token you spend on one thing is a token you cannot spend on another. So I think of the context as a budget to allocate: this much for instructions, this much for retrieved context, this much for conversation history, this much headroom reserved for the model's actual answer. Run out of headroom and the response truncates mid-thought.
Counting tokens is part of the job, and you count them with the model's own tokenizer, not an estimate borrowed from a different model, because the counts differ enough to matter. When a request is too big, the answer is not to silently truncate the input and hope. It is to retrieve less, compact the history, or split the work. Budgeting is what turns "we ran out of context" from a production incident into a design decision you made on purpose.
Multi-turn memory
The last piece is the difference between context and memory, and it trips people up.
Context is what is in the window for one call. It evaporates the instant the call returns. Memory is what survives across calls and sessions. The model itself does not remember your last conversation; anything it "remembers" is something your system stored outside the model and chose to pull back into the context when relevant.
So building a system that feels like it remembers is a context engineering job. You decide what is worth persisting, a user's preferences, key facts from earlier, decisions already made, you store it somewhere durable, and on a new request you retrieve the relevant bits and place them into the window. The model gets handed the right memories at the right moment and behaves as though it never forgot. Done well, this is invisible. Done poorly, you either drown every call in stale history or the assistant has amnesia. The middle path, store outside, retrieve selectively, place deliberately, is context engineering in its purest form.
That is the whole discipline. Not magic words, but deliberate decisions about what the model sees, where it sees it, what it costs, and what carries over. The clever-phrasing era was fun, but this is the one that actually scales. For the wording layer that still matters on top of all this, see prompt engineering for production.
FAQ
What is context engineering? It's the practice of deciding what goes into the model's context window on each call, in what order, and what gets left out. As windows grew huge, this replaced clever prompt wording as the skill that decides output quality.
If the context window is a million tokens, why not just stuff everything in? Because relevance beats volume. Burying the few things that matter under a mountain of marginally-related text degrades the answer and costs more. Retrieve and curate; don't dump.
What is prompt caching and when does it help? Caching reuses an already-processed prefix of your prompt so repeated calls are cheaper and faster. It helps when many requests share a large stable preamble, like a fixed system prompt or tool set.
How do I stop a long conversation from overflowing the context window? Summarize or compact older turns into a condensed form, or clear stale tool results you no longer need. The goal is to keep the live, relevant context small while preserving the thread.
What's the difference between context and memory? Context is what's in the window for a single call and vanishes after it. Memory is what persists across calls or sessions, usually stored outside the model and pulled back in when relevant.
Use the free, no-API prompt generators to put it into practice.
Multimodal AI: A Builder's Guide to Vision, Images, and Audio
What multimodal models actually do, where they earn their keep, and how to ship vision and structured extraction in production without surprises.
GuidePrompting Claude vs GPT: What Actually Differs
The prompting habits that carry between Claude and GPT, the ones that don't, and how each family wants to be steered in production.
GuideHow to Cut Your LLM Costs (Without Cutting Quality)
Prompt caching, batching, model routing, leaner context, output caps — the levers that drop your AI bill without touching output quality.