PillarBuilding with LLMsPrompting Evals & Quality Cost & Models

Building With LLMs: An Operator's Field Guide

How I actually build with large language models: model tiers, prompting as spec, structured output, evals, guardrails, and what breaks in production.

By Matt Goren · Updated June 25, 2026 · 13 min read

Most people building with large language models treat the model like a feature you bolt on: add a chat box, call an API, ship it. That framing is why so many AI features feel like toys. The model is not a feature. It is a new kind of capability, closer to "a tireless junior employee who reads fast and never remembers anything" than to a function that returns a value. Once you build for what it actually is, everything downstream gets easier.

I build with these models every day. RunOctopus, the content engine I run, is thousands of orchestrated model calls reading a merchant's whole store, reasoning about what to write, judging its own output, and publishing. Along the way I have shipped plenty of things that broke in ways I did not predict. This is the field guide I wish someone had handed me: the mental model, the real decisions, and the parts that bite you in production.

The mental model: capabilities, not features

A traditional function is deterministic and narrow. You give it the same input, you get the same output, and it does exactly one thing. A language model is the opposite on every axis. It is probabilistic, absurdly general, and it will attempt literally anything you ask whether or not it should. That generality is the whole point, and it is also the trap.

The shift that makes you good at this is to stop asking "what feature do I want" and start asking "what capability am I renting, and what is the smallest reliable unit of work I can hand it." A summarizer, a classifier, an extractor, a router, a drafter. Each of those is a capability. You compose them into a product the same way you compose functions into a program, except each one is fuzzy at the edges and needs its own guardrails.

The corollary: do not hand the model a giant vague job and hope. "Write me a marketing campaign" is a product. "Given these five product descriptions, produce a JSON list of three headline candidates each under sixty characters" is a capability. The second one you can build, test, and trust. The first one you can only cross your fingers over. Almost all of production AI work is decomposing the impressive-sounding product into boring reliable capabilities.

Choosing a model tier

You do not pick one model and use it everywhere. You pick a model per call based on how hard the call is and how much a wrong answer costs. Anthropic's current lineup makes the tiers concrete: Haiku is the cheap, fast tier; Sonnet is the balanced middle; Opus is the frontier reasoning tier; and Fable sits above that for the most demanding long-horizon work. The names will keep changing. The tiering logic will not.

Here is the decision I actually make. Start at the cheapest tier that could plausibly do the task. Run it against an eval set. If it passes, you are done, and you just saved yourself a multiple on cost and latency. If it fails, look at how it fails before you reach for a bigger model. Half the time the failure is a bad prompt or missing context, not insufficient intelligence, and a bigger model just buys you a more expensively wrong answer.

Move up a tier when the task needs genuine multi-step reasoning, when it has to hold a long context coherently, when it is driving an agentic loop that compounds its own mistakes, or when the cost of being wrong is high enough that the price difference is noise. A classification call that runs ten thousand times a day belongs on the cheap tier. The one call that decides whether to publish to a customer's live store belongs on the frontier tier. Same product, different tiers, chosen deliberately.

A practical pattern: use a cheap model for the fan-out and an expensive model for the judgment. Read a hundred documents with Haiku, then hand the distilled result to Opus for the decision that matters. You get most of the intelligence where it counts and pay cheap-tier prices for the bulk volume.

Prompting is specification, not magic

Early on, prompting felt like superstition. People traded "tricks." With current models that era is mostly over, and good prompting now looks like good spec-writing. You are telling a competent, literal-minded worker exactly what you want.

The prompt that works in production states four things plainly: the task, the constraints, the output format, and the edge cases. What is the job. What must always or never be true. What shape should the answer take. And what should happen in the weird situations, because the weird situations are most of real traffic. If you have ever written a ticket detailed enough that a contractor could do the work without a follow-up call, you already know how to do this.

Two things changed with newer models that trip people up. First, the aggressive cajoling that older models needed now backfires. Writing "CRITICAL: YOU MUST ALWAYS USE THE SEARCH TOOL" makes a current model overtrigger and reach for the tool when it should not. Dial it down to "use the search tool when the answer depends on current information." The model follows instructions closely now, so you do not have to shout. Second, examples beat adjectives. Showing one example of the output you want does more than three sentences describing it. If you care about format or tone, demonstrate it.

For the deeper version of this, including how I structure system prompts and keep them stable for caching, see prompt engineering for production. The one-line version: write the prompt like a spec, test it like code, and stop reaching for tricks.

Structured output and tool use

The single highest-leverage technique for making LLM features reliable is to stop parsing prose. If your code needs to act on the model's answer, do not ask for free text and then regex your way to the fields. Constrain the output to a schema.

Structured output lets you hand the model a JSON schema and get back data that validates against it. In the Anthropic SDK that is an output_config with a format, and the model is constrained to fill in exactly the shape you defined. Suddenly the model's answer is a typed object your program can trust instead of a paragraph you have to interpret.

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

const res = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  output_config: {
    format: {
      type: "json_schema",
      schema: {
        type: "object",
        properties: {
          sentiment: { type: "string", enum: ["positive", "neutral", "negative"] },
          score: { type: "number" },
        },
        required: ["sentiment", "score"],
        additionalProperties: false,
      },
    },
  },
  messages: [{ role: "user", content: reviewText }],
});

Tool use is the same idea pointed outward. Instead of constraining the final answer, you give the model a set of functions it can call, each with a typed input schema, and let it decide when to call them. The model emits a structured request, your code runs the actual function, and you feed the result back. This is how the model reaches out of the chat box and into your database, your APIs, the real world. Design those tools with the same care you would design an internal API: clear names, tight schemas, and a description that says when to call it, not just what it does. The newer models reach for tools more conservatively, so a description that states the trigger condition earns real lift.

Structured output and tool use together are what turn a chatbot into software. Most of the reliability I have in production comes from never letting the model speak in prose when it should be speaking in JSON.

Context engineering vs RAG vs fine-tuning

The model only knows what is in its training data and what you put in the prompt. Everything specific to your task, your user, your documents, has to get into that prompt somehow. There are three ways to do it, and they are not interchangeable. Reach for them in order of increasing cost and decreasing flexibility.

Context engineering is first and covers most cases. It just means assembling the right information into the prompt at request time: the relevant records, the user's history, the instructions, the few examples that pin down format. With million-token context windows on current models, you can fit a startling amount directly in the prompt. Often the answer to "how do I make the model know X" is simply "put X in the prompt," and people skip past that looking for something fancier.

RAG, retrieval-augmented generation, is a specific way to do context engineering when you have more knowledge than fits or than is relevant at once. You store your documents, retrieve the handful relevant to the current request, and inject those into the prompt. It is context engineering with a search step in front. You reach for it when your knowledge base is large and only a slice matters per query.

Fine-tuning is last and most people reach for it too early. It bakes behavior into the model's weights through training. It is powerful for teaching a consistent style or a narrow specialized skill, but it is expensive, slow to iterate, and it locks you in: every prompt change is now a retraining run, and you cannot easily update facts. As a rule, if context engineering or RAG can do it, do that instead. Fine-tuning is for when you have exhausted the cheaper options and have a stable, well-defined behavior worth committing to weights.

I go deep on this tradeoff, including when long context beats retrieval outright, in RAG vs fine-tuning vs long context. The headline: start by putting things in the prompt, add retrieval when the prompt overflows, and only fine-tune when you genuinely have to.

Evals: how you actually know it works

This is the part nobody wants to hear and the part that separates real systems from demos. You cannot tell whether your LLM feature works by trying it a few times and feeling good. Models are probabilistic; the three inputs you tried are not the thousand your users will throw at it. You need evals.

An eval set is a collection of real inputs paired with what a good outcome looks like. You run your prompt-and-model against the whole set and score the results. Sometimes the scoring is exact (did it return the right category), sometimes it is a check (does the JSON validate, is the summary under the length limit), and sometimes it is another model acting as a judge against a rubric you wrote. The point is that you get a number, and the number tells you whether a change made things better or worse.

This is the unlock that makes everything else safe. Want to drop from the frontier tier to a cheaper model to save money? Run the eval and find out if quality held. Want to rewrite the prompt? Run the eval. Want to refactor the whole pipeline? Run the eval. Without it, every change is a leap of faith and every "it seems fine" is a guess. I treat the eval set like a test suite, because that is exactly what it is. When I embody the product properly, the judge that scores production output and the eval that gates changes are the same quality contract, wired in from the first pass rather than bolted on later.

Build the eval set early, from real failures as they happen. Every time the model does something wrong in production, that input goes into the eval set with the correct answer attached. Over time your eval becomes a precise map of your product's hard cases, and your model stops regressing on things it already got right.

Guardrails, latency, cost, and failure modes

Three things break in production, predictably, every time. Knowing them in advance is most of the battle.

Latency is first. The model generates token by token, so a long output is a slow output, and the tail is worse than the average. The fix is to stream: start showing the answer as it generates instead of making the user stare at a spinner for fifteen seconds. Streaming does not make the model faster, but it makes the product feel responsive, and for anything user-facing that is the difference between usable and abandoned. For large outputs you have to stream anyway, because non-streaming requests can hit HTTP timeouts.

Cost is second and it sneaks up on you. You pay per token, input and output, and it is invisible until the bill arrives. The two levers that matter most: pick the right tier (a cheaper model on the bulk volume is a multiple of savings) and use prompt caching. If you send the same large system prompt or document on every request, caching lets you pay roughly a tenth of the price for the repeated prefix. The catch is that caching is a prefix match: change a single byte early in the prompt, a timestamp, a reordered field, and the cache misses. Keep the stable stuff first and the variable stuff last. I have watched cache hit rates go to zero because someone interpolated the current time into the top of a system prompt.

Failure modes are third and the scariest, because the worst one is silent. The model does not crash when it is wrong. It confidently returns something plausible and incorrect, and your code happily acts on it. Guardrails are how you contain this: validate the structured output against its schema and reject what does not fit, check the answer against rules you can verify in code, and at money or destructive steps, tell the user what is actually true even when the convenient lie is easier to ship. A model that fabricates an answer under pressure is doing exactly what it was built to do; your job is to build the harness that catches it before it reaches the user.

Agents are the next layer

Everything above composes into single calls and short pipelines. The next layer up is agents: you give the model tools, a memory, and a loop, and let it work toward a goal over many steps, deciding its own next move each time. That is where a lot of the frontier energy is right now, and it is genuinely powerful, because some tasks really are too open-ended to script in advance.

It is also where things break in new and exciting ways: loops that never terminate, context that drifts until the model forgets the original goal, hallucinated tool calls, costs that compound because every step is another model call. Agents are not a free upgrade. They are a different engineering problem with its own discipline, and the single biggest mistake is reaching for one when a single call or a simple scripted pipeline would have done the job more reliably and far cheaper.

I wrote a full tactical guide on this, what an agent actually is, when you need one, and how to keep the loop from spiraling, in building AI agents that work. Read it before you build one, because the gap between an agent demo and an agent in production is the widest gap in this entire field.

Build with LLMs the way you would build any system: smallest reliable unit first, test everything, watch the edges. The model is a remarkable new capability, but it is still just a capability. The engineering is the same engineering. It is just pointed at something fuzzier than you are used to.

FAQ

Do I need a frontier model or will a cheap one work? Start with the cheapest model that can plausibly do the task and only move up when an eval shows it failing. Most production work runs fine on a mid-tier model with a good prompt and structured output. Reserve the frontier tier for genuinely hard reasoning, long-horizon agentic loops, and the steps where a wrong answer is expensive.

Is prompt engineering still worth it, or is it going away? It is not going away, but it changed shape. With current models you write less defensive cajoling and more precise specification: state the task, the constraints, the output format, and the edge cases. Treat the prompt like a spec a careful contractor would follow, not a magic incantation.

How do I actually know my LLM feature works? You build an eval set: a collection of real inputs with known-good outcomes, scored automatically or by a model judge. Without it you are shipping on vibes. The eval is what lets you change models, change prompts, or refactor with confidence instead of hoping.

What's the difference between context engineering, RAG, and fine-tuning? Context engineering means putting the right information in the prompt at request time. RAG is a way to do that by retrieving relevant documents and injecting them. Fine-tuning bakes behavior into the model weights through training. Reach for them in that order, because each one is more expensive and less flexible than the last.

What breaks most often when LLM features hit production? Latency, cost, and silent wrong answers. Tail latency spikes on long outputs, cost balloons when you stop watching token counts, and the model confidently returns something plausible but wrong. Guardrails, streaming, caching, and evals are how you contain all three.

#llms#building#production#evals

Want to apply this right now?

Use the free, no-API prompt generators to put it into practice.

Open Prompt Studio →

Keep reading

Guide