MMatt Goren
← AI hub
GuideModels & CapabilitiesCost & Models

How to Choose an LLM (and Switch Without Pain)

A practical decision process for picking an LLM: define the task, pick a tier, test on your data, and architect so switching is a config change.

By Matt Goren · Updated June 25, 2026 · 8 min read

Choosing an LLM feels like a high-stakes, one-way door. It isn't — if you set it up right. The mistake I see most often is treating model selection as a permanent identity decision ("we're a GPT shop," "we're a Claude shop") instead of what it actually is: a routine engineering call you should be able to revisit any time something better ships. This guide is the process I use. It's boring on purpose. Boring is what survives contact with production.

If you want the wider map of who makes what and how the families compare, read the frontier model landscape first. This guide is the how-to: a repeatable process from "I have a task" to "this model, wired so I can swap it."

Step 1: Define the task before you look at any model

You cannot choose a model for a task you haven't pinned down. Before I read a single model spec, I write down five things:

  • Inputs. What's going in? Plain text? Images, PDFs, audio? A 200-word email or a 200-page document? This alone eliminates whole families — if you're feeding it video, you need native multimodal support.
  • Outputs. What's coming out, and in what shape? A free-form paragraph is a different problem than strict JSON that has to validate every time. A classification label is different from a long reasoned analysis.
  • Volume. Ten requests a day or ten million? Volume turns small per-token differences into the whole budget. At low volume, price barely matters and you should optimize for quality. At high volume, price is the decision.
  • Latency tolerance. Does a human stare at a spinner waiting for this, or does it run overnight in a batch? Interactive needs speed; batch can use the slow heavyweight.
  • Quality bar — and how you'll measure it. What does "good enough" mean, concretely, and how will you know? This is the one people skip, and it's the one that makes every later step possible. "It should summarize well" is useless. "It should produce a summary under 100 words that a human rates 4/5 or better, with no invented facts" is testable.

Write these down first. The model choice largely falls out of them. Pick the model first and you're reverse-engineering a justification.

Step 2: Pick a tier, not a model

With the task defined, don't jump straight to a specific model — pick a tier first. Every family offers roughly three: a fast/cheap tier, a balanced tier, and a frontier tier. (For Claude, that's Haiku 4.5, Sonnet 4.6, and Opus 4.8, with Fable 5 above for the most demanding work.)

The tier follows from the task:

  • Simple, high-volume, latency-sensitive — classification, extraction, routing, short Q&A, tagging. Start at the fast/cheap tier. These tasks don't need a genius; they need a competent, fast, cheap worker, and the savings at volume are enormous.
  • Most real application work — content generation, moderate reasoning, summarization with judgment, tool-using assistants. Start at the balanced tier. It's the default for a reason: enough intelligence for real tasks at a price you can run at volume.
  • Hard, open-ended, or high-stakes — complex code, deep multi-step reasoning, long autonomous agent runs, anything where a wrong answer is expensive. Start at the frontier tier. Pay for the ceiling when the task actually demands it.

Notice the word start. The tier is a hypothesis, not a verdict. The right move is often to start one tier higher than you think you need to confirm the task is solvable at all, then push down a tier and see if the cheaper model still clears your quality bar. You'll be surprised how often it does — and every tier you drop is a multiple off your bill.

Step 3: Test on YOUR data — this is the whole ballgame

Here is the step that separates people who choose well from people who cargo-cult a leaderboard: test the candidate models on your own data.

Public benchmarks tell you how a model does on someone else's tasks. They tell you almost nothing about how it does on yours. Your prompts, your edge cases, your input shapes, your quality bar — none of that is on a benchmark. I've watched a model that "wins" on paper lose badly on a specific extraction task, and a cheaper model quietly nail it. You only find that out by running it yourself.

Build an eval set. It does not need to be fancy:

  1. Collect 20 to 50 real cases. Actual inputs you expect in production. Include the easy ones, the weird ones, and the ones that have burned you before.
  2. Write down the known-good answer for each, or at least what a good answer looks like. This is your answer key.
  3. Run every candidate model through the whole set with the same prompt.
  4. Score the outputs. For some tasks (classification, valid JSON) scoring is exact and you can automate it. For judgment tasks, rate them yourself, or — a trick I lean on hard — use a strong frontier model as a judge to score the outputs against your criteria at scale.
  5. Record cost and latency alongside quality for each model. You're not choosing on quality alone.

Now you have a real table: model, quality score, cost per request, latency. That table makes the decision for you, on evidence, in an afternoon. And the eval set keeps paying off — it's exactly what lets you re-evaluate when a new model ships (more on that below). If you do nothing else from this guide, do this.

Step 4: Weigh cost, latency, and quality together

With the table in hand, choose deliberately across all three axes. The instinct is to pick the highest quality score and stop. Resist it. Ask:

  • Is the quality difference real and relevant? If the frontier model scores a hair higher but the balanced model already clears your bar, the extra quality is something you're paying for and not using. Clearing the bar is binary on many tasks — past "good enough," more capability is wasted spend.
  • What does the cost difference become at your volume? A few cents per request is invisible at a hundred requests a day and is your entire budget at ten million. Multiply it out before you decide.
  • Does the latency fit the experience? A model that's smarter but noticeably slower can be the wrong choice for a chat UI and the right choice for a batch job. Same model, different verdict, depending on where it runs.

Often the winner is the cheapest model that clears your quality bar with latency you can live with — not the smartest model on the table. Spend the capability budget where the task is hard; bank the savings everywhere else.

A note on the data substrate: if your task needs the model to know things it wasn't trained on — your docs, your products, current facts — the model choice is only half the decision. How you feed it that knowledge matters just as much. That's a separate axis worth understanding on its own; see RAG vs fine-tuning vs long context for how to choose. Pick the knowledge strategy and the model together, because they interact.

Step 5: Architect so switching is a config change

You've chosen. Now build so the choice is never a trap. This is the part people skip and regret, and it's cheap if you do it from the start.

The landscape moves every few months — better models, lower prices, new fast tiers. Every one of those is either a free upgrade or a painful migration, and which one it is depends entirely on how you wired things now.

Four habits make switching a config change instead of a project:

  • One config value for the model ID. Never scatter the model string across your codebase. Put it in one place — an environment variable, a config file — and read it everywhere. Switching becomes editing one line.
  • Wrap the provider call. Don't call the vendor SDK directly from all over your app. Write a thin function of your own — generate(prompt, options) — and route everything through it. When you want to try another provider, or send different tiers to different models, you change the wrapper, not the app. (The official SDKs make this easy; for Claude it's @anthropic-ai/sdk, and a one-function wrapper around it is all you need.)
  • Keep prompts portable. Some provider-specific tuning is unavoidable, but keep your core instructions clean and model-agnostic. The more your prompts lean on one model's undocumented quirks, the harder you're glued to it.
  • Keep the eval set alive. That test set from Step 3 is your switching superpower. When a new model lands, you don't agonize — you run it through your evals and know, in an hour, whether it's better for your actual task. Decisions on evidence, every time, forever.

I build this layer on day one of every project. It costs almost nothing up front and saves a rewrite later. With it in place, model selection stops being a one-way door and becomes what it should be: a routine call you make on evidence and revisit whenever the landscape gives you a reason to.

The whole process in one breath

Define the task — inputs, outputs, volume, latency, quality bar. Pick a tier from the task, starting cheaper than your gut says. Build a small eval set from your own data and run the candidates through it. Choose the cheapest model that clears your bar at acceptable latency. And wire the whole thing so the model ID is one config value behind your own wrapper, with your evals ready to re-run the day something better ships.

That's it. No leaderboard-chasing, no permanent identity, no one-way doors. A repeatable process and an architecture that turns every new model release from a threat into an opportunity.

FAQ

What's the first step in choosing an LLM? Define the task precisely — inputs, outputs, volume, latency tolerance, and how you'll know an answer is good. The model choice falls out of the task. Pick the model first and you're guessing.

Should I just use the most powerful model for everything? No. The frontier tier is expensive and slower. Most production traffic is simple enough for a fast, cheap tier. Use the heavyweight only where the task is genuinely hard, and route by difficulty.

How do I actually test models against each other? Build a small eval set from your own real tasks — 20 to 50 cases with known-good answers. Run each candidate model through it, score quality, and measure cost and latency. Decide on your data, not on benchmarks.

How do I switch models later without a rewrite? Put the model ID behind one config value, wrap the provider call in your own thin function, and keep your prompts portable. Then switching is changing one string and re-running your evals.

#llm#models#ai-architecture
Want to apply this right now?

Use the free, no-API prompt generators to put it into practice.

Open Prompt Studio →
Keep reading