Models & Capabilities: Frequently Asked Questions
Straight answers to the questions builders actually ask about LLMs: tokens, context windows, cost, hallucination, multimodality, and more.
These are the questions I get asked most often by people starting to build with large language models — from teammates, clients, and anyone who's poked at this stuff and hit a wall of jargon. I've tried to answer each one straight, in plain language, the way I'd explain it across a desk. No hype, no hand-waving. Where the specifics move fast, I'll say so rather than pretend a number is permanent. If you want the bigger picture after this, the frontier model landscape is the map and how to choose an LLM is the process.
Which model is best?
There's no single best model, and anyone who tells you otherwise is selling something. It depends entirely on the task. The frontier tier — Anthropic's Claude Opus 4.8, or its equivalents from OpenAI and Google — is best for hard reasoning, complex code, and long autonomous agent runs. A fast tier like Claude Haiku 4.5 is best for high-volume, simple, latency-sensitive work, where it's enormously cheaper and quicker. The right move is to match the model's strengths to what your task actually needs, then confirm it by testing candidates on your own data. "Best" is always "best for this job."
What is a context window?
The context window is the maximum amount of text a model can hold in mind for a single request. It's measured in tokens, and it covers everything in that exchange — your instructions, any documents or history you include, and the model's own response. Think of it as the model's working memory for one conversation. Frontier models have pushed this a long way: many now handle hundreds of thousands of tokens, and some reach a million, which is enough to drop entire codebases or large document sets into one request. Once you exceed the window, something has to give — the oldest content gets dropped or summarized to make room.
What is a token?
A token is the unit a model reads and writes in — a chunk of text it treats as one piece. It's usually a short word or a fragment of a longer word; common words are often a single token, while rarer words split into several. The rough rule of thumb for English is about four characters per token, which makes roughly 1,000 tokens equal to about 750 words. Why care? Because models don't bill or count in words — they count in tokens. Your costs, your context-window limits, and your rate limits are all measured this way, so it pays to think in tokens once you're building seriously.
How much do models cost?
LLMs are priced per token, and the bill splits into two parts: input tokens (what you send) and output tokens (what the model generates). Output almost always costs several times more than input, so a model that "thinks out loud" at length costs more than a terse one. The spread across tiers is large — a fast model can cost a small fraction of what a frontier model costs for the same text, which is exactly why you don't use the heavyweight for everything. I'm deliberately not quoting exact numbers here, because per-token prices change and often drop over time. Always check the provider's current pricing before you budget.
What's the difference between Opus, Sonnet, and Haiku?
Those are the three tiers of Anthropic's Claude family, and they're a clean illustration of how tiering works across every provider. Opus (4.8 as of early 2026) is the most capable — the one for the hardest reasoning, complex coding, and long agentic work. Sonnet (4.6) is the balanced default — strong enough for the vast majority of real tasks, at a price you can run at volume. It's the one I reach for first. Haiku (4.5) is the fastest and cheapest — built for high-volume, simple work where speed and cost matter more than raw brilliance. There's also Fable 5 above Opus for the most demanding work. Same idea repeats in every family: a big one, a balanced one, a fast one.
Are bigger models always better?
No, and believing they are is an expensive habit. Bigger models have a higher ceiling — they can do harder things — but they're slower, cost more, and are flat-out overkill for simple tasks. A smaller model that clears your quality bar is the better choice on easy work: faster responses, a fraction of the cost, and usually no meaningful drop in reliability on tasks within its range. The skill isn't "always go bigger," it's "use the smallest model that does the job well." Past the point where a task is solved, extra capability is money you're spending and not using.
What does multimodal mean?
A multimodal model can handle more than just text. It takes images as native input — and increasingly audio and video too — and reasons over them alongside any text you give it. A text-only model can read and write text and nothing else. This matters the moment your inputs aren't purely textual: screenshots, photos, PDFs with layout, charts, diagrams, recordings. If that's your world, you need a multimodal model, and it's one of the first filters that narrows which families you can use. Google's Gemini is especially known for treating mixed media as first-class, but multimodality is broadly available across the frontier now.
Can models browse the internet?
Not by themselves. A base model is a snapshot — it knows only what was in its training data, up to a cutoff date, and it has no live connection to the web. It can't check today's news or look up a current price on its own. It can browse when you give it a tool to do so: a web search or web fetch capability that you wire into the request. Many providers ship these as built-in, server-side tools, so the model can issue a search and use the results without you running the search yourself. But it's always an added capability, never a default — out of the box, the model is working from memory.
Do models hallucinate?
Yes — all of them, and it's important to expect it. A hallucination is when the model produces text that's fluent, confident, and wrong: an invented fact, a citation to a paper that doesn't exist, a plausible-sounding answer with no basis. It happens because these models generate likely-sounding text, and "likely-sounding" and "true" usually overlap but not always. You reduce hallucination by grounding the model in real source material it can quote from, asking it to cite, and keeping tasks within what it actually knows — and you reduce the damage by verifying anything that matters before you act on it. What you can't do is assume it's gone. Design as if the model can be confidently wrong, because sometimes it will be.
What's the difference between open and closed models?
Closed models — Claude, GPT, Gemini — run on the provider's servers, and you use them through an API. You get the highest available capability with zero infrastructure to manage, but you're sending data to the provider and paying per token. Open-weight models — Llama, Mistral, and others — have downloadable weights, so you can run them on your own hardware or in your own cloud. That buys you control, privacy (data never leaves your environment), the ability to fine-tune deeply, and no per-token bill. The cost is that you're now running and scaling the infrastructure yourself, which is real work, and open models still trail the closed frontier on the very hardest tasks — though the gap narrows constantly. Choose closed for capability and convenience, open for control and privacy.
What is fine-tuning?
Fine-tuning is taking a pre-trained base model and training it further on your own examples, so it learns a specific behavior — a house style, an exact output format, a specialized task pattern. The key thing to understand: fine-tuning changes how the model behaves, not what facts it knows. If you need the model to know your current data, fine-tuning is the wrong tool — that's a job for retrieval. Fine-tuning shines when you want consistent style or structure that's hard to get with prompting alone. It's also heavier than the alternatives — you need good training data and a training step — so most needs are met first with better prompting or by feeding the model the right context at request time. Reach for fine-tuning when those genuinely fall short.
How fast are models improving?
Fast — faster than almost any other technology I've built on. Roughly every few months a major family ships a model that's materially better than its predecessor, with smaller updates landing in between. The pace can feel like a treadmill if you try to evaluate every release. The healthier response, and the one I practice, is architectural: build your system so the model is one swappable component — its ID behind a single config value, your provider call wrapped in your own function, a small eval set ready to test newcomers. Then you don't chase releases. When a better or cheaper model lands, you run it through your evals and adopt it in an afternoon if it wins. The speed of improvement becomes an advantage you collect, not a pressure you fight.
Use the free, no-API prompt generators to put it into practice.
Building With Grok (xAI): Where It Fits
An honest operator's take on xAI's Grok — its real-time and X-data edge, where you'd reach for it, and the tradeoffs to weigh.
ComparisonBig Model vs Small Model: When Cheap and Fast Wins
Frontier model or small fast one? Quality, cost, latency, and reliability head to head, plus the fan-out-cheap, escalate-to-frontier pattern.
ComparisonClaude vs GPT vs Gemini: Picking a Model as a Builder
Choosing an LLM to build on, not chat with: reasoning, tool use, context, cost tiers, and where each family actually wins.