Multimodal AI: A Builder's Guide to Vision, Images, and Audio
What multimodal models actually do, where they earn their keep, and how to ship vision and structured extraction in production without surprises.
I build with these models every day at RunOctopus, and "multimodal" is the capability people most consistently underestimate. They think of it as a demo feature — drop in a photo, get a cute description. In practice it is the thing that lets you delete whole categories of brittle, hand-rolled infrastructure: OCR pipelines, layout parsers, screenshot diff tools, audio transcription glue. A single model now does the part that used to take five libraries duct-taped together.
This is a builder's guide. I am going to tell you what multimodal models actually do, where they earn their keep, how I wire them up, and where the sharp edges are.
What "multimodal" actually means
A multimodal model takes more than text as input, and sometimes produces more than text as output. The two halves are different and worth separating in your head.
Understanding is the strong half. The frontier chat models — Anthropic's Claude (Opus 4.8, Sonnet 4.6, Haiku 4.5), OpenAI's GPT family, Google's Gemini — natively accept images alongside your text prompt. You hand them a photo, a scanned invoice, a screenshot, a chart, a multi-page PDF, and they reason over it in the same turn as your instructions. Increasingly the same models take audio and video too. This part has gotten genuinely good, and it moves fast — capabilities that were shaky six months ago are reliable now.
Generation is the other half, and it usually lives in a separate model. Producing an image or synthesizing speech is typically a dedicated model you call on its own, even within the same provider's lineup. So the mental model is: one model to read the world, a different model to draw or speak. Don't assume the chat model that describes your image can also produce one — check the specific model card.
For a fuller picture of how I think about models as building blocks, see my field guide to building with LLMs.
The use cases that actually pay off
Forget the party tricks. Here is where vision and audio understanding earn real money in production.
Document extraction. This is the big one. Invoices, receipts, purchase orders, contracts, insurance forms, lab results, shipping manifests. The old way was OCR plus a forest of regex and positional heuristics that broke every time a vendor changed their template. A vision model reads the document the way a person does — it knows the total is the total even when it moved to the other corner. You ask for the fields you want and it pulls them.
Screenshot and UI understanding. Feed the model a screenshot of an app, a dashboard, an error state. It can tell you what's on screen, what's wrong, what a user would click next. This powers QA automation, support triage ("here's a photo of my screen, what's broken"), and agent workflows that operate software.
Charts, diagrams, and photos of the physical world. A model can read a chart's trend and the actual numbers, interpret an architecture diagram, or look at a photo of a product, a shelf, a damaged package, a whiteboard, and turn it into structured data. Retail, logistics, insurance, and field-service teams live on this.
Audio understanding. Transcription, but smarter — the model can transcribe and summarize and pull action items in one pass, and it keeps speaker context that a raw transcription service drops. Meeting notes, call QA, voice interfaces.
Generation, used sparingly. Image generation for marketing assets and product mockups, speech synthesis for voice agents. Useful, but a different tool with its own model — treat it as a separate integration, not a free add-on.
How I build a vision feature
The pattern is almost always the same: send an image plus a prompt that demands structured output, then validate. Here's the shape with the Anthropic SDK (@anthropic-ai/sdk), which is what I reach for first.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const msg = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{
role: "user",
content: [
{ type: "image", source: {
type: "base64", media_type: "image/png", data: imgBase64 } },
{ type: "text", text:
"Extract the invoice. Return JSON: vendor, invoice_no, " +
"date (ISO), line_items[{desc, qty, unit_price}], total. " +
"If a field is missing, use null. Output only JSON." }
]
}]
});
Three things I do every single time:
Demand a schema. Never accept free-form prose when you want data. Tell the model the exact JSON shape, including what to do with missing fields (null, not invention). Better still, use the provider's structured-output or tool-calling feature so the model is constrained to your schema rather than asked nicely. This is the single biggest reliability lever, and it's the same discipline I cover in prompt engineering for production.
Validate the output. Parse the JSON, check it against a real validator (I use a schema library), and have a fallback path when validation fails — retry with a sharper prompt, escalate to a bigger model, or flag for a human. The model is good; it is not a database constraint.
Right-size the model. I don't send every image to the biggest model. A high-volume, well-defined extraction job runs fine on a smaller, faster, cheaper model like Haiku. I reserve the heavy models for documents that are messy, ambiguous, or high-stakes. Tiered routing — cheap model first, escalate on low confidence — is how you keep the bill sane.
Costs and limits, honestly
The numbers move constantly, so I'll give you the durable principles instead of figures that will be stale by the time you read this.
Images are billed as tokens. A model converts your image into a token count based on its resolution. The practical consequence: a big, high-DPI scan can cost as much as several thousand words of text per image. If you're processing thousands of documents, resolution is your cost dial. Downscale to the smallest size where the text is still legible to a human and you'll often cut image cost by more than half with no accuracy loss. I rarely send anything larger than is needed to read the smallest important detail.
There are resolution and size ceilings. Every model has a max image dimension and file size; oversized images get downscaled automatically, which can blur fine print right when you need it. For dense documents, I rasterize each page separately at a controlled DPI rather than sending one giant image and hoping.
PDFs: native first, rasterize as fallback. Several models accept PDFs directly now, which is the least-effort path and preserves text layers. When a PDF is huge, scanned, or the native path misbehaves, I convert pages to images myself so I control resolution and ordering.
Latency scales with input. More pixels and more pages mean slower responses. For interactive features, downscale and consider processing pages in parallel. For batch jobs, latency matters less than throughput and cost.
Hallucination still applies. A vision model can confidently misread a smudged digit or invent a plausible field. This is why the schema-plus-validation-plus-human-review loop is non-negotiable for anything financial or legal. The model gets you 95% of the way for a fraction of the old cost — your job is to engineer the last 5% honestly.
Picking a model family
The providers are close and they leapfrog each other constantly, so I won't hand you fake benchmark numbers. Qualitatively: all three frontier families — Claude, GPT, Gemini — are strong at image and document understanding, and all are improving fast. Gemini has historically pushed hard on long video and large context. Claude is my default for document extraction where I want careful, instruction-following structured output. GPT is a strong generalist. The honest answer is that for most tasks any of them will work, and the right call depends on your existing stack, your latency budget, and a quick bake-off on your documents. I walk through how I actually choose in Claude vs GPT vs Gemini for builders.
Run the bake-off on your own data. Twenty real documents from your domain will tell you more than any leaderboard.
The takeaway
Multimodal understanding is not a novelty — it's a quiet replacement for a stack of fragile tooling you were probably maintaining by hand. Treat it like any other production capability: constrain the output to a schema, validate it, right-size the model, control your image resolution to control cost, and keep a human on the high-stakes path. Do that and you'll ship features that would have been a quarter of engineering work two years ago in an afternoon.
FAQ
Do I need a special model for images, or can a normal LLM read them? The frontier chat models from Anthropic, OpenAI, and Google are already multimodal — you pass an image in the same request as your text prompt. You only reach for a dedicated vision or OCR model when you have a narrow, high-volume task where a smaller specialist is cheaper.
Can these models generate images and audio, or only understand them? Understanding and generation are usually separate models. The big chat LLMs are strong at understanding images, documents, and increasingly audio. Generating images or speech typically means calling a different model built for that, even from the same provider.
How accurate is document extraction with vision models? Good enough to replace brittle OCR pipelines for most semi-structured documents, but not perfect. Always extract into a defined schema, validate the output, and keep a human in the loop for anything that touches money or legal commitments.
What drives the cost of multimodal requests? Images are billed as tokens based on their resolution, so a full-page scan can cost as much as a few thousand words of text. Downscale images to the smallest size that keeps the detail you need, and you cut cost dramatically.
Should I send a PDF directly or convert pages to images first? Several models accept PDFs natively now, which is the simplest path. If a model chokes on a big or messy PDF, rasterize each page to an image and send them as a sequence — you get more control over resolution and ordering.
Use the free, no-API prompt generators to put it into practice.
Prompting Claude vs GPT: What Actually Differs
The prompting habits that carry between Claude and GPT, the ones that don't, and how each family wants to be steered in production.
GuideContext Engineering: The Skill That Replaced Prompt Hacking
Managing the context window is the real craft now. What to put in, retrieval vs stuffing, ordering, caching, compaction, token budgets, and multi-turn memory.
GuideHow to Cut Your LLM Costs (Without Cutting Quality)
Prompt caching, batching, model routing, leaner context, output caps — the levers that drop your AI bill without touching output quality.