llms.txt, robots, and AI Crawlers: The Technical AEO Setup
The copy-pasteable technical setup for AI answer engines: llms.txt, robots.txt crawler rules, JSON-LD schema, canonicals, and freshness signals.
Most "AEO" advice stops at "write good content." Fine, but there is a technical layer underneath that decides whether an AI crawler can even read, trust, and cite your good content. This is that layer: the files and tags you control directly. None of it is exotic. It is llms.txt, robots.txt, JSON-LD schema, canonicals, and sitemaps wired together so an answer engine has the easiest possible time understanding your site.
I build this for a living. The pattern below is the same skeleton I drop onto every site I touch before I worry about a single word of body copy. Get the plumbing right once and every page you publish after that inherits it. For the strategy that sits on top of this plumbing, read the AEO playbook. This piece is the wrench-and-pipe version.
What llms.txt actually is
llms.txt is a plain Markdown file you put at the root of your domain, at /llms.txt. The idea is simple: HTML pages are noisy for a model. Navigation, cookie banners, related-post widgets, ad slots, and tracking scripts all dilute the actual content. llms.txt is a clean, curated map that says "here is what this site is, and here are the canonical pages worth reading, in Markdown, with short descriptions."
It is a convention, not a standard with teeth. No answer engine is obligated to read it, and I have not seen evidence that any of them rank you higher for having one. So why ship it? Because it is cheap, it documents your own information architecture, and it is exactly the kind of low-friction artifact that gets used more as agents proliferate. When an LLM-powered tool wants the short version of your site, you have handed it the short version instead of making it scrape and guess.
Here is a realistic structure. Keep it human-readable and link to Markdown versions of pages where you have them:
# Matt Goren — AI Knowledge Hub
> Practical guides on answer-engine optimization, building with LLMs,
> and running AI content engines. Operator voice, no fluff.
## Core guides
- [AEO Playbook](/ai/answer-engine-optimization-playbook.md): How to get cited by AI search engines.
- [Get Cited by AI Search](/ai/get-cited-by-ai-search.md): The citation mechanics, step by step.
- [Building with LLMs Field Guide](/ai/building-with-llms-field-guide.md): Architecture for real LLM apps.
- [Prompt Engineering for Production](/ai/prompt-engineering-for-production.md): Prompts as specifications.
## About
- [About Matt](/about.md): Who I am and what I build.
## Optional
- [Full sitemap](/sitemap.xml)
A few rules I follow. Keep the intro blockquote tight, because that line is the most likely thing a model lifts as a description of your whole site. List your genuinely best pages, not all of them; this is a curated index, not a dump. If you can serve a .md version of each page (many static-site setups can), link to those, because that is the whole point: clean text. And keep it current. A stale llms.txt that points at dead pages is worse than none.
How AI crawlers work, and how to allow or deny them
There are two different jobs an AI company's crawler can do, and conflating them is where people lock themselves out of citations by accident.
The first job is training: pulling content to help train or update a model. The second is retrieval (sometimes called live fetch or browsing): grabbing a page right now because a user asked a question and the system wants a current source to cite. These often use different user-agent names. If you block everything, you can stay out of training data and still lose the retrieval fetch that would have cited you in a live answer. Decide deliberately.
Here are the user-agents worth knowing as of early 2026:
- GPTBot — OpenAI's training crawler.
- OAI-SearchBot — OpenAI's search/retrieval fetcher (the one tied to being surfaced in answers).
- ChatGPT-User — fires when a ChatGPT user action triggers a live fetch.
- ClaudeBot — Anthropic's crawler.
- Claude-Web / anthropic-ai — Anthropic user-triggered fetch agents.
- PerplexityBot — Perplexity's crawler.
- Google-Extended — Google's token for Gemini/Vertex training; note it does NOT affect normal Googlebot indexing.
A robots.txt that allows retrieval and citation while being deliberate about training looks like this:
# Allow normal search
User-agent: Googlebot
Allow: /
# AI answer engines — allow so we can be cited
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /
# Opt out of model training specifically (your call)
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Everyone else
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Two honest caveats. robots.txt is a request, not a wall; well-behaved bots from major companies respect it, but it does not stop anyone who chooses to ignore it. And bot names change over time, so this list is a snapshot, not scripture. My default recommendation for almost everyone who wants visibility: allow the retrieval bots. The whole point of AEO is being the source an answer engine pulls from. You cannot be cited from a page a crawler was told to skip.
The schema that helps answer engines
JSON-LD is structured data you embed in a <script type="application/ld+json"> tag. It tells a machine, in unambiguous terms, what a page is, who wrote it, when, and what questions it answers. Answer engines parse this to map your page to entities and to lift clean Q&A pairs. It is some of the highest-leverage markup you can add because it removes guesswork.
Four types cover the bulk of the value:
- Article (or BlogPosting) — identifies the piece, author, publisher, and dates.
- FAQPage — exposes question/answer pairs an engine can quote directly.
- BreadcrumbList — shows where the page sits in your hierarchy.
- Organization — defines you as an entity, with name, logo, and profiles.
Here is a compact Article + FAQ block. Keep the visible page text and the schema in sync; mismatches read as manipulation:
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@graph": [
{
"@type": "Article",
"headline": "llms.txt, robots, and AI Crawlers",
"datePublished": "2026-06-25",
"dateModified": "2026-06-25",
"author": { "@type": "Person", "name": "Matt Goren" },
"publisher": {
"@type": "Organization",
"name": "Matt Goren",
"logo": { "@type": "ImageObject", "url": "https://example.com/logo.png" }
}
},
{
"@type": "FAQPage",
"mainEntity": [{
"@type": "Question",
"name": "Do I need an llms.txt file?",
"acceptedAnswer": {
"@type": "Answer",
"text": "No, it is optional. It is a low-cost curated index for LLMs on top of clean HTML and schema."
}
}]
}
]
}
</script>
Rules I enforce on myself. Every FAQ in the schema must match a real, visible FAQ on the page word for word; do not invent answers that only exist in the markup. Dates in schema must match the dates you actually changed the content. Author and publisher should be consistent across the site so engines build a stable picture of you as an entity. And validate. A schema typo silently fails, and a silently failing block helps nobody.
Canonicals, sitemaps, and freshness
The last layer is about telling engines which URL is the real one, where they all live, and how fresh each is.
Canonical tags
If the same content is reachable at multiple URLs — with and without a trailing slash, with tracking parameters, paginated variants — pick one canonical and declare it on every variant:
<link rel="canonical" href="https://example.com/ai/llms-txt-and-ai-crawlers" />
This consolidates signals onto one URL so an answer engine cites the right address and does not split your authority across duplicates. Self-reference the canonical on the canonical page too. It is not optional housekeeping; without it you are quietly competing with yourself.
Sitemaps
Your XML sitemap is the machine-readable list of every canonical URL you want crawled, each with an honest lastmod. Keep it generated automatically from your real content so it never drifts. Reference it from robots.txt (shown above) and submit it in your search console of choice.
<url>
<loc>https://example.com/ai/llms-txt-and-ai-crawlers</loc>
<lastmod>2026-06-25</lastmod>
</url>
Freshness signals
Answer engines lean toward current sources, especially for anything time-sensitive. Send freshness through four aligned channels: the lastmod in your sitemap, a visible "Updated [date]" line on the page, dateModified in your Article schema, and — this is the part people skip — actual content changes when you bump those dates. If you touch the date without touching the content, you are teaching crawlers that your dates are noise. Update for real, then update the signals to match.
Wire these five layers together — llms.txt, per-bot robots rules, JSON-LD, canonicals, and honest freshness — and you have removed every technical reason an answer engine might fail to read or trust your page. What is left is the content itself, which is exactly where you want the work to be. From here, go deep on the strategy in the AEO playbook and the citation mechanics in get cited by AI search.
FAQ
Do I need an llms.txt file for AI search?
No, it is not required and no major answer engine treats it as mandatory. It is a low-cost courtesy file that points crawlers and agents at your best Markdown content. Treat it as a nice-to-have on top of clean HTML, good schema, and a real sitemap.
Will blocking GPTBot or ClaudeBot remove me from AI answers?
It can. Blocking a model's training crawler is different from blocking its live retrieval fetcher, and the bot names differ. If you want to be cited in answers, allow the retrieval-oriented bots even if you block training crawlers. Decide per-bot, not all-or-nothing.
What schema types matter most for getting cited?
Article, FAQPage, BreadcrumbList, and Organization cover most of what answer engines parse. They map your page to entities and Q&A pairs the model can lift directly. Add Product or HowTo only when the page actually is that thing.
Does llms.txt replace robots.txt or my sitemap?
No. robots.txt controls crawler access, the sitemap lists canonical URLs with last-modified dates, and llms.txt is an optional curated index for LLMs. They do different jobs. Ship all three rather than picking one.
How do I signal freshness to AI crawlers?
Use accurate lastmod dates in your sitemap, a visible updated date on the page, dateModified in your Article schema, and real content changes when you bump those dates. Faking freshness without changing anything trains crawlers to ignore your dates.
Use the free, no-API prompt generators to put it into practice.
How to Write FAQ Pages That AI Actually Cites
Picking real questions, answer-first phrasing, FAQ schema, and structure — the mechanics that turn an FAQ page into a cited source.
GuideGetting Found and Cited on Perplexity
How Perplexity sources and cites answers, what content actually wins there, and how to show up and track it.
GuideGetting Found in ChatGPT Search
How ChatGPT's search and browsing pull in sources, how to be the page it cites, and how to track whether it's working.