← Oleksii Turovskyi

Why Your Site Doesn't Show Up in ChatGPT or Perplexity — and How to Fix It in robots.txt

· 8 min read

The audit comes back red. Three or more AI crawlers blocked. Or — worse — a global Disallow: / under User-agent: * quietly sitting in the root of your robots.txt, untouched since someone "ported it over" from a 2018 staging file. The result is the same regardless of cause: Perplexity won't read you, ChatGPT can't see you, Claude routes around your domain. The AI traffic that's growing fastest in 2026 walks right past your brand.

Let's break the problem down — and fix it.

1. AI bots are not Googlebot. Different infrastructure, different job.#

A classic SEO crawler and an AI crawler look similar in the logs but solve very different problems. Googlebot builds a search index. AI bots build something else: training corpora for models, real-time answer indexes, or fetchers that execute one specific user query in one specific session.

The big conceptual mistake is thinking "I allowed Googlebot, so I'm fine." Not fine. Modern AI vendors (OpenAI, Anthropic) have split their crawling infrastructure into three distinct categories, each with its own User-agent string:

  1. Training crawlersGPTBot, ClaudeBot, CCBot, Google-Extended. They collect content for model training. Blocking them prevents your content from reaching future model versions but does not affect real-time citation in answers.
  2. Search indexersOAI-SearchBot, Claude-SearchBot, PerplexityBot. These are your ticket into the answer. OpenAI explicitly tells publishers that sites blocking OAI-SearchBot will not appear in ChatGPT search results, even if regular navigation links are still allowed.
  3. User-triggered fetchersChatGPT-User, Claude-User, Perplexity-User. They visit a page when a real user makes a specific request. OpenAI and Perplexity note that for user-initiated fetches, robots.txt rules may not be applied in the standard way — that's a separate discussion about server-side controls.

Bottom line: "block all AI" is no longer a strategy. It's a matrix of decisions.

2. Two reasons your site is falling out of AI answers#

Reason A: a global Disallow: / under wildcard#

The classic trap. Someone wrote robots.txt years ago, added explicit rules for Googlebot and Bingbot, and set the rest as Disallow:

robots.txt — DO NOT DO THIS (anti-pattern)
User-agent: Googlebot
Allow: /
 
User-agent: Bingbot
Allow: /
 
User-agent: *
Disallow: /

Looks logical. Works catastrophically. Every AI agent not named explicitly falls under the wildcard Disallow — meaning ChatGPT Search, Claude, Perplexity, and Apple Intelligence are all effectively blocked. You optimized your site for two 2010s-era search engines and cut yourself off from five 2020s AI ecosystems.

Reason B: explicitly blocking 3+ AI bots out of inertia#

In 2023–2024, many brands added Disallow: / for GPTBot and CCBot as a reaction to the scraping discourse. At the time it felt cautious. Today it's a self-inflicted wound.

robots.txt — 2023 SCANDAL REACTION (also an anti-pattern)
User-agent: GPTBot
Disallow: /
 
User-agent: CCBot
Disallow: /
 
User-agent: ClaudeBot
Disallow: /
 
User-agent: PerplexityBot
Disallow: /

Why is this a fail? Because PerplexityBot is your only path into Perplexity. ClaudeBot was the catch-all for both search and training until Anthropic split it; if you still have an old Disallow: ClaudeBot rule, that's not "opt out of training" anymore — it's loss of visibility in Claude's web tool. Anthropic has officially split crawlers into multiple bot agents.

3. Three working robots.txt templates with pros/cons#

Pick one. Don't mix.

Strategy 1: maximum AI visibility#

Recommended for marketing sites, blogs, and documentation.

robots.txt — Strategy 1: maximum visibility
# Fits: content marketing, blog, product docs, media
 
# --- OpenAI / ChatGPT ---
User-agent: GPTBot
Allow: /
 
User-agent: OAI-SearchBot
Allow: /
 
User-agent: ChatGPT-User
Allow: /
 
# --- Anthropic / Claude ---
User-agent: ClaudeBot
Allow: /
 
User-agent: Claude-SearchBot
Allow: /
 
User-agent: Claude-User
Allow: /
 
# --- Perplexity ---
User-agent: PerplexityBot
Allow: /
 
User-agent: Perplexity-User
Allow: /
 
# --- Common Crawl (used by many LLMs) ---
User-agent: CCBot
Allow: /
 
# --- Google Gemini / AI Overviews ---
User-agent: Google-Extended
Allow: /
 
# --- Apple Intelligence ---
User-agent: Applebot
Allow: /
 
User-agent: Applebot-Extended
Allow: /
 
# --- Classic search engines ---
User-agent: Googlebot
Allow: /
 
User-agent: Bingbot
Allow: /
 
# --- Default ---
User-agent: *
Allow: /
 
Sitemap: https://example.com/sitemap.xml

Pros: maximum visibility across every AI ecosystem. Highest chance of citation.

Cons: your content does feed GPT and Claude training corpora. If you have regulated data or paid content, this isn't for you.

Strategy 2: differentiated (allow search, block training) — the industry default#

This is the configuration most brands settle on. Logic: we want to appear in answers, but we don't want our content fed into someone else's model.

robots.txt — Strategy 2: differentiated
# Fits: B2B SaaS, edtech, publishing brands with proprietary content
 
# --- OpenAI: allow search, block training ---
User-agent: GPTBot
Disallow: /
 
User-agent: OAI-SearchBot
Allow: /
 
User-agent: ChatGPT-User
Allow: /
 
# --- Anthropic: allow search, block training ---
User-agent: ClaudeBot
Disallow: /
 
User-agent: Claude-SearchBot
Allow: /
 
User-agent: Claude-User
Allow: /
 
# --- Perplexity: allow (no separate training crawler exists) ---
User-agent: PerplexityBot
Allow: /
 
User-agent: Perplexity-User
Allow: /
 
# --- Common Crawl: block (indirect training pipeline) ---
User-agent: CCBot
Disallow: /
 
# --- Google: block training token, keep search ---
User-agent: Google-Extended
Disallow: /
 
User-agent: Googlebot
Allow: /
 
# --- Apple: block training, keep search ---
User-agent: Applebot-Extended
Disallow: /
 
User-agent: Applebot
Allow: /
 
# --- Default ---
User-agent: *
Allow: /
 
Sitemap: https://example.com/sitemap.xml

Pros: you stay in ChatGPT, Claude, Perplexity, and Gemini answers. Your content does not go into training. Aligned with where the industry is settling.

Cons: OpenAI documents that GPTBot and OAI-SearchBot share data to avoid duplicate crawls when both are allowed. If you block GPTBot, OAI-SearchBot crawls independently — meaning your effective crawl budget on OpenAI's side is lower.

Strategy 3: hybrid with protected sections (e-commerce, portals)#

robots.txt — Strategy 3: hybrid
# Fits: e-commerce, SaaS portals, sites with member areas
 
User-agent: GPTBot
Disallow: /account/
Disallow: /checkout/
Disallow: /cart/
Disallow: /api/
Disallow: /admin/
Allow: /
 
# ... apply the same pattern to OAI-SearchBot, ClaudeBot,
# Claude-SearchBot, PerplexityBot, and *.
 
Sitemap: https://example.com/sitemap.xml

Pros: marketing, product, and content pages are visible to AI. Sensitive URLs are not.

Cons: longer file, harder to maintain, easy to forget a rule for a new bot.

4. "Content is open" ≠ "the model can read it"#

This is the crucial distinction even experienced marketers miss.

Imagine https://example.com/article-x returns 200 OK, renders in a browser, sits in your sitemap, is indexed by Googlebot, and scores perfectly on Lighthouse. Looks fully open. But if your robots.txt has User-agent: PerplexityBot followed by Disallow: /, then PerplexityBot will physically never make an HTTP request to that page. Not out of misplaced civility — bots that respect robots.txt filter URLs at the crawl-planning stage. The page never loads, the HTML never gets parsed, the content never gets vectorized, and your text never makes it into the model's answer.

Anthropic, OpenAI, and Perplexity all state that their official bots respect robots.txt. This isn't marketing. This is the engineering reality: the bot fetches robots.txt first, parses it, and any URL under a Disallow rule is dropped from the queue.

Four practical consequences people forget:

  1. OpenGraph, schema.org, hreflang, and every other on-page SEO signal are irrelevant if the AI bot doesn't fetch the page in the first place. You're optimizing something nothing reads.
  2. The sitemap is not a backdoor. If a URL is in the sitemap but blocked in robots.txt, robots.txt wins.
  3. HTTPS, performance, Core Web Vitals — all irrelevant to a bot that got "forbidden" before connecting.
  4. Common Crawl archives from past years can still be valuable to a model, but if you block CCBot today, future model versions will not refresh their knowledge of your site.

One more uncomfortable technical detail: none of OpenAI's crawlers execute JavaScript — they fetch .js files but don't run them. If your content renders client-side (CSR-only React/Vue with no SSR), even a perfect robots.txt leaves you invisible to AI. That's not a robots.txt problem — it's a rendering architecture problem. Next.js App Router with Server Components solves it by default; a pure Vite SPA does not.

5. Verification and deployment: 4 concrete steps#

  1. Audit current state. Open https://yourdomain.com/robots.txt directly. Scan every User-agent block. Look for: a global Disallow: /, explicit blocks for GPTBot/ClaudeBot/PerplexityBot/CCBot, deprecated identifiers (anthropic-ai, claude-web are both retired).
  2. Pick a strategy from the three above. Don't improvise. Mixing them produces incorrect precedence rules.
  3. Deploy the new robots.txt at the domain root (literally /robots.txt, not /static/robots.txt or /public/robots.txt). Verify with curl -I https://yourdomain.com/robots.txt — must return 200 OK with Content-Type: text/plain.
  4. Re-indexing latency. OpenAI documents that robots.txt updates can take roughly 24 hours to apply across their systems. Don't expect an instant change — let the cycle complete.

Separate check for Cloudflare/Akamai/WAF: make sure your bot-management rules don't block AI agents at the network layer. Allow: / in robots.txt doesn't help if the WAF returns 403 before the file is ever read.


Got questions or need help?

Follow me on LinkedIn for more AEO architecture write-ups. Need a deep audit of your platform or SSR configuration tuned for AI crawlers? Get in touch and we'll work through your case.