---
title: Why Your Site Doesn't Show Up in ChatGPT or Perplexity — and How to Fix It in robots.txt
description: Three robots.txt strategies for AI crawlers, and the two configuration mistakes that quietly cut you out of ChatGPT, Claude, and Perplexity answers.
date: 2026-05-02
tags: [aeo, seo, robots, ai]
---

The audit comes back red. Three or more AI crawlers blocked. Or — worse — a global `Disallow: /` under `User-agent: *` quietly sitting in the root of your `robots.txt`, untouched since someone "ported it over" from a 2018 staging file. The result is the same regardless of cause: Perplexity won't read you, ChatGPT can't see you, Claude routes around your domain. The AI traffic that's growing fastest in 2026 walks right past your brand.

Let's break the problem down — and fix it.

## 1. AI bots are not Googlebot. Different infrastructure, different job.

A classic SEO crawler and an AI crawler look similar in the logs but solve very different problems. Googlebot builds a search index. AI bots build something else: training corpora for models, real-time answer indexes, or fetchers that execute one specific user query in one specific session.

The big conceptual mistake is thinking *"I allowed Googlebot, so I'm fine."* Not fine. Modern AI vendors (OpenAI, Anthropic) have split their crawling infrastructure into three distinct categories, each with its own `User-agent` string:

1. **Training crawlers** — `GPTBot`, `ClaudeBot`, `CCBot`, `Google-Extended`. They collect content for model training. Blocking them prevents your content from reaching future model versions but **does not affect** real-time citation in answers.
2. **Search indexers** — `OAI-SearchBot`, `Claude-SearchBot`, `PerplexityBot`. These are your ticket into the answer. OpenAI explicitly tells publishers that sites blocking `OAI-SearchBot` will not appear in ChatGPT search results, even if regular navigation links are still allowed.
3. **User-triggered fetchers** — `ChatGPT-User`, `Claude-User`, `Perplexity-User`. They visit a page when a real user makes a specific request. OpenAI and Perplexity note that for user-initiated fetches, robots.txt rules may not be applied in the standard way — that's a separate discussion about server-side controls.

Bottom line: *"block all AI"* is no longer a strategy. It's a matrix of decisions.

## 2. Two reasons your site is falling out of AI answers

### Reason A: a global `Disallow: /` under wildcard

The classic trap. Someone wrote `robots.txt` years ago, added explicit rules for Googlebot and Bingbot, and set the rest as `Disallow`:

```text title="robots.txt — DO NOT DO THIS (anti-pattern)"
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: *
Disallow: /
```

Looks logical. Works catastrophically. Every AI agent not named explicitly falls under the wildcard `Disallow` — meaning ChatGPT Search, Claude, Perplexity, and Apple Intelligence are all effectively blocked. You optimized your site for two 2010s-era search engines and cut yourself off from five 2020s AI ecosystems.

### Reason B: explicitly blocking 3+ AI bots out of inertia

In 2023–2024, many brands added `Disallow: /` for `GPTBot` and `CCBot` as a reaction to the scraping discourse. At the time it felt cautious. Today it's a self-inflicted wound.

```text title="robots.txt — 2023 SCANDAL REACTION (also an anti-pattern)"
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /
```

Why is this a fail? Because `PerplexityBot` is your only path into Perplexity. `ClaudeBot` was the catch-all for both search and training until Anthropic split it; if you still have an old `Disallow: ClaudeBot` rule, that's not "opt out of training" anymore — it's loss of visibility in Claude's web tool. Anthropic has officially split crawlers into multiple bot agents.

## 3. Three working `robots.txt` templates with pros/cons

Pick one. Don't mix.

### Strategy 1: maximum AI visibility

Recommended for marketing sites, blogs, and documentation.

```text title="robots.txt — Strategy 1: maximum visibility"
# Fits: content marketing, blog, product docs, media

# --- OpenAI / ChatGPT ---
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# --- Anthropic / Claude ---
User-agent: ClaudeBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

# --- Perplexity ---
User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# --- Common Crawl (used by many LLMs) ---
User-agent: CCBot
Allow: /

# --- Google Gemini / AI Overviews ---
User-agent: Google-Extended
Allow: /

# --- Apple Intelligence ---
User-agent: Applebot
Allow: /

User-agent: Applebot-Extended
Allow: /

# --- Classic search engines ---
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# --- Default ---
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml
```

**Pros:** maximum visibility across every AI ecosystem. Highest chance of citation.

**Cons:** your content does feed GPT and Claude training corpora. If you have regulated data or paid content, this isn't for you.

### Strategy 2: differentiated (allow search, block training) — the industry default

This is the configuration most brands settle on. Logic: we want to appear in answers, but we don't want our content fed into someone else's model.

```text title="robots.txt — Strategy 2: differentiated"
# Fits: B2B SaaS, edtech, publishing brands with proprietary content

# --- OpenAI: allow search, block training ---
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# --- Anthropic: allow search, block training ---
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

# --- Perplexity: allow (no separate training crawler exists) ---
User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# --- Common Crawl: block (indirect training pipeline) ---
User-agent: CCBot
Disallow: /

# --- Google: block training token, keep search ---
User-agent: Google-Extended
Disallow: /

User-agent: Googlebot
Allow: /

# --- Apple: block training, keep search ---
User-agent: Applebot-Extended
Disallow: /

User-agent: Applebot
Allow: /

# --- Default ---
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml
```

**Pros:** you stay in ChatGPT, Claude, Perplexity, and Gemini answers. Your content does **not** go into training. Aligned with where the industry is settling.

**Cons:** OpenAI documents that `GPTBot` and `OAI-SearchBot` share data to avoid duplicate crawls when both are allowed. If you block `GPTBot`, `OAI-SearchBot` crawls independently — meaning your effective crawl budget on OpenAI's side is lower.

### Strategy 3: hybrid with protected sections (e-commerce, portals)

```text title="robots.txt — Strategy 3: hybrid"
# Fits: e-commerce, SaaS portals, sites with member areas

User-agent: GPTBot
Disallow: /account/
Disallow: /checkout/
Disallow: /cart/
Disallow: /api/
Disallow: /admin/
Allow: /

# ... apply the same pattern to OAI-SearchBot, ClaudeBot,
# Claude-SearchBot, PerplexityBot, and *.

Sitemap: https://example.com/sitemap.xml
```

**Pros:** marketing, product, and content pages are visible to AI. Sensitive URLs are not.

**Cons:** longer file, harder to maintain, easy to forget a rule for a new bot.

## 4. "Content is open" ≠ "the model can read it"

This is the crucial distinction even experienced marketers miss.

Imagine `https://example.com/article-x` returns `200 OK`, renders in a browser, sits in your sitemap, is indexed by Googlebot, and scores perfectly on Lighthouse. Looks fully open. But if your `robots.txt` has `User-agent: PerplexityBot` followed by `Disallow: /`, then **PerplexityBot will physically never make an HTTP request** to that page. Not out of misplaced civility — bots that respect robots.txt filter URLs at the crawl-planning stage. The page never loads, the HTML never gets parsed, the content never gets vectorized, and your text never makes it into the model's answer.

Anthropic, OpenAI, and Perplexity all state that their official bots respect robots.txt. This isn't marketing. This is the engineering reality: the bot fetches `robots.txt` first, parses it, and any URL under a `Disallow` rule is dropped from the queue.

Four practical consequences people forget:

1. **OpenGraph, schema.org, hreflang, and every other on-page SEO signal are irrelevant** if the AI bot doesn't fetch the page in the first place. You're optimizing something nothing reads.
2. **The sitemap is not a backdoor.** If a URL is in the sitemap but blocked in robots.txt, robots.txt wins.
3. **HTTPS, performance, Core Web Vitals — all irrelevant** to a bot that got "forbidden" before connecting.
4. **Common Crawl archives from past years can still be valuable to a model**, but if you block `CCBot` today, future model versions will not refresh their knowledge of your site.

One more uncomfortable technical detail: none of OpenAI's crawlers execute JavaScript — they fetch `.js` files but don't run them. If your content renders client-side (CSR-only React/Vue with no SSR), even a perfect `robots.txt` leaves you invisible to AI. That's not a robots.txt problem — it's a rendering architecture problem. Next.js App Router with Server Components solves it by default; a pure Vite SPA does not.

## 5. Verification and deployment: 4 concrete steps

1. **Audit current state.** Open `https://yourdomain.com/robots.txt` directly. Scan every `User-agent` block. Look for: a global `Disallow: /`, explicit blocks for `GPTBot`/`ClaudeBot`/`PerplexityBot`/`CCBot`, deprecated identifiers (`anthropic-ai`, `claude-web` are both retired).
2. **Pick a strategy** from the three above. Don't improvise. Mixing them produces incorrect precedence rules.
3. **Deploy** the new `robots.txt` at the domain root (literally `/robots.txt`, not `/static/robots.txt` or `/public/robots.txt`). Verify with `curl -I https://yourdomain.com/robots.txt` — must return `200 OK` with `Content-Type: text/plain`.
4. **Re-indexing latency.** OpenAI documents that robots.txt updates can take roughly 24 hours to apply across their systems. Don't expect an instant change — let the cycle complete.

Separate check for Cloudflare/Akamai/WAF: make sure your bot-management rules don't block AI agents at the network layer. `Allow: /` in robots.txt doesn't help if the WAF returns `403` before the file is ever read.

---

**Got questions or need help?**

Follow me on [LinkedIn](https://linkedin.com/in/alexturik) for more AEO architecture write-ups. Need a deep audit of your platform or SSR configuration tuned for AI crawlers? [Get in touch](mailto:alexturik@gmail.com) and we'll work through your case.
