Key takeaway
AI crawlers identify themselves through user-agent strings. Keeping those strings current in your robots.txt lets you guide how language models interact with your work. Most LLM-based AI search engines crawlers rely on a user-agent string; a short bit of text that tells your server “who” is making the request. When you spot GPTBot, ClaudeBot, PerplexityBot, or any of the newer strings below in your server access logs, you know an AI model is indexing, scraping, or quoting your page. Keep your robots.txt file and firewall rules up-to-date so the right agents can read your content while the wrong ones stay out.
Quick definitions
- AI crawler: A bot that copies public web pages so a large-language model can learn from them.
- AI user-agent: The string that identifies that crawler in HTTP requests. You use it in robots.txt rules.
- Robots.txt: A plain-text file at the root of your site that tells crawlers what they may fetch. Add one line per
User-agentyou want to allow or block.
Why you should care
Server logs show AI search bots, which now account for a growing share of impressions of your brand. Understanding which agents the frontier LLM models (and some others) use helps you encourage or discourage that traffic responsibly.
- AI search bots (ChatGPT, Claude, Bing Copilot, and Perplexity) send measurable referral traffic to websites.
- Clear robots.txt rules let helpful agents in and keep abusive scrapers out.
- If you have access to server log files, you can see how often AI/LLM bots are hitting your website so you can create a baseline.
Most AI crawlers can access your content by default. But with how fast this space is moving, it's super helpful to know exactly which crawlers are out there and verify they can actually see your site.
Complete AI crawler list
This list was updated in November 2025. The previous version was from April 2025. I know of a growing number of websites who use the information below in their firewall allow‑list.
<div class="tbl-wrap"><table style="border-radius:12px;"><thead><tr><th>User-agent token</th><th>Vendor</th><th>Bot description</th><th>robots.txt snippet</th></tr></thead><tbody><tr><td><code>GPTBot</code></td><td>OpenAI</td><td>Crawler that collects data for training GPT models. Block this to prevent your content from being used in model training.</td><td><code>User-agent: GPTBot<br>Disallow: /</code></td></tr><tr><td><code>OAI-SearchBot</code></td><td>OpenAI</td><td>Indexes pages for ChatGPT's search and citation features. Used for retrieval-augmented generation.</td><td><code>User-agent: OAI-SearchBot<br>Allow: /</code></td></tr><tr><td><code>ChatGPT-User</code></td><td>OpenAI</td><td>Fetches URLs when a ChatGPT user requests specific pages or when ChatGPT needs to cite sources during conversations.</td><td><code>User-agent: ChatGPT-User<br>Allow: /</code></td></tr><tr><td><code>ChatGPT-User/2.0</code></td><td>OpenAI</td><td>Updated version of ChatGPT-User with enhanced capabilities for on-demand content fetching.</td><td><code>User-agent: ChatGPT-User/2.0<br>Allow: /</code></td></tr><tr><td><code>anthropic-ai</code></td><td>Anthropic</td><td>Collects web data for training Claude models. Primary training data crawler.</td><td><code>User-agent: anthropic-ai<br>Disallow: /</code></td></tr><tr><td><code>ClaudeBot</code></td><td>Anthropic</td><td>Retrieves URLs for citations and real-time information during Claude chat sessions.</td><td><code>User-agent: ClaudeBot<br>Allow: /</code></td></tr><tr><td><code>claude-web</code></td><td>Anthropic</td><td>Undocumented crawler from Anthropic. Purpose unclear but appears to fetch web content for Claude.</td><td><code>User-agent: claude-web<br>Allow: /</code></td></tr><tr><td><code>PerplexityBot</code></td><td>Perplexity</td><td>Indexes websites to build Perplexity AI's search engine database.</td><td><code>User-agent: PerplexityBot<br>Allow: /</code></td></tr><tr><td><code>Perplexity-User</code></td><td>Perplexity</td><td>Fetches pages when users click citations in Perplexity results. Treated as human-triggered traffic.</td><td><code>User-agent: Perplexity-User<br>Allow: /</code></td></tr><tr><td><code>Google-Extended</code></td><td>Google</td><td>Controls access for Gemini AI training. NOTE: This is only a robots.txt token - it uses existing Googlebot user agents, not a separate crawler.</td><td><code>User-agent: Google-Extended<br>Disallow: /</code></td></tr><tr><td><code>Googlebot</code></td><td>Google</td><td>Primary crawler for Google Search indexing.</td><td><code>User-agent: Googlebot<br>Allow: /</code></td></tr><tr><td><code>Bingbot</code></td><td>Microsoft</td><td>Crawler for Bing Search and Bing Chat (Copilot).</td><td><code>User-agent: Bingbot<br>Allow: /</code></td></tr><tr><td><code>Amazonbot</code></td><td>Amazon</td><td>Crawls sites for Alexa, Fire OS AI features, and product recommendations.</td><td><code>User-agent: Amazonbot<br>Allow: /</code></td></tr><tr><td><code>Applebot</code></td><td>Apple</td><td>Indexes content for Siri and Spotlight search.</td><td><code>User-agent: Applebot<br>Allow: /</code></td></tr><tr><td><code>Applebot-Extended</code></td><td>Apple</td><td>Collects data for Apple's AI model training. Opt-in only.</td><td><code>User-agent: Applebot-Extended<br>Allow: /</code></td></tr><tr><td><code>FacebookBot</code></td><td>Meta</td><td>Generates link previews for Facebook and Instagram.</td><td><code>User-agent: FacebookBot<br>Allow: /</code></td></tr><tr><td><code>meta-externalagent</code></td><td>Meta</td><td>Backup fetcher for Meta platforms when FacebookBot fails.</td><td><code>User-agent: meta-externalagent<br>Allow: /</code></td></tr><tr><td><code>LinkedInBot</code></td><td>LinkedIn</td><td>Extracts preview data for links shared on LinkedIn.</td><td><code>User-agent: LinkedInBot<br>Allow: /</code></td></tr><tr><td><code>Bytespider</code></td><td>ByteDance</td><td>Powers TikTok search, content recommendations, and AI features across ByteDance products.</td><td><code>User-agent: Bytespider<br>Allow: /</code></td></tr><tr><td><code>DuckAssistBot</code></td><td>DuckDuckGo</td><td>Gathers data for DuckAssist AI answer feature.</td><td><code>User-agent: DuckAssistBot<br>Allow: /</code></td></tr><tr><td><code>cohere-ai</code></td><td>Cohere</td><td>Collects training data for Cohere's language models.</td><td><code>User-agent: cohere-ai<br>Allow: /</code></td></tr><tr><td><code>AI2Bot</code></td><td>Allen Institute</td><td>Academic crawler for Semantic Scholar and AI research projects.</td><td><code>User-agent: AI2Bot<br>Allow: /</code></td></tr><tr><td><code>CCBot</code></td><td>Common Crawl</td><td>Creates open datasets used by many AI projects and researchers.</td><td><code>User-agent: CCBot<br>Allow: /</code></td></tr><tr><td><code>Diffbot</code></td><td>Diffbot</td><td>Converts web pages into structured data for machine learning pipelines.</td><td><code>User-agent: Diffbot<br>Allow: /</code></td></tr><tr><td><code>omgili</code></td><td>Omgili</td><td>Specializes in indexing forums, comments, and discussion boards.</td><td><code>User-agent: omgili<br>Allow: /</code></td></tr><tr><td><code>Timpibot</code></td><td>Timpi</td><td>Decentralized search crawler with lower traffic volume.</td><td><code>User-agent: Timpibot<br>Allow: /</code></td></tr><tr><td><code>YouBot</code></td><td>You.com</td><td>Powers You.com's AI search and browser assistant features.</td><td><code>User-agent: YouBot<br>Allow: /</code></td></tr><tr><td><code>MistralAI-User</code></td><td>Mistral</td><td>Fetches content for citations in Mistral's Le Chat assistant.</td><td><code>User-agent: MistralAI-User<br>Allow: /</code></td></tr><tr><td><code>GoogleAgent-Mariner</code></td><td>Google</td><td>Agentic browser from Google's Project Mariner. Available to AI Ultra subscribers ($249.99/month).</td><td><code>User-agent: GoogleAgent-Mariner<br>Allow: /</code></td></tr><tr><td>Standard Chrome UA</td><td>OpenAI</td><td>ChatGPT Atlas browser - uses identical user agent to Chrome, making it indistinguishable from regular browsers. Cannot be blocked via robots.txt user agent alone.</td><td>N/A - Use IP blocking</td></tr><tr><td>No known UA</td><td>xAI</td><td>Grok crawler - Documented user agents (GrokBot, xAI-Grok, Grok-DeepSearch) are rarely seen. Grok confirmed it uses iPhone user-agent strings to avoid blocks.</td><td>Cannot be reliably blocked</td></tr></tbody></table></div>
Robots.txt examples for different AI use cases
There are two examples below for discovery or selective blocking for privacy. You should create your own block/allow patterns that align with your content strategy and business requirements.
1# You can paste any of these blocks into robots.txt or a firewall rule. grouped by company to make things readable for y'all.
2
3# ——— OPENAI ———
4# Search (shows my webpages as links inside ChatGPT search). NOT used for model training.
5User-agent: OAI-SearchBot
6Allow: /
7
8# User-driven browsing from ChatGPT and Custom GPTs. Acts after a human click.
9User-agent: ChatGPT-User
10User-agent: ChatGPT-User/2.0
11Allow: /
12
13# Model-training crawler. Opt-out here if I don’t want content in GPT-4o or GPT-5.
14User-agent: GPTBot
15Disallow: /private/ # example private folder
16Allow: / # everything else
17
18# ——— ANTHROPIC (Claude) ———
19User-agent: anthropic-ai # bulk model training
20Allow: /
21User-agent: ClaudeBot # chat citation fetch
22User-agent: claude-web # web-focused crawl
23Allow: /
24
25# ——— PERPLEXITY ———
26User-agent: PerplexityBot # index builder
27Allow: /
28User-agent: Perplexity-User # human-triggered visit
29Allow: /
30
31# ——— GOOGLE (Gemini) ———
32User-agent: Google-Extended
33Allow: /
34
35# ——— MICROSOFT (Bing / Copilot) ———
36User-agent: BingBot
37Allow: /
38
39# ——— AMAZON ———
40User-agent: Amazonbot
41Allow: /
42
43# ——— APPLE ———
44User-agent: Applebot
45User-agent: Applebot-Extended
46Allow: /
47
48# ——— META ———
49User-agent: FacebookBot
50User-agent: meta-externalagent
51Allow: /
52
53# ——— LINKEDIN ———
54User-agent: LinkedInBot
55Allow: /
56
57# ——— BYTEDANCE ———
58User-agent: Bytespider
59Allow: /
60
61# ——— DUCKDUCKGO ———
62User-agent: DuckAssistBot
63Allow: /
64
65# ——— COHERE ———
66User-agent: cohere-ai
67Allow: /
68
69# ——— ALLEN INSTITUTE / COMMON CRAWL / OTHER RESEARCH ———
70User-agent: AI2Bot
71User-agent: CCBot
72User-agent: Diffbot
73User-agent: omgili
74Allow: /
75
76# ——— EMERGING SEARCH START-UPS ———
77User-agent: TimpiBot
78User-agent: YouBot
79Allow: /
The above examples group user agents by their purposes (e.g. search, user-action, or model-training) so you can control the buckets that matter to your use case.
Robots.txt best‑practice checklist
- List every AI agent you care about. Use the table above or search your logs for tokens such as
gptbot,bingbot,claudebot,perplexitybot,google-extended,amazonbot, andduckassistbot. - Add a directive after every
User-agent:– at least oneAlloworDisallowline. A lone name won't do anything. - Use blank lines between blocks to avoid merge errors.
- Re‑test after big LLM model releases snd updates. New versions will sometimes ignore older rules or perhaps have entirely new User Agent names.
Testing tips & code snippets
Why test? I test to confirm every AI user agent above can (or cannot) reach the website as I intend.
Oh, and this nugget from Jori Ford:
There's 226 crawlers that Cloudflare found out. They found them out because they did IP mining and reverse lookups. Because here's the beautiful new world of AI that you're not going to be used to. Ai lies. Ai doesn't have any standards. They just want to get what they need. So guess what? You have 226 unknown bots traversing your planet, and you don't know who they are unless you have their IP address. Guess where you can get that? Your log files.

Web based “all bots” check (UI)
<div class="tbl-wrap"><table style="border-radius:12px;"><thead><tr><th>Tool</th><th>What it checks</th><th>How to use it</th></tr></thead><tbody><tr><td><a href="https://knowatoa.com">Knowatoa AI Search Console</a></td><td>Tests your robots.txt against 24 different AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) and shows which ones you're blocking vs. allowing.</td><td>Enter your URL, run the audit, fix anything marked as blocked that should be allowed (or vice versa).</td></tr><tr><td><a href="https://technicalseo.com/tools/robots-txt/">Merkle robots.txt Tester</a></td><td>Tests individual crawler behavior when you need to verify a specific user-agent string that Knowatoa doesn't cover.</td><td>Paste the user-agent name, check if your robots.txt is allowing or blocking it.</td></tr></tbody></table></div>
Checking server logs for AI crawler activity
Run this on your Nginx or Apache logs to see which AI crawlers have been hitting your website:
grep -Ei "gptbot|oai-searchbot|chatgpt-user|claudebot|perplexitybot|google-extended|bingbot" access.log \| awk '{print $1,$4,$7,$12}' | head
You will see hits like:
203.0.113.10 - - [25/Apr/2025:08:14:22 -0600] "GET /blog/ HTTP/1.1" 200 15843 "-" "GPTBot/1.1"
Next: if a bot I expect is absent, I re-run Knowatoa or Merkle manual test to see if I blocked it. If I didn't, that bot didn't visit my website.
Firewall template snippets
Tip: Use firewall rules sparingly; start with robots.txt and escalate only if abuse appears in logs.
This shows you which major AI crawlers have actually visited your site. If you expect to see a specific bot but don't, either you've blocked it in robots.txt or it hasn't crawled you yet. You can verify your robots.txt configuration using the Merkle robots.txt Tester.
Firewall rules (when robots.txt isn't enough)
Start with robots.txt. Only escalate to firewall rules if you're seeing abuse in your logs.
Here's how to navigate there in Cloudflare: Rules → Firewall Rules → Create rule
<div class="tbl-wrap"><table style="border-radius:12px;"><thead><tr><th>Purpose</th><th>Expression</th><th>Action</th></tr></thead><tbody><tr><td>Block GPT model training</td><td>(http.user_agent contains "GPTBot")</td><td>Block</td></tr><tr><td>Allow ChatGPT user traffic</td><td>(http.user_agent contains "ChatGPT-User")</td><td>Allow</td></tr></tbody></table></div>
Nginx
# Block Perplexity indexer but allow user clicks
if ($http_user_agent ~* "PerplexityBot") {
return 403;
}
Robots.txt template for AI bots and crawlers
1# Block model-training crawlers
2User-agent: GPTBot
3Disallow: /
4
5User-agent: anthropic-ai
6Disallow: /
7
8User-agent: Google-Extended
9Disallow: /
10
11# Allow AI search crawlers
12User-agent: OAI-SearchBot
13Allow: /
14
15User-agent: PerplexityBot
16Allow: /
17
18# Allow user-triggered agents
19User-agent: ChatGPT-User
20Allow: /
21
22User-agent: ChatGPT-User/2.0
23Allow: /
24
25User-agent: Perplexity-User
26Allow: /
27
28# Allow everything else
29User-agent: *
30Allow: /
Each User-agent: line identifies which crawler the rules apply to. The Allow or Disallow directives that follow tell the bot what it can access. Without a directive, the bot won't know what to do.
Two common robots.txt patterns
Allow a bot to crawl everything:
User-agent: GPTBot
Allow: /
Restrict the bot from everything:
User-agent: PerplexityBot
Disallow: /
You can group several user-agents that share the same rule:
User-agent: ChatGPT-User
User-agent: ChatGPT-User/2.0
Allow: /
Or give a crawler different rules for different paths:
User-agent: ClaudeBot
Allow: /public/
Disallow: /private/
Use blank lines between blocks so the file is easier to read.
Agentic browser crawlers
<div class="tbl-wrap"><table><thead><tr><th>Agent</th><th>User Agent</th><th>Status</th></tr></thead><tbody><tr><td>ChatGPT Atlas</td><td>Standard Chrome user agent string</td><td>Available for Mac (Windows/iOS/Android coming soon). Requires ChatGPT Plus, Pro, or Business. Uses standard Chrome user agent, making it indistinguishable from regular Chrome traffic.</td></tr><tr><td>OpenAI Operator</td><td>No known user agent</td><td>Integrated into ChatGPT as "agent mode" (July 2025). Runs in a remote browser that appears like Chrome.</td></tr><tr><td>Google Project Mariner</td><td>GoogleAgent-Mariner</td><td>Available to AI Ultra subscribers ($249.99/month) in the U.S. Runs on virtual machines in the cloud.</td></tr><tr><td>Anthropic Computer Use</td><td>Claude-Web (separate crawler)</td><td>API capability for Claude 3.5+. Also available as "Claude for Chrome" extension for Max plan users. Uses screenshots to interact with desktops/browsers in virtualized environments.</td></tr><tr><td>xAI Grok</td><td>GrokBot/1.0<br>xAI-Grok/1.0<br>Grok-DeepSearch/1.0</td><td>User agents are documented but webmasters report rarely seeing them in practice. Grok reportedly uses iPhone user-agent strings in some cases.</td></tr></tbody></table></div>
These agents operate differently from traditional crawlers. They use real browsers (or headless browsers) to interact with websites, making them harder to identify and control through standard robots.txt rules alone. Atlas is particularly notable because it's completely indistinguishable from Chrome in server logs. If you need to manage access, use IP-based restrictions or Cloudflare rules in addition to robots.txt.
OpenAI publishes IP ranges for their crawlers:
FAQs
What is an AI crawler in robots.txt?
A bot that requests your pages for model training or AI-powered answers. The User-agent: line tells it what it can access.
Is User-agent: * enough?
Not always. The wildcard catches most crawlers, but some AI bots ignore it. List the specific crawlers you care about.
What's the best open-data web crawler?
I'm not the person to ask about best, but Common Crawl (CCBot) releases monthly snapshots that anyone can access.
What do you mean by "top user agents"?
The crawlers in this guide account for 95% of AI crawler traffic based on log data from production sites.
Do bots have to follow robots.txt?
No. Most reputable crawlers do, but it's not legally binding. Anthropic got criticized in 2024 for ignoring robots.txt rules. If you need to block something for real, use firewall rules or password-protect the content.
How do AI crawlers fit into my audience funnel?
GPTBot ──► Your Website
│
▼
LLM Training
│
▼
Human Prompt
│
▼
Agent UI (e.g. ChatGPT)
╱ │ ╲
╱ │ ╲
Search-GPT │ Operator (unknown user agent)
│ │ │
│ ▼ ▼
│ Response ChatGPT-User
│ ╲ │
│ ▼ │
└────────►(merge)
▼
Human Click
│
▼
Your Website
Ending Tip
Even with the correct robots.txt configuration, your web server or firewall might still block AI crawlers. I recommend using Knowatoa's AI Search Console to streamline validate your setup - it'll check your site against 24 different AI user agents and flag any access issues.

Otherwise you can use Merkle's robots.txt tester to audit user agents one-by-one.

<div class="post-note">Questions or missing agents? Let me know! I update this guide whenever new data rolls in. Drop me a comment on LinkedIn if you spot any I've missed!</div>
Originally published: February 7, 2025 | Updated: April 24, 2025 | Last updated: April 29, 2025
this text
.png)
.png)


