AI user-agents, bots, and crawlers to watch (April 2025 update)

Posted in:
AI Updates
//
Friday, April 25, 2025
February 7, 2025
April 2025 list of AI user-agents, with practical robots.txt and auditing tips. LLM/AI crawlers leave their signatures through a user-agent string. I monitor those strings so I can welcome the bots that help me to build my audience and politely ask the ones that don't to leave. Now you can too.
Full list of top AI webcrawlers
//
Est. Read Time:
3 minutes

Key takeaway

AI crawlers identify themselves through user-agent strings. Keeping those strings current in your robots.txt lets you guide how language models interact with your work. Most LLM-based AI search engines crawlers rely on a user-agent string; a short bit of text that tells your server “who” is making the request.  When you spot GPTBot, ClaudeBot, PerplexityBot, or any of the newer strings below in your server access logs, you know an AI model is indexing, scraping, or quoting your page.  Keep your robots.txt file and firewall rules up-to-date so the right agents can read your content while the wrong ones stay out.

What changed since the March 2025 edition?

<div class="tbl-wrap"><table style="border-radius:12px;"><thead><tr><th>New string</th><th>Vendor / purpose</th><th>Why it matters</th></tr></thead><tbody><tr><td><code>MistralAI-User/1.0</code></td><td>Mistral — fetches citations for Le Chat.</td><td>First seen Mar 2025 and respects robots.txt.</td></tr><tr><td><code>Perplexity-User/1.0</code></td><td>Perplexity — live fetch when a person clicks a link.</td><td>Ignores robots.txt because it counts as a human-triggered visit.</td></tr><tr><td><code>ChatGPT-User/2.0</code></td><td>OpenAI — successor to v1.0 for on-demand page loads.</td><td>Rolling out since Feb 2025; keep rules for 1.x too.</td></tr></tbody></table></div>

New strings appear regularly; consider scheduling a quarterly review with your IT or engineering team.

Quick definitions

  • AI crawler: A bot that copies public web pages so a large-language model can learn from them.
  • AI user-agent: The string that identifies that crawler in HTTP requests.  You use it in robots.txt rules.
  • Robots.txt: A plain-text file at the root of your site that tells crawlers what they may fetch.  Add one line per User-agent you want to allow or block.

Why you should care

Server logs show AI search bots now account for a growing share of referral visits. Understanding which agents they use helps you encourage that traffic responsibly.

  • AI search bots (ChatGPT, Claude, Bing Copilot, and Perplexity) send measurable referral traffic to websites.
  • Clear robots.txt rules let helpful agents in and keep abusive scrapers out.
  • If you have access to server log files, you can see how often AI/LLM bots are hitting your website so you can create a baseline.

Momentic research shows significant growth in referrals to websites from ChatGPT. This is over double the rate at which Google Search sent users to non-Google properties in March 2025.

Chart showing ChatGPT sends 1.4 visits per unique visitor to external domains. Google sends 0.6.
ChatGPT sends 1.4 visits per unique visitor to external domains. Google sends 0.6.

Most AI crawlers can access your content by default. But with how fast this space is moving, it's super helpful to know exactly which crawlers are out there and verify they can actually see your site.

Complete AI crawler list

I merged every token from my February post with the April 2025 additions. Copy it as you see fit. I use the same order in my firewall allow‑list.

<div class="tbl-wrap"><table style="border-radius:12px;"><thead><tr><th>User-agent token</th><th>Vendor</th><th>Helpful, keyword-optimized description</th><th>robots.txt snippet</th></tr></thead><tbody><tr><td><code>GPTBot/1.1</code></td><td>OpenAI</td><td>Large-scale <strong>gptbot crawler</strong> that collects public text for ChatGPT & GPT-4o; block if you want to keep data out of model training.</td><td><code>User-agent: GPTBot Allow: /</code></td></tr><tr><td><code>OAI-SearchBot/1.0</code></td><td>OpenAI</td><td>Retrieval-augmented <em>openai searchbot</em> indexing pages for enterprise RAG pipelines.</td><td><code>User-agent: OAI-SearchBot Allow: /</code></td></tr><tr><td><code>ChatGPT-User/1.0</code></td><td>OpenAI</td><td>Legacy on-demand fetch when a user pastes a link or the response warrants web results (like research/reasoning).</td><td><code>User-agent: ChatGPT-User Allow: /</code></td></tr><tr><td><code>ChatGPT-User/2.0</code></td><td>OpenAI</td><td>Current version—same role, new token; verify both in your <strong>user agent list 2025</strong>.</td><td><code>User-agent: ChatGPT-User/2.0 Allow: /</code></td></tr><tr><td><code>anthropic-ai/1.0</code></td><td>Anthropic</td><td>Bulk crawl for Claude training—core <em>anthropic web crawler</em>.</td><td><code>User-agent: anthropic-ai Allow: /</code></td></tr><tr><td><code>ClaudeBot/1.0</code></td><td>Anthropic</td><td>Conversation fetcher that grabs cited URLs in real time.</td><td><code>User-agent: ClaudeBot Allow: /</code></td></tr><tr><td><code>claude-web/1.0</code></td><td>Anthropic</td><td>Smaller crawl focused on recent web content for <em>Claude browser agent</em>.</td><td><code>User-agent: claude-web Allow: /</code></td></tr><tr><td><code>PerplexityBot/1.0</code></td><td>Perplexity</td><td>Main <strong>perplexitybot crawler</strong> building the AI search index.</td><td><code>User-agent: PerplexityBot Allow: /</code></td></tr><tr><td><code>Perplexity-User/1.0</code></td><td>Perplexity</td><td>Loads a page only after a user clicks a citation—counts as human traffic, ignores robots.txt.</td><td><code>User-agent: Perplexity-User Allow: /</code></td></tr><tr><td><code>Google-Extended/1.0</code></td><td>Google</td><td>Feeds Gemini; opt-out if you don’t want content in AI answers but keep normal Googlebot indexed.</td><td><code>User-agent: Google-Extended Disallow: /</code></td></tr><tr><td><code>BingBot/1.0</code></td><td>Microsoft</td><td>Standard <em>bing bot user agent</em> powering Bing Search and Copilot AI replies.</td><td><code>User-agent: BingBot Allow: /</code></td></tr><tr><td><code>Amazonbot/0.1</code></td><td>Amazon</td><td>Supports Alexa, FireOS AI and product recommendations.</td><td><code>User-agent: Amazonbot Allow: /</code></td></tr><tr><td><code>Applebot/1.0</code></td><td>Apple</td><td>Improves Siri & Spotlight search results.</td><td><code>User-agent: Applebot Allow: /</code></td></tr><tr><td><code>Applebot-Extended/1.0</code></td><td>Apple</td><td>Explicit opt-in crawler for future Apple AI models.</td><td><code>User-agent: Applebot-Extended Allow: /</code></td></tr><tr><td><code>FacebookBot/1.0</code></td><td>Meta</td><td>Generates share previews for Facebook & Instagram.</td><td><code>User-agent: FacebookBot Allow: /</code></td></tr><tr><td><code>meta-externalagent/1.1</code></td><td>Meta</td><td>Fallback fetcher when FacebookBot fails.</td><td><code>User-agent: meta-externalagent Allow: /</code></td></tr><tr><td><code>LinkedInBot/1.0</code></td><td>LinkedIn</td><td>Pulls Open Graph data for posts and messages.</td><td><code>User-agent: LinkedInBot Allow: /</code></td></tr><tr><td><code>Bytespider/1.0</code></td><td>ByteDance</td><td>Feeds TikTok search, CapCut AI captions, and Toutiao headlines.</td><td><code>User-agent: Bytespider Allow: /</code></td></tr><tr><td><code>DuckAssistBot/1.0</code></td><td>DuckDuckGo</td><td>Scrapes factual snippets for private AI answers.</td><td><code>User-agent: DuckAssistBot Allow: /</code></td></tr><tr><td><code>cohere-ai/1.0</code></td><td>Cohere</td><td>Collects text samples to fine-tune Cohere Command models.</td><td><code>User-agent: cohere-ai Allow: /</code></td></tr><tr><td><code>AI2Bot/1.0</code></td><td>Allen Institute</td><td>Academic crawl that powers Semantic Scholar and AI2 research.</td><td><code>User-agent: AI2Bot Allow: /</code></td></tr><tr><td><code>CCBot/1.0</code></td><td>Common Crawl</td><td>Public archive used by many open-source LLMs—opt-out if licensing worries you.</td><td><code>User-agent: CCBot Allow: /</code></td></tr><tr><td><code>Diffbot/0.1</code></td><td>Diffbot</td><td>Extracts structured data (product, article, FAQ) for clients’ ML pipelines.</td><td><code>User-agent: Diffbot Allow: /</code></td></tr><tr><td><code>omgili/1.0</code></td><td>Omgili</td><td>Indexes forums, comments, Reddit-like discussions.</td><td><code>User-agent: omgili Allow: /</code></td></tr><tr><td><code>TimpiBot/0.8</code></td><td>Timpi</td><td>Decentralised search start-up; traffic still small.</td><td><code>User-agent: TimpiBot Allow: /</code></td></tr><tr><td><code>YouBot</code></td><td>You.com</td><td>Crawler for You.com AI search and <em>ai browser agent</em>.</td><td><code>User-agent: YouBot Allow: /</code></td></tr><tr><td><code>MistralAI-User/1.0</code></td><td>Mistral</td><td>Real-time citation fetch for Le Chat; respects robots.txt.</td><td><code>User-agent: MistralAI-User Allow: /</code></td></tr></tbody></table></div>

Robots.txt examples that include every agent above

The examples below illustrate two common approaches—open access for discovery or selective blocking for privacy. Choose the blend that aligns with your content strategy and business requirements.

1# You can paste any of these blocks into robots.txt or a firewall rule. grouped by company to make things readable for y'all.
2
3# ——— OPENAI ———
4# Search (shows my webpages as links inside ChatGPT search). NOT used for model training.
5User-agent: OAI-SearchBot
6Allow: /
7
8# User-driven browsing from ChatGPT and Custom GPTs. Acts after a human click.
9User-agent: ChatGPT-User
10User-agent: ChatGPT-User/2.0
11Allow: /
12
13# Model-training crawler. Opt-out here if I don’t want content in GPT-4o or GPT-5.
14User-agent: GPTBot
15Disallow: /private/          # example private folder
16Allow: /                     # everything else
17
18# ——— ANTHROPIC (Claude) ———
19User-agent: anthropic-ai      # bulk model training
20Allow: /
21User-agent: ClaudeBot         # chat citation fetch
22User-agent: claude-web        # web-focused crawl
23Allow: /
24
25# ——— PERPLEXITY ———
26User-agent: PerplexityBot     # index builder
27Allow: /
28User-agent: Perplexity-User   # human-triggered visit
29Allow: /
30
31# ——— GOOGLE (Gemini) ———
32User-agent: Google-Extended
33Allow: /
34
35# ——— MICROSOFT (Bing / Copilot) ———
36User-agent: BingBot
37Allow: /
38
39# ——— AMAZON ———
40User-agent: Amazonbot
41Allow: /
42
43# ——— APPLE ———
44User-agent: Applebot
45User-agent: Applebot-Extended
46Allow: /
47
48# ——— META ———
49User-agent: FacebookBot
50User-agent: meta-externalagent
51Allow: /
52
53# ——— LINKEDIN ———
54User-agent: LinkedInBot
55Allow: /
56
57# ——— BYTEDANCE ———
58User-agent: Bytespider
59Allow: /
60
61# ——— DUCKDUCKGO ———
62User-agent: DuckAssistBot
63Allow: /
64
65# ——— COHERE ———
66User-agent: cohere-ai
67Allow: /
68
69# ——— ALLEN INSTITUTE / COMMON CRAWL / OTHER RESEARCH ———
70User-agent: AI2Bot
71User-agent: CCBot
72User-agent: Diffbot
73User-agent: omgili
74Allow: /
75
76# ——— EMERGING SEARCH START-UPS ———
77User-agent: TimpiBot
78User-agent: YouBot
79Allow: /

Why I care: this grouping mirrors the agents’ purposes (e.g. search, user-action, or model-training) so I can throttle the buckets that matter to my organization.

Robots.txt best‑practice checklist

  1. List every AI agent you care about. Use the table above or search your logs for tokens such as gptbot, bingbot, claudebot, perplexitybot, google-extended, amazonbot, and duckassistbot.
  2. Add a directive after every User-agent: – at least one Allow or Disallow line. A lone name won't do anything.
  3. Use blank lines between blocks to avoid merge errors.
  4. Re‑test after big LLM releases. New versions will sometimes ignore older rules or perhaps have entirely new User Agent names.

Testing tips & code snippets

Why test? I test to confirm every AI user agent above can (or cannot) reach the website as I intend.

Web based “all bots” check (UI)

<div class="tbl-wrap"><table style="border-radius:12px;"><thead><tr><th>Tool</th><th>What it checks</th><th>How I use it</th></tr></thead><tbody><tr><td><a href="https://knowatoa.com">Knowatoa AI Search Console</a></td><td>Hits 24 AI user-agents—GPTBot, ClaudeBot, PerplexityBot, etc.—against my robots.txt and server.</td><td>Enter URL → Run Audit → fix any red ✖ rows.</td></tr><tr><td><a href="https://technicalseo.com/tools/robots-txt/">Merkle robots.txt Tester</a></td><td>Single-agent check for edge cases like meta-externalagent.</td><td>Paste the agent string → confirm “Allowed”.</td></tr></tbody></table></div>

How to spot check for AI crawlers in your server logs

Run this shell one‑liner on Nginx or Apache logs:

grep -Ei "gptbot|oai-searchbot|chatgpt-user|claudebot|perplexitybot|google-extended|bingbot" access.log \| awk '{print $1,$4,$7,$12}' | head

You will see hits like:

203.0.113.10 - - [25/Apr/2025:08:14:22 -0600] "GET /blog/ HTTP/1.1" 200 15843 "-" "GPTBot/1.1"

Why it matters: surfaces hits from every major AI crawler bot in seconds.

Next: if a bot I expect is absent, I re-run Knowatoa or Merkle manual test to see if I blocked it. If I didn't, that bot didn't visit my website.

Firewall template snippets

Tip: Use firewall rules sparingly; start with robots.txt and escalate only if abuse appears in logs.

Cloudflare → Rules → Firewall Rules → Create rule

<div class="tbl-wrap"><table style="border-radius:12px;"><thead><tr><th>Purpose</th><th>Expression</th><th>Action</th></tr></thead><tbody><tr><td>Block GPT model-training only</td><td>(http.user_agent contains "GPTBot")</td><td>Block</td></tr><tr><td>Allow ChatGPT user traffic</td><td>(http.user_agent contains "ChatGPT-User")</td><td>Allow</td></tr></tbody></table></div>

Nginx

# Block Perplexity indexer but keep user clicks
if ($http_user_agent ~* "PerplexityBot") {
    return 403;
}

Robots.txt boilerplate you can swap in

1# Block model-training crawlers
2User-agent: GPTBot
3Disallow: /
4
5User-agent: anthropic-ai
6Disallow: /
7
8User-agent: Google-Extended
9Disallow: /
10
11# Allow AI search crawlers
12User-agent: OAI-SearchBot
13User-agent: PerplexityBot
14Allow: /
15
16# Allow user-triggered agents
17User-agent: ChatGPT-User
18User-agent: ChatGPT-User/2.0
19User-agent: Perplexity-User
20Allow: /
21
22# Default catch-all
23User-agent: *
24Disallow: /

A User-agent: line on its own just names the crawler.

Every block must also include at least one directive—Allow or Disallow—so the bot knows what it may fetch.

Two common patterns in robots.txt directives

Let the bot crawl everything

User-agent: GPTBot
Allow: /

Restrict the bot everywhere

User-agent: PerplexityBot
Disallow: /

You can group several user-agents that share the same rule:

User-agent: ChatGPT-User
User-agent: ChatGPT-User/2.0
Allow: /

Or give each crawler its own custom path:

User-agent: ClaudeBot
Allow: /public/
Disallow: /private/

Add a blank line between blocks to keep the file readable.

Emerging “agentic” browser fetchers

<div class="tbl-wrap"><table><thead><tr><th>Agent</th><th>Current ID</th><th>Status</th></tr></thead><tbody><tr><td>OpenAI Operator</td><td>No known user agent</td><td>Public&nbsp;beta. Acts like Chrome.</td></tr><tr><td>Google&nbsp;Project&nbsp;Mariner</td><td>None&nbsp;(mimics&nbsp;Chrome)</td><td>Trusted-tester phase. Watch for <code>mariner</code> token.</td></tr><tr><td>Anthropic&nbsp;Computer&nbsp;Use</td><td>None</td><td>Headless browser driven by Claude&nbsp;3.5.</td></tr><tr><td>xAI&nbsp;Grok&nbsp;crawler</td><td>None&nbsp;yet</td><td>Docs promised this quarter.</td></tr></tbody></table></div>

Until these projects publish stable strings, pin access by IP ranges or lock them behind Cloudflare rules. Early-stage projects—monitor only; no action recommended yet.

For OpenAI's public list of IP ranges, see this regularly-updated JSON file that lists IP ranges OpenAI.

FAQs

What is an AI crawler in robots.txt?  

Any bot that requests your pages for model training or instant answers. You tell it what to do with User-agent: lines.

Is User-agent: * enough?  

No. A wildcard line should be a catch‑all. Still list named AI crawlers you care about; some ignore the star.

What is the best AI web crawler for open data?  

Common Crawl (CCBot) is still the leader because it releases monthly snapshots anyone can download.

What do you mean by "top user agents"?  

The tokens in this guide account for 95 % of AI crawler traffic according to log data we have access to.

Are bots required to follow directives in robots.txt files?

Nope! Most do though. Anthropic was critcized in 2024 for ignoring robots.txt rdirectives

Think of a robots.txt file as a list of preferences or suggestions on how to access a website. Block bad actors at the firewall/server level or add password authentication to content you don't want bots to access.

How do AI crawler bots fit into the picture of my target audience?

I'll be honest, I spent a few hours trying to put together a diagram in Canva, but this text-based version is better and easier to understand.

GPTBot ──► Your Website 
            │
            ▼
        LLM Training
            │
            ▼
      Human Prompt
            │
            ▼
        Agent UI (e.g. ChatGPT)
        ╱  │   ╲
       ╱   │    ╲
Search-GPT │  Operator (unknown user agent)
       │   │    │
       │   ▼    ▼
       │ Response   ChatGPT-User
       │        ╲     │
       │         ▼    │
       └────────►(merge)
                   ▼
             Human Click
                   │
                   ▼
             Your Website

Ending Tip

Even with the correct robots.txt configuration, your web server or firewall might still block AI crawlers. I recommend using Knowatoa's AI Search Console to streamline validate your setup - it'll check your site against 24 different AI user agents and flag any access issues.

Knowatoa's AI search console dashboard
Knowatoa's AI Search Console dashboard

Otherwise you can use Merkle's robots.txt tester to audit user agents one-by-one.

robots.txt Validator and Testing Tool
robots.txt Validator and Testing Tool from technicalseo.com

<div class="post-note">Questions or missing agents? Let me know! I update this guide whenever new data rolls in. Drop me a comment on LinkedIn if you spot any I've missed!</div>

Recommended additional reading

B.I.S.C.U.I.T Framework, by Mike Buckbee - Big shoutout to Mike Buckbee and his fantastic tool Knowatoa, which helps me stay on top of these crawlers/user agents. The AI Search Console tool is particularly helpful for validating your site's accessibility to AI crawlers.

Originally published: February 7, 2025 | Last updated: April 24, 2025

Bar chart showing increase over time with Momentic logo