List of AI Search Crawlers & User Agents

Posted in:
AI Updates
//
February 7, 2025
A list of AI search crawlers and how to verify they can access your website. Includes a validation tool and example robots.txt configurations.
Full list of top AI webcrawlers
//
Est. Read Time:
3 minutes

Oh hey! AI search is changing how people find our content. ChatGPT, Claude, Perplexity - these tools are a growing source of website traffic (Semrush just reported a 300% jump in domains getting ChatGPT traffic in second half of last year).

Most AI crawlers can access your content by default. But with how fast this space is moving, it's super helpful to know exactly which crawlers are out there and verify they can actually see your site. I've put together a complete list and found this rad tool called Knowatoa that makes checking access by user agent very simple.

Top AI Web Crawlers to Know About

Here are the major AI crawlers you should have on your radar:

OpenAI Family

GPTBot (ChatGPT's main crawler)

  • Gathers text data to improve ChatGPT’s language model.
  • Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot)

ChatGPT-User

  • Handles user prompt interactions in ChatGPT.
  • Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot)

OAI-SearchBot

  • Indexes online content to advance ChatGPT’s research and retrieval.
  • Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot)

Anthropic

Anthropic AI Bot

  • Collects information for Anthropic’s AI development.
  • Mozilla/5.0 (compatible; anthropic-ai/1.0; +http://www.anthropic.com/bot.html)

ClaudeBot

  • Processes and retrieves web data for conversation-based AI.
  • Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ClaudeBot/1.0; +claudebot@anthropic.com)

Claude Web

  • Acquires site data to refine Anthropic’s web-focused models.
  • Mozilla/5.0 (compatible; claude-web/1.0; +http://www.anthropic.com/bot.html)

Major Tech Companies

Google-Extended (used for Gemini)

  • Gathers data for Google’s AI programs beyond standard search.
  • Mozilla/5.0 (compatible; Google-Extended/1.0; +http://www.google.com/bot.html)

Applebot & Applebot-Extended (Siri)

  • Crawls webpages to improve results for Siri and Spotlight.
  • Applebot: Mozilla/5.0 (compatible; Applebot/1.0; +http://www.apple.com/bot.html  
  • Applebot-Extended: Mozilla/5.0 (compatible; Applebot-Extended/1.0; +http://www.apple.com/bot.html)

BingBot

  • Indexes sites for Microsoft Bing’s search engine.
  • Mozilla/5.0 (compatible; BingBot/1.0; +http://www.bing.com/bot.html)

FacebookBot & Meta External Fetcher

  • Fetches content for Facebook and other Meta services.
  • FacebookBot: Mozilla/5.0 (compatible; FacebookBot/1.0; +http://www.facebook.com/bot.html  
  • Meta External Fetcher: Mozilla/5.0 (compatible; meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler))

LinkedInBot

  • Collects site data for LinkedIn’s platform features.
  • LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com)

Amazonbot

  • Crawls sites to enhance Amazon’s web-related services.
  • Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML\, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)

Bytespider (ByteDance/TikTok)

  • Surveys webpages to support TikTok’s content discovery.
  • Mozilla/5.0 (compatible; Bytespider/1.0; +http://www.bytedance.com/bot.html)

Other Search Engines

PerplexityBot

  • Examines websites to inform Perplexity’s AI-powered search.
  • Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)

YouBot

  • Powers AI-based search functionality on You.com.
  • Mozilla/5.0 (compatible; YouBot (+http://www.you.com))

DuckAssistBot

  • Collects data to deliver AI-backed answers on DuckDuckGo.
  • Mozilla/5.0 (compatible; DuckAssistBot/1.0; +http://www.duckduckgo.com/bot.html)

Research & Development

AI2Bot (Allen Institute)

  • Crawls websites for the Allen Institute’s AI research.
  • Mozilla/5.0 (compatible; AI2Bot/1.0; +http://www.allenai.org/crawler)

CCBot (Common Crawl)

  • Gathers open web data for the Common Crawl archive.
  • Mozilla/5.0 (compatible; CCBot/1.0; +http://www.commoncrawl.org/bot.html)

Cohere AI

  • Collects text samples to refine Cohere’s language models.
  • Mozilla/5.0 (compatible; cohere-ai/1.0; +http://www.cohere.ai/bot.html)

Omgili Bot

  • Indexes discussion-focused data for research and analysis.
  • Mozilla/5.0 (compatible; omgili/1.0; +http://www.omgili.com/bot.html)

Timpi

  • Uses distributed crawling to compile datasets for AI applications.
  • Timpibot/0.8 (+http://www.timpi.io)

DiffBot

  • Scrapes webpages to produce structured data for AI systems.
  • Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729; Diffbot/0.1; +http://www.diffbot.com)

Example robots.txt Allow Snippets

Here's some robots.txt configuration snippets that allow major AI crawlers while maintaining standard SEO best practices

1# Example robots.txt entries to allow specific AI Crawlers
2
3# Allen Institute (AI2Bot)
4User-agent: AI2Bot
5Allow: /
6
7# Amazon (Amazonbot)
8User-agent: Amazonbot
9Allow: /
10
11# Anthropic (Anthropic AI Bot)
12User-agent: anthropic-ai
13Allow: /
14
15# Anthropic (ClaudeBot)
16User-agent: ClaudeBot
17Allow: /
18
19# Anthropic (Claude Web)
20User-agent: claude-web
21Allow: /
22
23# Apple (Applebot)
24User-agent: Applebot
25Allow: /
26
27# Apple (Applebot-Extended)
28User-agent: Applebot-Extended
29Allow: /
30
31# Microsoft (BingBot)
32User-agent: BingBot
33Allow: /
34
35# ByteDance (Bytespider)
36User-agent: Bytespider
37Allow: /
38
39# Common Crawl (CCBot)
40User-agent: CCBot
41Allow: /
42
43# OpenAI (ChatGPT-User)
44User-agent: ChatGPT-User
45Allow: /
46
47# OpenAI (GPTBot)
48User-agent: GPTBot
49Allow: /
50
51# OpenAI (OAI-SearchBot)
52User-agent: OAI-SearchBot
53Allow: /
54
55# Cohere (cohere-ai)
56User-agent: cohere-ai
57Allow: /
58
59# Diffbot (DiffBot)
60User-agent: DiffBot
61Allow: /
62
63# DuckDuckGo (DuckAssistBot)
64User-agent: DuckAssistBot
65Allow: /
66
67# Meta (FacebookBot)
68User-agent: FacebookBot
69Allow: /
70
71# Meta (Meta External Fetcher)
72User-agent: meta-externalagent
73Allow: /
74
75# Google (Google-Extended)
76User-agent: Google-Extended
77Allow: /
78
79# LinkedIn (LinkedInBot)
80User-agent: LinkedInBot
81Allow: /
82
83# Omgili (omgili)
84User-agent: omgili
85Allow: /
86
87# Perplexity (PerplexityBot)
88User-agent: PerplexityBot
89Allow: /
90
91# Timpi (Timpibot)
92User-agent: Timpibot
93Allow: /
94
95# You.com (YouBot)
96User-agent: YouBot
97Allow: /

Quick Tip

Even with the correct robots.txt configuration, your web server or firewall might still block AI crawlers. I recommend using Knowatoa's AI Search Console to streamline validate your setup - it'll check your site against 24 different AI user agents and flag any access issues.

Knowatoa's AI search console dashboard
Knowatoa's AI Search Console dashboard

Otherwise you can use Merkle's robots.txt tester to audit user agents one-by-one.

robots.txt Validator and Testing Tool
robots.txt Validator and Testing Tool from technicalseo.com

As AI search continues to mature, this list will keep growing. I'll update this post as new crawlers emerge. Drop me a comment if you spot any I've missed!

Shoutout to Mike Buckbee

Big shoutout to Mike Buckbee and his fantastic tool Knowatoa, which helps me stay on top of these crawlers/user agents. The AI Search Console tool is particularly helpful for validating your site's accessibility to AI crawlers.

Recommended additional reading

Originally published: February 7, 2025 | Last updated: February 7, 2025

Bar chart showing increase over time with Momentic logo