iSocialWeb

How to block ChatGPT, GPTBot and other AI chatbots from scraping your website

Learn how to block ChatGPT, GPTBot and other AI crawlers using robots.txt, server-level rules, and rate limiting to protect your content.

AI chatbots like ChatGPT have the ability to crawl and collect content from your website without prior permission. This practice, known as scraping, is a growing concern for website owners who want to protect their original content, preserve their traffic, and keep server costs under control.

The good news is that there are several ways to prevent AI tools from accessing your site. One of the most common starting points is configuring your robots.txt file. But as you'll see in this guide, robots.txt alone is not always enough. We'll cover the full picture: which AI bots exist, how to block them at different levels, the limitations of each approach, and how to decide which bots are actually worth blocking.

Why blocking AI crawlers matters

Before getting into the technical steps, it's worth understanding what's at stake. AI crawlers affect your website in several ways beyond just copying your content.

Impact on traffic and revenue

When AI models like ChatGPT or Perplexity scrape your content and use it to generate answers directly inside the chatbot interface, users often get what they need without ever visiting your site. This reduces organic traffic to your pages, which directly cuts into advertising revenue, newsletter sign-ups, and subscription conversions. Publishers, bloggers, and content-driven businesses feel this most acutely. The more useful your content is, the more likely it will be absorbed into AI responses, and the less likely readers are to click through to the source.

Server load and bandwidth costs

Aggressive AI crawlers don't just read a page or two. They systematically crawl entire websites, often at high frequency. This puts a real load on your server and increases bandwidth consumption. For sites on shared hosting or those paying for bandwidth by volume, this translates directly into higher costs. In some cases, a surge of bot traffic can slow down your site for real users, degrading the experience and hurting your SEO.

Security and compliance risks

AI bots also introduce risks that go beyond content scraping. Some crawlers may inadvertently expose sensitive data indexed on your site, probe for application vulnerabilities, or collect personally identifiable information in violation of privacy regulations like GDPR. If your site includes member-only pages, internal documents, or user-generated content, an unchecked crawler could access material you never intended to be public. Intellectual property theft is another concern: once your content is baked into a model's training data, it's extremely difficult to have it removed.

The legal landscape around AI training data is still evolving, but several high-profile lawsuits have established that scraping copyrighted content without consent is legally contestable. As a website owner, you have a reasonable claim over how your content is used. Blocking crawlers is one practical way to assert that right while the courts and regulators catch up with the technology. Questions of data provenance, content ownership, and consent are central to ongoing debates about how AI companies should acquire training data.

Main AI bots accessing your website

Almost every company building a large language model runs its own web crawler. Here is a comprehensive list of the most active ones, along with their user agent strings.

how to block chatgpt

GPTBot (OpenAI)

GPTBot is a web crawler developed by OpenAI. Its main function is to browse the web and collect information from websites, which can then be used to improve future AI models. GPTBot identifies pages through a specific user agent and is designed not to access content protected by paywalls or that contains personally identifiable information. That said, compliance with these stated policies depends entirely on OpenAI following its own rules.

  • User agent: GPTBot
  • Full user agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)

ChatGPT-User (OpenAI)

ChatGPT-User is a separate OpenAI user agent used by plugins and browsing features inside ChatGPT. Unlike GPTBot, it does not automatically crawl the web on a schedule. Instead, it fetches pages in real time to answer specific queries from ChatGPT users. This means it appears in your logs as a direct response to user requests, not as a background crawler.

  • User agent: ChatGPT-User

ClaudeBot (Anthropic)

ClaudeBot is the web crawler used by Anthropic to gather training data for its Claude AI models. It is one of the more active crawlers and has been reported to make a high volume of requests in short periods.

  • User agent: ClaudeBot
  • Also known as: anthropic-ai

PerplexityBot (Perplexity AI)

Perplexity is an AI-powered search engine that summarises web content in response to user queries. PerplexityBot crawls your pages to feed those summaries. Because Perplexity presents answers directly to users, it can divert traffic that would otherwise land on your site.

  • User agent: PerplexityBot

Google-Extended

Google-Extended is a specific token that Google introduced to let website owners opt out of having their content used to train Bard and future Google AI products, separately from the standard Googlebot crawler. Blocking Google-Extended does not affect your Google Search rankings. It only controls whether your content is used for AI training purposes.

  • User agent: Google-Extended

Meta-ExternalAgent (Meta)

Meta uses this crawler to collect data for its AI research and products, including the models that power Meta AI on Facebook, Instagram, and WhatsApp.

  • User agent: Meta-ExternalAgent

CCBot (Common Crawl)

Common Crawl is a non-profit that maintains a freely available archive of web content. Its data is widely used by AI companies to train language models, including early versions of GPT. Blocking CCBot can help prevent your content from entering this open dataset.

  • User agent: CCBot

Blocking AI bots with robots.txt

The robots.txt file sits at the root of your website (e.g., https://yoursite.com/robots.txt) and tells compliant crawlers which pages they are allowed to access. It is the simplest and most widely recommended starting point.

To block all the crawlers listed above, add the following to your robots.txt file:

  • Block GPTBot: User-agent: GPTBot followed by Disallow: /
  • Block ChatGPT-User: User-agent: ChatGPT-User followed by Disallow: /
  • Block ClaudeBot: User-agent: ClaudeBot followed by Disallow: /
  • Block anthropic-ai: User-agent: anthropic-ai followed by Disallow: /
  • Block PerplexityBot: User-agent: PerplexityBot followed by Disallow: /
  • Block Google-Extended: User-agent: Google-Extended followed by Disallow: /
  • Block Meta-ExternalAgent: User-agent: Meta-ExternalAgent followed by Disallow: /
  • Block CCBot: User-agent: CCBot followed by Disallow: /

You can also block everything and whitelist only the crawlers you trust, using User-agent: * with Disallow: / as a catch-all rule. Just be careful: this will block all bots including Googlebot unless you explicitly allow it.

The limitations of robots.txt

This is the most important caveat in this entire guide. Robots.txt is a voluntary protocol. Compliant, well-behaved crawlers like Googlebot will respect it. Many AI crawlers from established companies also follow it. But there is no technical enforcement mechanism. Any crawler can simply ignore your robots.txt file entirely and scrape your site anyway.

Scrapers built by smaller or less scrupulous actors almost certainly will not respect it. Even some AI companies have faced accusations of ignoring opt-out signals. For this reason, robots.txt should be treated as a first layer of defence, not the only one.

If protecting your content is a serious priority, you need to combine robots.txt with server-level controls.

Server-level blocking: A more robust approach

Server-level blocking works by rejecting requests from known bot user agents or IP ranges before they even reach your content. Unlike robots.txt, this cannot be ignored by a crawler. Here are the main techniques.

Blocking bots via .htaccess (Apache servers)

If your site runs on an Apache server, you can add rules to your .htaccess file to block specific user agents. For example, to block GPTBot and ClaudeBot, you would add a RewriteCond block that matches those user agent strings and returns a 403 Forbidden response. This is effective and does not require any additional software.

Blocking bots via Nginx configuration

On Nginx servers, you can add if ($http_user_agent ~ "GPTBot|ClaudeBot|PerplexityBot|CCBot") rules inside your server block to return a 403 or 444 (no response) status. Returning no response at all (444) is particularly efficient because it closes the connection immediately without consuming resources to generate a reply.

Rate limiting

Rate limiting restricts how many requests a single IP address can make within a given time window. Even if a bot gets through your user agent filter by spoofing its identity, aggressive crawling behaviour (many requests per second from one IP) will trigger the rate limit. Most web servers and CDNs (including Cloudflare, Nginx, and Apache) support rate limiting natively.

Using Fail2ban

Fail2ban is an open-source tool that monitors server logs and automatically bans IP addresses that match suspicious patterns. You can configure it to detect crawler-like behaviour (e.g., a high volume of requests, specific user agent strings) and temporarily or permanently block those IPs at the firewall level. This is particularly effective against crawlers that ignore robots.txt and cycle through requests at scale.

Using a CDN or WAF

Content delivery networks like Cloudflare offer bot management features that can detect and block AI crawlers automatically. A Web Application Firewall (WAF) can be configured with custom rules to block known bot user agents. These solutions sit in front of your server and handle blocking before requests even reach your hosting infrastructure, which also reduces the server load caused by bot traffic.

Blocking by IP range

Some AI companies publish the IP ranges their crawlers use. OpenAI, for example, provides a list of GPTBot IP addresses. You can block these ranges at the server or firewall level for a stronger guarantee that those specific crawlers cannot access your site. The downside is that IP ranges can change, so this approach requires occasional maintenance.

Should you block all AI bots?

Blocking every AI crawler is not automatically the right decision for every website. It's worth thinking through the trade-offs for each bot before applying a blanket rule.

Reasons to block

  • You publish original content that generates ad revenue or subscriptions, and AI summaries are diverting your audience.
  • Your site is experiencing performance issues or unexpected bandwidth costs due to crawler traffic.
  • You have concerns about IP theft, sensitive data exposure, or GDPR compliance.
  • You simply do not want your work used to train commercial AI products without compensation.

Reasons to allow (or be selective)

  • Google-Extended controls only AI training use, not search indexing. Blocking it has no effect on your Google rankings.
  • Some AI tools may increase your visibility by citing your site as a source in responses, driving referral traffic.
  • If your content is freely available and your goal is maximum reach, being included in AI training data could be seen as a form of distribution.
  • Blocking all bots with a catch-all rule risks accidentally excluding legitimate crawlers if not configured carefully.

The most sensible approach for most site owners is to block the crawlers that pose the clearest risk (CCBot, training-focused bots, aggressive scrapers) while being more selective about others. Review each user agent individually and make a decision based on your specific goals.

Pros and cons of blocking AI bots

To summarise the broader picture:

  • Pro: Protects original content from being used in AI training without consent.
  • Pro: Reduces server load and bandwidth costs from aggressive crawling.
  • Pro: Helps preserve organic traffic and revenue by preventing AI-generated summaries from replacing your pages.
  • Pro: Reduces exposure to security and compliance risks.
  • Con: Robots.txt rules can be ignored by non-compliant bots, making server-level measures necessary for real protection.
  • Con: Blocking certain AI tools may reduce your content's reach or citation potential.
  • Con: Requires ongoing maintenance as new crawlers emerge and IP ranges change.

Keeping your blocklist up to date

The AI industry moves fast. New models launch regularly, and with them come new crawlers and new user agent strings. It's a good idea to review your robots.txt and server-level rules every few months. Check your server access logs for unfamiliar user agents, and cross-reference them with updated crawler databases. Several open-source projects and community-maintained lists track new AI bots as they appear, and these can save you a lot of manual research.

Combining a well-maintained robots.txt file with at least one layer of server-level protection gives you a solid, practical defence against the vast majority of AI scraping activity.