How to block ChatGPT, GPTBot and other AI Chatbots from scraping your website content?

Table of Contents

Today, AI chatbots like ChatGPT have the ability to scrape and use content from your website without prior permission.

This practice, known as “scraping”, can be a concern for many website owners who want to protect their original and exclusive content.

The good news is that there are ways to prevent these AI tools from accessing your website.

One of the most effective strategies to achieve this is by configuring your website’s robots.txt file. This file acts as a gatekeeper, dictating which bots can interact with your site and to what extent.

In this article, we’ll show you what types of bots exist and how you can use the robots.txt file to specifically block AI bots like ChatGPT, as well as other common bots in the digital landscape.

We will also explore the pros and cons of this decision, helping you better understand how this action can influence your site’s visibility, SEO, and most importantly, protection of your content.

What are the main AI chatbots that access your website?

Practically all companies with large language models have their own bots to comb the web and collect information.

Below is a list of the most popular ones:

GPTBot: what it is and what functions it has

how to block chatgpt

GPTBot is a web crawler developed by OpenAI.

This bot’s main function is to navigate the web and collect information from websites, which can be used to improve future artificial intelligence models.

GPTBot identifies pages through a specific user agent, making sure not to access content protected by paywalls or containing identifiable personal information.

ChatGPT-User: what it is and what functions it has

On the other hand, ChatGPT-User is another OpenAI user agent, used by plugins in ChatGPT.

Unlike GPTBot, ChatGPT-User does not automatically crawl the web.

Instead, it uses to perform direct actions requested by ChatGPT users.

Therefore, it collects information from web pages to respond to real-time queries made by users through ChatGPT.

What are the differences between GPTBot and ChatGPT-User?

The main differences between GPTBot and ChatGPT-User lie in their purpose and method of operation:

GPTBot is designed to extensively and automatically crawl and collect data, with the goal of feeding and improving AI models.

Very similar to the operation of traditional search engine crawlers.

Instead, ChatGPT-User is activated to search for and obtain information to respond to user queries in real time, without performing extensive automatic crawling.

Anthropic-ai: what it is and what functions it has

image6 1

Anthropic-ai is a web crawler operated by Anthropic.

It is focused on downloading data to train large language models (LLMs) like those powering Claude.

Its main task is to collect web content, functioning as an “AI Data Scraper”.

It is true that the specific details on how it selects sites to crawl are generally unclear.

Google Extended: what it is and what functions it has

image1

Google-Extended is a web crawler operated by Google, primarily used to download training content for AI products like Bard and Vertex AI’s generative APIs.

Other AI chatbot crawler: Cohere-ai

Cohere-ai is a bot operated by Cohere, mainly used in its AI chat products.

This bot is activated in response to user prompts when it is necessary to retrieve content from the internet.

Unlike traditional web crawlers, cohere-ai does not automatically navigate the web, but rather makes specific visits to websites based on individual user requests.

How to block AI Bots from using my content

To block these bots’ access to your website, you can use the robots.txt file.

It is the standard tool for all webmasters to control crawler access.

It looks something like this:

bloquear chatgpt

And is inserted at the root of the domain:

Yourdomain.com/robots.txt

Below, we explain the code you should add to block each AI Bot:

How to block OpenAI bots

Block GPTBot:

Add the following lines to your robots.txt file:

 				 					User-agent: GPTBot  Disallow: /   				 			

Block ChatGPT-User:

Similarly, to prevent access by ChatGPT-User, add:

 				 					User-agent: ChatGPT-User  Disallow: /   				 			

How can I block Anthropic / Claude bots?

To block anthropic-ai access to your website, you must modify the robots.txt file at the root of your domain with the following lines:

 				 					User-agent: GPTBot  User-agent: anthropic-ai  Disallow: /     				 			

How can I block Google Bard / Vertex AI bots?

To control or block Google-Extended access to your website, you can configure the robots.txt file including the following lines:

 				 					User-agent: Google-Extended  Disallow: /     				 			

How effective is using Robots.txt to block AI?

Blocking AI bots via the robots.txt file is the most effective method we currently have, but it is not entirely reliable.

The first problem is that you need to specify each bot you want to block, but who can keep track of all the AI bots that come to market?

The next drawback is that the orders in your robots.txt file are non-mandatory instructions. While bots like Common Crawl and ChatGPT respect these orders, many others do not.

Another major drawback is that you can only block AI bots to prevent future scraping.

You cannot remove data from previous crawls or send requests to companies like OpenAI to delete all your data.

What websites are blocking AI bots? Some examples:

Top 10 websites

The following table shows which of the top websites worldwide are blocking different bots related to artificial intelligence.

image5

As you can see, Pinterest and Amazon are the only ones that have taken action so far.

Top 1,000 websites How many websites have blocked AI bot access?

Among the world’s 1,000 most popular websites, blocking percentages are very uneven

image3

The chart shows that of the top 1,000 websites:

  1. GPTBot is the most blocked with 29%,
  2. CCBot is the second with 16%,
  3. Google-Extended, with 7.7%, would be the third
  4. And in fourth position, anthropic-ai with 1.4%.

This indicates greater concern about content scraping by GPTBot, while anthropic-ai seems to be the least worrisome for website administrators.

Or it may be less well known.

Pros and Cons of Blocking AI Bots

The use of AI bots like ChatGPT, Google-Extended, cohere-ai, and others has increased significantly, and it is normal for website owners to have doubts about whether or not to block them.

Below, we will explore some of the benefits and drawbacks of blocking these AI bots.

Benefits of blocking ChatGPT Bot and other AI bots

– Protects Exclusive Content:

Blocking these bots helps protect your original content from being scraped and used without your permission, maintaining exclusivity and value of your work.

– Control over Information Distribution:

You have greater control over how and where your content is distributed, which is particularly important for sites with sensitive or proprietary information.

– Reduces Server Load:

By preventing constant access by these bots, you can reduce load on your servers, which is crucial for sites with limited resources.

– Prevents Unauthorized Uses:

Preventing access by these bots can also avoid unauthorized or unethical uses of your content, such as creating AI models based on scraped data without consent.

Disadvantages of blocking ChatGPT Bot and other AI bots

– Possible Impact on Visibility and SEO:

Blocking certain bots, especially those related to search engines like Google, can negatively impact your site’s visibility and SEO performance.

– Limitation on Innovation and Collaboration:

By restricting access by these bots, you may be limiting indirect opportunities for innovation and collaboration that these bots could facilitate by interacting with your content.

– Challenges in Implementation:

Implementing effective blocking requires technical knowledge of how robots.txt files work and may require constant updates to be effective.

– Possible Isolation in the Digital Ecosystem:

Excessive blocking can lead to isolation in the digital ecosystem, where your content is placed out of reach from technological advances and emerging opportunities.

Conclusion

Around 25.9% of the thousand most visited websites have implemented measures to block GPTBot access.

Among the most prominent that have recently added restrictions are Pinterest, along with other major ones like Amazon and Quora.

A significant number of large news publishers and media outlets, such as The New York Times, The Guardian and CNN, have also decided to block GPTBot.

The first six major websites that led this trend include Amazon, Quora, The New York Times, Shutterstock, Wikihow and CNN.

On the other hand, the CCBot, despite being older than GPTBot, has only been blocked by 13.9% of websites since August 1, 2023.

Regarding attempts to block Anthropic AI, only two cases have been observed, with Reuters being one of the sites that has applied restrictions to both anthorpic-ai.

We can affirm that the concern of large companies and media is growing about how AI companies use their content as raw material.

Could these misgivings condition the evolution of AI models? Are there legal grounds that could lead to lawsuits for intellectual property violations?

Frequently Asked Questions

The short answer is that it’s almost certainly not worth it. Manually blocking each AI bot is practically impossible and even if you succeed, there is no guarantee that all will obey your robots.txt file instructions. Additionally, blocking these bots may prevent you from collecting meaningful data to determine if tools like Bard are benefiting or hurting your search marketing strategy.

You can block GPTbot and other OpenAI crawlers using plugins or by adjusting your site’s settings to limit their access. Restricting access via your website’s robots.txt file is also possible.

You can implement measures like establishing paywalls, requiring user registration, or limiting access to certain sections of your site to prevent unauthorized GPTbot scraping.

Yes, continuous scraping by GPTbot or other OpenAI bots can consume server resources and slow down your website’s performance, although this situation arising is exceptional.

Need help adapting your web positioning strategies to the new AI paradigm?

We know how to leverage AI to increase organic and paid traffic to RETAIN your customers and get steady returns from SEO, PPC, and CRO.

Let us help you dominate search engine result pages!

Tell us about your case!
Valora este contenido
avatar user 1 1626173824
Web | + posts

Agency specialized in digital marketing engineering. Traffic acquisition, analysis and optimization of results.

If you liked it, please share it:

Related Posts