SEO

How to Block ChatGPT From Using Your Website Content

There is a concern that there is no easy way to opt out of content being used to train LLMs such as ChatGPT. There is a way to do this, but it’s not straightforward or guaranteed to work.

How does artificial intelligence learn from your content?

Language Large Models (LLMs) are trained on data that originates from multiple sources. Many of these datasets are open source and freely used to train AI.

Some of the sources used are:

  • Wikipedia
  • Government court records
  • books
  • Email messages
  • Crawled websites

There are really portals, websites that offer datasets, that offer huge amounts of information.

One of the portals is hosted by Amazon, offering thousands of datasets at a time Open Data Record on AWS.

Amazon’s portal with thousands of datasets is just one portal among many others with more datasets.

Wikipedia lists 28 portals To download datasets, including Google Dataset and Hugging Face portals to find thousands of datasets.

Web content datasets

OpenWebText

A common data set for web content is called OpenWebText. OpenWebText consists of URLs found in Reddit posts that have at least three upvotes.

The idea is that these URLs are trustworthy and will contain quality content. I couldn’t find information about the user agent for their crawler, maybe it was specified as Python, I’m not sure.

However, we do know that if your site is linked from Reddit with at least three upvotes, there is a good chance your site is in the OpenWebText dataset.

More information about OpenWebText here.

joint crawl

One of the most widely used datasets for Internet content is offered by a non-profit organization called joint crawl.

Common crawl data comes from a bot that crawls the entire Internet.

The data is downloaded by organizations that want to use the data and then cleaned from unwanted sites, etc.

The name of the joint crawling bot is CCBot.

CCBot adheres to the robots.txt protocol so it is possible to block cross-crawling with robots.txt and prevent your website data from being diverted into another dataset.

However, if your site has already been crawled, it is likely that it is already included in multiple datasets.

However, by blocking co-crawling, it is possible to opt-out of your website content from being included in new datasets that are obtained from more recent co-crawl data.

The CCBot User-Agent series is:

CCBot/2.0

Add the following to your robots.txt file to block the co-crawling bot:

User-agent: CCBot
Disallow: /

An additional way to confirm whether a CCBot user agent is legitimate is that it crawls from Amazon AWS IP addresses.

CCBot also adheres to the nofollow robots meta tag guidelines.

Use this in your robots meta tag:

<meta name="robots" content="nofollow">

Prevent artificial intelligence from using your content

Search engines allow websites to opt out of being crawled. Joint crawling also allows for withdrawal. But there is currently no way to remove website content from existing datasets.

Furthermore, research scientists do not appear to offer website publishers a way to opt out of crawling.

Article, Is Using ChatGPT for Web Content Fair? Topic explores whether it is ethical to use website data without permission or a means to opt out.

Many publishers would appreciate if in the near future they were given more say over how their content was used, especially by AI products like ChatGPT.

Whether that will happen is unknown at this time.

Featured image by Shutterstock / ViDI Studio

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button