SEO

Is ChatGPT Use Of Web Content Fair?

Language Large Models (LLMs) such as ChatGPT train using multiple sources of information, including web content. This data forms the basis of summaries of this content in the form of articles that were produced without attribution or benefit to those who published the original content used in the ChatGPT training.

Search engines download website content (called crawling and indexing) to provide answers in the form of links to websites.

Website publishers have the ability to opt out of having their content crawled and indexed by search engines through the robots exclusion protocol, commonly referred to as Robots.txt.

Bot Exception Protocol is not an official Internet standard but one that legitimate web crawlers adhere to.

Should web publishers be able to use the Robots.txt protocol to prevent large language models from using the content of their websites?

Large language models use website content without attribution

Some of those involved in search marketing are uncomfortable with how website data can be used to train machines without returning anything, such as an acknowledgment or traffic.

Hans Peter Blindheim (LinkedIn profile), a senior Curamando expert shared his opinions with me.

Hans Peter commented:

When an author writes something after learning something from an article on your site, it often doesn’t relate to your original work because it provides credibility and professional courtesy.

It’s called a quote.

But the size at which ChatGPT ingests content and doesn’t give it away is anything that sets it apart from both Google and people.

A website is generally built with a commercial orientation in mind.

Google helps people find content, providing traffic, which is beneficial to that content.

But it’s not like big language models ask your permission to use your content, they just use it in a broader sense than was expected when your content was published.

And if AI language models don’t provide value in return – why would publishers allow them to crawl and use content?

Does their use of your content comply with fair use standards?

When Google’s ChatGPT and ML/AI models train on your content without permission, they spin what it learns there and use that while keeping people off your websites – industry and lawmakers also shouldn’t be trying to take back control of the internet by forcing them Transition to the “enabling” model? “

The concerns expressed by Hans Peter are reasonable.

In light of the rapid development of technology, should laws related to fair use be reviewed and updated?

I asked John Rizvi, registered patent attorney (LinkedIn profile) holds a board certificate in intellectual property law, If internet copyright laws are outdated.

John answered:

“Yes, without a doubt.

One of the main points of contention in such cases is the fact that law inevitably evolves much more slowly than technology.

In the nineteenth century this might not have mattered so much because progress was relatively slow and so legal mechanisms were more or less fitted to match.

Today, however, runaway technological progress has far outstripped the law’s ability to keep up.

There are simply too many developments and too many moving parts for the law to keep up.

Because it is currently shaped and run, in large part, by people who are not experts in the areas of technology we’re discussing here, the law is poorly equipped or organized to keep up with technology…and we must consider that this isn’t entirely a bad thing.

So, in one aspect, yes, intellectual property law does not need to evolve even if it claims, let alone hopes, to keep pace with technological progress.

The fundamental problem is striking a balance between keeping up with the ways in which various forms of technology can be used while backing away from blatant excesses or outright censorship for political gain disguised as benevolent intentions.

The law must also be careful not to legislate against potential uses of the technology so broadly as to stifle any potential benefit that might result from it.

You could easily go against the First Amendment and any number of settled issues that determine how, why and to what degree intellectual property can be used and by whom.

And trying to conceive of every conceivable use of technology years or decades before the framework exists to make it viable or even feasible would be a very dangerous fool’s errand.

In situations like this, the law can’t really help but be a reaction to how the technology was used…not necessarily how it was intended.

This is not likely to change anytime soon, unless we reach a massive and unexpected technological plateau that allows the law time to catch up with current events.”

So the issue of copyright laws seems to have many considerations to balance when it comes to how to train an AI, there is no simple answer.

OpenAI and Microsoft Sued

An interesting case raised recently was one where OpenAI and Microsoft used open source code to create their CoPilot product.

The problem with using open source code is that the Creative Commons license requires attribution.

According to a published article In a scientific journal:

The plaintiffs allege that OpenAI and GitHub bundled and distributed a commercial product called Copilot to create composing code using publicly accessible code originally available under various “open source” style licenses, many of which include attribution requirements.

As GitHub says, “…[t]GitHub Copilot rains down billions of lines of code, turning natural language prompts into markup suggestions across dozens of languages.

The resulting product allegedly omitted any credit to the original creators.”

The author of this article, a legal expert on the subject of copyright, writes that open source CC licenses are considered by many to be “free for all.”

Some may also consider the phrase Free for all A fair description of datasets made up of Internet content is scraped and used to create AI products such as ChatGPT.

Background on LLMs and datasets

Large language models train on multiple datasets of content. Datasets can consist of emails, books, government data, Wikipedia articles, and even datasets created for linked websites from posts on Reddit that have at least three upvotes.

Many of the Internet content-related datasets have their origins in a crawl created by a non-profit organization called joint crawl.

Their dataset, the Joint Crawl Dataset, is free to download and use.

The popular crawl dataset is the starting point for many other datasets that are built from it.

For example, GPT-3 used a filtered version of the shared crawl (Language models are few-shot learners PDF).

This is how GPT-3 researchers used the website data included in the co-crawl dataset:

“The datasets for language models have expanded rapidly, culminating in the co-crawl dataset… comprising nearly a trillion words.

This size of the dataset is sufficient to train our largest models without updating in the same sequence twice.

However, we found that unfiltered or lightly filtered versions of co-crawl tend to be of lower quality than curated datasets.

Therefore, we took 3 steps to improve the average quality of our datasets:

(1) We downloaded a copy from CommonCrawl and filtered it based on similarity to a high-quality reference set,

(ii) we performed fuzzy de-doubling at the document level, within and across datasets, to prevent redundancy and maintain the integrity of our pending validation set as an accurate measure of overfitting, and

(3) We also added well-known, high-quality reference groups to the training mix to augment CommonCrawl and increase its diversity. “

Google’s C4 dataset (clean vertical crawl dataset), which was used to create the text-to-text (T5) converter, has its roots in the popular crawl dataset as well.

their research paper (Exploring the limits of transfer learning with a standardized text-to-text converter PDF) It is clear:

“Before presenting the results from our large-scale pilot study, we review the necessary key topics required to understand our results, including the architecture of the transformer model and the end-tasks we are evaluating.

We also introduce our approach to each problem as a text-to-text task and describe the “Colossal Clean Crawl Dataset” (C4), the popular crawl-based dataset we created as a source of unlabeled textual data.

We refer to our model and framework as the Text-to-Text Converter (T5). “

Google Post an article on their AI blog It also shows how to use co-crawl data (which contains content stolen from the internet) to build C4.

They wrote:

An important component of the learning process is the unlabeled data set used for pre-training.

To accurately measure the effect of increasing the amount of pretraining, one would need a data set that is not only high quality and diverse, but also massive.

The pre-training datasets don’t meet all three of these criteria – for example, the text from Wikipedia is of high quality, but uniform in style and relatively small for our purposes, while the common crawled web snippets are enormous and very diverse, but of reasonably low quality.

To meet these requirements, we developed Colossal Clean Crawled Corpus (C4), a clean crawled corpus two orders of magnitude larger than Wikipedia.

Our cleaning process included removing redundant statements, getting rid of incomplete sentences, and removing offensive or noisy content.

This filtering produced better results in the final tasks, while the additional size allowed the model to be scaled up without overfitting during pretraining. “

Google, OpenAI, and even Oracle Open Data It uses the content of the internet, your content, to generate datasets that are then used to build AI applications such as ChatGPT.

Joint crawling can be blocked

It is possible to block co-crawling and thus opt-out of all datasets that are based on co-crawling.

But if the site is already crawled, the website data is already in the datasets. There is no way to remove your content from the shared crawl dataset and any of the other derived datasets such as C4 and Open Data.

Using the Robots.txt protocol will only block future crawls by co-crawling, and will not prevent researchers from using content already in the dataset.

How to prevent co-crawling of your data

Cross-crawling can be blocked through the use of the Robots.txt protocol, within the restrictions described above.

The joint crawler is called CCBot.

It is specified using the latest CCBot User-Agent series: CCBot/2.0

CCBot blocking is implemented with Robots.txt as with any other bot.

Below is the code to block CCBot with Robots.txt.

User-agent: CCBot
Disallow: /

CCBot crawls from Amazon AWS IP addresses.

CCBot also follows the nofollow Robots meta tag:

<meta name="robots" content="nofollow">

What if you don’t block co-crawling?

Web content can be downloaded without permission, which is how browsers work, they download content.

Neither Google nor anyone else needs permission to download and publicly use the posted content.

Website publishers have limited options

Considering whether it is ethical to train AI on web content does not seem to be part of any conversation about the ethics of how AI technology is developed.

It seems to be taken for granted that Internet content can be downloaded, abstracted, and turned into a product called ChatGPT.

Does this sound fair? The answer is complicated.

Featured image by Shutterstock / Krakenimages.com

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button