SEO

Google Bard AI – What Sites Were Used To Train It?

Google’s Bard is based on a LaMDA language model, which is trained on datasets based on Internet content called Infiniset about which very little is known about where the data came from and how it was obtained.

The 2022 LaMDA paper lists the percentages of different types of data used to train LaMDA, but only 12.5% ​​comes from a general dataset of content crawled from the web and another 12.5% ​​comes from Wikipedia.

Google is intentionally vague about where the rest of the stolen data came from, but there are hints about the locations in these datasets.

Infiniset dataset from Google

Google Bard is based on a language model called LaMDA, which is short for Language model for dialogue applications.

LaMDA is trained on a dataset called Infiniset.

Infiniset is a mixture of Internet content intentionally chosen to enhance the model’s ability to engage in dialogue.

LaMDA research paper (PDF) explains why they chose this configuration of content:

“…This configuration was chosen to achieve more robust performance on dialogue tasks…while maintaining its ability to perform other tasks such as code generation.

As future work, we can study how the choice of this configuration affects the quality of some other NLP tasks performed by the model. “

Research paper refers to dialogue And dialogueswhich is a spelling of words used in this context, in the field of computer science.

In all, LaMDA has been pre-trained on 1.56 trillion words of “General dialogue data and web text. “

The data set consists of the following mix:

  • 12.5% ​​of data based on C4
  • 12.5% ​​Wikipedia is English
  • 12.5% ​​of the code documentation is from programming question-and-answer sites, tutorials, and more
  • 6.25% English web documents
  • 6.25% Non-English web documents
  • 50% of data dialogues are from public forums

The first two parts of Infiniset (C4 and Wikipedia) consist of known data.

The C4 dataset, which will be explored soon, is a specially filtered version of the co-crawl dataset.

Only 25% of the data is from a named source ( C 4 data set f Wikipedia).

The rest of the data that makes up the bulk of Infiniset’s dataset, 75%, consists of words that have been scraped from the Internet.

The paper does not say how the data was obtained from the websites, what sites it was obtained from or any other details about the stolen content.

Google only uses generic descriptions such as “non-English web documents”.

Ambiguous means when something is not explained and is mostly hidden.

Murky is the word that best describes the 75% of the data Google used to train LaMDA.

There are some clues to this It might give a general idea Of the sites embedded in 75% of the web’s content, but we can’t know for sure.

C4 dataset

C4 is a data set developed by Google in 2020. C4 stands for “The hollowed-out structure of great cleanliness. “

This dataset is based on shared crawl data, which is an open source dataset.

About the joint crawl

joint crawl It is a registered non-profit organization that crawls the Internet every month to create free datasets that anyone can use.

Common Crawl is currently run by people who worked at the Wikimedia Foundation, former Google employees, the founder of Blekko, and counts as advisors such as Peter Norvig, Director of Research at Google and Danny Sullivan (also at Google).

How C4 was developed by co-crawl

The raw co-crawl data is cleaned up by removing things like thin content, obscenities, lorem ipsum, navigational menus, deduplication, etc. in order to limit the data set to the main content.

The goal of filtering out unnecessary data was to remove nonsense and retain examples of natural English.

This is what the researchers who created C4 wrote:

To compile our core dataset, we downloaded text extracted from the web as of April 2019 and applied the above filtering.

This results in a body of text that is not only larger in sizes than most datasets used in pre-training (about 750 GB) but also includes reasonably clean and natural English text.

We call this data the “Colossal Clean Crawled Corpus” (or C4 for short) and export it as part of TensorFlow datasets…”

There are also other unfiltered versions of C4.

Title of the paper describing the C4 dataset, Exploring the Limits of Transfer Learning with a Text-to-Unified Text Converter (PDF).

Another research paper from 2021, (Web Bulk Documentation: A Case Study of Crawled Clean Bulk – PDF) Examine the composition of the sites included in the C4 dataset.

Interestingly, the second paper detected anomalies in the original C4 data set that resulted in the removal of web pages that were Hispanic and African American.

Hispanic webpages were removed by the blocklist filter (expletives, etc.) on average 32% of the pages.

Web pages compatible with African Americans were removed at a rate of 42%.

Presumably, these shortcomings have been addressed …

Another finding was that 51.3% of the C4 dataset consisted of web pages hosted in the United States.

Finally, a 2021 analysis of the original C4 dataset acknowledges that the dataset is only a fraction of the total internet.

analysis says:

Our analysis shows that while this dataset represents a large portion of the public internet, it is in no way representative of the English-speaking world, and spans a wide range of years.

When creating a dataset from a web scrape, reporting the domains from which text was excerpted is integral to understanding the dataset; The data collection process can result in a significantly different distribution of Internet domains than one would expect.”

The following statistics on the C4 dataset are from the second research paper linked above.

The top 25 sites (by number of symbols) in C4 are:

  1. patents.google.com
  2. en.wikipedia.org
  3. en.m.wikipedia.org
  4. www.nytimes.com
  5. www.latimes.com
  6. www.theguardian.com
  7. Journal.plos.org
  8. www.forbes.com
  9. www.huffpost.com
  10. patents.com
  11. www.scribd.com
  12. www.washingtonpost.com
  13. www.fool.com
  14. ipfs.io
  15. www.frontiersin.org
  16. www.businessinsider.com
  17. www.chicagotribune.com
  18. www.booking.com
  19. www.theatlantic.com
  20. link.springer.com
  21. www.aljazeera.com
  22. www.kickstarter.com
  23. caselaw.findlaw.com
  24. www.ncbi.nlm.nih.gov
  25. www.npr.org

These are the top 25 domains represented at the top level in the C4 dataset:

Screenshot from Documenting the Web’s Big Script: A Case Study of a Clean Crawler’s Big Script

If you are interested in learning more about the C4 dataset, I recommend reading Documenting Big Web Bugs: A Case Study of Clean Bugs (PDF) In addition to the original 2020 research paper (PDF) for which C4 was created.

What could be the dialogue data from the public forums?

50% of the training data is fromData dialogues from public forums. “

That’s all Google’s LaMDA research paper says about this training data.

If one had to guess, Reddit and other great communities like StackOverflow are safe bets.

Reddit is used in many important datasets like datasets Developed by OpenAI and called WebText2 (PDF)which is an open source approximation of WebText2 called OpenWebText2 and Google’s own Web-like text (PDF) Data set from 2020.

Google also published details of another dataset of public debate sites a month before LaMDA’s paper.

The dataset containing public discourse sites is called MassiveWeb.

We do not anticipate that the MassiveWeb dataset was used to train LaMDA.

But it does contain a good example of what Google has chosen for another dialogue-focused language model.

MassiveWeb was created by DeepMind, which is owned by Google.

It is designed for use by a large language model called Gopher (Link to a PDF of the research paper).

MassiveWeb uses conversational web sources beyond Reddit to avoid creating bias towards data affected by Reddit.

He’s still using Reddit. But it also contains data taken from many other sites.

The public dialogue sites listed on MassiveWeb are:

  • reddit
  • Facebook
  • Quora
  • Youtube
  • mode
  • StackOverflow

Again, this does not indicate that LaMDA has been trained at the aforementioned sites.

It’s just meant to show what Google could have used it for, by showing a dataset that Google was working on around the same time as LaMDA, which is a site with forum-type sites.

The remaining 37.5%

The final set of data sources are:

  • 12.5% ​​of the code documentation is from programming-related sites such as Q&A sites, tutorials, etc.;
  • 12.5% ​​Wikipedia (English)
  • 6.25% English web documents
  • 6.25% Non-English web documents.

Google does not specify what sites are on Programming question and answer sites It represents 12.5% ​​of the data set that LaMDA is training on.

So we can only speculate.

Stack Overflow and Reddit seem like obvious choices, especially since they’re both included in the MassiveWeb dataset.

What “TutorialsCrawled sites? We can only speculate as to what these “educational” sites might be.

This leaves out the last three categories of content, two of which are very vague.

English Wikipedia does not need to be discussed, we all know Wikipedia.

But the following two things are not explained:

English And Change English Language web pages are a general description for 13% of the sites included in the database.

This is all the information that Google provides about this piece of training data.

Should Google be transparent about the datasets used to Bard?

Some publishers feel uncomfortable using their sites to train AI systems because, in their opinion, such systems can make their websites outdated and out of date.

Whether or not this is true remains to be seen, but it is a real concern expressed by publishers and members of the search marketing community.

Google is frustratingly vague about the websites used to train LaMDA as well as the technology used to scrape websites for data.

As noted in the analysis of the C4 dataset, the methodology for selecting website content to be used to train large language models can influence the quality of the language model by excluding specific populations.

Should Google be more transparent about the sites used to train its AI or at least publish an easy-to-find transparency report about the data that was used?

Featured image by Shutterstock / Asier Romero

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button