Our Global Presence :

Home / Blog / AI/ML

The Unseen Data Harvesting by Tech Giants for AI Development

Gurpreet Singh

20 MIN TO READ

July 29, 2024

The Unseen Data Harvesting by Tech Giants for AI Development

Gurpreet Singh

20 MIN TO READ

July 29, 2024

Table of Contents

The New York Times has published a ground-breaking exposé detailing the extreme measures taken by digital behemoths such as OpenAI, Google, and Meta to gather the enormous volumes of data necessary to advance artificial intelligence.

These businesses have engaged in actions that transgress corporate standards, straddle ethical lines, and even touch the law in their quest for AI development supremacy. In addition to exposing the extreme techniques used, the study has sparked important debates around ethical technology usage and privacy.

Tech organizations are pushing the boundaries of data collection as AI development services continue to advance and the demand for comprehensive data sets grows more pressing. This aggressive approach to data gathering highlights the risks involved with unrestricted data collection and the competitive nature of the AI sector.

The New York Times’ investigation is a sobering lesson in the necessity for strict regulation and moral standards to make sure that the pursuit of AI advancement doesn’t compromise people’s rights and moral principles.

Concerned about data privacy and the impact of AI development?

Discover how tech giants are harvesting your data. Stay informed and protect your privacy.

Discover Our Services Now!

The Need To Train

Late in 2021, OpenAI encountered a supply issue. While working on its most recent A.I. system, the artificial intelligence lab has gone through every credible English-language text source available online. To train the next iteration of its technology, it required a huge amount of data.

Thus, Whisper is a speech recognition tool developed by OpenAI researchers. A.I. systems could become smarter if they could transcribe YouTube videos’ audio and produce fresh, conversational text.

Giants Cutting Corners

Three persons with knowledge of the conversations stated that some OpenAI staff members discussed how such a move may violate YouTube’s policies. Google-owned YouTube videos are forbidden for applications that are “independent” of the video platform.

According to the people, an OpenAI team eventually finished transcribing over a million hours of YouTube videos. According to two people, Greg Brockman, president of OpenAI, directly assisted in gathering the videos for the team. The words were then fed into GPT-4, a system that served as the foundation for the most recent iteration of the ChatGPT chatbot and was regarded as one of the most potent AI models in the world.

The New York Times has looked at how internet giants like Google, OpenAI, and Meta have advanced AI technology by taking advantage of company policies and legal complexity, which has led to a frantic rush for digital data.

Records of internal meetings acquired by The Times reveal that last year, management, attorneys, and engineers at Meta, the company that owns Facebook and Instagram, contemplated purchasing Simon & Schuster, a publishing house, to acquire long works. Additionally, they discussed the necessity of obtaining copyrighted information from the internet, even at the risk of legal action. They claimed it would take too long to negotiate licensing with news organizations, publishers, musicians, and artists.

Five experts familiar with Google’s methods claim that the business may have violated creators’ copyrights when transcribing YouTube footage for AI models.

Gathering Data From Other Products

Google also expanded its terms of service last year. Members of the company’s privacy team and an internal memo seen by The Times both said that one reason for the change was to provide Google access to additional online content for its artificial intelligence products, such as restaurant ratings on Google Maps, publicly accessible Google Docs, and other online content.

The corporations’ activities demonstrate how online content, including computer programs, images, podcasts, movie clips, message boards, fiction, Wikipedia entries, news items, and computer programs, has progressively become essential to the rapidly expanding A.I. industry. Having enough data to train the technologies to rapidly produce text, images, sounds, and videos that mimic what a human does is essential to building creative solutions.

Prominent chatbots have gained knowledge from digital text repositories that contain up to three trillion words, approximately twice as many words as the Bodleian Library at Oxford University, which has been gathering manuscripts since 1602. High-quality information, such as published books and articles meticulously produced and edited by specialists, is the most highly valued type of data or artificial intelligence.

Related Blog: Step-by-Step Guide to Custom AI Development

The internet and websites like Reddit and Wikipedia have provided an almost limitless supply of information for years. However, as AI tools developed, IT firms looked for additional repositories. Due to privacy regulations and internal policies, Google and Meta, two companies with billions of users creating social network posts and search queries daily, were prohibited from using a major portion of that content for artificial intelligence.

The Need For More Data

AI development companies need more data. According to research institute Epoch, tech corporations may exhaust the high-quality data available on the Internet as early as 2026. Businesses are utilizing data more quickly than it is being generated.

A.I. models can only be helpful if they can be trained on vast amounts of data without requiring licenses, according to Sy Damle, a lawyer for Silicon Valley venture capital firm Andreessen Horowitz, who made this statement last year in a public discussion about copyright law. “Even collective licensing can’t work because the amount of data required is so enormous.”

Because tech businesses are always looking for fresh data, some create “synthetic” data, text, images, and code generated by A.I. models rather than organic data created by humans. In other words, the systems learn from their own generated content.

Copyrights and Lawsuits

Companies are increasingly using AI-generated material, which has resulted in copyright and licensing lawsuits. Last year, the Times filed lawsuits against Microsoft and OpenAI for training AI chatbots with copyrighted news stories without authorization. According to Microsoft and OpenAI, this is “fair use” by copyright laws.

Last year, the Copyright Office received comments from over 10,000 trade associations, writers, and businesses regarding the use of creative works by AI applications models. Author, filmmaker, and former actress Justine Bateman called it the worst theft in US history, claiming AI models were stealing information without consent or money.

Is Scaling All You Need?

The need for internet data increased after Johns Hopkins University theoretical physicist Jared Kaplan presented a ground-breaking article on artificial intelligence in January 2020.

His conclusion was clear: A huge language model, which powers online chatbots, would function better if it had more data to train. Large language models are more accurate and better able to identify patterns in text as they have more material. This is much like how a student gains knowledge from reading more books.

In a report released with nine other OpenAI researchers, Dr. Kaplan showed that the scaling rules, or trends, were as accurate as those observed in physics or astronomy. The finding caused the phrase “Scale is all you need” to become an AI catchphrase.

Size is Relative

To build AI applications, researchers have long exploited massive public databases of digital information, such as Wikipedia and Common Crawl, a database of more than 250 billion web pages gathered since 2007. Before utilizing the data to train AI models, researchers frequently “cleaned” it by eliminating hate speech and other undesirable content.

By today’s standards, data sets in 2020 were little. At the time, one database with 30,000 images from the photo website Flickr was considered an essential tool.

That level of information was insufficient following Dr. Kaplan’s publication. According to Brandon Duderstadt, CEO of New York-based artificial intelligence company Nomic, the focus shifted to “just making things really big.”

GPT-3, which OpenAI unveiled in November 2020, was trained on the most data at the time: 300 billion “tokens,” which are essentially words or word fragments. After learning from the data, the system produced text with amazing precision; it was able to write poems, blog entries, and computer programs.

Google’s AI research division DeepMind took things a step further in 2022. It changed the quantity of training data and other variables while testing 400 A.I. models. The best-performing models employed more data than Dr. Kaplan had projected in his article. Chinchilla, one model, underwent training using 1.4 trillion tokens.

Pushing further, Chinese researchers this year unveiled Skywork, an artificial intelligence model trained on 3.2 trillion tokens from writings written in both Chinese and English. And even further, Google debuted PaLM 2, an AI tools system that has amassed over 3.6 trillion tokens.

More Data, Video Data

OpenAI CEO Sam Altman acknowledged in May that artificial intelligence businesses might eventually consume all of the internet’s usable data. He declared during a lecture at a tech convention that “that will run out.”

Mr. Altman had direct experience with the phenomenon. For years, researchers at OpenAI collected data, cleaned it, and then put it into a sizable corpus of text to train the business’s language models. They mined GitHub, a computer code repository, collected databases of chess moves, and used information from Quizlet, an online resource that describes homework assignments and high school exams.

According to the NY Times report, those supplies had run out by late 2021. Hence, OpenAI desperately needed more data to construct its next-generation A.I. model, GPT-4. According to the people, staff members discussed transcribing YouTube videos, audiobooks, and podcasts, using AI systems to create data from scratch, and contemplating purchasing startups that had amassed substantial digital data collections.

Eventually, Whisper, the speech recognition program from OpenAI, was able to transcribe podcasts and YouTube videos, according to six sources. However, YouTube forbids users from accessing its videos by “any automated means (such as robots, botnets, or scrapers)” in addition to exploiting them for “independent” uses.

According to the sources, OpenAI personnel were aware that they were entering a legal limbo. Still, they thought it was appropriate to utilize the movies to train AI. A research paper mentioned Mr. Brockman, the president of OpenAI, as one of Whisper’s creators. According to two sources, he personally assisted in compiling YouTube videos and feeding them into the system.

With the help of Whisper’s transcriptions of over a million hours of YouTube videos, OpenAI published GPT-4 last year. Mr. Brockman oversaw the GPT-4 development team.

Read Also This Blog: Generative AI Development: Cost & Time Factors Explained

Despite knowing that OpenAI was using YouTube video transcripts, Google staff members chose not to intervene. This may have violated the copyrights of YouTube artists, and if Google raised a ruckus about OpenAI, it might spark a backlash against its own methods.

Raising the Data at Meta

The CEO of Meta, Mark Zuckerberg, had been investing in artificial intelligence for years, but he was left behind when OpenAI unveiled ChatGPT in 2022. According to three current and former workers, he immediately pushed to match and surpass ChatGPT, calling executives and engineers at all night hours to encourage them to develop a competitor chatbot.

However, Meta faced the same obstacle as its competitors by the beginning of last year: the shortage of data.

Internal recordings show that Meta’s generative AI development team developed a model by reading many English-language books, essays, poems, and news stories. The team needed additional data to equal ChatGPT. To tackle the problem, Meta’s engineers, lawyers, and business development experts convened every day in March and April 2023.

They discussed purchasing Simon & Schuster, paying $10 a book for complete license rights, and summarizing works without the author’s consent. They also talked of buying more pieces, even if it meant breaking the law. A lawyer raised ethical issues about stealing artists’ intellectual property, but no one responded.

Following a 2018 controversy over sharing user data with Cambridge Analytica, Meta has come under criticism for lack of access to vast volumes of user postings and privacy changes. Additionally, the business is alleged to have used African contractors to compile summaries of both fiction and nonfiction, including works protected by copyright. Executives at Meta assert that OpenAI has improperly utilized copyrighted content and that it would take too long for Meta to work out licensing with news organizations, publishers, musicians, and artists.

In light of a 2015 court ruling involving the Authors Guild versus Google, Meta’s attorneys contend that using data to train AI systems should be considered fair use.

Two workers expressed concerns about intellectual property use and inadequate payment for writers and artists. Meta executives, including CEO Chris Cox, cited a 2015 ruling in the Authors Guild v. Google case, which allowed Google to scan, digitize, and catalogue books in an online database. The morality of using other people’s creative works was not discussed.

Going Synthetic in Data

David Altman, the CEO of OpenAI, has suggested using synthetic data—text produced by AI—to teach AI systems to combat the impending data scarcity. This would lessen the systems’ reliance on copyrighted data by enabling them to produce more data to improve upon themselves. Altman states everything will work out if the model is intelligent enough to provide high-quality synthetic data.

Building an AI system that can educate itself is difficult, though, because the system may reinforce its own flaws, errors, and limits. In response, researchers at OpenAI and others are examining the possibility of combining two distinct AI models to produce more trustworthy and valuable synthetic data.

While one system generates the data, another evaluates it to distinguish the good from the bad. Although experts are split on whether this approach would succeed, leaders in AI are moving forward.

Is your data being used for AI development?

Get expert advice on safeguarding your information.

Consult Debut Infotech Today!

Conclusion

In conclusion, internet companies’ covert data collection tactics raise major ethical and privacy concerns, which are driving the development of AI. Although these techniques promote the development of AI, they also raise important issues related to data security and user consent.

AI consulting companies like Debut Infotech support ethical and transparent data usage. Companies and consumers alike must support and uphold appropriate data practices that protect privacy and promote faith in technological innovation as AI develops.

Frequently Asked Questions

Q. What is data harvesting, and how do tech giants use it for AI development?

A. AI Data harvesting involves collecting large volumes of data from various sources, often without explicit user consent. Tech giants use this data to train AI algorithms, improve machine learning models, and enhance the performance of their AI applications.

Q. How do tech companies collect data for AI purposes?

A. Tech businesses utilize various methods to gather data, including internet tracking, app usage monitoring, social media analysis, cookie usage, and data from publicly accessible records, linked devices, and independent brokers.

Q. Is AI data harvesting by tech giants legal?

A. The legality of data collection depends on the specific procedures and jurisdiction, with some techniques adhering to privacy laws while others may violate them. New laws and increased scrutiny aim to prevent such activities and protect user privacy.

Q. What are the risks associated with data harvesting for AI development?

A. Potential privacy violations, unauthorized access to private data, and improper use of personal information are among the risks. Extensive data collection might also result in less accountability and openness regarding the usage and storage of user data.

Q. How can users protect their data from being harvested by tech giants?

A. Users can implement various measures to safeguard their data, including modifying privacy settings on various devices and platforms, employing encryption software, routinely deleting cookies and browsing history, and exercising caution when granting rights to websites and apps.

Q. What role do regulations play in controlling tech giants’ data harvesting?

A. The California Consumer Privacy Act (CCPA) in the US and the General Data Protection Regulation (GDPR) in Europe aim to protect user privacy and limit data collection, ensuring businesses disclose their methods openly and transparently.

Q. What are the ethical considerations surrounding data harvesting for AI development?

A. Ethical responsibilities include user consent, data transparency, privacy protection, and data exploitation prevention. Businesses must balance ethical norms and individual rights while utilizing AI advancements.