AI has a voracious hunger for data and is willing to do anything (absolutely everything) to appease the greed

tasnimsanika00 · Post by **tasnimsanika00** » Sat Dec 28, 2024 8:13 am

Data to train artificial intelligence (AI) models is the new oil of the internet, and such data is paradoxically becoming increasingly scarce (because it is exploited beyond the resources currently available). In fact, tech giants such as Meta, Google and OpenAI are currently desperate to obtain copyright-free data to train their AI systems. The desperation is such that Meta would even be willing to face lawsuits and the fines arising from such lawsuits in order to (illegally) feed on copyright-protected data.

According to a recent study by AI specialist Epoch , the demand for quality data to train artificial intelligence models is so absolutely enormous that such data could run out by 2026. All the more reason for major technology companies to have absolutely desperate measures in their pipeline to deal with the data shortage.

Goal
At Facebook and Instagram's parent company, the hunger for data is so insatiable that some cambodia whatsapp list of its most prominent leaders are said to be holding daily meetings in 2023 to address the data deficit, according to Business Insider .

One of the ideas that Meta is said to have put on the table to calm its data obsession is the purchase of the American publishing house Simon & Schuster , which the investment firm KKR acquired in October 2023 for 1.62 billion dollars. If the acquisition had been formalized, Meta would have had legal access to the texts of the books published by Simon & Schuster (assuming that the corresponding agreements had been previously reached with the authors).

Meta has also reportedly considered paying $10 for each Simon & Schuster book used to train its AI systems.

Currently, the company led by Mark Zuckerberg is said to have several poorly paid employees in Africa working hard to write summaries of fiction and non-fiction books to train Meta's AI models.

This ploy is clearly dubious from a copyright standpoint, but Meta executives have reportedly argued that they simply have no other alternative and are even willing to deal with potential lawsuits.

OpenaAI
Knowing that it needs massive amounts of data to train its AI models, ChatGPT’s parent company has developed Whisper, a voice recognition software that can transcribe text from videos and podcasts . Using this system, OpenAI has reportedly sourced more than a million hours of content from YouTube. Whether the company led by Sam Altman has actually used this content to train its AI models is currently the subject of heated debate across the globe. According to The New York Times , Mira Murati, OpenAI’s chief technology officer, says she is not at all certain on this point.

For his part, Neal Mohan, CEO of YouTube, does not want to launch specific accusations against OpenAI, but he does make it clear that, if ChatGPT's parent company had actually used content from its platform to train its AI models, this would have constituted "a clear violation of our terms of use . "

Google
The Mountain View company has been working since the middle of last year to secure the rights to data generated by its own users to train its AI systems. And data from the free versions of Google Docs, Google Sheets, Google Slides and even restaurant reviews on Google Maps could end up feeding Google's AI models. However, the truth is that the internet giant has not yet made the necessary adjustments to its privacy policy, the last update of which dates back to July 2023, to be able to use this data in its AI models.

With the demand for data so utterly overwhelming and the supply so extremely meager, tech companies are also considering relying on artificial text generation to train their AI systems. OpenAI is working, not in vain, on so-called “synthetic data,” as revealed by its CEO Sam Altman. “If an AI model was smart enough to generate quality synthetic data, that would be really fantastic,” he says. The problem is that if AI models train themselves, the errors and information that emerge from such systems could eventually multiply. For this reason, OpenAI is developing a system in which one AI generates synthetic data and a second AI controls the results it brings to the table.