According to reports OpenAI made use of YouTube to train its speech-to-text AI language model Whisperby scraping its data.
Using YouTube
Some of the training data derived from Whisper ultimately contributed to the development of GPT-4, which is the language model behind ChatGPT.
According to a report in The Information, OpenAI “has secretly used data from the site (YouTube) to train some of its artificial intelligence models”.
AI models need tons of data for training and YouTube is the single biggest and richest source of imagery, audio and text transcripts on the web.
Google’s Gemini
Google researchers have also been using YouTube data to train and refine its own large-language model called Gemini.
Sundar Pichai, the CEO of Google noted, “Gemini was created from the ground up to be multimodal, highly efficient at tool and API integrations, and built to enable future innovations, like memory and planning.”
He further added that it offers “impressive capabilities not seen in prior models.”
The value of video content for AI training purposes has also been acknowledged by Meta.
Using video data
Yann LeCun, the AI chief at Meta Platforms, has emphasized the significance of video training data in his work.
LeCun stated that a hierarchical Joint Embedding Predictive Architecture could potentially learn about the world by watching videos and interacting with its environment.
His point highlights the importance of video in enabling AI models to “think” more like humans, as opposed to relying solely on text data for training.
Violates rules
YouTube does not permit use of its data for such purposes.
Its terms of service ban using content for anything other than “personal, non-commercial use.”
Hence, training a commercially oriented AI model using such content could potentially violate the site’s rules.
Controversy
It’s an open secret in the AI industry that all are scraping the web and OpenAI reportedly “scraped” YouTube data to train its AI models which are now a rage in the world.
This has provoked debates and disputes as major technology companies increasingly move to improve their AI capabilities or offer AI-powered services.
Despite the lawsuits filed against text-to-image generator firms for violating artists’ copyright, large language models continue to be developed in secrecy with no information or transparency about their training data content.