Open AI and Google trained AI models on YouTube videos

Both OpenAI and Google turned to transcribing YouTube videos to further train their AI models, which may violate creators’ copyrights, the New York Times reports. The report details how the two tech giants, along with Meta, cut corners to access as much data as possible to train their AI models.

According to the report, OpenAI used Whisper, a speech recognition tool, to transcribe more than one million hours of YouTube videos. It then fed the transcripts into GPT-4, the powerful AI system that the latest model of ChatGPT‘s chatbot runs on. Google, which owns YouTube, also transcribed YouTube videos to train its AI models.

The transcription of videos by both companies may infringe on creator’s copyrights to their videos. Other uses of creator content to train AI has prompted copyright and licensing lawsuits.

OpenAI’s use of YouTube videos also may violate Google’s rules, which prohibits the use of its videos for “independent” applications and “automated means (such as robots, botnets or scrapers)” of accessing its videos.

Matt Bryant, a spokesperson for Google, told the New York Times that the company was unaware of any such use by OpenAI. But the report alleges that people at Google knew about OpenAI’s unauthorized use of YouTube videos and neglected to take action because it was doing the same thing. Google also told the paper that it only trains its AI on videos from creators who have agreed for their content to be used in this manner.

In July 2023, Google changed its terms of service to allow the use public online material like Google Docs and Google Maps restaurant reviews to further train its AI models.