A YouTube creator based in Massachusetts, David Millette, has filed a class-action lawsuit against OpenAI, alleging that the company trained its generative AI models using millions of transcripts from YouTube videos without informing or compensating the videos’ owners. Millette, represented by the law firm Bursor and Fisher, is seeking a jury trial and over $5 million in damages on behalf of all YouTube users whose data might have been used by OpenAI for training purposes.
According to the complaint filed in the U.S. District Court for the Northern District of California, OpenAI surreptitiously transcribed Millette’s and other creators’ videos to train its AI-powered chatbot platform, ChatGPT, and other generative AI tools and products. The complaint alleges that OpenAI violated copyright law and YouTube’s terms of service, which prohibit the use of videos for apps independent of its service. OpenAI is accused of profiting significantly from the creators’ work without their consent, credit, or compensation.
Generative AI models like OpenAI’s rely on vast amounts of training data to learn and generate text based on patterns. These models are typically trained on data from public websites and data sets available on the web. While companies argue that fair use protects their right to scrape data for commercial model training, many copyright holders disagree and have filed lawsuits to stop these practices.
Video transcriptions have become crucial training data for AI models as other sources of data become more limited. According to data from Originality.AI, over 35% of the world’s top 1,000 websites now block OpenAI’s web crawler. Additionally, a study by MIT’s Data Provenance Initiative found that around 25% of high-quality data sources have restricted access to major data sets used for training AI models. If this trend of limiting access to data continues, developers may run out of data to train generative AI models between 2026 and 2032, according to the research group Epoch AI.
In April, it was reported that OpenAI used its speech recognition model called Whisper to transcribe audio from videos and collect additional training data. An OpenAI team, including the company’s president, transcribed over a million hours of video from YouTube using Whisper’s results to train OpenAI’s text-generating and -analyzing model, GPT-4. Despite some OpenAI staff expressing concerns about potential violations of YouTube’s rules, the practice continued.
Other companies, including Anthropic, Apple, Salesforce, and Nvidia, have also used data sets containing subtitles from hundreds of thousands of YouTube videos to train their generative AI models. Many YouTube creators whose subtitles were included in these data sets were unaware of and did not give consent for their usage. Apple later clarified that it did not intend to use those models for any AI features in its products.
Notably, even YouTube’s parent company, Google, sought to utilize transcripts to train its AI models. Google updated its terms of service last year to allow for increased utilization of user data for generative AI model training. This change in terms enabled Google to leverage YouTube data for building products beyond the video platform.
OpenAI and Google have not yet provided any comments on the class-action lawsuit. In addition to this legal challenge, OpenAI is facing another lawsuit filed by Elon Musk, the CEO of Tesla and X. The lawsuit accuses OpenAI of deviating from its original nonprofit mission and reserving its most advanced technology for commercial customers. Musk previously made similar claims in a February lawsuit, but the recent suit also accuses OpenAI of engaging in racketeering activities.
In conclusion, OpenAI is facing legal challenges over allegations of training generative AI models using YouTube videos’ transcripts without consent or compensation. The class-action lawsuit highlights the potential copyright violations and breach of YouTube’s terms of service. These lawsuits shed light on the ongoing debate surrounding the fair use of data for AI model training and raise questions about transparency and consent in the use of user-generated content. As the demand for training data grows and access to certain data sources becomes restricted, companies and researchers will need to navigate these legal and ethical challenges to ensure responsible and lawful usage of data.
Source link