GPT-4 Trained with More Than a Million Hours of Transcribed YouTube Videos by OpenAI

GPT-4, million hours, OpenAI, train, transcribed, YouTube videos

The scarcity of high-quality training data has become a major challenge for AI companies. In a recent report, The Wall Street Journal highlighted this issue, and The New York Times explored some of the ways companies have been dealing with it. However, many of these approaches seem to be in a gray area of AI copyright law.

One example mentioned in The New York Times article is OpenAI, which faced a shortage of training data and developed its Whisper audio transcription model to overcome this obstacle. The company transcribed over a million hours of YouTube videos for training its GPT-4 language model. OpenAI acknowledged the legal ambiguity of this approach but believed it fell under fair use. OpenAI President Greg Brockman personally collected the videos used for training. OpenAI spokesperson Lindsay Held mentioned that the company curates unique datasets for each of its models using various sources, including publicly available data, partnerships for non-public data, and exploring the potential of generating synthetic data.

Google also faced similar challenges in acquiring training data for its AI models. The New York Times reported that Google used transcripts from YouTube, mentioning that the company trained its models on YouTube content in accordance with its agreements with YouTube creators. However, Google’s terms of use prohibit unauthorized scraping or downloading of YouTube content, and the company takes technical and legal measures to prevent such unauthorized use when there is a clear basis for doing so. The new policy language released by Google’s legal department enabled the company to expand its usage of consumer data, such as data from its office tools like Google Docs.

Meta, formerly known as Facebook, also encountered difficulties in obtaining training data. The company’s AI team discussed the unauthorized use of copyrighted works while trying to catch up with OpenAI. They considered options like paying for book licenses or even acquiring a large publisher to access their content. However, privacy-focused changes made by Meta following the Cambridge Analytica scandal limited its use of consumer data.

The scarcity of training data is a pressing issue for AI companies as their models heavily rely on data to improve their performance. The Journal suggested that companies may outpace new content by 2028, posing a significant challenge for the future of AI development. To address this problem, potential solutions include training models on synthetic data generated by their own models or implementing curriculum learning, where models are fed high-quality data in a structured manner to enhance their understanding and make smarter connections between concepts. However, these approaches are yet to be proven effective.

In the absence of a robust alternative, AI companies are left with the option of using whatever data they can find, regardless of permission or legal boundaries. This approach presents significant challenges, as evidenced by the numerous copyright infringement lawsuits filed against various AI companies in recent years.

AI companies must navigate the complex landscape of AI copyright law and ethics to continue advancing their technology. Efficient and legal acquisition of training data is crucial for the development of AI models, and companies need to prioritize responsible data usage and compliance with copyright regulations. As the AI industry evolves, stakeholders must work together to establish clear guidelines and frameworks to address the challenges related to training data acquisition, copyright infringement, and the ethical use of AI. Collaborative efforts between AI companies, content creators, legal experts, and policymakers are necessary to strike a balance between innovation and the protection of intellectual property rights.

Source link

Leave a Comment