Search the YouTube Videos Secretly Powering Generative AI

We built a tool to reveal the channels used by AI giants

FilmMagic/FilmMagic for YouTube

By Alex Reisner

Jul 16, 2024

Most developers of artificial intelligence models are secretive about the sources of their training data. They need vast amounts of high-quality text to build AI models that mimic human speech and writing. Books, blogs, art, original research, and other creative work is used, often without the knowledge of creators.

Proof News examined a training dataset of YouTube video subtitles that was publicly available — but not easy to access without technical expertise — to find out whose work had been used for AI training.

Our investigation found subtitles from 173,536 YouTube videos taken from more than 48,000 channels that were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce, according to their own research papers and posts.

Representatives from Anthropic and Salesforce confirmed use of a compilation of training datasets called the Pile, which includes YouTube Subtitles, and denied wrongdoing. A representative for Nvidia declined to comment. Apple, Databricks, and Bloomberg representatives did not respond to requests for comment.

The dataset contained subtitles, but it took some work to identify which videos they came from. Proof News took the video IDs from the dataset, queried YouTube’s publicly accessible developer tool, and obtained the metadata for each video, including the title, channel, and category.

We built a tool so you can search the data for yourself. Be advised that the search tool will occasionally return false negatives for channels and videos that are in the dataset. Make sure to spell your channel or video title correctly.

Search the YouTube Videos Secretly Powering Generative AI

By Alex Reisner

Was your YouTube video used to train AI? Search here to find out

Read more

AI Water Usage Eclipses the Biggest Beverage Companies in the World

General Purpose AI Uses 20 to 30 Times More Energy than Task-Specific AI

“The Public has a Right to Know Where our Resources are Going”

AI’s Unquenchable Thirst: How AI is Guzzling up the Water Supply

Was your YouTube video used to train AI? Search here to find out

Republish This Article

Read more

AI Water Usage Eclipses the Biggest Beverage Companies in the World

General Purpose AI Uses 20 to 30 Times More Energy than Task-Specific AI

“The Public has a Right to Know Where our Resources are Going”

AI’s Unquenchable Thirst: How AI is Guzzling up the Water Supply

Search the YouTube Videos Secretly Powering Generative AI