OpenAI transcribed over a million hours of YouTube videos to train GPT-4

Michael Ten @lemmy.world to Technology@lemmy.ml – 33 points –
OpenAI transcribed over a million hours of YouTube videos to train GPT-4
theverge.com
7

You are viewing a single comment

Wouldn't the YouTube algorithm add an unintentionally bias into the training data?

A lot of YouTubers talk about how they're having to adjust their content and style to maintain viewership numbers. Hence all the click bait thumbnails & captions.

Probably, but that assumes that the transcribers went from video to video following the algorithm. I'd suspect that they would randomize the videos they chose somehow figured out some other distribution.

But that is just a guess, you could be right.