OpenAI transcribed over a million hours of YouTube videos to train GPT-4

Michael Ten @lemmy.world to Technology@lemmy.world – 152 points –
OpenAI transcribed over a million hours of YouTube videos to train GPT-4
theverge.com
43

You are viewing a single comment

I fucking hate how this company is just taking data and metrics without any permissions and repercussions. OpenAI and Sam Altman can fuck right off. Same with Microsoft and copilot and every other company rushing for the AI/ML arms race, its disgusting and irresponsible.

We joke about skynet and terminators and whatnot, but the reality is OpenAI is legitimately moving towards that end with no safety precautions, no thought put into the economic and humanitarian impacts they're going to cause. Capitalism in general (and yes I'm going to be that guy and say it) simply cannot survive the AI/ML age of humanity without evolving.

Going to start keeping score. Mark you down in the AI is going to be amazingly powerful camp.

How clueless are you. Everything "taken" was available for free. Provided for free for any web crawler to consume and now you're acting like consuming it is a crime?

I get that you're really jealous because you didn't think of LLMs but you don't get to claim something is a crime in one specific instance just because you don't like what they're doing after their program consumes content.

Google has done the same thing for years and no one said a peep. What does everyone think search results even are??????

You completely miss my point, are you saying data such as copyrighted published works and medical records are free? Because I did not in any way consent to sharing medical records to OpenAI https://www.businessinsider.com/openai-chatgpt-generative-ai-stole-personal-data-lawsuit-children-medical-2023-6?op=1

Now I realize this is an alleged offense, but it's still fucked up. As for wanting to be the first to make a LLM, I have no desire to put myself into that amount of responsibility and liability. Sam Altman is chasing money and nothing more.

There's a distinct difference between quotation and plagiarism. A search engine does the former, LLMs do the latter.

No. If you write a truly unique combination of words then an LLM will be very unlikely to reproduce them.

An LLM is only likely to plagiarise you if your writing is similar to others.

[citation needed]

The differences between human and machine-generated text overlap support the image of LLMs as more "arrangers" than "creators" of text.

So plagiarism...

It only plagiarises you if you write something similar to lots of other people.

Write something original and, even if it is in their training dataset, LLMs are highly unlikely to reproduce it.