In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

Technology@lemmy.world – 487 points – 9 months ago

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

You are viewing a single comment

View all comments Show the parent comment

...with the prevalence of clickbaity bottom-feeder news sites out there, i've learned to avoid TFAs and await user summaries instead...

(clicks through)

...yep, ~~seven~~ nine ads plus another pop-over, about 15% of window real estate dedicated to the actual story...

The issue is that the LLMs do often just verbatim spit out things they plagiarized form other sources. The deeper issue is that even if/when they stop that from happening, the technology is clearly going to make most people agree our current copyright laws are insufficient for the times.

The model in question, plus all of the others I've tried, will not give you copyrighted material

That's one example, plus I'm talking generally why this is an important question for a CEO to answer and why people think generally LLMs may infringe on copyright, be bad for creative people

I'm talking generally why this is an important question for a CEO to answer ...

Right, which your only evidence for is "LLMs do often just verbatim spit out things they plagiarized form other sources" and that they aren't trying to prevent this from happening.

Which is demonstrably false, and I'll demonstrate it with as many screenshots/examples you want. You're just wrong about that (at least about GPT). You can also demonstrate it yourself, and if you can prove me wrong I'll eat my shoe.

https://archive.is/nrAjc

Yep here you go. It's currently a very famous lawsuit.

I already talked about that lawsuit here (with receipts) but the long and short of it is, it's flimsy. There's blatant lies, exactly half of their examples omit the lengths they went to for the output they allegedly got or any screenshots as evidence it happened at all, and none of the output they allegedly got was behind a paywall.

Also, using their prompts word for word doesn't give the output they claim they got. Maybe it did in the past, idk, but I've never been able to do it for any copyrighted text personally, and they've shown that they're committed to not letting that stuff happen.

OK but this is why people give a shit when a CEO is cagey about how their magic box works