What is a well known 'public secret' in the industry you work in that the majority of outsiders are unaware of?

NotSpez@lemm.ee to Ask Lemmy@lemmy.world – 629 points –
654

You are viewing a single comment

Technically not my industry anymore, but: companies that sell human-generated AI training data to other companies most often are selling data that a) isn't 100% human generated or b) was generated by a group of people pretending to belong to a different demographic to save money.

To give an example, let's say a company wants a training set of 50,000 text utterances of US English for chatbot training. More often than not, this data will be generated using contract workers in a non-US locale who have been told to try and sound as American as possible. The Philippines is a common choice at the moment, where workers are often paid between $1-2 an hour: more than an order of magnitude less what it would generally cost to use real US English speakers.

In the last year or so, it's also become common to generate all of the utterances using a language model, like ChatGPT. Then, you use the same worker pool to perform a post-edit task (look at what ChatGPT came up with, edit it if it's weird, and then approve it). This reduces the time that the worker needs to spend on the project while also ensuring that each datapoint has "seen a set of eyes".

Obviously, this makes for bad training data -- for one, workers from the wrong locale will not be generating the locale-specific nuance that is desired by this kind of training data. It's much worse when it's actually generated by ChatGPT, since it ends up being a kind of AI feedback loop. But every company I've worked for in that space has done it, and most of them would not be profitable at all if they actually produced the product as intended. The clients know this -- which is perhaps why it ends up being this strange facade of "yep, US English wink wink" on every project.

A couple decades ago I worked for a speech recognition company that developed tools for the telephony industry. Every week or two all the employees would be handed sheets of words or phrases with instructions to call a specific telephone extension and read them off. That’s how they collected training data…

I'm not surprised tbh. Having perused some of the text training datasets they were pretty bad. The classification is dodgy too. I ended up starting my own dataset because of this.