Google’s AI is Not a Pro in Data Labeling! But the Comp Fails to Admit it

By S G Rickman On Jul 18, 2022

The biggest problems facing Google’s AI industry are rubbish, exploitative data-labeling practices

A study published by Surge AI highlights one of the biggest problems facing the AI industry: rubbish, exploitative data-labeling practices. Google built a dataset called “GoEmotions. It is the largest manually annotated dataset of 58k English Reddit comments, labeled for 27 emotion categories or Neutral. Google’s AI GoEmotions dataset consists of comments from Reddit users with labels of their emotional coloring. A whopping 30% of the dataset is severely mislabeled.

According to Google: In “GoEmotions: A Dataset of Fine-Grained Emotions”, we describe GoEmotions, a human-annotated dataset of 58k Reddit comments extracted from popular English-language subreddits and labeled with 27 emotion categories. As the largest fully annotated English language fine-grained emotion dataset to date, we designed the GoEmotions taxonomy with both psychology and data applicability in mind. It is designed to train neural networks to perform deep analysis of the tonality of texts.

Google’s AI industry data-labeling practices:

Surge AI took a look at a sample of 1,000 labeled comments from the GoEmotions dataset and found that a significant portion of them was mislabeled. This kind of data can’t be properly labeled. It’s not that the particular labelers didn’t do a good job, it’s that they were given an impossible task. This particular kind of AI development is a grift. It’s a scam. And it’s one of the oldest in the book.

Google used data labelers unfamiliar with US English and US culture despite Reddit being a US-centric site with particularly specialized memes and jargon. When we relabeled the dataset, our technical infrastructure, and human-AI algorithms allowed us to leverage our labeling marketplace to build a team of Surgery who aren’t only native US English speakers, but also heavy Reddit and social media users who understand all of Reddit’s in-jokes, the nuances in US politics.

The researchers took an impossible problem, how to decide human feeling in the text at enormous scopes without setting, and used the magic of bullshit to turn it into a relatively simple one that any AI can tackle how to match keywords to labels. The explanation’s a gift is that you needn’t bother with AI to match keywords to labels.

Assuming that the AI’s result can be utilized to impact human prizes like surfacing every one of the resumes in a stack that have “positive opinion” in them, we need to expect that a portion of the documents it didn’t surface was unjustly oppressed. It is our position here at Neural that it is altogether untrustworthy to prepare an AI on human-made content without the communicated individual assent of the people who made it. Besides, it is likewise our position that it is unscrupulous to convey AI models prepared on the information.

Google’s scientists know that a conventional “keyword search and comparison” calculation can’t transform an AI model into a human-level master in brain science, social science, mainstream society, and semantics since they feed it a dataset loaded with haphazardly mislabeled Reddit posts. Yet, no measure of ability and innovation can transform a sack brimming with bologna into a helpful AI model when human results are in question.

Google’s AI is Not a Pro in Data Labeling! But the Comp Fails to Admit it

The biggest problems facing Google’s AI industry are rubbish, exploitative data-labeling practices

Google’s AI industry data-labeling practices:

More Trending Stories

The biggest problems facing Google’s AI industry are rubbish, exploitative data-labeling practices

Google’s AI industry data-labeling practices:

More Trending Stories