The Key to Creating a High-Quality Labeled Data Set | by Leah Berg and Ray McLendon | Nov, 2022

By Jessie Hobb On Nov 18, 2022

How to provide the best experience for the people annotating your data

Almost every data science course begins with a conveniently labeled data set (I’m looking at you, Titanic and Iris). However, the real world isn’t so convenient. As a Data Scientist, at some point in your career, you’ll likely be asked to create a machine learning model from an unlabeled data set. Sure, you can use unsupervised learning techniques, but oftentimes, the most powerful models are created from a data set labeled by subject matter experts.

You may be lucky enough to outsource your labeling to a system like Amazon Mechanical Turk, but if you’re dealing with highly sensitive data, you may need to find in-house annotators. Unfortunately, it can be difficult to convince subject matter experts to manually label data for a project. In this article, I’ll provide several suggestions to help you secure and retain valuable annotators for your next data science project.

The most foundational step to securing annotators for a project is discussing the desirability of the project with them. For a deep dive on determining desirability, check out my three-part series on machine learning proof of concepts here. Determining if your idea is desirable can be done in a number of ways, but one of my favorites is by building a demo. Even if that demo isn’t using machine learning, you can determine the desirability of your product when you share the demo. In fact, sometimes your annotators are actually the people who would benefit from your solution, and it might take them “seeing” the value of the solution to prioritize time for labeling data.

I ran into this exact situation on a recent project — my annotators were actually the target audience for the product. I created a five-minute video that showcased a simple demo of my rules-based model (no machine learning required) and explained how the efforts of my potential annotators would eventually eliminate a task that everyone dreaded.

Prior to the demo, I struggled to get a single person to prioritize time for labeling, and after the demo, I had 20 people volunteering to annotate data. This was a huge deal! The video was a great success, and it made its way around the organization quickly without me having to give multiple presentations.

Find what motivates your annotators, and you’ll have a much easier time convincing them to contribute to your product.

Photo by Cristofer Maximilian on Unsplash

Once you’ve convinced your annotators to label data for you, it’s critical that you are as efficient with their time as possible. Don’t waste their time labeling unimportant data. If you randomly sample data for your annotators to label, they may end up labeling too many similar items, which can lead to over-representation in your data set. Not only is this bad for your model, but your annotators may also get annoyed and/or fatigued.

There are various techniques for combating this, but my favorite is clustering. After you cluster your data, you can randomly sample a fixed number of data points from each cluster to help ensure diversity in the data your annotators label.

To take this one step further, you can ask your annotators to capture the average amount of time it takes them to annotate one data point. Once my annotators tell me how much time they can provide for labeling, I use this information to backfill the number of samples I can pull from each cluster.

Another way to be efficient with your annotators’ time is to ensure you have high-quality annotation guidelines in place. This is vital to providing a smooth and seamless experience for your annotators.

In my experience, high-quality annotation guidelines include the following key sections:

Definitions of labels — Spend time defining and documenting each label your annotators will be using. These definitions are critical to making sure your annotators have a common understanding of the labeling task.
Difficult examples — As your annotators begin labeling, they will encounter some tricky labeling situations. Document these in your annotation guidelines and ensure you include an explanation of how the annotators came to their decision.
How to use the annotation interface — Don’t assume it’s obvious how to interact with your interface and provide annotations. Document everything from accepting a task in the system to correcting an annotation after it’s already been submitted.

Annotation guidelines should be a living document and continuously updated. Additionally, they can serve as an excellent way to train new annotators.

Before you document your annotation interface in your guidelines, you need to ensure you put some serious time and thought into the details of your interface. A poor annotation interface can create a detrimental experience for your annotators and ultimately lead to fewer and/or poor-quality labels. Make sure you provide ample time to test out what it’s actually like to label data with your system. If you find it painful, you can guarantee your annotators will have a similar experience.

In some industries, it’s beneficial to build your own annotation system. This was the case at my first job where I worked with large time series data sets.

Thankfully, there are multiple annotation systems out there today, so you may not need to build a custom system. But don’t just focus on what the system will provide you as a data scientist. Think deeply about the experience your annotators will have with the system since they’ll be working in that system the most.

If possible, include your annotators in the selection of the system. Being inclusive will lead to a better experience for all.

As you begin receiving annotations, you may identify gaps in the data your model trains on. To help with this issue, you can generate synthetic data by taking a small set of annotations and expanding them to help your model grasp various relationships in the data set. Synthetic data will allow you to generate more examples for your model while utilizing the labels your annotators have already provided.

This can be as simple as perturbing the data or as sophisticated as using Generative Adversarial Networks (GANs). Now I’ve never gone the GAN route, but I have played around with word2vec for text data. You can use it to generate “synonyms” that you substitute in your text. Another technique is to use translation models to translate text into another language, and then translate it back into the original language. Oftentimes, the resulting text is slightly different from the original.

The world of possibilities in expanding your data set without having to spend more annotator time is worth exploring.

So how do you keep your annotators motivated throughout the annotation process? I’m no master of this, but one of the key mistakes I made early on was not sharing my model’s performance results with my annotators. I would get new labels, build a new model, and get so excited to see my model incrementally improve. It didn’t even cross my mind that my annotators might be just as excited. I’ve since learned that you should show your annotators how their work has directly impacted and improved the performance of your model.

When annotators feel more connected to the work that they’re doing, they’ll be more motivated and provide better-quality annotations.

Let’s be real, labeling data may not be the highlight of your annotators’ days. To make the process more enjoyable for everyone involved, consider having a Labeling Party. Literally, bring in some pizza and drinks like you would if you’re having friends help you move homes.

Photo by Aleksandra Sapozhnikova on Unsplash

Throw the annotation system up on the screen and label some examples together. Talk through the nuances of difficult examples to help transfer knowledge from your subject matter experts to your data scientists. The insights you gather may even help you build the next great feature for your model.

And don’t forget to update your annotation guidelines with your findings!

Behind every labeled data set are multiple annotators who spent countless hours labeling data. To keep annotators engaged, you need to think deeply about what could be used to motivate them to see the value before, during, and after the labeling process. As soon as they commit to labeling data, ensure you respect the effort that goes into it. An efficient annotation system and high-quality guidelines can help with this. Don’t forget to sprinkle in some fun along the way!

If you enjoyed this article and want to learn more about how to implement the concepts discussed, check out my workshop here.

https://medium.com/towards-data-science/lean-machine-learning-running-your-proof-of-concept-the-lean-startup-way-part-1-f9dbebb74d63

R. Monarch, Human-in-the-Loop Machine Learning (2021), https://www.manning.com/books/human-in-the-loop-machine-learning