Can Weak Labeling Replace Human-Labeled Data? | by Walid Amamou | May, 2022

By Jessie Hobb On Jun 2, 2022

A step-by-step comparison between weak and full supervision

In recent years, there has been a significant advancement in Natural Language Processing (NLP) due to the advent of deep learning models. Real-world applications using NLP, ranging from intelligent chatbots to automated data extraction from unstructured documents, are becoming more prevalent and bringing real business values to many companies. However, these models still require hand-labeled training data to fine-tune them to the specific business use cases. It can take many months to gather this data and even longer to label it, especially if a domain expert is needed and there are multiple classes to be identified within the text. As you can imagine, this can become a real adoption barrier to many businesses as subject matter experts are hard to find and expensive.

To address this problem, researchers have adopted weak forms of supervision, such as using heuristically generated label functions and external knowledge bases to programmatically label the data. While this approach holds a lot of promise, its impact on the model performance in comparison with full supervision remains unclear.

In this tutorial, we will generate two training datasets from job descriptions: one generated with weak labeling and a second dataset generated by hand labeling using UBIAI. We will then compare the model performance on a NER task that aims at extracting skills, experience, diploma and diploma major from job descriptions. The data and the notebook are available in my github repo.

With weak supervision, the user defines a set of functions and rules that assign a noisy label, that is a label that may not be correct, to unlabeled data. The labeling functions may be in the form of patterns such as regular expressions, dictionaries, ontologies, pre-trained machine learning models, or crowd annotation.

Weak supervision pipelines have three components: (1) user-defined labeling functions and heuristic functions, (2) a statistical model which takes as input the labels from the functions, and outputs probabilistic labels, and (3) a machine learning model that is trained on the probabilistic training labels from the statistical model.

In this tutorial, we will perform the weak labeling using the Skweak library. Per the library creators [1]:

Skweak is a Python-based software toolkit that provides a concrete solution to this problem using weak supervision. skweak is built around a very simple idea: Instead of annotating texts by hand, we define a set of labelling functions to automatically label our documents, and then aggregate their results to obtain a labelled version of our corpus.

To learn more about the skweak library, please read the original paper”skweak: Weak Supervision Made Easy for NLP”.

To perform the weak labeling, we will write a set of functions that encode dictionaries, patterns, knowledge bases and rules related to the corpus we would like to label. In this tutorial, we will add functions that will auto-label entities SKILLS, EXPERIENCE, DIPLOMA and DIPLOMA_MAJOR from job description. After applying those functions to the unlabeled data, the results will be aggregated into a single, probabilistic annotation layer using a statistical model provided by the skweak library.

First we will start by creating a dictionary of skills Skills_Data.json and add it to our function lf3 to annotate the SKILLS entity. The dictionary was obtained from a publicly available dataset.

#Create_Skills_Function_from_dictionary
import spacy, re
from skweak import heuristics, gazetteers, generative, utils
tries=gazetteers.extract_json_data('data/Skills_Data.json')
nlp=spacy.load('en_core_web_md' , disable=['ner'])
gazetteer = gazetteers.GazetteerAnnotator("SKILLS", tries)
lf3= gazetteers.GazetteerAnnotator("SKILLS", tries)

For the EXPERIENCE entity, we use regex pattern to capture the number of years of experience:

#Create_Function_Foe_Experience_Detection(Use Regex)
def experience_detector (doc):
expression=r'[0-9][+] years'
for match in re.finditer(expression, doc.text):
start, end = match.span()
span = doc.char_span(start, end)
if(span):
yield span.start , span.end ,  "EXPERIENCE"
lf1 = heuristics.FunctionAnnotator("experience", experience_detector)

For the entities DIPLOMA and DPLOMA_MAJOR, we use a publicly available dataset from Kaggle and regex:

with open('Diploma_Dic.json' , 'r' , encoding='UTF-8') as f :
DIPLOMA=json.load(f)print(len(DIPLOMA))
with open ('Diploma_Major_Dic.json' ,encoding='UTF-8') as f :
DIPLOMA_MAJOR=json.load(f)#Create Diploma_Function
def Diploma_fun(doc):
for key in DIPLOMA:
for match in re.finditer(key , doc.text , re.IGNORECASE):
start, end = match.span()
span = doc.char_span(start, end)
if(span):
yield (span.start , span.end ,  "DIPLOMA")lf4 = heuristics.FunctionAnnotator("Diploma", Diploma_fun)#Create_Diploma_Major_Function
def Diploma_major_fun(doc):  
for key in DIPLOMA_MAJOR:
for match in re.finditer(key , doc.text , re.IGNORECASE):
start, end = match.span()
span = doc.char_span(start, end)
if(span):
yield (span.start , span.end ,  "DIPLOMA_MAJOR")lf2 = heuristics.FunctionAnnotator("Diploma_major", Diploma_major_fun)#Create_Function_For_diploma_major(Use Regex)
def diploma_major_detector (doc):
expression=re.compile(r"(^.*(Ph.D|MS|Master|BA|Bachelor|BS)\S*) in (\S*)")
for match in re.finditer(expression, doc.text):
start, end = match.span(3)
span = doc.char_span(start, end)
if(span):
yield span.start , span.end ,  "DIPLOMA_MAJOR"lf5 = heuristics.FunctionAnnotator("Diploma_major", diploma_major_detector)

We aggregate all the functions together and use Skweak’s statistical model to find the best agreement to auto-label the data.

#create_corpus_annotated_to_train_the_model
docs=[]
with open('Corpus.txt' , 'r') as f :
data=f.readlines()
for text in data:
if (len(text) !=1):
newtext=str(text)
doc=nlp(newtext)
doc=lf1(lf2(lf3(lf4(lf5(doc)))))
print(doc.spans)
docs.append(doc)from skweak import aggregation
model = aggregation.HMM("hmm", ["DIPLOMA", "DIPLOMA_MAJOR" , "EXPERIENCE" , "SKILLS"])
docs = model.fit_and_aggregate(docs)

We are finally ready to train the model! We chose to train a spaCy model since it is easily integrated with the skweak library but we can of course use any other model such as transformers for example. Annotated datasets are available in the github repo.

for doc in docs:
doc.ents = doc.spans["hmm"]
utils.docbin_writer(docs, "train.spacy")!python -m spacy train config.cfg --output ./output --paths.train train.spacy --paths.dev train.spacy

We are now ready to run the training with both datasets, fully hand-labeled and weakly labeled, having equal number of documents:

Hand-labeled dataset model performance:

================================== Results ==================================TOK     100.00
NER P   74.27 
NER R   80.10 
NER F   77.08 
SPEED   4506  

=============================== NER (per type) ===============================
P       R       F
DIPLOMA         85.71   66.67   75.00
DIPLOMA_MAJOR   33.33   16.67   22.22
EXPERIENCE      81.82   81.82   81.82
SKILLS          74.05   83.03   78.29

Weakly-labeled dataset model performance:

================================== Results ==================================TOK     100.00
NER P   31.78 
NER R   17.80 
NER F   22.82 
SPEED   2711  

=============================== NER (per type) ===============================
P       R       F
DIPLOMA          33.33   22.22   26.67
DIPLOMA_MAJOR    14.29   50.00   22.22
EXPERIENCE      100.00   27.27   42.86
SKILLS           33.77   15.76   21.49

Interestingly, the model performance of the hand-labeled dataset is significantly superior to the weakly labeled one by a wide margin 0.77 for supervised versus 0.22 for weak supervision. If we dig deeper, we will find that the performance gap still holds at the entity level (except the EXPERIENCE entity).

By adding more labeling functions such as crowd annotation, model labeling, rules, dictionaries, we would certainly expect improvement in model performance but it is very unclear whether the performance will ever match the subject matter expert labeled data. Second, figuring out the correct auto-labeling functions is an iterative and ad-hoc process. This issue is exacerbated when dealing with a highly technical dataset such as medical notes, legal documents or scientific articles where in some cases, it fails to properly capture the domain knowledge that users want to encode.

In this tutorial, we demonstrated a step by step comparison between model performance trained by weakly labeled data and hand-labeled data. We have shown that in this specific use case, the model performance from weakly labeled dataset is significantly lower than that of the fully supervised approach. This does not certainly mean weak labeling is not useful. We can use weak labeling to pre-annotate the dataset and hence bootstrap our labeling project but we cannot rely on it to perform fully unsupervised labeling.

References:

https://github.com/NorskRegnesentral/skweak