Metrics are not enough — you need behavioral tests for NLP | by Mateusz Bednarski | Sep, 2022


Track systematic issues with your model by behavioral tests

One of the first concepts learned by Data Science practitioners is different metrics. Very early, you understand that accuracy cannot be used in all scenarios, and optimizing a wrong metric can do more harm than good. But is a good metric enough to know how a model behaves?

A canonical example of an incorrect metric. Image by the author.

Imagine the following scenario: your company receives thousands of job applications every month, and it is physically impossible to read every single of them. Also, hiring dozen HR employees just to screen them is too costly. So, you want to create an NLP model for screening CVs and discard those that obviously would not fit.

You collect a dataset (let’s assume you have access to historical applications and hiring decisions) and build the model. Evaluation on different metrics (precision/recall/etc) shows a decent performance.

Question 1: Do you see an issue with using the company’s internal data to build the model?

A few weeks after deployment, you notice something suspicious — the model strongly prefers candidates who graduated in computer science over computer engineering. Even though both of them are perfectly valid. You make a quick test — take a few random applications and replace computer engineering with computer science and vice versa. Results are clear: the same CV, only with computer engineering replaced by computer science, gets a lot higher score. You just discovered that your system has a systematic issue and start worrying — are there more of them?

The same job application except equivalent education makes the model wrong — image by the author.

Just add more data, bro/sis.

The first idea to solve the problem is to add more data. But you do not have them. Even if you had — how will you know they fix the problem?

In fact, the reality is that collecting/annotating new data can be very expensive or even impossible. And as a data scientist, you must be aware of this and propose other solutions.

We just discovered a severe issue with the model. Before we plunge into work to solve it, let’s take a deep breath and think — what other problems might we have?

  1. Sex/ethnicity bias — does your model discriminate against males/females or a specific nationality?
  2. Equivalent words/synonyms — If a candidate replaces “Good python knowledge” with “good python 3 knowledge” how does your model react?
  3. Skill grading — do your model assign a higher score for “very good knowledge” vs. “good knowledge” vs. “basic knowledge”. Are adjectives adequately understood? Candidates with “exceptional skill” should not be rated below one with “basic skill”.
  4. Sentence ordering — If we reverse the order of job experience, is the model prediction consistent?
  5. Typos — I’ve seen a lot of models where a typo in a completely unimportant word changed the model prediction completely. We may argue that job applications should not contain typos, but we can all agree that, in general, this is an issue in NLP
  6. Negations — I know that is difficult. But if your task requires understanding them, do you measure it? (for example, “I have no criminal records” vs. “I have criminal records” or “I finished vs. I did not finish”. How about double negations?

You may argue that those criteria are tough to satisfy. But on the other hand — imagine you were rejected during the interview, and the company in feedback for you says, “we expected that your job history will be in reversed order”. I would be upset.

AI is going to have a bigger and bigger influence on our daily lives, and we, practitioners, need to thread our job very seriously.

That’s why we need behavioral testing. Metrics show only a part of model quality and only as a highly aggregated number. They say nothing about internal model assumptions, biases, and reasoning. Even if you do cross-validation (which for deep learning is impractical) still, you do not have a whole picture of how reliable your model is. Behavioral testing is a systematic way to evaluate factors that influence model predictions. Before going further, let’s examine more examples of when it helps.

The concept was introduced by Riberio et al. in the paper “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList”. The authors formalize our discussion about model capabilities and propose a systematic testing framework. Let’s take a look at the selected subset.

Methodology

CheckList paper introduces three main contributions:

  1. Capabilities — specific behaviors we might want to check (what to test)
  2. Test types — well-defined test types (how to test)
  3. Templates and abstractions — python package for performing behavioral testing

The nice thing is that we can still use the methodology proposed by authors without using their library (if for any reason we want not to). In this article, I will focus on defining behavioral testing itself; how to use the CheckList library is a topic for the second part.

Today it is normal to take your dog to a licensed behaviorist. Time to do it as well with AI models. Photo by Destiny Wiens on Unsplash

Model capability

A capability of a model is a specific adversarial setup that the model can be vulnerable to. We use different capabilities for different tasks. Specific selection for capabilities will depend on the problem model is solving, but some generic ones can be applied for most of the models.

  • Vocabulary/Taxonomy: how the model handles replacing words with synonyms/antonyms. In sentiment analysis, we expect that replacing “great” with “amazing” will not change the prediction considerably. On the other hand — replacing “great” with “terrible” should make an impact in a specific direction.
  • Robustness: adding small typos in minor words should not affect the prediction significantly. In our recruitment example, the shuffling order of previous workplaces also should not have an effect. This one seems especially important when dealing with informal documents — where typos and irrelevant information are more common.
  • NER: If we switch a person’s name, it should not affect the prediction. For example, “Mike is a great waiter” vs. “Steve is a great waiter”. We do not want to prefer Mikes over Steves or another way around. The same stands for locations. Consider “The ticket to New York was way too expensive” vs. “The ticket to Phoenix was way too expensive”.
  • Temporal: does the model understand a sequence of actions? Can it distinguish between the past and the future?
The model incorrectly assumes that the sequence of actions matches the order in the text Image by the author.
  • Negations: This is a big one. “The food was bad” vs. “The food was not bad” is a simple case. But how about double negations? Truth value of p is equal to not(not(p)). Can it be handled? It can be challenging even for humans!
  • Coreference: Does the model correctly link different subjects? “John is a plumber. Kate is an accountant” She earns $50/hour. He makes $20 per hour.” Question: “How much does John earn?
  • Logic: does the model understand basic logic laws? “Ape is a mammal. Mammals are animals. Is an ape an animal?”
  • There are some more but enough for now 🙂

Woah, how can we even test them? They seem to be very complicated! For the rescue, CheckList comes with predefined types of tests and a framework for building them. Let’s take a closer look at the test types.

To help with systematic testing, the authors also proposed three different types of tests. Each test type can be used with any capability (but, of course, some of them will be more/less suitable). Defining test types help us to decide on how we should test a capability. Inadequate testing can introduce more harm than good!

Minimum functionality test (MFT)

They can be compared to unit testing — a single, small piece of functionality executed in an isolated environment. For behavioral testing, we could define an MFT as a small, synthetic, targeted test dataset. For each record, we know the expected answer so we can easily compare it to model predictions.

Let’s see an example of vocabulary capability for sentiment analysis.

A simple MFT (vocab capability). Image by the author.

You can notice a few things: First, the test result is reported as a fraction of failed tests — failure rate. In most realistic scenarios, we do not expect that our models are 100% correct. The amount of acceptable error can vary — so the framework does not force anything on us. Also, if we include behavioral testing for an existing model, there is a great chance that the result will be poor. In such a situation, we do not want just report FAILED — if fixing the model takes time, it is preferred to track the failure rate over time.

Secondly, test cases are generated solely for behavioral testing. They are not from the train/val/test dataset. Of course, there can be some overlapping just by chance, but in principle, cases are generated just for MFTs.

Third, there is a lot of redundancy. Five out of six cases follow the template “Food was ____”. It can look like a lot of manual work. But the situation is not so bad — CheckList provides tooling for rapid cases development without the need to type everything by hand.

Question 2:

What capabilities can be tested by an MFT? We already saw vocabulary. Can you think of more?

Invariance test checks if a modification introduced to the test case does not change the prediction. We start with a single input example (either synthetic or real) and introduce various perturbations. The simplest one can be introducing a random comma in the text. Such modification should not influence model prediction.

The lovely property of INV is that we don’t need labeled data. We check model consistency and robustness — we don’t care if the actual prediction is correct — only if it is stable. This is different than MFT.

Let’s see an example. We take a single sample (highlighted in blue) and compute a prediction. It is positive (once again — we don’t care if it is an actual class). Now we introduce perturbations by replacing a country name with different values. It is clearly visible that the model changes the behavior of Korea and Iran — so it is not robust to the origin country. We conclude that the model is biased against specific countries — and, therefore, not fair.

Simple INV test. Replacing the country should not change model prediction — because sentiment shall be recognized by “is a classic” phrase — image by the author.

Question 3:

What capabilities can be tested by an INV?

The last type of test is a Directional Expectation test. Here we expect that the prediction will change in a specific direction. For example: if we add the phrase “I have seen better movies.” to a movie review, the predicted sentiment should not be better. Such tests are beneficial when we know that a specific modification should push the model in a particular direction.

Getting back to the job application example: let’s consider a candidate with 3/5 python skill. His score from the model is 0.53. Suppose we replace it with 4/5; the prediction changes to 0.52. Technically it is less, but the difference is so slight that we may argue it is a noise — that’s why we define a margin — ” how much change we consider a real change”. Let’s say the margin is set to 0.1.

This is very similar to a delta parameter in early stopping.

Going further, replacement by 5/5 changes the score to 0.67 — so the model indeed prefers higher-skilled devs.

Setting the margin to 0.1 means we expect predictions to be in the range [0.43;1] — image by the author.

We defined capabilities and different ways of testing them. As we already said, a specific capability can be tested by one or more types of tests, creating a matrix.

Question 4:

Try to define a test suite for the job application example. Consider what capabilities should be tested and which types would be suitable. If you are not sure about assigning capabilities/test types, it is ok just to list specific topics for the beginning 😉 Also, think of acceptable error rates. You can start with:

  • The model should not favor computer engineering over computer science (and vice versa). Capability: NER, Test types: INV
  • The model should reject all candidates with python skills 1/5 or 2/5. Capability: Vocabulary/Taxonomy, Test types: MFT, DIR

That’s all for today. In the next article, we will build a real test suite for a model using CheckList. Defining behavioral tests is very fun and satisfying and can significantly improve understanding of model performance. You can access the original paper here: https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf

Stay tuned!


Track systematic issues with your model by behavioral tests

One of the first concepts learned by Data Science practitioners is different metrics. Very early, you understand that accuracy cannot be used in all scenarios, and optimizing a wrong metric can do more harm than good. But is a good metric enough to know how a model behaves?

A canonical example of an incorrect metric. Image by the author.

Imagine the following scenario: your company receives thousands of job applications every month, and it is physically impossible to read every single of them. Also, hiring dozen HR employees just to screen them is too costly. So, you want to create an NLP model for screening CVs and discard those that obviously would not fit.

You collect a dataset (let’s assume you have access to historical applications and hiring decisions) and build the model. Evaluation on different metrics (precision/recall/etc) shows a decent performance.

Question 1: Do you see an issue with using the company’s internal data to build the model?

A few weeks after deployment, you notice something suspicious — the model strongly prefers candidates who graduated in computer science over computer engineering. Even though both of them are perfectly valid. You make a quick test — take a few random applications and replace computer engineering with computer science and vice versa. Results are clear: the same CV, only with computer engineering replaced by computer science, gets a lot higher score. You just discovered that your system has a systematic issue and start worrying — are there more of them?

The same job application except equivalent education makes the model wrong — image by the author.

Just add more data, bro/sis.

The first idea to solve the problem is to add more data. But you do not have them. Even if you had — how will you know they fix the problem?

In fact, the reality is that collecting/annotating new data can be very expensive or even impossible. And as a data scientist, you must be aware of this and propose other solutions.

We just discovered a severe issue with the model. Before we plunge into work to solve it, let’s take a deep breath and think — what other problems might we have?

  1. Sex/ethnicity bias — does your model discriminate against males/females or a specific nationality?
  2. Equivalent words/synonyms — If a candidate replaces “Good python knowledge” with “good python 3 knowledge” how does your model react?
  3. Skill grading — do your model assign a higher score for “very good knowledge” vs. “good knowledge” vs. “basic knowledge”. Are adjectives adequately understood? Candidates with “exceptional skill” should not be rated below one with “basic skill”.
  4. Sentence ordering — If we reverse the order of job experience, is the model prediction consistent?
  5. Typos — I’ve seen a lot of models where a typo in a completely unimportant word changed the model prediction completely. We may argue that job applications should not contain typos, but we can all agree that, in general, this is an issue in NLP
  6. Negations — I know that is difficult. But if your task requires understanding them, do you measure it? (for example, “I have no criminal records” vs. “I have criminal records” or “I finished vs. I did not finish”. How about double negations?

You may argue that those criteria are tough to satisfy. But on the other hand — imagine you were rejected during the interview, and the company in feedback for you says, “we expected that your job history will be in reversed order”. I would be upset.

AI is going to have a bigger and bigger influence on our daily lives, and we, practitioners, need to thread our job very seriously.

That’s why we need behavioral testing. Metrics show only a part of model quality and only as a highly aggregated number. They say nothing about internal model assumptions, biases, and reasoning. Even if you do cross-validation (which for deep learning is impractical) still, you do not have a whole picture of how reliable your model is. Behavioral testing is a systematic way to evaluate factors that influence model predictions. Before going further, let’s examine more examples of when it helps.

The concept was introduced by Riberio et al. in the paper “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList”. The authors formalize our discussion about model capabilities and propose a systematic testing framework. Let’s take a look at the selected subset.

Methodology

CheckList paper introduces three main contributions:

  1. Capabilities — specific behaviors we might want to check (what to test)
  2. Test types — well-defined test types (how to test)
  3. Templates and abstractions — python package for performing behavioral testing

The nice thing is that we can still use the methodology proposed by authors without using their library (if for any reason we want not to). In this article, I will focus on defining behavioral testing itself; how to use the CheckList library is a topic for the second part.

Today it is normal to take your dog to a licensed behaviorist. Time to do it as well with AI models. Photo by Destiny Wiens on Unsplash

Model capability

A capability of a model is a specific adversarial setup that the model can be vulnerable to. We use different capabilities for different tasks. Specific selection for capabilities will depend on the problem model is solving, but some generic ones can be applied for most of the models.

  • Vocabulary/Taxonomy: how the model handles replacing words with synonyms/antonyms. In sentiment analysis, we expect that replacing “great” with “amazing” will not change the prediction considerably. On the other hand — replacing “great” with “terrible” should make an impact in a specific direction.
  • Robustness: adding small typos in minor words should not affect the prediction significantly. In our recruitment example, the shuffling order of previous workplaces also should not have an effect. This one seems especially important when dealing with informal documents — where typos and irrelevant information are more common.
  • NER: If we switch a person’s name, it should not affect the prediction. For example, “Mike is a great waiter” vs. “Steve is a great waiter”. We do not want to prefer Mikes over Steves or another way around. The same stands for locations. Consider “The ticket to New York was way too expensive” vs. “The ticket to Phoenix was way too expensive”.
  • Temporal: does the model understand a sequence of actions? Can it distinguish between the past and the future?
The model incorrectly assumes that the sequence of actions matches the order in the text Image by the author.
  • Negations: This is a big one. “The food was bad” vs. “The food was not bad” is a simple case. But how about double negations? Truth value of p is equal to not(not(p)). Can it be handled? It can be challenging even for humans!
  • Coreference: Does the model correctly link different subjects? “John is a plumber. Kate is an accountant” She earns $50/hour. He makes $20 per hour.” Question: “How much does John earn?
  • Logic: does the model understand basic logic laws? “Ape is a mammal. Mammals are animals. Is an ape an animal?”
  • There are some more but enough for now 🙂

Woah, how can we even test them? They seem to be very complicated! For the rescue, CheckList comes with predefined types of tests and a framework for building them. Let’s take a closer look at the test types.

To help with systematic testing, the authors also proposed three different types of tests. Each test type can be used with any capability (but, of course, some of them will be more/less suitable). Defining test types help us to decide on how we should test a capability. Inadequate testing can introduce more harm than good!

Minimum functionality test (MFT)

They can be compared to unit testing — a single, small piece of functionality executed in an isolated environment. For behavioral testing, we could define an MFT as a small, synthetic, targeted test dataset. For each record, we know the expected answer so we can easily compare it to model predictions.

Let’s see an example of vocabulary capability for sentiment analysis.

A simple MFT (vocab capability). Image by the author.

You can notice a few things: First, the test result is reported as a fraction of failed tests — failure rate. In most realistic scenarios, we do not expect that our models are 100% correct. The amount of acceptable error can vary — so the framework does not force anything on us. Also, if we include behavioral testing for an existing model, there is a great chance that the result will be poor. In such a situation, we do not want just report FAILED — if fixing the model takes time, it is preferred to track the failure rate over time.

Secondly, test cases are generated solely for behavioral testing. They are not from the train/val/test dataset. Of course, there can be some overlapping just by chance, but in principle, cases are generated just for MFTs.

Third, there is a lot of redundancy. Five out of six cases follow the template “Food was ____”. It can look like a lot of manual work. But the situation is not so bad — CheckList provides tooling for rapid cases development without the need to type everything by hand.

Question 2:

What capabilities can be tested by an MFT? We already saw vocabulary. Can you think of more?

Invariance test checks if a modification introduced to the test case does not change the prediction. We start with a single input example (either synthetic or real) and introduce various perturbations. The simplest one can be introducing a random comma in the text. Such modification should not influence model prediction.

The lovely property of INV is that we don’t need labeled data. We check model consistency and robustness — we don’t care if the actual prediction is correct — only if it is stable. This is different than MFT.

Let’s see an example. We take a single sample (highlighted in blue) and compute a prediction. It is positive (once again — we don’t care if it is an actual class). Now we introduce perturbations by replacing a country name with different values. It is clearly visible that the model changes the behavior of Korea and Iran — so it is not robust to the origin country. We conclude that the model is biased against specific countries — and, therefore, not fair.

Simple INV test. Replacing the country should not change model prediction — because sentiment shall be recognized by “is a classic” phrase — image by the author.

Question 3:

What capabilities can be tested by an INV?

The last type of test is a Directional Expectation test. Here we expect that the prediction will change in a specific direction. For example: if we add the phrase “I have seen better movies.” to a movie review, the predicted sentiment should not be better. Such tests are beneficial when we know that a specific modification should push the model in a particular direction.

Getting back to the job application example: let’s consider a candidate with 3/5 python skill. His score from the model is 0.53. Suppose we replace it with 4/5; the prediction changes to 0.52. Technically it is less, but the difference is so slight that we may argue it is a noise — that’s why we define a margin — ” how much change we consider a real change”. Let’s say the margin is set to 0.1.

This is very similar to a delta parameter in early stopping.

Going further, replacement by 5/5 changes the score to 0.67 — so the model indeed prefers higher-skilled devs.

Setting the margin to 0.1 means we expect predictions to be in the range [0.43;1] — image by the author.

We defined capabilities and different ways of testing them. As we already said, a specific capability can be tested by one or more types of tests, creating a matrix.

Question 4:

Try to define a test suite for the job application example. Consider what capabilities should be tested and which types would be suitable. If you are not sure about assigning capabilities/test types, it is ok just to list specific topics for the beginning 😉 Also, think of acceptable error rates. You can start with:

  • The model should not favor computer engineering over computer science (and vice versa). Capability: NER, Test types: INV
  • The model should reject all candidates with python skills 1/5 or 2/5. Capability: Vocabulary/Taxonomy, Test types: MFT, DIR

That’s all for today. In the next article, we will build a real test suite for a model using CheckList. Defining behavioral tests is very fun and satisfying and can significantly improve understanding of model performance. You can access the original paper here: https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf

Stay tuned!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsBednarskibehavioralmachine learningMateuszmetricsNLPSepTech Newstests
Comments (0)
Add Comment