Testing the Consistency of Reported Machine Learning Performance Scores by the mlscorecheck Package

By Jessie Hobb On Jan 1, 2024

AI (Dall-E) generated depiction of the topic

A small step towards the big leap of reproducible machine learning science

In this post, we explore how the Python package mlscorecheck can be used for testing the consistency between reported machine learning performance scores and the accompanying descriptions of experimental setups.

Disclaimer: the author of this post is the author of the mlscorecheck package.

What is the consistency testing of performance scores?

Assume you come across accuracy (0.9494), sensitivity (0.8523), and specificity (0.9765) scores reported for a binary classification problem with a testset consisting of 100 positive and 1000 negative samples. Can you trust these scores? How can you check if they could truly be the outcome of the claimed experiment? This is where the mlscorecheck package can help you by providing such consistency testing capabilities. In this particular example, one can exploit

from mlscorecheck.check.binary import check_1_testset_no_kfold

result = check_1_testset_no_kfold(
    testset={'p': 100, 'n': 1000},
    scores={'acc': 0.8464, 'sens': 0.81, 'f1': 0.4894},
    eps=1e-4
)
result['inconsistency']
#False

and the 'insconsistency’ flag of the result being False indicates that the scores could be yielded from the experiment. (Which is true, since the scores correspond to 81 true positive and 850 true negative samples.) What if the accuracy score 0.8474 was reported due to an accidental typo?

result = check_1_testset_no_kfold(
    testset={'p': 100, 'n': 1000},
    scores={'acc': 0.8474, 'sens': 0.81, 'f1': 0.4894},
    eps=1e-4
)
result['inconsistency']
#True

Testing the adjusted setup, the result signals inconsistency: the scores could not be the outcome of the experiment. Either the scores or the assumed experimental setup is incorrect.

In the rest of the post, we take a closer look on the main features and use cases of the mlscorecheck package.

Introduction

In both research and applications, supervised learning approaches are routinely ranked by performance scores calculated in some experiments (binary classification, multiclass classification, regression). Due to typos in the publications, improperly used statistics, data leakage, and cosmetics, in many cases the reported performance scores are unreliable. Beyond contributing to the reproducibility crisis in machine learning and artificial intelligence, the effect of unrealistically high performance scores is usually further amplified by the publication bias, eventually skewing entire fields of research.

The goal of the mlscorecheck package is to provide numerical techniques to test if a set of reported performance scores could be the outcome of an assumed experimental setup.

The operation of consistency tests

The idea behind consistency testing is that in a given experimental setup, performance scores cannot take any values independently:

For example, if there are 100 positive samples in a binary classification testset, the sensitivity score can only take the values 0.0, 0.01, 0.02, …, 1.0, but it cannot be 0.8543.
When multiple performance scores are reported, they need to be consistent with each other. For example, accuracy is the weighted average of sensitivity and specificity, hence, in a binary classification problem with a testset of 100 positive and 100 negative samples, the scores acc = 0.96, sens = 0.91, spec = 0.97 cannot be yielded.

In more complex experimental setups (involving k-fold cross-validation, the aggregation of the scores across multiple folds/datasets, etc.), the constraints become more advanced, but they still exist. The mlscorecheck package implements numerical tests to check if the scores assumed to be yielded from an experiment satisfy the corresponding constraints.

The tests are numerical, inconsistencies are identified conclusively, with certainty. Drawing an analogy with statistical hypothesis testing, the null-hypothesis is that there are no inconsistencies, and whenever some inconsistency is identified, it provides evidence against the null-hypothesis, but being a numerical test, this evidence is indisputable.

Various experimental setups impose various constraints on the performance scores that need dedicated solutions. The tests implemented in the package are based on three principles: exhaustive enumeration expedited by interval computing; linear integer programming; analytical relations between the scores. The sensitivity of the tests highly depends on the experimental setup and the numerical uncertainty: large datasets, large numerical uncertainty and a small number of reported scores reduce the ability of the tests to recognize deviations from the assumed evaluation protocols. Nevertheless, as we see later on, the tests are still applicable in many real life scenarios. For further details on the mathematical background of the tests, refer to the preprint and the documentation.

Use cases

Now, we explore some examples illustrating the use of the package, but first, we discuss the general requirements of testing and some terms used to describe the experiments.

The requirements

Consistency testing has three requirements:

the collection of reported performance scores;
the estimated numerical uncertainty of the scores (when the scores are truncated to 4 decimal places, one can assume that the real values are within the range of 0.0001 from the reported values, and this is the numerical uncertainty of the scores) — this is usually the eps parameter of the tests which is simply inferred by inspecting the scores;
the details of the experiment (the statistics of the dataset(s) involved, the cross-validation scheme, the mode of aggregation).

Glossary

The terms used in the specifications of the experiments:

mean of scores (MoS): the scores are calculated for each fold/dataset, and then averaged to gain the reported ones;
score of means (SoM): the fold/dataset level raw figures (e.g. confusion matrices) are averaged first, and the scores are calculated from the average figures;
micro-average: the evaluation of a multiclass problem is carried out by measuring the performance on each class against all other (as a binary classification), and the class-level results are aggregated in the score of means fashion;
macro-average: the same as the micro-average, but the class level scores are aggregated in the mean of scores fashion;
fold configuration: when k-fold cross-validation is used, the tests usually rely on linear integer programming. Knowing the number of samples of classes in the folds can be utilized in the formation of the linear program. These fold level class sample counts are referred to as the fold configuration.

Binary classification

In the beginning, we already illustrated the use of the package when binary classification scores calculated on a single testset are to be tested. Now, we look into some more advanced examples.

In addition to the two examples we investigate in detail, the package supports altogether 10 experimental setups for binary classification, the list of which can be found in the documentation with further examples in the sample notebooks.

N testsets, score-of-means aggregation

In this example, we assume that there are N testsets, k-folding is not involved, but the scores are aggregated in the score-of-means fashion, that is, the raw true positive and true negative figures are determined for each testset and the performance scores are calculated from the total (or average) number of true positive and true negative figures. The available scores are assumed to be the accuracy, negative predictive value and the F1-score.

For example, in practice, the evaluation of an image segmentation technique on N test images stored in one tensor usually leads to this scenario.

The design of the package is such that the details of the experimental setup are encoded in the names of the test functions, in this way guiding the user to take care of all available details of the experiment when choosing the suitable tests. In this case, the suitable test is the function check_n_testsets_som_no_kfold in the mlscorecheck.check.binary module, the token 'som’ referring to the mode of aggregation (score of means):

from mlscorecheck.check.binary import check_n_testsets_som_no_kfold

scores = {'acc': 0.4719, 'npv': 0.6253, 'f1': 0.3091}
testsets = [
    {'p': 405, 'n': 223}, 
    {'p': 3, 'n': 422}, 
    {'p': 109, 'n': 404}
]

result = check_n_testsets_som_no_kfold(
    testsets=testsets,
    scores=scores,
    eps=1e-4
)
result['inconsistency']
# False

The result indicates that the scores could be the outcome of the experiment. No wonder, the scores are prepared by sampling true positive and true negative counts for the testsets and calculating them in the specified manner. However, if one of the scores is slightly changed, for example F1 is modified to 0.3191, the configuration becomes inconsistent:

scores['f1'] = 0.3191

result = check_n_testsets_som_no_kfold(
    testsets=testsets,
    scores=scores,
    eps=1e-4
)
result['inconsistency']
# True

Further details of the analysis, for example, the evidence for feasibility can be extracted from the dictionaries returned by the test functions. For the structure of the outputs, again, see the documentation.

1 dataset, k-fold cross-validation, mean of scores aggregation

In this example, we assume that there is a dataset on which a binary classifier is evaluated in a stratified repeated k-fold cross-validation manner (2 folds, 3 repetitions), and the mean of the scores yielded on the folds is reported.

This experimental setup is possibly the most commonly used in supervised machine learning.

We highlight the distinction between knowing and not knowing the fold configuration. Typically, MoS tests rely on linear integer programming and the fold configuration is required to formulate the linear integer program. The fold configuration might be specified by listing the statistics of the folds, or one can refer to a folding strategy leading to deterministic fold statistics, such as stratification. Later on, we show that testing can be carried in the lack of knowing the fold configuration, as well, however, in that case all possible fold configurations are tested, which might lead to enormous computational demands.

Again, the first step is to select the suitable test to be used. In this case, the correct test is the check_1_dataset_known_folds_mos function, where the token mos refers to the mode of aggregation, and known_folds indicates that the fold configuration is known (due to stratification). The test is executed as follows:

from mlscorecheck.check.binary import check_1_dataset_known_folds_mos

scores = {'acc': 0.7811, 'sens': 0.5848, 'spec': 0.7893}
dataset = {'p': 21, 'n': 500}
folding = {
    'n_folds': 2, 
    'n_repeats': 3, 
    'strategy': 'stratified_sklearn'
}

result = check_1_dataset_known_folds_mos(
    dataset=dataset,
    folding=folding,
    scores=scores,
    eps=1e-4
)
result['inconsistency']
# False

Similarly to the previous examples, there is no inconsistency, since the performance scores are prepared to constitute a consistent configuration. However, if one of the scores is slightly changed, the test detects the inconsistency:

scores['acc'] = 0.79

result = check_1_dataset_known_folds_mos(
    dataset=dataset,
    folding=folding,
    scores=scores,
    eps=1e-4,
    verbosity=0
)
result['inconsistency']
# True

In the previous examples, we supposed that the fold configuration is known. However, in many cases, the exact fold configuration is not known and stratification is not specified. In these cases one can rely on tests that systematically test all possible fold configurations, as shown in the below example. This time, the suitable test has the 'unknown_folds' token in its name, indicating that all potential fold configurations are to be tested:

from mlscorecheck.check.binary import check_1_dataset_unknown_folds_mos

folding = {'n_folds': 2, 'n_repeats': 3}
result = check_1_dataset_unknown_folds_mos(
    dataset=dataset,
    folding=folding,
    scores=scores,
    eps=1e-4,
    verbosity=0
)
result['inconsistency']
# False

As before, the test correctly identifies that there is no inconsistency: during the process of evaluating all possible fold configurations, it got to the point of testing the actual stratified configuration which shows consistency, and with this evidence, stopped the testing of the remaining one.

In practice, prior to launching a test with unknown folds, it is advised to make an estimation on the number of possible fold configurations to be tested:

from mlscorecheck.check.binary import estimate_n_evaluations

estimate_n_evaluations(
    dataset=dataset, 
    folding=folding, 
    available_scores=['acc', 'sens', 'spec']
)
# 4096

In worst case, solving 4096 small linear integer programming problems is still feasible with regular computing equipment, however, with larger datasets the number of potential fold configurations can quickly grow intractable.

Multiclass classification

Testing multiclass classification scenarios is analogous to that of the binary case, therefore, we do not get into as much details as in the binary case.

From the 6 experimental setups supported by the package we picked a commonly used one for illustration: we assume there is a multiclass dataset (4 classes), and repeated stratified k-fold cross-validation was carried out with 4 folds and 2 repetitions. We also know that the scores were aggregated in the macro-average fashion, that is, in each fold, the performance on each class was evaluated against all other classes in a binary classification manner, and the scores were averaged across the classes and then across the folds.

Again, the first step is chosing the suitable test function, which in this case becomes check_1_dataset_known_folds_mos_macro from the mlscorecheck.check.multiclass module. Again, the tokens 'mos’ and 'macro’ in the name of the test refer to the aggregations used in the experiment.

from mlscorecheck.check.multiclass import check_1_dataset_known_folds_mos_macro

scores = {'acc': 0.626, 'sens': 0.2483, 'spec': 0.7509}
dataset = {0: 149, 1: 118, 2: 83, 3: 154}
folding = {
    'n_folds': 4, 
    'n_repeats': 2, 
    'strategy': 'stratified_sklearn'
}

result = check_1_dataset_known_folds_mos_macro(
    dataset=dataset,
    folding=folding,
    scores=scores,
    eps=1e-4,
    verbosity=0
)
result['inconsistency']
# False

Similarly to the previous cases, with the hand-crafted set of consistent scores, the test detects no inconsistency. However, a small change, for example, accuracy modified to 0.656 renders the configuration infeasible.

Regression

The last supervised learning task supported by the mlscorecheck package is regression. The testing of regression problems is the most difficult since the predictions on the testsets can take any values, consequently, any score values could be yielded an experiment. The only thing regression tests can rely on is the mathematical relation between the currently supported mean average error (mae), mean squared error (mse) and r-squared (r2).

In the following example, we assume that the mae and r2 scores are reported for a testset, and we know its main statistics (the number of samples and the variance). Then, the consistency test can be executed as follows:

from mlscorecheck.check.regression import check_1_testset_no_kfold

var = 0.0831
n_samples = 100
scores =  {'mae': 0.0254, 'r2': 0.9897}

result = check_1_testset_no_kfold(
    var=var,
    n_samples=n_samples,
    scores=scores,
    eps=1e-4
)
result['inconsistency']
# False

Again, the test correctly shows that there is no inconsistency (the scores are prepared by a real evaluation). However, if the r2 score is slightly changed, for example, to 0.9997, the configuration becomes infeasible.

Test bundles

To make the consistency testing of scores reported for popular, widely researched problems more accessible, the mlscorecheck package includes specifications for numerous experimental setups that are considered standards in certain problems.

Retinal vessel segmentation on the DRIVE dataset

In the field of retinal image analysis, an ambiguity exists in the evaluation of various segmentation techniques: authors have the freedom to account for pixels outside the circular field of view area, and this choice is rarely indicated in publications. This ambiguity can result in the ranking of algorithms based on incomparable performance scores. The functionalities implemented in the mlscorecheck package are suitable to identify if the authors used pixels outside the field of view for evaluation or not.

One of the most widely researched problems is the segmentation of vessels based on the DRIVE dataset. To prevent the cumbersome task of looking up the statistics of the images and constructing the experimental setups, the package contains the statistics of the dataset and provides two high-level functions to test the ambiguity of image-level and aggregated scores. For example, having a triplet of image level accuracy, sensitivity and specificity scores for the test image ‘03’ of the DRIVE dataset, one can exploit the package as:

from mlscorecheck.check.bundles.retina import check_drive_vessel_image

scores = {'acc': 0.9323, 'sens': 0.5677, 'spec': 0.9944}

result = check_drive_vessel_image(
    scores=scores,
    eps=10**(-4),
    image_identifier='03',
    annotator=1
)
result['inconsistency']
# {'inconsistency_fov': False, 'inconsistency_all': True}

The result indicates that the scores for this image must have been obtained by using only the field of view (fov) pixels for evaluation, since the scores are not inconsistent with this hypothesis, but they inconsistent with the alternative hypothesis of using all pixels for evaluation.

Further test bundles

The list of all popular research problems and corresponding publicly available datasets supported by test bundles in the mlscorecheck package reads as follows:

Retinal image analysis:
– Vessel segmentation: DRIVE, STARE, HRF, CHASE_DB1;
– Retinopathy recognition: DIARETDB0, DIARETDB1;
– Optic disk/cup segmentation: DRISHTI_GS;
– Exudate segmentation: DIARETDB1;
Skin lesion classification: ISIC2016, ISIC2017;
Term-preterm delivery prediction from electrohysterogram signals: TPEHG.

Call for contribution

Experts from any fields are welcome to submit further test bundles to facilitate the validation of machine learning performance scores in various areas of research!

Conclusions

The meta-analysis of machine learning research doesn’t encompass many techniques beyond thorough paper assessments and potential attempts at re-implementing proposed methods to validate claimed results. The functionalities provided by the mlscorecheck package enable a more concise, numerical approach to the meta-analysis of machine learning research, contributing to maintaining the integrity of various research fields.

What is the consistency testing of performance scores?

from mlscorecheck.check.binary import check_1_testset_no_kfold

result = check_1_testset_no_kfold(
    testset={'p': 100, 'n': 1000},
    scores={'acc': 0.8464, 'sens': 0.81, 'f1': 0.4894},
    eps=1e-4
)
result['inconsistency']
#False

result = check_1_testset_no_kfold(
    testset={'p': 100, 'n': 1000},
    scores={'acc': 0.8474, 'sens': 0.81, 'f1': 0.4894},
    eps=1e-4
)
result['inconsistency']
#True

Testing the adjusted setup, the result signals inconsistency: the scores could not be the outcome of the experiment. Either the scores or the assumed experimental setup is incorrect.

In the rest of the post, we take a closer look on the main features and use cases of the mlscorecheck package.

Introduction

The goal of the mlscorecheck package is to provide numerical techniques to test if a set of reported performance scores could be the outcome of an assumed experimental setup.

The operation of consistency tests

The idea behind consistency testing is that in a given experimental setup, performance scores cannot take any values independently:

For example, if there are 100 positive samples in a binary classification testset, the sensitivity score can only take the values 0.0, 0.01, 0.02, …, 1.0, but it cannot be 0.8543.
When multiple performance scores are reported, they need to be consistent with each other. For example, accuracy is the weighted average of sensitivity and specificity, hence, in a binary classification problem with a testset of 100 positive and 100 negative samples, the scores acc = 0.96, sens = 0.91, spec = 0.97 cannot be yielded.

Use cases

Now, we explore some examples illustrating the use of the package, but first, we discuss the general requirements of testing and some terms used to describe the experiments.

The requirements

Consistency testing has three requirements:

the collection of reported performance scores;
the estimated numerical uncertainty of the scores (when the scores are truncated to 4 decimal places, one can assume that the real values are within the range of 0.0001 from the reported values, and this is the numerical uncertainty of the scores) — this is usually the eps parameter of the tests which is simply inferred by inspecting the scores;
the details of the experiment (the statistics of the dataset(s) involved, the cross-validation scheme, the mode of aggregation).

Glossary

The terms used in the specifications of the experiments:

mean of scores (MoS): the scores are calculated for each fold/dataset, and then averaged to gain the reported ones;
score of means (SoM): the fold/dataset level raw figures (e.g. confusion matrices) are averaged first, and the scores are calculated from the average figures;
micro-average: the evaluation of a multiclass problem is carried out by measuring the performance on each class against all other (as a binary classification), and the class-level results are aggregated in the score of means fashion;
macro-average: the same as the micro-average, but the class level scores are aggregated in the mean of scores fashion;
fold configuration: when k-fold cross-validation is used, the tests usually rely on linear integer programming. Knowing the number of samples of classes in the folds can be utilized in the formation of the linear program. These fold level class sample counts are referred to as the fold configuration.

Binary classification

In the beginning, we already illustrated the use of the package when binary classification scores calculated on a single testset are to be tested. Now, we look into some more advanced examples.

N testsets, score-of-means aggregation

For example, in practice, the evaluation of an image segmentation technique on N test images stored in one tensor usually leads to this scenario.

from mlscorecheck.check.binary import check_n_testsets_som_no_kfold

scores = {'acc': 0.4719, 'npv': 0.6253, 'f1': 0.3091}
testsets = [
    {'p': 405, 'n': 223}, 
    {'p': 3, 'n': 422}, 
    {'p': 109, 'n': 404}
]

result = check_n_testsets_som_no_kfold(
    testsets=testsets,
    scores=scores,
    eps=1e-4
)
result['inconsistency']
# False

scores['f1'] = 0.3191

result = check_n_testsets_som_no_kfold(
    testsets=testsets,
    scores=scores,
    eps=1e-4
)
result['inconsistency']
# True

1 dataset, k-fold cross-validation, mean of scores aggregation

This experimental setup is possibly the most commonly used in supervised machine learning.

from mlscorecheck.check.binary import check_1_dataset_known_folds_mos

scores = {'acc': 0.7811, 'sens': 0.5848, 'spec': 0.7893}
dataset = {'p': 21, 'n': 500}
folding = {
    'n_folds': 2, 
    'n_repeats': 3, 
    'strategy': 'stratified_sklearn'
}

result = check_1_dataset_known_folds_mos(
    dataset=dataset,
    folding=folding,
    scores=scores,
    eps=1e-4
)
result['inconsistency']
# False

scores['acc'] = 0.79

result = check_1_dataset_known_folds_mos(
    dataset=dataset,
    folding=folding,
    scores=scores,
    eps=1e-4,
    verbosity=0
)
result['inconsistency']
# True

from mlscorecheck.check.binary import check_1_dataset_unknown_folds_mos

folding = {'n_folds': 2, 'n_repeats': 3}
result = check_1_dataset_unknown_folds_mos(
    dataset=dataset,
    folding=folding,
    scores=scores,
    eps=1e-4,
    verbosity=0
)
result['inconsistency']
# False

In practice, prior to launching a test with unknown folds, it is advised to make an estimation on the number of possible fold configurations to be tested:

from mlscorecheck.check.binary import estimate_n_evaluations

estimate_n_evaluations(
    dataset=dataset, 
    folding=folding, 
    available_scores=['acc', 'sens', 'spec']
)
# 4096

Multiclass classification

Testing multiclass classification scenarios is analogous to that of the binary case, therefore, we do not get into as much details as in the binary case.

from mlscorecheck.check.multiclass import check_1_dataset_known_folds_mos_macro

scores = {'acc': 0.626, 'sens': 0.2483, 'spec': 0.7509}
dataset = {0: 149, 1: 118, 2: 83, 3: 154}
folding = {
    'n_folds': 4, 
    'n_repeats': 2, 
    'strategy': 'stratified_sklearn'
}

result = check_1_dataset_known_folds_mos_macro(
    dataset=dataset,
    folding=folding,
    scores=scores,
    eps=1e-4,
    verbosity=0
)
result['inconsistency']
# False

Regression

from mlscorecheck.check.regression import check_1_testset_no_kfold

var = 0.0831
n_samples = 100
scores =  {'mae': 0.0254, 'r2': 0.9897}

result = check_1_testset_no_kfold(
    var=var,
    n_samples=n_samples,
    scores=scores,
    eps=1e-4
)
result['inconsistency']
# False

Test bundles

Retinal vessel segmentation on the DRIVE dataset

from mlscorecheck.check.bundles.retina import check_drive_vessel_image

scores = {'acc': 0.9323, 'sens': 0.5677, 'spec': 0.9944}

result = check_drive_vessel_image(
    scores=scores,
    eps=10**(-4),
    image_identifier='03',
    annotator=1
)
result['inconsistency']
# {'inconsistency_fov': False, 'inconsistency_all': True}

Further test bundles

The list of all popular research problems and corresponding publicly available datasets supported by test bundles in the mlscorecheck package reads as follows:

Retinal image analysis:
– Vessel segmentation: DRIVE, STARE, HRF, CHASE_DB1;
– Retinopathy recognition: DIARETDB0, DIARETDB1;
– Optic disk/cup segmentation: DRISHTI_GS;
– Exudate segmentation: DIARETDB1;
Skin lesion classification: ISIC2016, ISIC2017;
Term-preterm delivery prediction from electrohysterogram signals: TPEHG.

Call for contribution

Experts from any fields are welcome to submit further test bundles to facilitate the validation of machine learning performance scores in various areas of research!

Testing the Consistency of Reported Machine Learning Performance Scores by the mlscorecheck Package

A small step towards the big leap of reproducible machine learning science

What is the consistency testing of performance scores?

Introduction

The operation of consistency tests

Use cases

The requirements

Glossary

Binary classification

Multiclass classification

Regression

Test bundles

Retinal vessel segmentation on the DRIVE dataset

Further test bundles

Call for contribution

Conclusions

Further reading

A small step towards the big leap of reproducible machine learning science

What is the consistency testing of performance scores?

Introduction

The operation of consistency tests

Use cases

The requirements

Glossary

Binary classification

Multiclass classification

Regression

Test bundles

Retinal vessel segmentation on the DRIVE dataset

Further test bundles

Call for contribution

Conclusions

Further reading