The Open Loop of ML — Part 2. Why model accuracy is a deceptive… | by Devesh Rajadhyax | Jun, 2022

By Jessie Hobb On Jun 16, 2022

Why model accuracy is a deceptive metric

The first part of this series was about the psychology of ML developers. The ‘accuracy’ metric associated with model development provides a mental closure to the model builders. However, the accuracy metric is the outcome of the model training process, and it has little bearing on the usefulness of the model in the real world. I have proposed that the artificial closure is a major reason for the discrepancy between the number of prototypes and actual working ML systems.

After the first part was published, I received a lot of feedback from my friends who are experienced and knowledgeable in building industry grade solutions. Some of those suggestions were extremely pertinent. I can divide them in five general categories:

The relevance of model accuracy can be improved by selecting the training dataset carefully. The data should contain enough volume and variety to reflect the real world. Biases in the data should also be identified and corrected.
Any system requires multiple iterations to become useful in the real world. ML systems can also improve once they are put in production through incremental changes.
In order to build more robust systems, the design should be broken into many segments. Only the segments that strictly require uncertain inference should be built using models.
We should redefine the criteria for closure, just like we do with any normal software project. We should incentivise the right outcome.
The usefulness of the ML system also depends on the type of use. While ‘assist’ type of requirement is easy to meet, ‘replace’ requirement is hard to satisfy.

I was really happy to get these suggestions. The response told me that the people who struggle with real problems are able to connect with the Open Loop of ML, and they are already thinking about solutions. Encouraged by this, I have decided to present the remainder of my thinking in two parts — Part 2 (this one) and Part 3 (next one).

In this part, I will use the first two suggestions. I will focus more on the model accuracy and why it does not reflect the correct picture. I will also talk about the challenges in implementing the suggestions 1 and 2.
In the next part, I will use suggestions 3, 4 and 5. I will suggest a measure that makes the ML systems more useful and which can be used as a closure criteria.

Let me start with the accuracy metric then. Unlike the previous part, which was entirely practical and based on my experience, this part contains some theoretical material. The material is a result of my experience and thinking, partly spurred by the book I am writing on the mathematics of intelligent machines.

A model is a method to make a guess. A better model makes a better guess. The guessing process is made up of three steps:

Observe (past data collection)

Detect pattern (model identification and training)

Use pattern (inference)

Patterns can be of many types. Here, we will talk about one particular pattern — the relation pattern. I have selected this pattern for a reason. Most of the popular models (regression, neural networks, bayesian) use this pattern. The ML algorithms based on the relation pattern are usually called ‘parametric methods’. Relation pattern means that a certain relation (or function) exists between the quantities of our interest. For example, the two quantities repo rate and stock market index are connected by a relation pattern.

When we are trying to solve a guessing problem, we meet three different functions:

The Real World Function (RWF): This is the actual relationship that exists in the real world. An example is the relation between the number of vaccinations and the spread of the infectious disease. No one actually knows this function, because if we knew, we wouldn’t be taking the trouble to train a model.
The Observed Function (OF): This is the output of the Observe step of our ML efforts. We create data in the form of a record of the input and output variables. The data itself is the OF. This function is in the ‘mapping’ form, which means that you just see pairs (or tuples) of numbers, not any actual function.
The Model Function (MF): This is our attempt to guess the RWF. This is an actual mathematical function. Though in some cases (such as neural networks) it is impossible to know the exact function, one certainly exists. While training the model, we use the Observed Function and ML algorithms to guess the best Model Function.

Image by Rajashree Rajadhyax

Now you can easily understand why model accuracy is so ineffective:

Model accuracy indicates how close the MF is to OF. What makes the model useful is how close the MF is to RWF.

A little thinking on the above will make you realize the prerequisite of usefulness:

The OF should be close to RWF.

Consider the nature of these two functions. The Real World Function is hidden and unknown. The Observed Function is the record you have kept of the real world phenomenon. In fact, RWF manifests itself through observations. To make this abstract point easier, we will take an example, one that is too familiar to readers of ML literature — identifying a cat.

The real world phenomena here is that a picture of a cat contains certain unique shapes and also some distinguishing arrangement of those shapes. The pictures of cats that you collect are observed data. A neural network can learn the relation between these shapes/arrangements and the picture being that of a cat. This then becomes the learnt model function. In this example, pay attention to the nature of RWF and OF. The real world phenomenon gives rise to the characteristics of a cat picture. What is the count of such pictures? Practically infinite. Every cat in the world in each one of its poses, in every surrounding and lighting gives rise to a new picture. There is no way the OF is going to contain all these pictures. Therefore:

The OF will always be a subset of all manifestations of RWF.

Armed with background, we list down the challenges of taking OF close to RWF:

Lack of knowledge: Since RWF is unknown, we do not actually know how much data and which varieties we have to collect
Exponential nature of efforts: The efforts to collect the initial volume of data is reasonable. As we go after more volume and varieties, the efforts increase exponentially:

Image depicting the exponential nature of data collection effort — Fig 1: Why collecting more data is harder

There is one more challenge that I will describe a little later.

This discussion should be enough to highlight the deceptive nature of model accuracy. If the accuracy is 90%, there is a distance equivalent to 10% between MF and OF. But the distance between RWF and OF can be substantial, and therefore the usefulness of the model in the real world is not known at this point.

How can we measure the usefulness of a model in the real world? I will talk about this in the next part, but one point should be clear:

‘The usefulness of a model in the real world can be measured only by putting it in the real world.’

It means that unless the model operates in an actual situation for sufficient time, its usefulness will not be apparent. But there is a big hurdle in putting the model in the real world — the cost of errors!

A model can make many types of errors. Take for instance a model for identification of cancer. It can make two types of errors — False Positives(FP) and False Negatives (FN). The cost of the error can be different for the two errors. In case of cancer, the cost of a FN can be huge, as it is missing an existing medical condition. Now consider the following points about putting the model in the real world:

Because of the RWF-OF gap, the model will make some errors.

The cost of some of the errors can be large.

This difficulty comes in the way of putting models in the production. The system then becomes a loop (called Cost of Errors Loop):

RWF-OF gap -> Errors -> Reluctance to putting in production -> Unable to fill RWF-OF gap

The loop prevents the incremental improvement in ML systems.

Our original problem was the discrepancy between experimental and production ML systems. In this part, we discussed the reasons why the model accuracy metric is so deceptive. While it indicates how close the model is to the available data, it says nothing about the proximity of the model to the real world phenomena. In fact, the Cost of Error loop discourages putting the model into production and hampers further data collection.

We now know that we have to find answers to two questions:

How can we break the Cost of Errors loop described above?

What measure can we suggest for the usefulness of the model in the real world?

I would try to suggest answers to the above in the next part. I am hoping to receive some good feedback on the above discussion. It will be really kind of you if you can put your thoughts in the comments so that I can refer to them easily.

Previous: The Open Loop of ML — Part 1