5 typical beginner mistakes in Machine Learning | by Agnis Liukis | Jul, 2022


If you are about to build your first model, be aware and avoid these

Data Science and Machine Learning are becoming more and more popular. The number of people in this field grows each day. Which means there are a lot of data scientists without big experience building their first Machine Learning models. And this is where mistakes are very likely to occur.

Photo by Kind and Curious on Unsplash

Let’s talk about some of the most frequently seen beginner mistakes in Machine Learning solutions to ensure that you are aware of and will avoid them…

1. Not using data normalization where it is needed

It is easy to take the features, throw them in the model and expect it to give the predictions. But in some cases, results of such a simple approach might be disappointing because a very important part is missing.

Some types of models require data normalization. These include Linear Regression, classical Neural Networks, and some more. These kinds of models are using feature values multiplied by trained weights. And in the case of non-normalized features, it may happen, that the range of possible values for one feature is very different from the possible range of another feature.

Let’s take, one feature has values in the range [0, 0.001] and another feature has values [100000, 200000]. For the model to make both features equally important, the weight for the first feature will be 100’000’000 times bigger than the weight for the second feature. Huge weights might cause serious problems for the model. For example, in a case when there is some outlier value. Additionally, it becomes very hard to estimate the importance of various features, as big weights might mean the feature is important or it might simply mean it has small values.

After the normalization, all the features are in the same values range, typically [0, 1] or [-1, 1]. In this case, the weights will be in similar ranges and will closely correspond to the real importance of each feature.

Overall, using data normalization where it is needed, will result in better and more accurate predictions.

2. Thinking that more features are better

One might think that throwing in all the possible features is a good idea, expecting the model to select and use only the best ones automatically.

In practice, this is rarely true. In most cases, models with wisely engineered and selected features will significantly outperform similar models with 10 times more features.

The more features the model has, the bigger becomes the risk of over-fitting. Models are capable to find some signal even in completely random data — sometimes weaker, sometimes stronger. Of course, there is no real signal in random noise. But if we have enough noisy columns, there is a chance, that model will use part of them based on a faulty detected signal. When this happens, model prediction quality will reduce, as they will be partly based on random noise.

There exist various techniques for feature selection, which can help in such situations. Describing them is out of the scope of this article. But the main thing to remember is — you should be able to reason for every feature you have, why this feature is expected to help your model.

3. Using Tree-based models where extrapolation is required

Tree models are very easy to use and also powerful at the same time. This is the reason for their popularity. However, using tree-based models in some cases might be a mistake.

Tree models can’t extrapolate. These models will never give the prediction value bigger than the biggest value seen in training data. And they will never output a smaller prediction than the smallest value in training, too.

Photo by Gilly Stewart on Unsplash

But in some tasks extrapolation capabilities might be very important. For example, if the model predicts stock prices — it is possible that in the future stock prices will grow higher than ever before. Tree-based models in this case won’t be usable directly, as their predictions will be capped approximately to the highest historical price seen before.

There are various solutions to this problem. One option is to predict the change or the difference instead of predicting the value directly. Another solution could be using different model types for such tasks. Linear regression or Neural networks are capable to extrapolate.

4. Using data normalization where it is not needed

I know, previously I talked about the necessity to have data normalization in place. But that’s not always the case.

Tree-based models don’t need data normalization as feature raw values are not used as multipliers and outliers don’t impact them.

Neural Networks might not need the explicit normalization as well — for example, if the network already contains the layer handling normalization inside (e.g. BatchNormalization of Keras library).

And in some cases, even Linear Regression might not need data normalization. This is when all the features are already in similar value ranges and have the same meaning. For example, if the model is applied for the time-series data and all the features are the historical values of the same parameter.

In practice, applying unneeded data normalization won’t necessarily hurt the model. Mostly, the results in these cases will be very similar to skipped normalization. However, having additional unnecessary data transformation will complicate the solution and will increase the risk of introducing some bugs.

5. Leaking information between training and validation/test sets

Creating a data leak is easier than one might think. Consider the following code snippet:

Example features with data leak

In reality, both features (sum_feature and diff_feature) are incorrect. They are leaking information because after splitting to train/test sets, the part with train data will contain some information from the test rows. This will result in a higher validation score, but when applied to a real data model performance will be worse.

The correct way of doing this would be to make the train/test split first. And only then apply the feature generation function. Processing train and test sets separately is a good pattern of feature engineering in general.

There might be cases when some information must be passed between both — for example, we might want for test set to use the same StandardScaler which was used for the training set and was trained on it. But these are just a couple of cases, which need to be decided on and validated each case by case.

Final words

It is good to learn from your mistakes. But it is even better to learn from someone else’s mistakes — hopefully, the provided examples of mistakes will help you in this. Thanks for reading!

Follow me if you are interested in reading my future articles about Data Science, Machine Learning, and Python Programming.


If you are about to build your first model, be aware and avoid these

Data Science and Machine Learning are becoming more and more popular. The number of people in this field grows each day. Which means there are a lot of data scientists without big experience building their first Machine Learning models. And this is where mistakes are very likely to occur.

Photo by Kind and Curious on Unsplash

Let’s talk about some of the most frequently seen beginner mistakes in Machine Learning solutions to ensure that you are aware of and will avoid them…

1. Not using data normalization where it is needed

It is easy to take the features, throw them in the model and expect it to give the predictions. But in some cases, results of such a simple approach might be disappointing because a very important part is missing.

Some types of models require data normalization. These include Linear Regression, classical Neural Networks, and some more. These kinds of models are using feature values multiplied by trained weights. And in the case of non-normalized features, it may happen, that the range of possible values for one feature is very different from the possible range of another feature.

Let’s take, one feature has values in the range [0, 0.001] and another feature has values [100000, 200000]. For the model to make both features equally important, the weight for the first feature will be 100’000’000 times bigger than the weight for the second feature. Huge weights might cause serious problems for the model. For example, in a case when there is some outlier value. Additionally, it becomes very hard to estimate the importance of various features, as big weights might mean the feature is important or it might simply mean it has small values.

After the normalization, all the features are in the same values range, typically [0, 1] or [-1, 1]. In this case, the weights will be in similar ranges and will closely correspond to the real importance of each feature.

Overall, using data normalization where it is needed, will result in better and more accurate predictions.

2. Thinking that more features are better

One might think that throwing in all the possible features is a good idea, expecting the model to select and use only the best ones automatically.

In practice, this is rarely true. In most cases, models with wisely engineered and selected features will significantly outperform similar models with 10 times more features.

The more features the model has, the bigger becomes the risk of over-fitting. Models are capable to find some signal even in completely random data — sometimes weaker, sometimes stronger. Of course, there is no real signal in random noise. But if we have enough noisy columns, there is a chance, that model will use part of them based on a faulty detected signal. When this happens, model prediction quality will reduce, as they will be partly based on random noise.

There exist various techniques for feature selection, which can help in such situations. Describing them is out of the scope of this article. But the main thing to remember is — you should be able to reason for every feature you have, why this feature is expected to help your model.

3. Using Tree-based models where extrapolation is required

Tree models are very easy to use and also powerful at the same time. This is the reason for their popularity. However, using tree-based models in some cases might be a mistake.

Tree models can’t extrapolate. These models will never give the prediction value bigger than the biggest value seen in training data. And they will never output a smaller prediction than the smallest value in training, too.

Photo by Gilly Stewart on Unsplash

But in some tasks extrapolation capabilities might be very important. For example, if the model predicts stock prices — it is possible that in the future stock prices will grow higher than ever before. Tree-based models in this case won’t be usable directly, as their predictions will be capped approximately to the highest historical price seen before.

There are various solutions to this problem. One option is to predict the change or the difference instead of predicting the value directly. Another solution could be using different model types for such tasks. Linear regression or Neural networks are capable to extrapolate.

4. Using data normalization where it is not needed

I know, previously I talked about the necessity to have data normalization in place. But that’s not always the case.

Tree-based models don’t need data normalization as feature raw values are not used as multipliers and outliers don’t impact them.

Neural Networks might not need the explicit normalization as well — for example, if the network already contains the layer handling normalization inside (e.g. BatchNormalization of Keras library).

And in some cases, even Linear Regression might not need data normalization. This is when all the features are already in similar value ranges and have the same meaning. For example, if the model is applied for the time-series data and all the features are the historical values of the same parameter.

In practice, applying unneeded data normalization won’t necessarily hurt the model. Mostly, the results in these cases will be very similar to skipped normalization. However, having additional unnecessary data transformation will complicate the solution and will increase the risk of introducing some bugs.

5. Leaking information between training and validation/test sets

Creating a data leak is easier than one might think. Consider the following code snippet:

Example features with data leak

In reality, both features (sum_feature and diff_feature) are incorrect. They are leaking information because after splitting to train/test sets, the part with train data will contain some information from the test rows. This will result in a higher validation score, but when applied to a real data model performance will be worse.

The correct way of doing this would be to make the train/test split first. And only then apply the feature generation function. Processing train and test sets separately is a good pattern of feature engineering in general.

There might be cases when some information must be passed between both — for example, we might want for test set to use the same StandardScaler which was used for the training set and was trained on it. But these are just a couple of cases, which need to be decided on and validated each case by case.

Final words

It is good to learn from your mistakes. But it is even better to learn from someone else’s mistakes — hopefully, the provided examples of mistakes will help you in this. Thanks for reading!

Follow me if you are interested in reading my future articles about Data Science, Machine Learning, and Python Programming.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
AgnisAi NewsBeginnerJullearningLiukisMachineMistakesTechnoblenderTechnologytypical
Comments (0)
Add Comment