Techno Blender
Digitally Yours.

Top 3 Non-Machine Learning Skills to Rise in Kaggle Competitions | by Pranay Dave | Jun, 2022

0 85


Data, creativity, and tactic will make you climb the leaderboard

Photo by Element5 Digital on Unsplash

This may sound counter-intuitive. But the key skill which will give you an edge over others in machine learning competitions like Kaggle may not be machine-learning.

Your knowledge of machine-learning algorithms is just the base skill that you need in Kaggle. Applying different algorithms, ensemble, and hyperparameter optimization is of course necessary, but then it’s just automation. Anyone can copy/paste algorithm function code from Stack-overflow or Python Scikit-learn. It will help you get a decent score, but rarely it will get you in the top 10 or top 20%.

I will illustrate this article using the Kaggle Space Titanic problem where I achieved the Top 20% in this competition. (reference and dataset citation available at end of the story)

Getting in the Top 20% of the leaderboard (image by author)

The machine-learning algorithm which I used was XGBoost. However, it was not the algorithm that helped me reach the top 20%. What helped me was

  • Focusing on data
  • Creative feature engineering
  • Selecting the right machine learning tactic
Winning takes more than just machine learning algorithm skills (image by author)

80% of the time in the machine learning process is spent in data preparation. And the Kaggle Space Titanic problem is no different. One of the main activities in preparing data is removing the skew and handling missing values

Removing the skew — a step which is generally missed, but very necessary

Skew can be a killer. You can think of it as a silent outlier, which we tend to forget. In the Space titanic problem, there are many skewed numeric values. Let us look at one of them, which is the Room service.

High left-skew (image by author)

As you will observe that the distribution is highly left-skewed. The reason is that most of the passengers did not expense for Room service. However, there are some high room service values, which are barely visible in the histogram. If the skew is not treated, the machine learning algorithm will be biased towards small values.

One way to remove the skew is to apply the power transform function. The results look as shown below after removing the skew.

Un-skewing data (image by author)

After applying the power transformer, the high values become more prominent. The machine learning algorithm will be able to pick up these high values. In the Space Titanic problem, the un-skewing helps in increasing the accuracy, as the services are important features in determining the fate of the passenger.

Handling missing value — the most crucial step needs to be done like a pro

There are many missing values in the data. Though there are many ways to replace missing data, the KNN (K-Nearest Neighbour) approach worked very well in the Space Titanic problem.

For a complete guide on how to do KNN missing value replacement, you can see my article here.

Creativity is what differentiates leaders from the rest. Feature engineering is the place to think out of the box and gain those vital points to move up on the Kaggle leaderboard. Though creativity cannot be documented, let me attempt to give it a structure based on Space titanic problem.

Look for compounded fields in the data

Compounded fields are a great way to start your creative feature engineering. Compounded fields have multiple information packed into a single field. You can potentially identify them when you have “_”,“-” or “/” or “:” in the field. A typical example is the date field. When you see something like 31/12/2022, it has three things compounded into one — the date, month, and the year.

In Space titanic problem, there are two fields which are compounded fields. The first is Passengerid which has data such as 0001_01, 0003_01, etc. A “_” is an indication of a compounded field. So we can create two features out of this field.

De-compounding passenger id field (image by author)

Similarly, you can decompound the cabin field which has values such as G/3/S, and C/1/S. As you guessed it, you can split them up into three features.

Look for multiple numeric fields and if they can be added together

Numeric fields are your best friends for creative feature engineering. Combining them in different ways is easy and can help you win some extra accuracy points. One of the easiest, but a very effective feature to create is a Total feature from numeric fields. Generally, machine learning algorithms would not be able to derive “Total” information if it is not explicitly specified as a feature.

In the Space Titanic dataset, you have multiple numeric fields such as RoomService charges, Spa charges, Shopping Mall charges (imagine shopping in space !), etc. There is a good opportunity to create total fields which would reflect the total charges for a passenger.

Creating a Total feature (image by author)

The Total field goes a long way to beat all competition in the Space Titanic problem!

Look for Vertical Patterns with a curious eye!

When we look at the data, we generally tend to analyze it horizontally. However, you may find hidden nuggets when you analyze the data vertically. One of the hidden nuggets in the Space titanic problem is information if the passenger is traveling alone or in a group.

Creating new features by analyzing data vertically (image by author)

On analyzing the data vertically, you will observe that there are passengers with identical passenger-id (before the _) and similar surnames. This would mean that they are a family and are traveling as a group. So we can creatively create a new feature that indicates if the passenger is traveling alone or not.

Whatever sophisticated algorithm you use, none of them would be able to deduct such a feature automatically. This is where the creativity of the human brain comes into play. And it will help you go a long way up on the Kaggle leaderboard.

A tactic is a sequence of actions, which aims to achieve a certain goal. The word tactic has its origins in military warfare. One of the classic tactics known to mankind is called Oblique Order, which was used in Greek Warefare.

Solving a machine learning problem is in fact a sequence of algorithms. Like war tactics, if you get it correctly, you win. If you get it wrong, even if you are using the right algorithms, you lose!

In the Space Titanic problem, you have multiple algorithms: KNN missing value replacement, power transformation, creating additional features, normalization, one-hot encoding, and XGBoost. Is there a specific sequence that helps achieve a better score? The answer is yes! And here it is.

Winning tactic (image by author)

If you use a different sequence, for example, if you inverse the power transform and new feature creation, you will have less accuracy.

To get high on the Kaggle leaderboard, you need more than just knowledge of machine learning algorithms. Here are the top 3 things

  • Focus on understanding and preparing your data
  • Creative feature engineering will make you go long way ahead
  • Think like a war tactician! The sequence of action is very important

My no-code platform to learn data science

You can visit my platform to learn data science in a very easy way as well as apply some of the techniques described in this article without coding. https://experiencedatascience.com

My Youtube channel on data science demo

Here is a link to my YouTube channel, where I show various demos on data science, machine learning, and AI.
https://www.youtube.com/c/DataScienceDemonstrated

Please subscribe in order to stay informed whenever I release a new story.

You can also join Medium with my referral link

Space titanic reference and Datasource citation

The space titanic problem and data are available here : https://www.kaggle.com/competitions/spaceship-titanic/overview

As specified in the rules (https://www.kaggle.com/competitions/spaceship-titanic/rules), in section 7 A, data can be used for any purpose.

A. Data Access and Use. You may access and use the Competition Data for any purpose, whether commercial or non-commercial, including for participating in the Competition and on Kaggle.com forums, and for academic research and education.


Data, creativity, and tactic will make you climb the leaderboard

Photo by Element5 Digital on Unsplash

This may sound counter-intuitive. But the key skill which will give you an edge over others in machine learning competitions like Kaggle may not be machine-learning.

Your knowledge of machine-learning algorithms is just the base skill that you need in Kaggle. Applying different algorithms, ensemble, and hyperparameter optimization is of course necessary, but then it’s just automation. Anyone can copy/paste algorithm function code from Stack-overflow or Python Scikit-learn. It will help you get a decent score, but rarely it will get you in the top 10 or top 20%.

I will illustrate this article using the Kaggle Space Titanic problem where I achieved the Top 20% in this competition. (reference and dataset citation available at end of the story)

Getting in the Top 20% of the leaderboard (image by author)

The machine-learning algorithm which I used was XGBoost. However, it was not the algorithm that helped me reach the top 20%. What helped me was

  • Focusing on data
  • Creative feature engineering
  • Selecting the right machine learning tactic
Winning takes more than just machine learning algorithm skills (image by author)

80% of the time in the machine learning process is spent in data preparation. And the Kaggle Space Titanic problem is no different. One of the main activities in preparing data is removing the skew and handling missing values

Removing the skew — a step which is generally missed, but very necessary

Skew can be a killer. You can think of it as a silent outlier, which we tend to forget. In the Space titanic problem, there are many skewed numeric values. Let us look at one of them, which is the Room service.

High left-skew (image by author)

As you will observe that the distribution is highly left-skewed. The reason is that most of the passengers did not expense for Room service. However, there are some high room service values, which are barely visible in the histogram. If the skew is not treated, the machine learning algorithm will be biased towards small values.

One way to remove the skew is to apply the power transform function. The results look as shown below after removing the skew.

Un-skewing data (image by author)

After applying the power transformer, the high values become more prominent. The machine learning algorithm will be able to pick up these high values. In the Space Titanic problem, the un-skewing helps in increasing the accuracy, as the services are important features in determining the fate of the passenger.

Handling missing value — the most crucial step needs to be done like a pro

There are many missing values in the data. Though there are many ways to replace missing data, the KNN (K-Nearest Neighbour) approach worked very well in the Space Titanic problem.

For a complete guide on how to do KNN missing value replacement, you can see my article here.

Creativity is what differentiates leaders from the rest. Feature engineering is the place to think out of the box and gain those vital points to move up on the Kaggle leaderboard. Though creativity cannot be documented, let me attempt to give it a structure based on Space titanic problem.

Look for compounded fields in the data

Compounded fields are a great way to start your creative feature engineering. Compounded fields have multiple information packed into a single field. You can potentially identify them when you have “_”,“-” or “/” or “:” in the field. A typical example is the date field. When you see something like 31/12/2022, it has three things compounded into one — the date, month, and the year.

In Space titanic problem, there are two fields which are compounded fields. The first is Passengerid which has data such as 0001_01, 0003_01, etc. A “_” is an indication of a compounded field. So we can create two features out of this field.

De-compounding passenger id field (image by author)

Similarly, you can decompound the cabin field which has values such as G/3/S, and C/1/S. As you guessed it, you can split them up into three features.

Look for multiple numeric fields and if they can be added together

Numeric fields are your best friends for creative feature engineering. Combining them in different ways is easy and can help you win some extra accuracy points. One of the easiest, but a very effective feature to create is a Total feature from numeric fields. Generally, machine learning algorithms would not be able to derive “Total” information if it is not explicitly specified as a feature.

In the Space Titanic dataset, you have multiple numeric fields such as RoomService charges, Spa charges, Shopping Mall charges (imagine shopping in space !), etc. There is a good opportunity to create total fields which would reflect the total charges for a passenger.

Creating a Total feature (image by author)

The Total field goes a long way to beat all competition in the Space Titanic problem!

Look for Vertical Patterns with a curious eye!

When we look at the data, we generally tend to analyze it horizontally. However, you may find hidden nuggets when you analyze the data vertically. One of the hidden nuggets in the Space titanic problem is information if the passenger is traveling alone or in a group.

Creating new features by analyzing data vertically (image by author)

On analyzing the data vertically, you will observe that there are passengers with identical passenger-id (before the _) and similar surnames. This would mean that they are a family and are traveling as a group. So we can creatively create a new feature that indicates if the passenger is traveling alone or not.

Whatever sophisticated algorithm you use, none of them would be able to deduct such a feature automatically. This is where the creativity of the human brain comes into play. And it will help you go a long way up on the Kaggle leaderboard.

A tactic is a sequence of actions, which aims to achieve a certain goal. The word tactic has its origins in military warfare. One of the classic tactics known to mankind is called Oblique Order, which was used in Greek Warefare.

Solving a machine learning problem is in fact a sequence of algorithms. Like war tactics, if you get it correctly, you win. If you get it wrong, even if you are using the right algorithms, you lose!

In the Space Titanic problem, you have multiple algorithms: KNN missing value replacement, power transformation, creating additional features, normalization, one-hot encoding, and XGBoost. Is there a specific sequence that helps achieve a better score? The answer is yes! And here it is.

Winning tactic (image by author)

If you use a different sequence, for example, if you inverse the power transform and new feature creation, you will have less accuracy.

To get high on the Kaggle leaderboard, you need more than just knowledge of machine learning algorithms. Here are the top 3 things

  • Focus on understanding and preparing your data
  • Creative feature engineering will make you go long way ahead
  • Think like a war tactician! The sequence of action is very important

My no-code platform to learn data science

You can visit my platform to learn data science in a very easy way as well as apply some of the techniques described in this article without coding. https://experiencedatascience.com

My Youtube channel on data science demo

Here is a link to my YouTube channel, where I show various demos on data science, machine learning, and AI.
https://www.youtube.com/c/DataScienceDemonstrated

Please subscribe in order to stay informed whenever I release a new story.

You can also join Medium with my referral link

Space titanic reference and Datasource citation

The space titanic problem and data are available here : https://www.kaggle.com/competitions/spaceship-titanic/overview

As specified in the rules (https://www.kaggle.com/competitions/spaceship-titanic/rules), in section 7 A, data can be used for any purpose.

A. Data Access and Use. You may access and use the Competition Data for any purpose, whether commercial or non-commercial, including for participating in the Competition and on Kaggle.com forums, and for academic research and education.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment