Bird Species Classification with Machine Learning | by Benedict Neo | May, 2022

By Jessie Hobb On May 27, 2022

Data Science

Predict what species is a bird based on genetics and location

Like birds? Like Data Science?

You’ll love this challenge!

Problem Statement

Scientists have determined that a known species of bird should be divided into 3 distinct and separate species. These species are endemic to a particular region of the country and their populations must be tracked and estimated with as much precision as possible.

As such, a non-profit conservation society has taken up the task. They need to be able to log which species they have encountered based on the characteristics that their field officers observe in the wild.

Using certain genetic traits and location data, can you predict the species of bird that has been observed?

This is a beginner-level practice competition and your goal is to predict the bird species based on attributes or location.”

Source

You now have a clear goal.

The goal 🥅

Predict the bird species (A, B, or C) based on attributes or location

Let’s now look at the data

The data 💾

Get the data by registering for this data science competition.

📂 train
├── training_target.csv
├── training_set.csv
└── solution_format.csv📂 test
└── test_set.csv

The data has been conveniently split into train and test datasets.

In each train and test, you’re given bird data for locations 1 to 3.

Here’s a look at the first five rows of training_set.csv

The training_set and the training_target can be joined with the ‘id’ column.

Below is a data dictionary for the given columns

species     : animal species (A, B, C)
bill_length : bill length (mm)
bill_depth  : bill depth (mm)
wing_length : wing length (mm)
mass        : body mass (g)
location    : island type (Location 1, 2, 3)
sex         : animal sex (0: Male; 1: Female; NA: Unknown)

Then, looking at solution_format.csv

Now that you have an idea about the goal and some information about the data given to you, it’s time to get your hands dirty.

Code for this article → Deepnote

Load Libraries

Next, we load up some essential libraries for visualizations and machine learning.

Missing data helper function

Load the data

First, we load the train and test data using the read_csv function.

We also merge training_set.csv (containing the features) with `training_target.csv` (containing the target variable) and form the train data.

Here I manually saved the column names, which are numerical and categorical, and also saved the target column.

This allows me to easily reference columns that I want later on

It’s time for the fun part, visualizing the data.

From the info function, there seem to be missing values, and we can see that location and sex should be categorical, so we have to do some data type conversion later on.

Numerical columns

Plotting the histograms of the numerical variables, we see that

bill_depth peaks around 15 and 19
bill length peaks around 39 and 47
wing length peaks around 190 and 216
mass is right-skewed

Categorical columns

Let’s first visualize our target class.

We see location and species seemingly for their respective locations and species (loc2 & species C, loc3 & species A).

We also see there are slightly more female (1) birds than the male counterpart.

Based on the species plot, it appears we have in our hands an imbalanced class as species B is considerably less than species A and C

Why is this a problem?

The model will be biased towards classes with a larger amount of samples.

This happens because the classifier has more information on classes with more samples, so it learns how to predict those classes better while it remains weak in the smaller classes.

In our case, the species A and C will be predicted more than other classes.

Here’s a great article on how to deal with this issue.

Using the helper function, there seems to be a substantial amount of missing data for bill_length and wing_length

Let’s also use a heatmap to visualize the missing data for that column.

Impute categorical values

Let’s first see how many missing variables are in our categorical variables.

Let’s use the simple imputer to deal with them by replacing them with the most frequent value.

As you can see, by the most_frequent strategy, the missing values were imputed with 1.0, which was the most frequent.

Impute Numerical columns

We’ll have to convert the categorical features to a numerical format, including the target variable.

Let’s use scikit-learn’s Label Encoder to do that.

Here’s an example of using LabelEncoder() the label column

By fitting it first, we can see what the mapping looks like.

Using fit_transform directly converts it for us

For other columns with string variables (non-numeric), we also do the same encoding

We also convert categorical features into the pd.Categorical dtype

Here’s the current data type of the variables.

Now we create some additional features by dividing some variables with another to form ratios.

We don’t know if they would help increase the predictive power of the model, but it doesn’t hurt to try.

Here’s what the train set looks like so far

Train test split

Now it’s time to build the model, we first split it into X (features) and y (target variable), and then split it into training and evaluation set.

Training is where we train the model, evaluation is where we test the model before fitting it to the test set.

We use train_test_split to split our data into the training and evaluation sets.

Decision Tree Classifier

For this article, we choose a simple baseline mode, the DecisionTreeClassifier

Once we fit the training set, we can predict on the evaluation data.

Let’s see how our simple decision tree classifier did.

A 99% accuracy can be meaningless for an imbalanced dataset, so we need more suitable metrics like precision, recall, and a confusion matrix.

Confusion matrix

Let’s create a confusion matrix for our model predictions.

First, we need to get the class names and the labels that the label encoder gave so our plot can show the label names.

We then plot a non-normalized and normalized confusion matrix.

The confusion matrix shows us that it is predicting more classes A and C, which is not surprising since we had more samples.

It also shows the model is predicting more A classes when it should be B/C.

Classification Report

A classification report measures the quality of predictions from a classification algorithm.

It tells us how many predictions are right/wrong

More specifically, it uses True Positives, False Positives, True Negatives, and False Negatives to compute the metrics of precision, recall, and f1-score

For a detailed calculation of these metrics, check out Multi-Class Metrics Made Simple, Part II: the F1-score by Boaz Shmueli

Intuitively, precision is the ability of the classifier not to label as positive (correct) a sample that is negative (wrong), and recall is the ability of the classifier to find all the positive (correct) samples.

From the docs,

"macro" simply calculates the mean of the binary metrics, giving equal weight to each class. In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance. On the other hand, the assumption that all classes are equally important is often untrue, such that macro-averaging will over-emphasize the typically low performance on an infrequent class.
"weighted" accounts for class imbalance by computing the average of binary metrics in which each class’s score is weighted by its presence in the true data sample.

There is no single best metric — it depends on your application. The application, and the real-life costs associated with the different types of errors, will dictate which metric to use.

Let’s also plot the feature importance to see which features matter more.

From the feature importance, it seems mass is the best at predicting species, second is bill_length.

Other variables seem to have zero importance in the classifier.

We see how the feature importance is used in this visualization of our decision tree classifier.

In root node, if the mass is lower than around 4600, it then checks for bill_length, else it checks for bill_depth, and then at the leaf is where it predicts the classes.

First we perform the same preprocessing + feature generations

Then we can use our model to make the prediction, and concatenate the ID column to form the solution file.

Notice the species value are numerical, we have to convert it back to the string values. with the label encoder with fit earlier, we can do so.