Feature Encoding Techniques in Machine Learning with Python Implementation | by Kay Jan Wong | Jan, 2023

By Jessie Hobb On Jan 10, 2023

6 feature encoding techniques to consider for your data science workflows

Feature Encoding converts categorical variables to numerical variables as part of the feature engineering step to make the data compatible with Machine Learning models. There are various ways to perform feature encoding, depending on the type of categorical variable and other considerations.

This article introduces tips to perform feature encoding in general, elaborating on 6 feature encoding techniques that you can consider in your Data Science workflows, with comments on when to use them, and finally how to implement them in Python.

Fig 1: Summary of Feature Encoding Techniques — Image by author

The cheat sheet summary of the 6 feature encoding techniques is summarized in Fig 1; read on for a detailed explanation and implementation of each method.

Label / Ordinal Encoding
One-Hot / Dummy Encoding
Target Encoding
Count / Frequency Encoding
Binary / BaseN Encoding
Hash Encoding

Tip 1: Prevent Data Leakage

Given that the purpose of feature encoding is to convert categorical variables to numerical variables, we can encode categories such as “cat”, “dog”, and “horse” to numbers such as 0, 1, 2, etc. However, we must keep in mind the problem of data leakage which happens when information in test data is leaked into the training data.

The encoder must be fitted on only the training data, such that the encoder only learns the categories that exist in the training set, and then be used to transform the validation/test data. Do not fit the encoder on the whole dataset!

The question naturally follows “What if I have missing or new categories in the validation/test data?”, we can handle this in two ways — by removing these unseen categories since the model is not trained on these unseen categories anyway. Alternatively, we can encode them as -1 or other arbitrary values to indicate that these are unseen categories.

Tip 2: Save your Encoders

As mentioned in the previous tip, encoders are fitted on training data (with method .fit) and used to transform validation/test data (with method .transform). It is best to save the encoder to transform the validation/test data later on.

Other benefits of saving encoders are the ability to retrieve the categories or to transform the encoded values back to their categories (with method .inverse_transform) if applicable and required.

Now let’s dive into the feature encoding techniques!

Label Encoder and Ordinal Encoder encode categories into numerical values directly (refer to Fig 2).

Label Encoder is used for nominal categorical variables (categories without order i.e., red, green, blue) while Ordinal Encoder is used for ordinal categorical variables (categories with order i.e., small, medium, large).

Fig 2: Example of Label Encoding — Image by author

Given that the cardinality (number of categories) is n, Label and Ordinal Encoder encodes the values from 0 to n-1.

When to use Label / Ordinal Encoder

Nominal/Ordinal Variables: Label Encoder is used for nominal categorical variables, while Ordinal Encoder is used for ordinal categorical variables
Supports High Cardinality: Label and Ordinal Encoder can be used in cases where there are many different categories
Unseen Variables: Ordinal Encoder can encode unseen variables in the validation/test set (refer to the code example below), by default it throws a value error

Cons of Label / Ordinal Encoder

Unseen Variables: Label Encoder does not encode unseen variables in the validation/test set and will throw a value error, special error handling must be done to avoid this error
Categories interpreted as numerical values: Machine Learning models would read the encoded columns as numerical variables instead of interpreting them as distinct categories

For Label Encoder, it can only encode one column at a time and multiple label encoders must be initialized for each categorical column.

from sklearn.preprocessing import LabelEncoder# Initialize Label Encoder
encoder = LabelEncoder()
# Fit encoder on training data
data_train["type_encoded"] = encoder.fit_transform(data_train["type"])
# Transform test data
data_test["type_encoded"] = encoder.transform(data_test["type"])
# Retrieve the categories (returns list)
list(encoder.classes_)
# Retrieve original values from encoded values
data_train["type2"] = encoder.inverse_transform(data_train["type_encoded"])

For Ordinal Encoder, it can encode multiple columns at once, and the order of the categories can be specified.

from sklearn.preprocessing import OrdinalEncoder# Initialize Ordinal Encoder
encoder = OrdinalEncoder(
categories=[["small", "medium", "large"]],
handle_unknown="use_encoded_value",
unknown_value=-1,
)
data_train["size_encoded"] = encoder.fit_transform(data_train[["size"]])
data_test["size_encoded"] = encoder.transform(data_test[["size"]])
# Retrieve the categories (returns list of lists)
encoder.categories
# Retrieve original values from encoded values
data_train["size2"] = encoder.inverse_transform(data_train[["size_encoded"]])

In One-Hot Encoding and Dummy Encoding, the categorical column is split into multiple columns consisting of ones and zeros.

This addresses the drawback to Label and Ordinal Encoding where columns are now read in as categorical columns due to encoded data being represented as multiple boolean columns.

Fig 3: Example of One-Hot Encoding — Image by author

Given that the cardinality (number of categories) is n, One-Hot Encoder encodes the data by creating n additional columns. In Dummy Encoding, we can drop the last column as it would be a dummy variable, and this will result in n-1 columns.

When to use One-Hot / Dummy Encoder

Nominal Variables: One-Hot Encoder is used for nominal categorical variables
Low to Medium Cardinality: As new columns are created for each category, it is recommended to use One-Hot encoding where there is a low to a medium number of categories such that the resulting data would not be too sparse
Missing or Unseen Variables: One-Hot Encoder from the sklearn package can handle missing or unseen variables by creating columns for missing variables and omitting columns for unseen variables so that the feature columns remain consistent, by default it throws a value error

Cons of One-Hot / Dummy Encoder

Dummy Variable Trap: It may result in a phenomenon where the features are highly correlated since the encoded data is sparse
Large dataset: One-hot encoding increases the number of columns in the dataset, which may in turn affect the training speed and is not optimal for tree-based models

One-hot encoding can be done with OneHotEncoder from the sklearn package or using the pandas get_dummies method.

from sklearn.preprocessing import OneHotEncoder# Initialize One-Hot Encoder
encoder = OneHotEncoder(handle_unknown="ignore")
# Fit encoder on training data (returns a separate DataFrame)
data_ohe = pd.DataFrame(encoder.fit_transform(data_train[["type"]]).toarray())
data_ohe.columns = [col for cols in encoder.categories_ for col in cols]
# Join encoded data with original training data
data_train = pd.concat([data_train, data_ohe], axis=1)
# Transform test data
data_ohe = pd.DataFrame(encoder.transform(data_test[["type"]]).toarray())
data_ohe.columns = [col for cols in encoder.categories_ for col in cols]
data_test = pd.concat([data_test, data_ohe], axis=1)

Using the pandas built-in get_dummies method, missing and unseen variables in validation/test data must be manually handled.

data_ohe = pd.get_dummies(data_train["type"])
data_train = pd.concat([data_train, data_ohe], axis=1)

Target Encoding uses Bayesian posterior probability to encode categorical variables to the mean of the target variable (numerical). Smoothing techniques are applied to prevent target leakage.

Compared to Label and Ordinal Encoding, Target Encoding encodes the data with values that explains the target instead of arbitrary numbers 0, 1, 2, etc. Other similar encodings can be to encode categorical variables with Information Value (IV) or Weight of Evidence (WOE).

Fig 4: Example of Target Encoding — Image by author

There are two ways to implement target encoding

Mean Encoding: The encoded values are the mean of the target values with smoothing applied
Leave-One-Out Encoding: The encoded values are the mean of the target values except for the data point that we want to predict

When to use Target Encoder

Nominal Variables: Target Encoder is used for nominal categorical variables
Supports High Cardinality: Target Encoder can be used in cases where there are many different categories, and it is better if there are multiple data samples for each category
Unseen Variables: Target Encoder can handle unseen variables by encoding them with the mean of the target variable

Cons of Target Encoder

Target Leakage: Even with smoothing, this may result in target leakage and overfitting. Leave-One-Out Encoding and introducing Gaussian noise in the target variable can be used to address the overfitting problem
Uneven Category Distribution: The category distribution can differ in train and validation/test data and result in categories being encoded with incorrect or extreme values

Target Encoding requires installing the category_encoders python package using the command pip install category_encoders.

import category_encoders as ce# Target (Mean) Encoding - fit on training data, transform test data
encoder = ce.TargetEncoder(cols="type", smoothing=1.0)
data_train["type_encoded"] = encoder.fit_transform(data_train["type"], data_train["label"])
data_test["type_encoded"] = encoder.transform(data_test["type"], data_test["label"])
# Leave One Out Encoding
encoder = ce.LeaveOneOutEncoder(cols="type")
data_train["type_encoded"] = encoder.fit_transform(data_train["type"], data_train["label"])
data_test["type_encoded"] = encoder.transform(data_test["type"], data_test["label"])

Count and Frequency Encoding encodes categorical variables to the count of occurrences and frequency (normalized count) of occurrences respectively.

Fig 5: Example of Count and Frequency Encoding — Image by author

When to use Count / Frequency Encoder

Nominal Variables: Frequency and Count Encoder is effective for nominal categorical variables
Unseen Variables: Frequency and Count Encoder can handle unseen variables by encoding them with a 0 value

Cons of Count / Frequency Encoder

Similar encodings: If all categories have similar count, the encoded values will be the same

import category_encoders as ce# Count Encoding - fit on training data, transform test data
encoder = ce.CountEncoder(cols="type")
data_train["type_count_encoded"] = encoder.fit_transform(data_train["type"])
data_test["type_count_encoded"] = encoder.transform(data_test["type"])
# Frequency (normalized count) Encoding
encoder = ce.CountEncoder(cols="type", normalize=True)
data_train["type_frequency_encoded"] = encoder.fit_transform(data_train["type"])
data_test["type_frequency_encoded"] = encoder.transform(data_test["type"])

Binary Encoding encodes categorical variables into integers, then converts them to binary code. The output is similar to One-Hot Encoding, but lesser columns are created.

This addresses the drawback to One-Hot Encoding where a cardinality of n does not result in n number of columns, but log2(n) columns. BaseN Encoding follows the same idea but uses other base values instead of 2, resulting in logN(n) columns.

Fig 6: Example of Binary Encoding — Image by author

When to use Binary Encoder

Nominal Variables: Binary and BaseN Encoder are used for nominal categorical variables
High Cardinality: Binary and BaseN encoding works well with a high number of categories
Missing or Unseen Variables: Binary and BaseN Encoder can handle unseen variables by encoding them with 0 values across all columns

import category_encoders as ce# Binary Encoding - fit on training data, transform test data
encoder = ce.BinaryEncoder()
data_encoded = encoder.fit_transform(data_train["type"])
data_train = pd.concat([data_train, data_encoded], axis=1)
data_encoded = encoder.transform(data_test["type"])
data_test = pd.concat([data_test, data_encoded], axis=1)
# BaseN Encoding - fit on training data, transform test data
encoder = ce.BaseNEncoder(base=5)
data_encoded = encoder.fit_transform(data_train["type"])
data_train = pd.concat([data_train, data_encoded], axis=1)
data_encoded = encoder.transform(data_test["type"])
data_test = pd.concat([data_test, data_encoded], axis=1)

Hash Encoding encodes categorical variables into distinct hash values using a hash function. The output is similar to One-Hot Encoding, but you can choose the number of columns created.

Hash Encoding is similar to Binary Encoding such that they are more space-efficient than One-Hot Encoding, but Hash Encoding makes use of a hash function instead of binary numbers.

Fig 7: Example of Hash Encoding using 2 columns — Image by author

Hash encoding can encode high-cardinality data to a fixed-sized array as the number of new columns is manually specified. This is also similar to dimensionality reduction algorithms such as TSNE or Spectral Embedding methods which construct fixed-sized arrays from eigenvalues and other distance measures.

When to use Hash Encoder

Nominal Variables: Hash Encoder is used for nominal categorical variables
High Cardinality: Hash encoding works well with a high number of categories
Missing or Unseen Variables: Hash Encoder can handle unseen variables by encoding them with null values across all columns

Cons of Hash Encoder

Irreversible: Hashing functions are one-direction such that the original input can be hashed into a hash value, but the original input cannot be retrieved from the hash value
Information Loss or Collision: If too few columns are created, hash encoding can lead to loss of information and multiple different input may result in the same output from the hash function

Hash encoding can be done with FeatureHasher from the sklearn package or with HashingEncoder from the category encoders package.

from sklearn.feature_extraction import FeatureHasher# Hash Encoding - fit on training data, transform test data
encoder = FeatureHasher(n_features=2, input_type="string")
data_encoded = pd.DataFrame(encoder.fit_transform(data_train["type"]).toarray())
data_train = pd.concat([data_train, data_encoded], axis=1)
data_encoded = pd.DataFrame(encoder.transform(data_test).toarray())
data_test = pd.concat([data_test, data_encoded], axis=1)

Using category_encoders,

import category_encoders as ce# Hash Encoding - fit on training data, transform test data
encoder = ce.HashingEncoder(n_components=2)
data_encoded = encoder.fit_transform(data_train["type"])
data_train = pd.concat([data_train, data_encoded], axis=1)
data_encoded = encoder.transform(data_test["type"])
data_test = pd.concat([data_test, data_encoded], axis=1)

Hope you have learned more about the different ways to encode your categorical data into numerical data. To choose between which feature encoding technique to use, it is important to consider the type of categorical data (nominal or ordinal), the Machine Learning model used, and the pros and cons of each method.

It is also important to consider missing or unseen variables during testing due to changing patterns or trends so that the data science workflow does not fail in production!