Automate the Feature Engineering Pipeline for Your Relational Dataset | by Satyam Kumar | Aug, 2022

By Jessie Hobb On Aug 25, 2022

Essential guide to an open-source Python framework for automated feature engineering

Feature engineering is an important and time-consuming component of the data science model development pipeline. The feature engineering pipeline decides the robustness and performance of the model.

There are various automated feature engineering packages that process and create features for a single dataset. But these packages fail for the use-cases that involve the usage of multiple relational datasets. Merging multiple relational datasets and computing features from the same is a tedious and time-consuming task. In this article, we will discuss an open-source package Featuretools that automatically create features from temporal and relational datasets in a few lines of Python code.

Featuretools is an open-source python framework to automate the feature engineering pipeline for the predictive modeling use-cases with temporal and relational datasets. Some of the key features of the Featuretools library are:

Deep Feature Synthesis: Featuretools package offers DFS to automatically build meaningful features from a relational dataset.
Precise handling of Time: Featuretools provides APIs to ensure only valid data is used for calculations, keeping your feature vectors safe from common label leakage problems.
Reusable feature primitives: Featuretools offer low-level functions which can be stacked to create features. Custom primitives can be built and shared on any dataset.

Featuretools package is compatible with other popular packages such as Pandas, NumPy, and scikit-learn and creates meaningful features in a fraction of time.

You can go through my previous article related to automated feature engineering:

Featuretools library can be installed from PyPI using pip install featuretools .

Reading the dataset:

We will be using a mock sample relational dataset having transactions, sessions, and customer tables. The mock dataset can be generated using the featuretools load_mock_customer() function.

The relationship between the above dataset is:

(Image by Author), Relationship between relational tables

Specify the Relationship:

First, we specify a dictionary with all the entities in our dataset (lines 1-5).
Second, we specify how the entities are related (lines 7–10). When two entities have a one-to-many relationship, we call the ‘one’ entity as a ‘parent’ entity and the remaining child entities. A relationship between a parent and child is defined as: ( parent_table, parent_key, child_table, link_key ) as a list of tuples.

Generate Features:

Featuretools package offers dfs() function can generate relevant features using Deep Feature Synthesis. Deep Feature Synthesis (DFS) is an automated method for performing feature engineering on relational and temporal data.

Typically, without featuretools, a data scientist would write code to aggregate data for a customer, and apply different statistical functions resulting in features quantifying the customer’s behavior. But DFS algorithm can generate these features by specifying the target_dataframe_name="customers" , where ‘customers’ is the dataset to aggregate all the data.

On calling the dfs() function, it returns the feature engineered matrix and feature description in a fraction of time. By providing the above-mentioned 3 relational datasets and aggregating the customer data, we get 75 aggregated features.

Results:

The descriptions of the 75 automated generated features can be observed from the below snapshot:

(GIF by Author), Feature description on the Featuretools generated features

While working with use-cases involving relational datasets, a data scientist has to manually aggregate data and create features using different statistical techniques. Featuretools is a handy package that has the ability to elegantly extract features from multiple tables and aggregate them into one final dataset. It saves a lot of time and energy for data scientists and they can spend much more time performing some advanced feature engineering.

[1] Featuretools documentation: https://www.featuretools.com/

Thank You for Reading