Automated Feature Engineering in Python | by David Farrugia | May, 2023

By Jessie Hobb On May 2, 2023

MACHINE LEARNING | PYTHON | DATA SCIENCE

A guide to augmenting your dataset with new and informative features using Upgini

One of the most vital skills of any data scientist or machine learning professional is the ability to extract deeper and more meaningful features from any given dataset. This concept, more commonly known as feature engineering, is perhaps one of the most powerful techniques to master when modelling machine learning algorithms.

Learning from data involves a lot of engineering. Although most of its complexities have now been abstracted away by modern high-level tools such as sklearn, there still remains the critical need to fully understand the data and shape it into the problem that you want to solve.

Extracting better features helps with providing additional (and potentially stronger) underlying relationships to the model regarding the business domain and its influencing factors.

Needless to say, feature engineering is incredibly time-consuming and exhaustive. It requires a lot of creativity, technical expertise, and also in most cases, trial and error.

I’ve recently came across a new tool, Upgini. Fitting with the current trend on Large Language Models (LLM), Upgini exploits the power of OpenAI’s GPT LLM to automate the entire feature engineering process for our dataset.

In this article, we will go through the Upgini package and discuss its functionality.

For the purpose of this article, we will be using the Amazon Fine Food Review dataset (licensed under CC0: Public Domain).

For more information on the Upgini package, you can visit its GitHub page here:

First things first, we can install Upgini directly through pip:

pip install upgini

We also load in our dataset:

import pandas as pd
import numpy as np# read full data
df_full = pd.read_csv("/content/Reviews.csv")
# convert Time to datetime column
df_full['Time'] = pd.to_datetime(df_full['Time'], unit='s')
# re-order columns
0df_full = df_full[['Time', 'ProfileName', 'Summary', 'Text', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Score' ]]

Snippet of the result data — Image by author

We also filter our dataset to include reviews that have more than 10 helpfulness and were published after 2011–01–01.

df_full = df_full[(df_full['HelpfulnessDenominator'] > 10) & 
(df_full['Time'] >= '2011-01-01')]

We also transform Helpfulness into a binary variable with 0.50 ratio.

df_full.loc[:, 'Helpful'] = np.where(df_full.loc[:, 'HelpfulnessNumerator'] / df_full.loc[:, 'HelpfulnessDenominator'] > 0.50, 1, 0)

Finally, we create a new column — combined — that will concatenate the summary and text into a single column. We also take this opportunity to drop any duplicates.

df_full["combined"] = f"Title: {df_full['Summary'].str.strip()} ; Content: {df_full['Text'].str.strip()}"
df_full.drop(['Summary', 'Text', 'HelpfulnessNumerator', 'HelpfulnessDenominator' ], axis=1, inplace=True)
df_full.drop_duplicates(subset=['combined'], inplace=True)
df_full.reset_index(drop=True, inplace=True)

We are now ready to start searching for new features.

Following from the Upgini documentation, we can start a feature search using the FeaturesEnricher object. Within that FeaturesEnricher, we can specify a SearchKey (i.e., the column that we want to search for).

We can search for the following column types:

email
hem
IP
phone
date
datetime
country
post code

Let us import these into Python.

from upgini import FeaturesEnricher, SearchKey

We can now start a feature start.

enricher = FeaturesEnricher(search_keys={'Time': SearchKey.DATE})
enricher.fit(df_full[['Time', 'ProfileName', 'Score', 'combined']], df_full['Helpful'])

After some time, Upgini presents us with a list of search results — potentially relevant features to augment our dataset.

Snippet of the found features. Image by author

It seems that Upgini calculates the SHAP value for every found feature to measure the overall impact of that feature on the data and model quality.

For every returned feature, we also can see and visit its source directly.

The package also evaluates the performance of a model on the original and enriched dataset.

Results obtained after enrichment. Image by author

Here we can see that by adding the enriched features, we managed to slightly improve the model’s performance. Admittedly, this performance gain is negligible.

Digging deeper into the documentation, it seems that the FeaturesEnricher also accepts another parameter — generate_features.

generate_features allows us to search for and generated feature embeddings for text columns. This sounds really promising. We do have text columns — combined and ProfileName.

Upgini has two LLMs connected to a search engine — GPT-3.5 from OpenAI and GPT-J — from the Upgini documentation

Let run this enrichment, shall we?

enricher = FeaturesEnricher(
search_keys={'Time': SearchKey.DATE}, 
generate_features=['combined', 'ProfileName']
)
enricher.fit(df_full[['Time','ProfileName','Score','combined']], df_full['Helpful'])

Upgini found us 222 relevant features. Again, per feature we get a report on their SHAP value, source, and its coverage on our data.

This time, we can also note that we have some generated features (i.e., the text GPT embeddings features).

Example of the text embeddings features that were generated. Image by author

And the evaluated performance?

With the newly generated features we see a massive boost in predictive performance — an uplift of 0.1. And the best part is that all of it was fully automated!

We definitely want to keep these features, given the massive performance gain that we observed. We can do this as follows:

df_full_enrich = enricher.transform(df_full)

The result is a dataset composed of 11 features. From this point onwards, we can proceed as we normally would with any other machine learning task.

Upgini offers plenty of potential. I’m still trying out its features and getting familiar with its different functionality — but so far, it’s proving to be quite useful — especially that GPT feature generator!

Let me know your results!

Amazon Fine Food Reviews Dataset by Stanford Network Analysis Project in Kaggle — licensed under CC0: Public Domain.