Handling Missing Values in Versatile Data Kit | by Angelica Lo Duca | Aug, 2022

By Jessie Hobb On Aug 31, 2022

Data Preprocessing

A tutorial on how to build data pipelines using VDK to handle missing values

VMware has recently released a new framework, Versatile Data Kit (VDK), which you can use for Data Ingestion and Data Processing. VDK helps you to easily perform complex operations, such as data ingestion from different sources, using either SQL or Python. In other words, you can use VDK to build data lakes, where you ingest raw data extracted from different sources, including structured, semi-structured, and unstructured data.

VDK also provides many plugins, which you can use to import data available in different formats, such as CSV, HTTP, and much more. The list of all the available plugins is available in the VDK official documentation.

In addition to a raw importing of data, you can use VDK to process data and transform them into a format that is directly readable by other people, such as data scientists or data analysts.

Data Preprocessing is the next step after importing your raw data into a data lake. Usually, Data Preprocessing includes different steps, such as dealing with missing values, data formatting, data normalization, data standardization, and data binning.

You can combine the power of SQL and Python to perform Data Preprocessing directly in VDK. This article will show you how to deal with missing values in VDK through a practical example.

The article is organized as follows:

Environment Setup in VDK
Description of the scenario
Ingesting raw data in VDK
Dealing with missing values in VDK.

You can install Versatile Data Kit through pip as follows:

pip install -U pip setuptools wheel
pip install quickstart-vdk

To make VDK work properly, you need to install Python 3.7+.

VDK supports different storage systems, including Postgres, Trino, and much more. For simplicity, in this example, we’ll use SQLite. Thus, we need to install the related plugin as follows:

pip install vdk-sqlite

In addition, we’ll use the vdk-csv plugin to ingest data, thus we need to also install it as follows:

pip install vdk-csv

As an example, we’ll use the hepatitis.csv dataset, available at this link, under the Public Domain Dedication and License.

The following figure shows the first 15 rows of the dataset:

The dataset contains many missing values.

The objective of this scenario is:

ingest the dataset as it is into an SQLite database through the vdk-csv plugin
create a new table without missing values, using VDK
define a strategy to replace missing values with other values, using VDK.

To create a new data job, we open a terminal and we run the following command:

vdk create -n missing-values -t my-team --local

the vdk create command creates a new data job, the -n parameter specifies the project name, the -t parameter specifies the team name, and the --local parameter specifies that the project will use the local file system to store data.

As an output, the command creates a directory, named missing-values, which contains the template files shown in the following figure.

Files with extensions .sql and .py contain ingesting or processing steps. VDK runs job steps by following alphabetical order, so 10_sql_step.sql is executed before 20_python_step.py. The file requirements.txt contains the list of Python packages required by the project. The config.ini file contains the configuration required to run the project. In our case, we can use the default configuration. If you want to use a different database, you should configure properly the following section of the config.ini file:

[vdk]db_default_type=SQLITE
ingest_method_default=SQLITE

To ingest raw data in VDK, we use the vdk-csv plugin. The objective is to ingest the hepatitis.csv into an SQLite database. Firstly, we configure an environment variable, which will be used by the vdk-csv plugin to understand how to ingest data:

export VDK_INGEST_METHOD_DEFAULT="sqlite"

Then, we ingest the dataset using the vdk-csv plugin:

vdk ingest-csv -f hepatitis.csv

We have used the -f parameter to specify the file to ingest.

We can check if we have imported the dataset correctly, by running a query. For example, we can count the total number of records directly in VDK, through the following command:

vdk sqlite-query -q "SELECT count(*) FROM hepatitis"

There are 155 records.

A missing value is a value that is not provided in the dataset. There are many strategies when dealing with missing values. In this article, we describe two ways:

removing all missing values
replacing missing values with other values, such as the average value or a constant value.

We will discuss how to implement both strategies using both SQL in VDK.

4.1 Removing all missing values

We will store all the non-null in a new table, named dropped_missing_values. Firstly, we define a new job step, named 10_drop_dropped_missing_values_table.sql, that drops the table if it exists:

DROP TABLE IF EXISTS dropped_missing_values;

Now, we build a new job step, named 20_drop_dropped_missing_values_table.sql which creates the dropped_missing_values table, which contains only non-null values. The following piece of code shows the SQL version of the job:

CREATE TABLE dropped_missing_values AS
SELECT *
FROM hepatitis
WHERE age IS NOT NULL AND
sex IS NOT NULL AND
steroid IS NOT NULL AND
antivirals IS NOT NULL AND
fatigue IS NOT NULL AND
malaise IS NOT NULL AND
anorexia IS NOT NULL AND
liver_big IS NOT NULL AND
liver_firm IS NOT NULL AND
spleen_palpable IS NOT NULL AND
spiders IS NOT NULL AND
ascites IS NOT NULL AND
varices IS NOT NULL AND 
bilirubin IS NOT NULL AND
alk_phosphate IS NOT NULL AND
sgot IS NOT NULL AND
albumin IS NOT NULL AND
protime IS NOT NULL AND 
histology IS NOT NULL AND
class IS NOT NULL
;

The previous code simply checks that all the fields do not contain null values.

We can run all the steps in the missing-values data job as follows:

vdk run missing-values

4.2 Replacing missing values

Alternatively, you can replace missing values with other values, such as the most frequent value or the average value.

In our case, we have 3 types of columns: numbers, booleans, and strings. We replace missing values in:

numbers with the average value
booleans with the constant value False
strings with a constant value.

In all the cases we use the COALESCE() function to search for and replace missing values.

In vdk, we define two steps in the data jobs. The first job, named 10_drop_replace_missing_values_table.sql, removes the replaced_missing_values tables, if it exists:

DROP TABLE IF EXISTS replaced_missing_values;

The second step, named 20_create_missing_values_table.sql, creates the replaced_missing values table:

CREATE TABLE replaced_missing_values AS
SELECT COALESCE(age, AVG(age) OVER()) AS age,
COALESCE(sex, "female") AS sex,
COALESCE(steroid, False) AS steroid,
COALESCE(antivirals, False) AS antivirals,
COALESCE(fatigue, False) AS fatigue,
COALESCE(malaise, False) AS malaise,
COALESCE(anorexia, False) AS anorexia,
COALESCE(liver_big, False) AS liver_big,
COALESCE(liver_firm,False) AS liver_firm,
COALESCE(spleen_palpable, False) AS spleen_palpable,
COALESCE(spiders, False) AS spiders,
COALESCE(ascites, False) AS ascites,
COALESCE(varices, False) AS varices,
COALESCE(bilirubin, AVG(bilirubin) OVER()) AS bilirubin,
COALESCE(alk_phosphate, AVG(alk_phosphate) OVER()) AS alk_phosphate,
COALESCE(sgot, AVG(sgot) OVER()) AS sgot,
COALESCE(albumin, AVG(albumin) OVER()) AS albumin,
COALESCE(protime, AVG(protime) OVER()) AS protime,
COALESCE(histology, False) AS histology,
COALESCE(class, "live") AS class
FROM hepatitis;

We can run all the jobs as follows:

vdk run missing-values

Congratulations! You have just learned how to deal with missing values using VDK! You can use VDK both for data ingestion and processing.

VDK is a very powerful framework, which permits you to build a data lake and merge multiple sources through its plugins. In this example, you have used the vdk-csv and vdk-sqlite plugins. You can view the full list of available plugins at this link.

For more information, you can read the VDK official documentation, as well as join the VDK slack channel.

You can read this article, which uses a Trino Database with a MySQL server to store data and VDK to ingest data in the database.

This article describes another example in VDK, which extracts data from REST API. Since access to the REST API requires a secret key, the example also describes how to configure the VDK Server to store the secret.