Getting Started with Databricks. A Beginners Guide to Databricks | by Sadrach Pierre, Ph.D. | May, 2023

By Jessie Hobb On May 10, 2023

A Beginner’s Guide to Databricks

Databricks allows data scientists to easily create and manage notebooks for research, experimentation, and deployment. The appeal of platforms like Databricks includes seamless integration with cloud services, tooling for model maintenance, and scalability.

Databricks is very useful for model experimentation and maintenance. Databricks has a machine learning library, called MLflow, that provides useful tooling for model development and deployment. With MLflow, you can log models as well as metadata associated with the models such as performance metrics and hyperparameters. This makes it very straightforward to run experiments and analyze results.

Many Databricks features are useful for scaling steps within the machine learning workflow such as data loading, model training, and model logging. Koalas is a library in Databricks that is a more efficient alternative to pandas. Pandas User-defined functions (UDF) allow you to apply custom functions, which are usually computationally costly, in a distributed manner which can significantly reduce runtime. Databricks also allows you to configure jobs on larger machines which can be useful for dealing with large data and heavy computation. Further, the model registry allows you to run and store experiment results for hundreds or even thousands of models. This is useful in terms of scaling the number of models that are researcher develops and eventually deploys.

In this article, we will cover some of the basics of Databricks. First, we will walk through a simple data science workflow where we will build a churn classification model. We will then see how we can use tools like Koalas and Pandas UDF to speed up specific operations. Finally, we will see how we can use Mlflow to help us run experiments and inspect results.

Here, we will be working with the Telco churn data set. This data contains customer billing information for a fictional Telco company. It specifies whether a customer stopped or continued using the service, known as churning. The data is publicly available and is free to use, share and modify under the Apache 2.0 license.

Getting Started

To start, navigate to the Databricks website and click on “Get Started for Free”:

You should see the following:

Enter your information and click continue. Next you will be prompted to select a cloud platform. We won’t be working with any external cloud platforms in this article. At the bottom of the right-hand panel click on “Get Started with Community Edition”

Next follow the steps to create a Community Edition Account.

Importing Data

Let’s start by navigating to the ‘data’ tab in left-hand panel:

Next click on ‘data’ and then click on create table:

Next drag and drop the churn CSV file in the space where it says “Drop files to upload, or click to browse”

Upon uploading the CSV you should see the following:

Next click on “Create Table in Notebook”. A Databricks filestore (DBFS) example notebook with logic for writing this file to the Databricks filestore will pop up:

DBFS allows Databricks users to upload and manage data. The system is distributed so it is very useful for storing and managing large amounts of data.

The first cell specifies logic for reading the Churn data we uploaded:

# File location and type
file_location = "/FileStore/tables/telco_churn-1.csv"
file_type = "csv"# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)

If we run this cell we get the following result:

We see that the table includes column names that aren’t very useful (_c0, _c1, … etc). To fix this we need to specify first_row_is_header= “true”:

first_row_is_header = "true"

When we run this cell, we now get:

If you click on the table you can scroll to the right and see the additional columns in the data:

Building a Classification Model

Let’s proceed by building a churn classification model using our uploaded data in Databricks. On the left hand panel click on ‘create’:

Next click on notebook:

Let’s name our notebook “churn_model”:

Now we can copy the logic from the DBFS example notebook allowing us to access the data:

Next let’s convert the spark dataframe into a Pandas dataframe:

df_pandas = df.toPandas()

Let’s build a Catboost classification model. Catboost is a tree-based ensemble machine learning algorithm that uses gradient boosting to improve the performance of the successive trees used in the ensemble.

Let’s pip install the Catboost package. We do this in a cell at the top of the notebook:

And let’s build a Catboost churn classification model. Let’s use tenure, monthly charges, and contract to predict churn outcome. Let’s convert the churn column to binary values:

import numpy as np 
df_pandas['churn_label'] = np.where(df_pandas['Churn']== 'No', 0, 1)
X = df_pandas[["tenure", "MonthlyCharges", "Contract"]]
y = df_pandas['churn_label']

Catboost allows us to handle categorical variables directly without the need to convert them to machine readable codes. To do this we just define a list that contains the names of the categorical columns:

cats = ["Contract"]

When defining the Catboost model object we set the cat_features parameter equal to this list. Let’s split our data for training and testing:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

And we can train out Catboost model. We’ll just use default parameter values:

model = CatBoostClassifier(cat_features= cats, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

And we can evaluate performance:

from sklearn.metrics import accuracy_score, precision_score
accuracy = accuracy_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)print("Accuracy: ", accuracy )
print("Precision: ", precision )

Koalas

Here we converted a spark dataframe to a pandas dataframe. This is fine for our small data set, but as data size grows Pandas becomes slow and inefficient. An alternative to Pandas is the Koalas library. Koalas is a package developed by Databricks that is a distributed version of Pandas. To use Koalas we can pip install Databricks at the top of our notebook:

%pip install -U databricks

And we import Koalas from databricks:

from databricks import koalas as ks

And to convert our spark dataframe to a Koalas dataframe we do the following:

df_koalas = ks.DataFrame(df)
df_koalas.head()

Pandas UDF

Pandas UDF is another useful tool in databricks. It allows you to apply a function to a dataframe in a distributed manner. This is useful for increasing the efficiency of calculations done on large dataframes. For example, we can define a function that takes a data frame and builds a catboost model. We can then use Pandas UDF to apply this function at a grouped or categorical level. Let’s build a model for each value of internet service.

To start we need to define our function and schema for Pandas UDF. The schema simply specifies the column names and their data types:

from pyspark.sql.functions import pandas_udf, PandasUDFTypechurn_schema = StructType(
[
StructField("tenure", FloatType()),
StructField("Contract", StringType()),
StructField("InternetService", StringType()),
StructField("MonthlyCharges", FloatType()),
StructField("Churn", FloatType()),
StructField("Predictions", FloatType()),
]
)

Next we will define our function. We will simply include the logic we defined earlier in a function called ‘build_model’. To use pandas UDF we add the decorator ‘@pandas_udf’:

@pandas_udf(churn_schema, PandasUDFType.GROUPED_MAP)
def build_model(df: pd.DataFrame) -> pd.DataFrame:

And we can include the model building logic in our function. We’ll also store the predictions and the true churn values in our dataframe:

@pandas_udf(churn_schema, PandasUDFType.GROUPED_MAP)
def build_model(df: pd.DataFrame) -> pd.DataFrame:
df['churn_label'] = np.where(df['Churn']== 'No', 0, 1)
X = df[["tenure", "MonthlyCharges", "Contract"]]
y = df['churn_label']
cats = ["Contract"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = CatBoostClassifier(cat_features= cats, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
output = X_test
output['Prediction'] = y_pred
output['Churn'] = y_test
output['InternetService'] = df['InternetService']
output['churn_label'] = df['churn_label']
return output

Finally we can apply this function to our dataframe. Let’s convert our Koalas dataframe back to a spark dataframe:


df_spark = df_koalas.to_spark()
churn_results = (
df_spark.groupBy('InternetService').apply(build_model))

And we can convert the resulting spark data frame to a Pandas dataframe (also can convert back to Koalas) and display the first five rows:

churn_results = churn_results.toPandas()
churn_results.head()

Even though we stored predictions, you can use Pandas UDF to store any information that you get as a result of a calculation done on a dataframe. An interesting excercise is to include accuracy score and precision score in the output spark dataframe for each internet service value.

Getting started with MLflow

Another useful tool in Databricks is MlFlow. MlFlow allows you to easily run, log and analyze experiments. For this demonstration we will work with the first model object we defined earlier in our notebook. Let’s pip install Mlflow at the top of our notebook:

%pip install -U mlflow

and import Mlflow:

import mlflow

Let’s proceed by setting an experiment name:

mlflow.set_experiment(
f"/Users/[email protected]/churn_model"
)

One thing we can log is the Catboost feature importance which will allow us to analyze which features are important for predicting churn:

feature_importance = pd.DataFrame(
{"variable": model.feature_names_, "importance": model.feature_importances_}
)
feature_importance.to_csv("/feature_importance.csv")

We can then log our Catboost model using the log_model method:

with mlflow.start_run(run_name=f"churn_model"):
mlflow.sklearn.log_model(model, "Catboost Model")

We get a notification stating “Logged 1 run to an experiment in Mlflow”:

Screenshot Taken by Author

We can click on the run and see the following:

This is where we can see metrics like model performance and model artifacts such as feature importance. Both of these we will show how to log in Mlflow shortly.

We can also click on the experiment:

This is where we see each run associated with the experiment. This is useful for keeping track of experiments such as modifying Catboost parameters, training data, engineered features etc.

Finally, let’s log feature importance as an artifact, accuracy score and precision score as metrics, and the list of categorical inputs as a parameter:

with mlflow.start_run(run_name=f"churn_model"):
mlflow.sklearn.log_model(model, "Catboost Model")
mlflow.log_artifact("/feature_importance.csv")
mlflow.log_metric("Precison", precision)    
mlflow.log_metric("Accuracy", accuracy)       
mlflow.log_param("Categories", cats)

If we click on the run we see we logged feature importance, accuracy score and precision score, and categorical inputs:

The code in the Databricks notebook has been ported to a ipython file and is available in GitHub.

Conclusion

In this post, we discussed how to get started with Databricks. First, we saw how to add upload data to the DBFS. We then created a notebook and showed how to access the uploaded file in the notebook. We then proceed to discuss tools available in Databricks that help data scientists and researchers scale data science solutions. First, we saw how to convert spark dataframes to Koalas dataframe, which are a faster alternative to Pandas. We then saw how to apply customer functions to spark data frames using Pandas UDF. This is very useful for heavy computational tasks that need to be performed on large dataframes. Finally, we saw how to log metrics, parameters, and artifacts associated with modeling experiments. Having familiarity with these tools is important for anyone working in the data science, machine learning, and machine learning engineering spaces.