Unlock the Secret to Efficient Batch Prediction Pipelines Using Python, a Feature Store and GCS | by Paul Iusztin | May, 2023

By Jessie Hobb On May 12, 2023

Prepare Credentials

First of all, you have to create a .env file where you will add all our credentials.

I already showed you in Lesson 1 how to set up your .env file. Also, I explained in Lesson 1 how the variables from the .env file are loaded from your ML_PIPELINE_ROOT_DIR directory into a SETTINGS Python dictionary to be used throughout your code.

Thus, if you want to replicate what I have done, I strongly recommend checking out Lesson 1.

If you only want a light read, you can completely skip the “Prepare Credentials” step.

In Lesson 3, you will use two services:

Hopsworks (free)

We already showed you in Lesson 1 how to set up the credentials for Hopsworks. Please visit the “Prepare Credentials” section from Lesson 1, where we showed you in detail how to set up the API KEY for Hopsworks.

GCP — Cloud Storage (free)

While replicating this course, you will stick to the GCP — Cloud Storage free tier. You can store up to 5GB for free in GCP — Cloud Storage, which is far more than enough for our use case.

This configuration step will be longer, but I promise that it is not complicated. By the way, you will learn the basics of using a cloud vendor such as GCP.

First, go to GCP and create a project called “energy_consumption”. Afterward, go to your GCP project’s “Cloud Storage” section and create a non-public bucket called “hourly-batch-predictions“. Pick any region, but just be aware of it—official docs about creating a bucket on GCP [2].

Note: To make things convenient, stick to our naming conventions. Otherwise, you will have more configuration steps, and who likes these configuration steps, right?

Screenshot of the GCP — Cloud Storage view, where you must create your bucket [Image by the Author].

Now you finished creating all your GCP resources. The last step is to create a way to have read & write access to the GCP bucket directly from your Python code.

You can easily do this using GCP service accounts. I don’t want to hijack the whole article with GCP configurations. Thus, this GCP official doc shows you how to create a service account [3].

When creating the service account, be aware of one thing!

Service accounts have attached different roles. A role is a way to configure your service account with various permissions.

Thus, you need to configure your service account to have read & write access the your “hourly-batch-predictions” bucket.

You can easily do that by choosing the “Storage Object Admin” role when creating your service account.

The final step is to find a way to authenticate with your newly created service account in your Python code.

You can easily do that by going to your service account and creating a JSON key. Again, here are the official GCP docs that will show you how to create a JSON key for your service account [4].

Again, keep in mind one thing!

When creating the JSON key, you will download a JSON file.

After you download your JSON file, put it in a safe place and go to your .env file. There, change the value of GOOGLE_CLOUD_SERVICE_ACCOUNT_JSON_PATH with your absolute path to the JSON file.

A screenshot of the .env.default file [Image by the Author].

NOTE: If you haven’t followed our naming conventions, you must change the GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_BUCKET_NAME variables accordingly.

Congratulations! You are done configuring GCS — Cloud Storage.

Now you have created a GCP project and bucket. Also, you have read & write access using your Python code through your service account. You log in with your service account with the help of the JSON file.

If something isn’t working, let me know in the comments below or directly on LinkedIn.