Big Data Fundamentals in Google Cloud Platform | by David Farrugia

CLOUD COMPUTING | BIG DATA | TECHNOLOGY

Part 2— Road to Google Cloud Professional Data Engineer

Welcome to the second part of the GCP Professional Data Engineer Certification series. In the first part we introduced Google’s cloud platform and its hierarchy. You can find Part 1 here:

In this part, we will go over the services and GCP’s offerings when it comes to Big Data technologies and Machine Learning.

Product Recommendations using Cloud SQL and Spark

Product recommendations are perhaps one of the most common ML applications for modern businesses.

The idea of this use-case is to migrate an existing recommendation system from on-premises to the cloud.

When moving to the cloud we are moving from dedicated storage to off-cluster cloud storage.

Core pieces of a ML task is data, a model, and infrastructure to train and serve predictions to the users.

As a use-case, let us pick the task of developing a recommender system for rental houses.

When it comes to infrastructure, first, we need to decide on how frequently do we want to deliver predictions.

So first decision is, should our ML application work with streaming data or as a batch process?

In our use-case, we do not require to continuously recommend houses to our users but rather we can pre-load the results every day and serve to the users when they are online. Therefore, a batch process would work just fine in this case.

On the other hand, depending on the number of houses and users that we have, we need to also consider computation resources. When we are dealing with large datasets, we need to perform this processing in a fault tolerant way. This means, that ideally we run our process on a cluster of machines rather than a single one.

One such example of a fault-tolerant distributing process framework is Apache Hadoop. The process would then look something like:

every day per user, predict the score/rating of every house based on their previous ratings
store these predicted ratings
on user login, query the top N results (based on the predicted scores) and display to the user

As such, we require a transactional way to store the predictions. It needs to be transactional so that we can update the table whilst the user is reading it.

GCP offers multiple transactional solutions. Of course, given different requirements, we must use different services. Below, we summarise the properties of some GCP services.

Google Services and their access patterns. © Google Cloud Platform

GCP Storage Flowchart. Image by author.

For our example use-case, Cloud SQL is the best service to use.

Cloud SQL is a fully managed RDBMS. Using a static IP, we can also connect to our Cloud SQL instance from anywhere.

We also need a service that manages data processing pipelines.

We need a service to process our data batches and streams and also train machine learning models.

A good example of such software is Apache Spark and its machine-learning package (Apache SparkML). Check out my other blog on running a Spark job in record time without the need for any infrastructure overhead.

CLOUD COMPUTING | BIG DATA | TECHNOLOGY

Part 2— Road to Google Cloud Professional Data Engineer

Photo by Pawel Czerwinski on Unsplash