Rapid Prototyping Using Terraform, GitHub Action, Docker and Streamlit in GCP | by Ken Moriwaki | Aug, 2022

By Jessie Hobb On Aug 10, 2022

Speed Up Insight-sharing Using CI/CD Tools

Data scientists and analysts are valuable resources in many organisations. They spend most of the time understanding the business requirements and finding insights through data collection, wrangling, cleaning, and modelling. The tools that support such data specialists are abundant, e.g., notebooks like Jupiter and Colab, Python or R libraries and other data technologies.

On the other hand, when it comes to sharing essential findings and valuable insights with the stakeholders, the options are limited. The notebooks may not be the best choice for the non-technical audience; sharing your finding in office documents like Word and PowerPoint often fail to describe the dynamic and interactive nature of your findings. It is usually a pain for the data specialists to find the optimal way to present the findings in a suitable format to decision-makers quickly and effectively.

In this article, we would like to demonstrate a way to share the data science findings with the stakeholders using CI/CD (Continuous Integration and Continuous Delivery) tools like Terraform, GitHub Actions, and Docker with Streamlit applications that could accelerate the prototyping. We will deploy the code to a virtual machine in Google Cloud Platform (GCP), but you can also apply these techniques to other cloud service providers.

Workflow Overview

This article guides you through creating the following CI/CD pipeline.

1. Writing Terraform file locally and pushing it to GitHub.
2. Automating GCP provisioning using Terraform by GitHub Action.
3. Writing Streamlit app code and Dockerfile locally and pushing the code to GitHub.
4. Deploying the code to Virtual Machine instance in GCP by GitHub Action.

The image describes the workflow among local, GutHub and GCP environment. — Figure 1. Workflow overview (Image by Author)

Prerequisites

There are four requirements to implement the solutions.

● GitHub Account
● Terraform Installation
● GCP Service Account
● Python libraries

GitHub

GitHub is a web-based version control service using Git. It also provides a service called GitHub Actions. You do not need to install third-party applications to automate workflows for testing, releasing and deploying the code into production. You can create an account on GitHub website.

Terraform

Terraform is an open-source CI/CD tool that manages the infrastructure as code (IaC) provided by HashiCorp. IaC uses the code to manage and provision the infrastructure, eliminating the need to manually configure the physical hardware through the UI provided by the cloud vendors.

Using Terraform for prototyping has some advantages. For example:

avoid manual provisioning of the infrastructure.
data scientists and software engineers can help themselves.
changes to the current set-up become straightforward.
easy to transfer knowledge and keep versioning.

It is possible to automate the process by GitHub Action without installing Terraform locally because GitHub in the cloud executes the workflows. However, it may be a good idea to install the software locally when testing new resources or just getting started with Terraform. In this case, you need to provide the credentials to access cloud resources with the Terraform user, which may be a security concern in some organisations. You can find the installation instruction on the Terraform website.

GCP

A service account (SA) represents a Google Cloud service identity, e.g., code running on the compute engine, creating a bucket in cloud storage, etc. SAs can be created from IAM and admin console. For prototyping purposes, please select the editor role for the account to view, create, update, and delete the resources. You may like to refer to the official guide by Google Cloud. This article uses Compute Engine, Cloud Storage, and VPC Network.

Python

We just use three libraries: Pandas, Yfinance and Streamlit. Pandas is an open-source data manipulation tool. yfinance is a python API to retrieve financial market data from Yahoo! Finance. Streamlit is a Python-based web application framework; you can develop a web application with fewer codes than other frameworks like Flask or Django.

Terraform is an IaC tool. You can manage the cloud infrastructure by writing code. There are various cloud providers of which Terraform can provision services and resources. HashiCorp and the Terraform Community write and maintain codes. A list of providers you can use, including AWS, Azure and GCP, is found on the Terraform Registry website.

Terraform’s workflow consists of three steps: define, plan, and apply. In the define step, you will write configuration files to declare the provider, services, and resources. The configuration files can be formatted and validated by Terraform using terraform fmt and terraform validate commands. In the plan step, Terraform creates an execution plan, based on the configuration files and the state of the existing infrastructure. It can create new services or update/remove the existing ones. Finally, the apply step performs the planned operations and records the state of the infrastructure provisioned by Terraform.

We will create three Terraform files: main.tf, variables.tf, and terraform.tfvars.

main.tf

main.tf is a file that contains the main set of configurations. It defines required providers, provider variables and resources. The contents of the file are as follows:

First, the provider is defined. We will use google in this article.

The provider’s variables are defined. var.* will import the values from the separate file, variables.tf.

credentials_file should be the path to the service account key in json which is downloaded from GCP.

The next scripts create a VPC network.

Then, we configure a VM instance.

We specify VM instance name, machine type, disk image, and network interface. We also add tags and define the start-up script. Tags allow you to make firewall rules applicable to the VM instance. We created three tags corresponding to each of the three firewall rules defined below. metadata_startup_script runs shell scripts at the start-up. In our case, we will run the script to pre-install Docker engine in the VM.

The next specific http access firewall rule.

You can limit the access by source_ranges in the CIDAR format.

We may wish to access the VM instance via SSH.

The third firewall rule is for Streamlit web apps. Port 8501 is to be opened.

The final resource definition is for Cloud Storage. It creates new bucket with variables read from variables.tf file.

variables.tf

The values for var.* in main.tf are supplied from variables.tf files.

Each variable can have a default value. In our example, some variables do not have the default value. Because they are required arguments, Terraform needs them to be supplied. One of the ways to supply the variables is via *.tfvar file, which reduces the exposure of some sensitive information.

terraform.tfvars

We have three lines in terraform.tfvars file. Please replace “***” with your values.

GCP Machine Type and Image List

In main.tf, we specified the machine type and disk image. The accepted values may not always be identical to the descriptions you see on the GCP console. You can retrieve a list of machine types and compute engine images using the following gcloud commands. (Google Cloud SDK must be installed. If not, please refer to the official guides)

gcloud compute images list

We install the docket to the VM instance. Locate the installation script, install_docker.sh, in the same directory as main.tf file. The installation scripts for Ubuntu are copied from the official Docker Doc website.

GitHub Actions allow you to build automated CI/CD pipelines. It is free of charge for the public repositories, and 2,000 minutes are included for the private repositories [1].

GitHub Action reads YAML files saved under .github/workflows/ directory in a repository and performs the workflow defined in the file. A workflow can have multiple jobs, and each job has one or more steps. A step can have action, but not all steps may perform an action.

The file content is as follows:

The image is a screenshot from GitHub Actions UI showing “terraform” workflow steps. — Figure 2. GitHub Action workflow log (Image by Author)

In the previous sections, we prepared the infrastructure using Terraform and GitHub Actions. We now prepare web app codes that run on the VM instance in GCP with Streamlit and Docker.

As a simple example, we will create a dashboard for the time-series data from Yahoo Finance, showing daily return rates so that the user can compare the different indices or foreign exchange rates. A dashboard user can also modify the date ranges.

The image of the dashboard that we are going to create is as shown below:

The image is a screenshot of the web application created by Streamlit. The screen has four pars: one for selecting currency index, two for inputting from_date value, three for to_date value and the last part is to show the line chart based on the user selections. — Figure 3. Streamlit Web App UI (Image by Author)

Python scripts

The app shown above can be created using Streamlit.

Firstly, we import three libraries. If they are not installed, please install them using pip (or pip3) install command.

We specify the name of the web app in st.title(‘text’).

Then, we prepare multiple choice options. st.multiselect() is to create a drop-down selection.

Streamlit offers date_input option with which a date can be selected from a calendar. Here we create two variables: from_date and to_date.

Finally, we design the chart with calculated return values.

Streamlit dynamically reads the values in the selector variable, then retrieves the closing price data from yfinance API. The growth rate is calculated using pandas pct_change(). The final step is to represent the data frame in a line chart using st.line_chart() function.

We save the scripts in app.py file.

Dockerfile

A dockerfile is a text file that contains all the commands to create an image file. You can build an image file using docker build and run the image as a container by docker run command.

In the above Dockerfile, we build our own image on the latest official Python docker distribution. As we require three libraries to run the Streamlit app, we specify the dependency in the text file and then run the pip install command. Streamlit’s default port is 8501. Then, we copy the app/py file in /app directory of the container and run the Streamlit web app.

In the earlier section, we prepared a YAML file for the infrastructure provisioning in a GitHub repository. We must also create a repository and define a workflow to deploy a docker container that runs a Streamlit web app on the provisioned VM instance in GCP.

First, we create a new private GitHub repository for web app code, then repeat the same step to add the GCP service account credential in GitHub Action secrets.

Next, we prepare a private access token in GitHub. The token is used to clone this repository in GitHub from the VM instance in GCP.

On your GitHub page, go to Settings -> Developer settings -> Personal Access Tokens, then click Generate new token. On the New personal access token page: repo and workflow must be ticked. The workflow option allows you to update GitHub Action workflows. You will see a generated token only once, so copy your token to an editor.

The image is a screenshot of GitHub page where yo create a personal access token. In the slect scopes section, we ticked “repo” and “workflow”. — Figure 4. New Personal Access Token (Image by Author)

We need to insert your username and personal access token in the repository URL. For example, the https://github.com/your-user-name/your-repository.git will be https://your-user-name:your-access-token@github.com/your-user-name /your-repository.git. Save the complete URL in the GitHub Action secrets, so we can call it in the workflow YAML file.

Under .github/workflows directory, we create the deploy_dokcer.yaml file. The workflow builds a docker image from Dockerfile and deploys it in the GCP virtual machine instance.

The image is a screenshot from GitHub Actions UI showing “Build and Deploy” workflow steps. — Figure 5. GitHub Action workflow log (Image by Author)

This article addressed the pain point of the data specialists in sharing the valuable insights from the data analysis phase and proposed a way to present the findings using a Streamlit web application. To deploy the application, CI/CD tools and services like Terraform and GitHub Actions help the data specialists to accelerate the prototyping by automating the workflow.

The example we used was a simple use case; however, Streamlit can do much more. We would recommend you to visit the Streamlit website to learn what it can offer. Similarly, Terraform Registry has full of useful resources, and they are actively updated by HashiCorp and providers. It is worth checking the documentation for the provider of your interest to find out other opportunities for workflow automation. Finally, GitHub Actions allow you to design more complex workflows. If you would like to use the GitHub Actions beyond prototyping, it is highly recommended to read the official documentation.