Build a Robust Workflow to Visualize Trending GitHub Repositories in Python | by Khuyen Tran | Jul, 2022


Get Updated with the Latest Trends on GitHub in Your Local Machine

GitHub feed is a great way for you to follow along with what’s trending in the community. You can discover some useful repositories by looking at what your connections have starred.

Image by Author

However, there might be some repositories you don’t care about. For example, you might be only interested in Python repositories, but there are repositories written in other languages on your GitHub feed. Thus, it may take you a while to find useful libraries.

Wouldn’t it be nice if you can create a personal dashboard showing the repositories that your connections followed, filtered by your favorite language like below?

Image by Author

In this article, we will learn how to do exactly that, with GitHub API + Streamlit + Prefect.

At a high level, we will:

  • Use GitHub API to write scripts to pull the data from GitHub
  • Use Streamlit to create a dashboard displaying the statistics of the processed data
  • Use Prefect to schedule to run the scripts to get and process data daily
Image by Author

If you want to skip through the explanations and go straight to creating your own dashboard, view this GitHub repository:

To pull data from GitHub, you need your username and an access token. Next, create the file called .env to save your GitHub authentication.

Add .env to your .gitignore file to ensure that .env is not tracked by Git.

In the file process_data.py under the development directory, we will write code to access the information in the file .env using python-dotenv:

Next, we will get the general information of public repositories we received in our GitHub feed using GitHub API. The GitHub API allows us to easily get data from your GitHub account including events, repositories, etc.

Image by Author

We will use pydash to get the URLs of all starred repositories from a list of repositories:

Get the specific information of each starred repository such as language, stars, owners, pull requests, etc:

The information extracted from each repo should look similar to this.

Save the data to the local directory:

Put everything together:

The current script works, but to make sure that it is resilient against failures, we will use Prefect to add observability, caching, and retries to our current workflow.

Prefect is an open-source library that enables you to orchestrate your data workflow in Python.

Add Observability

Because it takes a while to run the file get_data.py, we might want to know which code is being executed and approximately how much longer we need to wait. We can add Prefect’s decorators to our functions to get more insight into the state of each function.

Specifically, we add the decorator @task to the functions that do one thing and add the decorator @flow to the function that contains several tasks.

In the code below, we add the decorator @flow to the function get_data since get_data contains all tasks.

You should see the below output when running this Python script:

From the output, we know which tasks are completed and which ones are in progress.

Caching

In our current code, the function get_geneneral_info_of_repos takes a while to run. If the function get_specific_info_of_repos failed, we need to rerun the entire pipeline again and wait for the function get_geneneral_info_of_repos to run.

Image by Author

To reduce the execution time, we can use Prefect’s caching to save the results of get_general_info_of_repos in the first run and then reuse the results in the second run.

Image by Author

In the file get_data.py, we add caching to the tasks get_general_info_of_repos , get_starred_repo_urls , and get_specific_info_of_repos because they take quite a bit of time to run.

To add caching to a task, specify the values for the arguments cache_key_fn and cach_expiration.

In the code above,

  • cache_key_fn=task_input_hash tells Prefect to use the cached results unless the inputs change or the cache expired.
  • cached_expiration=timedelta(days=1) tells Prefect to refresh the cache after one day.

We can see that the run that doesn’t use caching (lush-ostrich) finished in 27s while the run that uses caching (provocative-wildebeest) finished in 1s!

Image by Author

Note: To view the dashboard with all runs like above, run prefect orion start.

Retries

There is a chance that we fail to pull the data from GitHub through the API, and we need to run the entire pipeline again.

Image by Author

Instead of rerunning the entire pipeline, it’s more efficient to rerun the task that failed a specific number of times after a specific period of time.

Prefect allows you to automatically retry on failure. To enable retries, add retries and retry_delay_seconds parameters to your task.

In the code above:

  • retries=3 tells Prefect to rerun the task up to 3 times
  • retry_delay_seconds=60 tells Prefect to retry after 60 seconds. This functionality is helpful since we might hit the rate limit if we call the GitHub API continuously in a short amount of time.

In the file process_data.py under the directory development, we will clean up the data so that we get a table that only shows what we are interested in.

Start with loading the data and only saving the repositories that are written in a specific language:

Image by Author

Next, we will keep only the information of a repository that is interesting to us, which includes:

  • Full name
  • HTML URL
  • Description
  • Stargazers count

Create a DataFrame from the dictionary then remove the duplicated entries:

Put everything together:

Now come to a fun part. Create a dashboard to view the repositories and their statistics.

The structure of the directory for our app code will look similar to this:

Visualize The Statistics of Repositories

The file [Visualize.py](<http://Visualize.py>) will create the home page of the app while the files under the directory pages create the children pages.

We will use Streamlit to create a simple app in Python. Let’s start with writing the code to show the data and its statistics. Specifically, we want to see the following on the first page:

  • A table of repositories filtered by language
  • A chart of the top 10 most popular repositories
  • A chart of the top 10 most popular topics
  • A word cloud chart of topics

Code of Visualize.py :

To view the dashboard, type:

cd app
streamlit run Visualize.py

Go to http://localhost:8501/ and you should see the following dashboard!

Image by Author

Filter Repositories Based on Their Topics

We get repositories with different topics, but we often only care about specific topics such as machine learning and deep learning. Let’s create a page that helps users filter repositories based on their topics.

And you should see a second page that looks like this. In the GIF below, I only see the repositories with the tags deep-learning, spark, and mysql after applying to the filter.

Image by Author

If you want to have a daily update of repositories on the GitHub feed, you might feel lazy to run the script to get and process data every day. Wouldn’t it be nice if you can schedule to automatically run your script every day?

Image by Author

Let’s schedule our Python scripts by creating a deployment with Prefect.

Use Subflows

Since we want to run the flow get_data before running the flow process_data, we can put them under another flow called get_and_process_data inside the file development/get_and_process_data.py.

Next, we will write a script to deploy our flow. We use IntervalSchedule to run the deployment every day.

To run the deployment, we will:

  • Start a Prefect Orion server
  • Configure a storage
  • Create a work queue
  • Run an agent
  • Create the deployment

Start a Prefect Orion Server

To start a Prefect Orion server, run:

prefect orion start

Configure Storage

Storage saves your task results and deployments. Later when you run a deployment, Prefect will retrieve your flow from the storage.

To create storage, type:

prefect storage create

And you will see the following options on your terminal.

In this project, we will use temporary local storage.

Create a Work Queue

Each work queue also organizes deployments into work queues for execution.

To create a work queue, type:

prefect work-queue create --tag dev dev-queue
Image by Author

Output:

UUID('e0e4ee25-bcff-4abb-9697-b8c7534355b2')

The --tag dev tells the dev-queue work queue to only serve deployments that include a dev tag.

Run an Agent

Each agent makes sure deployments in a specific work queue are executed

Image by Author

To run an agent, type prefect agent start <ID of dev-queue> . Since the ID of the dev-queue is e0e4ee25-bcff-4abb-9697-b8c7534355b2 , we type:

prefect agent start 'e0e4ee25-bcff-4abb-9697-b8c7534355b2'

Create a Deployment

To create a deployment from the file development.py, type:

prefect deployment create development.py

You should see the new deployment under the Deployments tab.

Image by Author

Then click Run in the top right corner:

Image by Author

Then click Flow Runs on the left menu:

Image by Author

And you will see that your flow is scheduled!

Image by Author

Now the script to pull and process data will run every day. Your dashboard also shows the latest repositories in your local machine. How cool is that?

In the current version, the app and Prefect agent runs on the local machine and will stop working if you turn off your machine. If we turn off the machine, the app and the agent will stop running.

To prevent this from happening, we can use a cloud service such as AWS or GCP to run the agent, store the database, and serve the dashboard.

Image by Author

In the next article, we will learn how to do exactly that.


Get Updated with the Latest Trends on GitHub in Your Local Machine

GitHub feed is a great way for you to follow along with what’s trending in the community. You can discover some useful repositories by looking at what your connections have starred.

Image by Author

However, there might be some repositories you don’t care about. For example, you might be only interested in Python repositories, but there are repositories written in other languages on your GitHub feed. Thus, it may take you a while to find useful libraries.

Wouldn’t it be nice if you can create a personal dashboard showing the repositories that your connections followed, filtered by your favorite language like below?

Image by Author

In this article, we will learn how to do exactly that, with GitHub API + Streamlit + Prefect.

At a high level, we will:

  • Use GitHub API to write scripts to pull the data from GitHub
  • Use Streamlit to create a dashboard displaying the statistics of the processed data
  • Use Prefect to schedule to run the scripts to get and process data daily
Image by Author

If you want to skip through the explanations and go straight to creating your own dashboard, view this GitHub repository:

To pull data from GitHub, you need your username and an access token. Next, create the file called .env to save your GitHub authentication.

Add .env to your .gitignore file to ensure that .env is not tracked by Git.

In the file process_data.py under the development directory, we will write code to access the information in the file .env using python-dotenv:

Next, we will get the general information of public repositories we received in our GitHub feed using GitHub API. The GitHub API allows us to easily get data from your GitHub account including events, repositories, etc.

Image by Author

We will use pydash to get the URLs of all starred repositories from a list of repositories:

Get the specific information of each starred repository such as language, stars, owners, pull requests, etc:

The information extracted from each repo should look similar to this.

Save the data to the local directory:

Put everything together:

The current script works, but to make sure that it is resilient against failures, we will use Prefect to add observability, caching, and retries to our current workflow.

Prefect is an open-source library that enables you to orchestrate your data workflow in Python.

Add Observability

Because it takes a while to run the file get_data.py, we might want to know which code is being executed and approximately how much longer we need to wait. We can add Prefect’s decorators to our functions to get more insight into the state of each function.

Specifically, we add the decorator @task to the functions that do one thing and add the decorator @flow to the function that contains several tasks.

In the code below, we add the decorator @flow to the function get_data since get_data contains all tasks.

You should see the below output when running this Python script:

From the output, we know which tasks are completed and which ones are in progress.

Caching

In our current code, the function get_geneneral_info_of_repos takes a while to run. If the function get_specific_info_of_repos failed, we need to rerun the entire pipeline again and wait for the function get_geneneral_info_of_repos to run.

Image by Author

To reduce the execution time, we can use Prefect’s caching to save the results of get_general_info_of_repos in the first run and then reuse the results in the second run.

Image by Author

In the file get_data.py, we add caching to the tasks get_general_info_of_repos , get_starred_repo_urls , and get_specific_info_of_repos because they take quite a bit of time to run.

To add caching to a task, specify the values for the arguments cache_key_fn and cach_expiration.

In the code above,

  • cache_key_fn=task_input_hash tells Prefect to use the cached results unless the inputs change or the cache expired.
  • cached_expiration=timedelta(days=1) tells Prefect to refresh the cache after one day.

We can see that the run that doesn’t use caching (lush-ostrich) finished in 27s while the run that uses caching (provocative-wildebeest) finished in 1s!

Image by Author

Note: To view the dashboard with all runs like above, run prefect orion start.

Retries

There is a chance that we fail to pull the data from GitHub through the API, and we need to run the entire pipeline again.

Image by Author

Instead of rerunning the entire pipeline, it’s more efficient to rerun the task that failed a specific number of times after a specific period of time.

Prefect allows you to automatically retry on failure. To enable retries, add retries and retry_delay_seconds parameters to your task.

In the code above:

  • retries=3 tells Prefect to rerun the task up to 3 times
  • retry_delay_seconds=60 tells Prefect to retry after 60 seconds. This functionality is helpful since we might hit the rate limit if we call the GitHub API continuously in a short amount of time.

In the file process_data.py under the directory development, we will clean up the data so that we get a table that only shows what we are interested in.

Start with loading the data and only saving the repositories that are written in a specific language:

Image by Author

Next, we will keep only the information of a repository that is interesting to us, which includes:

  • Full name
  • HTML URL
  • Description
  • Stargazers count

Create a DataFrame from the dictionary then remove the duplicated entries:

Put everything together:

Now come to a fun part. Create a dashboard to view the repositories and their statistics.

The structure of the directory for our app code will look similar to this:

Visualize The Statistics of Repositories

The file [Visualize.py](<http://Visualize.py>) will create the home page of the app while the files under the directory pages create the children pages.

We will use Streamlit to create a simple app in Python. Let’s start with writing the code to show the data and its statistics. Specifically, we want to see the following on the first page:

  • A table of repositories filtered by language
  • A chart of the top 10 most popular repositories
  • A chart of the top 10 most popular topics
  • A word cloud chart of topics

Code of Visualize.py :

To view the dashboard, type:

cd app
streamlit run Visualize.py

Go to http://localhost:8501/ and you should see the following dashboard!

Image by Author

Filter Repositories Based on Their Topics

We get repositories with different topics, but we often only care about specific topics such as machine learning and deep learning. Let’s create a page that helps users filter repositories based on their topics.

And you should see a second page that looks like this. In the GIF below, I only see the repositories with the tags deep-learning, spark, and mysql after applying to the filter.

Image by Author

If you want to have a daily update of repositories on the GitHub feed, you might feel lazy to run the script to get and process data every day. Wouldn’t it be nice if you can schedule to automatically run your script every day?

Image by Author

Let’s schedule our Python scripts by creating a deployment with Prefect.

Use Subflows

Since we want to run the flow get_data before running the flow process_data, we can put them under another flow called get_and_process_data inside the file development/get_and_process_data.py.

Next, we will write a script to deploy our flow. We use IntervalSchedule to run the deployment every day.

To run the deployment, we will:

  • Start a Prefect Orion server
  • Configure a storage
  • Create a work queue
  • Run an agent
  • Create the deployment

Start a Prefect Orion Server

To start a Prefect Orion server, run:

prefect orion start

Configure Storage

Storage saves your task results and deployments. Later when you run a deployment, Prefect will retrieve your flow from the storage.

To create storage, type:

prefect storage create

And you will see the following options on your terminal.

In this project, we will use temporary local storage.

Create a Work Queue

Each work queue also organizes deployments into work queues for execution.

To create a work queue, type:

prefect work-queue create --tag dev dev-queue
Image by Author

Output:

UUID('e0e4ee25-bcff-4abb-9697-b8c7534355b2')

The --tag dev tells the dev-queue work queue to only serve deployments that include a dev tag.

Run an Agent

Each agent makes sure deployments in a specific work queue are executed

Image by Author

To run an agent, type prefect agent start <ID of dev-queue> . Since the ID of the dev-queue is e0e4ee25-bcff-4abb-9697-b8c7534355b2 , we type:

prefect agent start 'e0e4ee25-bcff-4abb-9697-b8c7534355b2'

Create a Deployment

To create a deployment from the file development.py, type:

prefect deployment create development.py

You should see the new deployment under the Deployments tab.

Image by Author

Then click Run in the top right corner:

Image by Author

Then click Flow Runs on the left menu:

Image by Author

And you will see that your flow is scheduled!

Image by Author

Now the script to pull and process data will run every day. Your dashboard also shows the latest repositories in your local machine. How cool is that?

In the current version, the app and Prefect agent runs on the local machine and will stop working if you turn off your machine. If we turn off the machine, the app and the agent will stop running.

To prevent this from happening, we can use a cloud service such as AWS or GCP to run the agent, store the database, and serve the dashboard.

Image by Author

In the next article, we will learn how to do exactly that.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
BuildGitHubJulKhuyenlatest newsmachine learningpythonRepositoriesrobustTech NewsTrantrendingvisualizeworkflow
Comments (0)
Add Comment