Get Updated with the Latest Trends on GitHub in Your Local Machine
GitHub feed is a great way for you to follow along with what’s trending in the community. You can discover some useful repositories by looking at what your connections have starred.
However, there might be some repositories you don’t care about. For example, you might be only interested in Python repositories, but there are repositories written in other languages on your GitHub feed. Thus, it may take you a while to find useful libraries.
Wouldn’t it be nice if you can create a personal dashboard showing the repositories that your connections followed, filtered by your favorite language like below?
In this article, we will learn how to do exactly that, with GitHub API + Streamlit + Prefect.
At a high level, we will:
- Use GitHub API to write scripts to pull the data from GitHub
- Use Streamlit to create a dashboard displaying the statistics of the processed data
- Use Prefect to schedule to run the scripts to get and process data daily
If you want to skip through the explanations and go straight to creating your own dashboard, view this GitHub repository:
To pull data from GitHub, you need your username and an access token. Next, create the file called .env to save your GitHub authentication.
Add .env
to your .gitignore
file to ensure that .env
is not tracked by Git.
In the file process_data.py
under the development
directory, we will write code to access the information in the file .env
using python-dotenv:
Next, we will get the general information of public repositories we received in our GitHub feed using GitHub API. The GitHub API allows us to easily get data from your GitHub account including events, repositories, etc.
We will use pydash to get the URLs of all starred repositories from a list of repositories:
Get the specific information of each starred repository such as language, stars, owners, pull requests, etc:
The information extracted from each repo should look similar to this.
Save the data to the local directory:
Put everything together:
The current script works, but to make sure that it is resilient against failures, we will use Prefect to add observability, caching, and retries to our current workflow.
Prefect is an open-source library that enables you to orchestrate your data workflow in Python.
Add Observability
Because it takes a while to run the file get_data.py
, we might want to know which code is being executed and approximately how much longer we need to wait. We can add Prefect’s decorators to our functions to get more insight into the state of each function.
Specifically, we add the decorator @task
to the functions that do one thing and add the decorator @flow
to the function that contains several tasks.
In the code below, we add the decorator @flow
to the function get_data
since get_data
contains all tasks.
You should see the below output when running this Python script:
From the output, we know which tasks are completed and which ones are in progress.
Caching
In our current code, the function get_geneneral_info_of_repos
takes a while to run. If the function get_specific_info_of_repos
failed, we need to rerun the entire pipeline again and wait for the function get_geneneral_info_of_repos
to run.
To reduce the execution time, we can use Prefect’s caching to save the results of get_general_info_of_repos
in the first run and then reuse the results in the second run.
In the file get_data.py
, we add caching to the tasks get_general_info_of_repos
, get_starred_repo_urls
, and get_specific_info_of_repos
because they take quite a bit of time to run.
To add caching to a task, specify the values for the arguments cache_key_fn
and cach_expiration
.
In the code above,
cache_key_fn=task_input_hash
tells Prefect to use the cached results unless the inputs change or the cache expired.cached_expiration=timedelta(days=1)
tells Prefect to refresh the cache after one day.
We can see that the run that doesn’t use caching (lush-ostrich
) finished in 27s while the run that uses caching (provocative-wildebeest
) finished in 1s!
Note: To view the dashboard with all runs like above, run prefect orion start
.
Retries
There is a chance that we fail to pull the data from GitHub through the API, and we need to run the entire pipeline again.
Instead of rerunning the entire pipeline, it’s more efficient to rerun the task that failed a specific number of times after a specific period of time.
Prefect allows you to automatically retry on failure. To enable retries, add retries
and retry_delay_seconds
parameters to your task.
In the code above:
retries=3
tells Prefect to rerun the task up to 3 timesretry_delay_seconds=60
tells Prefect to retry after 60 seconds. This functionality is helpful since we might hit the rate limit if we call the GitHub API continuously in a short amount of time.
In the file process_data.py
under the directory development
, we will clean up the data so that we get a table that only shows what we are interested in.
Start with loading the data and only saving the repositories that are written in a specific language:
Next, we will keep only the information of a repository that is interesting to us, which includes:
- Full name
- HTML URL
- Description
- Stargazers count
Create a DataFrame from the dictionary then remove the duplicated entries:
Put everything together:
Now come to a fun part. Create a dashboard to view the repositories and their statistics.
The structure of the directory for our app code will look similar to this:
Visualize The Statistics of Repositories
The file [Visualize.py](<http://Visualize.py>)
will create the home page of the app while the files under the directory pages
create the children pages.
We will use Streamlit to create a simple app in Python. Let’s start with writing the code to show the data and its statistics. Specifically, we want to see the following on the first page:
- A table of repositories filtered by language
- A chart of the top 10 most popular repositories
- A chart of the top 10 most popular topics
- A word cloud chart of topics
Code of Visualize.py
:
To view the dashboard, type:
cd app
streamlit run Visualize.py
Go to http://localhost:8501/ and you should see the following dashboard!
Filter Repositories Based on Their Topics
We get repositories with different topics, but we often only care about specific topics such as machine learning and deep learning. Let’s create a page that helps users filter repositories based on their topics.
And you should see a second page that looks like this. In the GIF below, I only see the repositories with the tags deep-learning
, spark
, and mysql
after applying to the filter.
If you want to have a daily update of repositories on the GitHub feed, you might feel lazy to run the script to get and process data every day. Wouldn’t it be nice if you can schedule to automatically run your script every day?
Let’s schedule our Python scripts by creating a deployment with Prefect.
Use Subflows
Since we want to run the flow get_data
before running the flow process_data
, we can put them under another flow called get_and_process_data
inside the file development/get_and_process_data.py
.
Next, we will write a script to deploy our flow. We use IntervalSchedule
to run the deployment every day.
To run the deployment, we will:
- Start a Prefect Orion server
- Configure a storage
- Create a work queue
- Run an agent
- Create the deployment
Start a Prefect Orion Server
To start a Prefect Orion server, run:
prefect orion start
Configure Storage
Storage saves your task results and deployments. Later when you run a deployment, Prefect will retrieve your flow from the storage.
To create storage, type:
prefect storage create
And you will see the following options on your terminal.
In this project, we will use temporary local storage.
Create a Work Queue
Each work queue also organizes deployments into work queues for execution.
To create a work queue, type:
prefect work-queue create --tag dev dev-queue
Output:
UUID('e0e4ee25-bcff-4abb-9697-b8c7534355b2')
The --tag
dev tells the dev-queue
work queue to only serve deployments that include a dev
tag.
Run an Agent
Each agent makes sure deployments in a specific work queue are executed
To run an agent, type prefect agent start <ID of dev-queue>
. Since the ID of the dev-queue
is e0e4ee25-bcff-4abb-9697-b8c7534355b2
, we type:
prefect agent start 'e0e4ee25-bcff-4abb-9697-b8c7534355b2'
Create a Deployment
To create a deployment from the file development.py
, type:
prefect deployment create development.py
You should see the new deployment under the Deployments tab.
Then click Run in the top right corner:
Then click Flow Runs on the left menu:
And you will see that your flow is scheduled!
Now the script to pull and process data will run every day. Your dashboard also shows the latest repositories in your local machine. How cool is that?
In the current version, the app and Prefect agent runs on the local machine and will stop working if you turn off your machine. If we turn off the machine, the app and the agent will stop running.
To prevent this from happening, we can use a cloud service such as AWS or GCP to run the agent, store the database, and serve the dashboard.
In the next article, we will learn how to do exactly that.
Get Updated with the Latest Trends on GitHub in Your Local Machine
GitHub feed is a great way for you to follow along with what’s trending in the community. You can discover some useful repositories by looking at what your connections have starred.
However, there might be some repositories you don’t care about. For example, you might be only interested in Python repositories, but there are repositories written in other languages on your GitHub feed. Thus, it may take you a while to find useful libraries.
Wouldn’t it be nice if you can create a personal dashboard showing the repositories that your connections followed, filtered by your favorite language like below?
In this article, we will learn how to do exactly that, with GitHub API + Streamlit + Prefect.
At a high level, we will:
- Use GitHub API to write scripts to pull the data from GitHub
- Use Streamlit to create a dashboard displaying the statistics of the processed data
- Use Prefect to schedule to run the scripts to get and process data daily
If you want to skip through the explanations and go straight to creating your own dashboard, view this GitHub repository:
To pull data from GitHub, you need your username and an access token. Next, create the file called .env to save your GitHub authentication.
Add .env
to your .gitignore
file to ensure that .env
is not tracked by Git.
In the file process_data.py
under the development
directory, we will write code to access the information in the file .env
using python-dotenv:
Next, we will get the general information of public repositories we received in our GitHub feed using GitHub API. The GitHub API allows us to easily get data from your GitHub account including events, repositories, etc.
We will use pydash to get the URLs of all starred repositories from a list of repositories:
Get the specific information of each starred repository such as language, stars, owners, pull requests, etc:
The information extracted from each repo should look similar to this.
Save the data to the local directory:
Put everything together:
The current script works, but to make sure that it is resilient against failures, we will use Prefect to add observability, caching, and retries to our current workflow.
Prefect is an open-source library that enables you to orchestrate your data workflow in Python.
Add Observability
Because it takes a while to run the file get_data.py
, we might want to know which code is being executed and approximately how much longer we need to wait. We can add Prefect’s decorators to our functions to get more insight into the state of each function.
Specifically, we add the decorator @task
to the functions that do one thing and add the decorator @flow
to the function that contains several tasks.
In the code below, we add the decorator @flow
to the function get_data
since get_data
contains all tasks.
You should see the below output when running this Python script:
From the output, we know which tasks are completed and which ones are in progress.
Caching
In our current code, the function get_geneneral_info_of_repos
takes a while to run. If the function get_specific_info_of_repos
failed, we need to rerun the entire pipeline again and wait for the function get_geneneral_info_of_repos
to run.
To reduce the execution time, we can use Prefect’s caching to save the results of get_general_info_of_repos
in the first run and then reuse the results in the second run.
In the file get_data.py
, we add caching to the tasks get_general_info_of_repos
, get_starred_repo_urls
, and get_specific_info_of_repos
because they take quite a bit of time to run.
To add caching to a task, specify the values for the arguments cache_key_fn
and cach_expiration
.
In the code above,
cache_key_fn=task_input_hash
tells Prefect to use the cached results unless the inputs change or the cache expired.cached_expiration=timedelta(days=1)
tells Prefect to refresh the cache after one day.
We can see that the run that doesn’t use caching (lush-ostrich
) finished in 27s while the run that uses caching (provocative-wildebeest
) finished in 1s!
Note: To view the dashboard with all runs like above, run prefect orion start
.
Retries
There is a chance that we fail to pull the data from GitHub through the API, and we need to run the entire pipeline again.
Instead of rerunning the entire pipeline, it’s more efficient to rerun the task that failed a specific number of times after a specific period of time.
Prefect allows you to automatically retry on failure. To enable retries, add retries
and retry_delay_seconds
parameters to your task.
In the code above:
retries=3
tells Prefect to rerun the task up to 3 timesretry_delay_seconds=60
tells Prefect to retry after 60 seconds. This functionality is helpful since we might hit the rate limit if we call the GitHub API continuously in a short amount of time.
In the file process_data.py
under the directory development
, we will clean up the data so that we get a table that only shows what we are interested in.
Start with loading the data and only saving the repositories that are written in a specific language:
Next, we will keep only the information of a repository that is interesting to us, which includes:
- Full name
- HTML URL
- Description
- Stargazers count
Create a DataFrame from the dictionary then remove the duplicated entries:
Put everything together:
Now come to a fun part. Create a dashboard to view the repositories and their statistics.
The structure of the directory for our app code will look similar to this:
Visualize The Statistics of Repositories
The file [Visualize.py](<http://Visualize.py>)
will create the home page of the app while the files under the directory pages
create the children pages.
We will use Streamlit to create a simple app in Python. Let’s start with writing the code to show the data and its statistics. Specifically, we want to see the following on the first page:
- A table of repositories filtered by language
- A chart of the top 10 most popular repositories
- A chart of the top 10 most popular topics
- A word cloud chart of topics
Code of Visualize.py
:
To view the dashboard, type:
cd app
streamlit run Visualize.py
Go to http://localhost:8501/ and you should see the following dashboard!
Filter Repositories Based on Their Topics
We get repositories with different topics, but we often only care about specific topics such as machine learning and deep learning. Let’s create a page that helps users filter repositories based on their topics.
And you should see a second page that looks like this. In the GIF below, I only see the repositories with the tags deep-learning
, spark
, and mysql
after applying to the filter.
If you want to have a daily update of repositories on the GitHub feed, you might feel lazy to run the script to get and process data every day. Wouldn’t it be nice if you can schedule to automatically run your script every day?
Let’s schedule our Python scripts by creating a deployment with Prefect.
Use Subflows
Since we want to run the flow get_data
before running the flow process_data
, we can put them under another flow called get_and_process_data
inside the file development/get_and_process_data.py
.
Next, we will write a script to deploy our flow. We use IntervalSchedule
to run the deployment every day.
To run the deployment, we will:
- Start a Prefect Orion server
- Configure a storage
- Create a work queue
- Run an agent
- Create the deployment
Start a Prefect Orion Server
To start a Prefect Orion server, run:
prefect orion start
Configure Storage
Storage saves your task results and deployments. Later when you run a deployment, Prefect will retrieve your flow from the storage.
To create storage, type:
prefect storage create
And you will see the following options on your terminal.
In this project, we will use temporary local storage.
Create a Work Queue
Each work queue also organizes deployments into work queues for execution.
To create a work queue, type:
prefect work-queue create --tag dev dev-queue
Output:
UUID('e0e4ee25-bcff-4abb-9697-b8c7534355b2')
The --tag
dev tells the dev-queue
work queue to only serve deployments that include a dev
tag.
Run an Agent
Each agent makes sure deployments in a specific work queue are executed
To run an agent, type prefect agent start <ID of dev-queue>
. Since the ID of the dev-queue
is e0e4ee25-bcff-4abb-9697-b8c7534355b2
, we type:
prefect agent start 'e0e4ee25-bcff-4abb-9697-b8c7534355b2'
Create a Deployment
To create a deployment from the file development.py
, type:
prefect deployment create development.py
You should see the new deployment under the Deployments tab.
Then click Run in the top right corner:
Then click Flow Runs on the left menu:
And you will see that your flow is scheduled!
Now the script to pull and process data will run every day. Your dashboard also shows the latest repositories in your local machine. How cool is that?
In the current version, the app and Prefect agent runs on the local machine and will stop working if you turn off your machine. If we turn off the machine, the app and the agent will stop running.
To prevent this from happening, we can use a cloud service such as AWS or GCP to run the agent, store the database, and serve the dashboard.
In the next article, we will learn how to do exactly that.