Techno Blender
Digitally Yours.

The Simplest Possible Analytics Architecture You Can Set Up in Less Than a Day | by Cameron Warren | Sep, 2022

0 68


Set up and deploy a database, an ETL server, and automated reporting from scratch in a few hours all from your laptop.

The unbundling of analytics and database tools is creating massive confusion in the market.

Businesses want simple, impactful reporting and analytics but the journey from ‘I need analytics’ to actually getting it has become more challenging to navigate over the last several years.

Photo by Negative Space: https://www.pexels.com/photo/blue-and-green-pie-chart-97080/

Fortunately, I’ve personally built dozens of analytics programs across multiple companies. In this article, I’m going to describe the simplest possible approach that I know of to set up and deploy a database and reporting architecture.

This approach is so simple, even non-analytics/non-technical individuals could use it to get started with their own simple data stack.

By the end of this article you will…

  • Know how to set up a simple and complete analytics and database architecture from scratch.
  • Know how to save you, your team, or your business thousands of dollars in software and data processing costs.
  • Wildly impress your boss and your peers, especially if analytics isn’t even your job.

Step 1. Select a Cloud Servicer

Step 1 is selecting a cloud servicer that will manage your data processing (your ETL server and database).

There are many options here but it’s easiest to stick with the three big ones — either Google Cloud, AWS, or Azure.

My personal recommendation is to use Google Cloud. In my opinion it has the simplest to use UI and will make setup as easy and simple as possible.

Head to cloud.google.com and click ‘Console’ in the upper right corner. If you’re not signed in yet, you’ll be asked to sign in using a gmail account. You can use whatever account you want.

And that’s it. You’re done with step 1!

Step 2. Create a database

In the Google Cloud Console, hit the 4 bars in the top left and scroll down to where it says ‘Databases’ — click the one that says SQL.

There are other, possibly more popular options here — such as Google Big Query — but we’re keeping it as simple as possible so we’ll use the basic SQL options.

On the next screen it will ask you to add a billing account. Don’t stress too much here as your costs will be extremely low. Google Cloud will charge you based on the level you select and the amount of resources you use. When you start these levels will be very small. Additionally, Google Cloud gives you $300 of free credit for first time users.

Select ‘Create Instance’ and then click the option for Postgres SQL. Why Postgres and not MSSQL? It really doesn’t matter too much — but I’ve personally found that Postgres is the simplest to interface with for data operations, writing SQL, doing analysis and ingesting data. There are subtle differences between the two that give Postgres a slight edge.

On the next screen, enable the API — which will allow Google Cloud to manage the instance through the Cloud portal.

On the next page you’ll complete the setup by entering an Instance ID and Password. These can be whatever you want.

For this guide, select Postgres 12 (rather than the latest version) and select ‘Production’ as your ‘configuration to start with.’ For your region select the region you live in and will be connecting to the database from. Then select ‘Single Zone’ for now.

Under ‘Machine Type’ change the setting to ‘Lightweight’ as we’re keeping things small and simple. Then adjust the storage to just 10GB. This will save you significant money. GCP will allow you to scale up as much as you need/want later.

Finally, hit ‘Create Instance’ on the bottom of the page.

*Ignore the PostgreSQL 14 — Yours should say 12

After a few minutes your database will be live!

Now let’s connect.

I like to use a database tool called DBeaver. You can download it here for free.

Open Dbeaver and click the icon that looks like a plug in the top left to add a new connection.

On the ‘Connect to a database’ screen (shown above), double click PostgreSQL.

You can get the credentials you’ll need to connect to your new database in Google Cloud.

In the screenshot above, it’s showing the IP address to my new instance in the overview page for the database I just created.

You’ll also need your user name and password. Your username can be obtained by clicking the users tab on the left hand side of the screen. Your password is the initial password you set when you went through the database creation process. If you forgot the password, you can reset it on the user screen.

Before you can connect, you’ll need to whitelist your IP address. Under the ‘Connections’ tab find the ‘Authorized networks’ section. Then click ‘Add Network’ and add your IP address and give it a name. Then hit ‘Save.’ This will allow you to login to the database from that specific IP address. I recommend doing this at your primary place of work or your home.

Once that’s done, enter the required credentials into DBeaver and hit ‘Test Connection’ — you should see a screen that looks like the following:

This means you’ve successfully connected!

Now let’s get some data into your database.

Step 3. Set up an ETL Tool or ETL Server

There are dozens of options for getting data into your database. If you have very custom needs you will probably want an ETL Server where you can host and execute Python (or similar) scripts. If your needs are simpler (like pulling Linkedin, Facebook marketing data, etc.) you can use something like Fivetran. I will walk through how to set up both.

Virtual Machine

Go back to Google Cloud and hit the 4 bars in the top left of the screen. Then scroll down and hover over ‘Compute Engine.’ At the very top of the new selection bar, click on ‘Virtual Machine.’

Click on ‘Create Instance.’

On the next screen give the VM a name and under ‘Machine type’ select e2-small. You can upgrade this at any time so always pick as small as possible at first.

Under ‘Firewall’ check ‘Allow HTTP’ and ‘Allow HTTPS’ traffic.

Under ‘Boot Disk’ you will see the ‘image’ is set at Debian as the default Operating System. You can change this by clicking ‘Change.’ I prefer to use Ubuntu based on familiarity, but you can use whichever you’re most comfortable with.

There will be several OS options with different versions, I recommend picking the latest version of whatever OS you decide to use.

Once that is set, scroll to the bottom and click ‘Create.’

After a few seconds, you should see a green checkmark on the next screen indicating your server is now live.

To setup access to your new VM, click on the VM and then click the drop down next to where it says ‘SSH’ and use ‘Open in Browser Window.’

This will open a new browser window where you will be connected to the VM.

Connecting this way every time is inefficient, so I recommend creating ssh-keys so you can connect using either Powershell on Windows or Terminal on Mac.

To do this on Mac go to Terminal and type in

cd ~/.ssh/

This will take you to the ssh keys folder where you will store the ssh keys to connect to your new VM. Now type in

ssh-keygen

The command will prompt you for a name. I usually like to use the name of the VM I just created, but you can use whatever you want.

Two files will be created: One called [name] (the name you just used) and the other called [name].pub

To set up your local computer to connect to your VM you need to add the public key in the .pub file to your new virtual machine.

You can do this by opening up your new VM (per the instructions above) and typing in

cd ~/.ssh/

You’ll now be in the ssh key folder of your virtual machine.

add the public key (found inside the .pub file) to the authorized keys file by typing

echo [PUBLIC KEY] >> authorized_keys

Back on your local machine type in

vim ~/.ssh/config 

Here you’ll need to enter the host information for your new connection. This is especially important if you have more than one key-pair stored on your machine.

In the config file you’ll add the following

Host [whatever you want]
HostName [External IP Address]
User [username it shows on the VM]
IdentityFile ~/.ssh/[name of the private key - the file without .pub]
ServerAliveInterval 50

*Note that you can find the ‘External IP Address’ to your VM in the Google Cloud Console by going to Compute Engine>Virtual Machine> and then selecting your VM.

Hit shift z, z to save and exit.

To connect to your VM from your local machine just type in ‘ssh [Host]’ into Terminal and you should be connected.

Don’t fret if this doesn’t work on the first try. If you’re having issues feel free to message me or check out this article for additional guidance: https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys-on-ubuntu-1804

You now have a virtual machine that you can use to write and deploy python scripts and set up automated cron jobs to deploy or use data in interesting ways.

For an example of what you could do, check out my article on How to Pull data from an API using Python Requests.

ETL Tool

There are dozens of ETL tools on the market with out of the box connectors to some of the most popular applications, websites, and APIs (usually for around $150 a month or less to start).

In this example, I’ll pull some Linkedin Advertising metrics for my company using an ETL tool called Fivetran.

Tools like Fivetran make ETL largely trivial.

Pick a data source and a destination (in this case, Postgres), add in your credentials and follow the instructions. Don’t forget to whitelist the IP addresses on your database for whatever third party tools you use. Tools like Fivetran have extensive help pages that will tell you which IP addresses you need to allow for the software to function correctly. (You can whitelist IPs by following the instructions above where I discuss how to whitelist your client IP for Dbeaver.)

Tools like Fivetran will do a complete sync to your database without even needing to setup merge criteria or configure a data model.

The data will automatically synchronize pre-modeled tables into your Postgres database on whatever cadence you set it for.

Here’s what it looks like in the database:

Step 3. Set up Reporting

In just 2 steps you’re already 80% of the way to having automated reporting and analytics.

You didn’t even need Hadoop, a Snowflake instance, DBT, or any other fancy enterprise level tool to do it.

Now on to the fun part.

There are many different BI and reporting solutions on the market. Tableau is the most popular but is expensive and time consuming to set up.

MS Power BI is cheap and it’s likely your company’s already paying for it through your Office 365 subscription.

But the by far the best and simplest option for getting started is Google Data Studio.

Go to Google Data Studio and login using the same login you used to access the Google Cloud console. Hit ‘Create’ in the top left and select ‘Data Source.’

You’ll see dozens of different options. Type ‘postgres’ in the search box and select ‘PostgreSQL.’

In the top left you can give your data source a name. Just use something that you’ll remember or the same name you used for your database.

On the next screen you’ll need to once again add your database credentials and authenticate your database. You’ll first need to whitelist Data Studio’s IP addresses that it uses for connection (you can do this by following the same instructions provided in Step 1 for whitelisting your client PC/Laptop).

After you input your credentials you’ll see your schemas and tables as reflected in your database.

From here you can select a table, or you can select the ‘CUSTOM QUERY’ option and write SQL to get the data pre-aggregated the way you want.

On the next page, you’ll be able to format your data fields based on the custom query or table you’re pulling in. This is a good chance to convert certain fields to be text, numeric, currency, or date fields depending on the context of your data.

When you’re done formatting your fields, click ‘Create Report’ in the top right.

At this point you will be manipulating your freshly ingested data into actual reporting.

Depending on the context of your data, there are literally hundreds of options for displaying, aggregating, and formatting your data.

I built this Linkedin Ads dashboard in just a few minutes:

When you’re finished editing your reports you can click ‘View’ to see what it looks like. Then you can share it out to members of your team or any relevant stakeholders directly from the same screen.

Voila! You’ve just created an automated dashboard.

As long as your data source is being updated consistently in the database (using Fivetran or your custom ETL scripts), the report will automatically refresh whenever a user opens it to check for new data.

What’s Next?

Congratulations! If you’ve followed this guide to completion, you’ve successfully set up a full blown analytics architecture.

From here you can go about adding new data sources with an ETL tool or using Python scripts, building more and better dashboards, or do some analysis using SQL.

As your data grows you can use Google Cloud (or whatever servicer you have chosen) to quickly and easily scale your database.

Thanks for reading! If you enjoyed this guide, follow me here on Medium for more articles on analytics and getting value from data.

If you want to work with me on a project, or get help setting up single-source-of-truth analytics for your team or business, reach out to me directly.


Set up and deploy a database, an ETL server, and automated reporting from scratch in a few hours all from your laptop.

The unbundling of analytics and database tools is creating massive confusion in the market.

Businesses want simple, impactful reporting and analytics but the journey from ‘I need analytics’ to actually getting it has become more challenging to navigate over the last several years.

Photo by Negative Space: https://www.pexels.com/photo/blue-and-green-pie-chart-97080/

Fortunately, I’ve personally built dozens of analytics programs across multiple companies. In this article, I’m going to describe the simplest possible approach that I know of to set up and deploy a database and reporting architecture.

This approach is so simple, even non-analytics/non-technical individuals could use it to get started with their own simple data stack.

By the end of this article you will…

  • Know how to set up a simple and complete analytics and database architecture from scratch.
  • Know how to save you, your team, or your business thousands of dollars in software and data processing costs.
  • Wildly impress your boss and your peers, especially if analytics isn’t even your job.

Step 1. Select a Cloud Servicer

Step 1 is selecting a cloud servicer that will manage your data processing (your ETL server and database).

There are many options here but it’s easiest to stick with the three big ones — either Google Cloud, AWS, or Azure.

My personal recommendation is to use Google Cloud. In my opinion it has the simplest to use UI and will make setup as easy and simple as possible.

Head to cloud.google.com and click ‘Console’ in the upper right corner. If you’re not signed in yet, you’ll be asked to sign in using a gmail account. You can use whatever account you want.

And that’s it. You’re done with step 1!

Step 2. Create a database

In the Google Cloud Console, hit the 4 bars in the top left and scroll down to where it says ‘Databases’ — click the one that says SQL.

There are other, possibly more popular options here — such as Google Big Query — but we’re keeping it as simple as possible so we’ll use the basic SQL options.

On the next screen it will ask you to add a billing account. Don’t stress too much here as your costs will be extremely low. Google Cloud will charge you based on the level you select and the amount of resources you use. When you start these levels will be very small. Additionally, Google Cloud gives you $300 of free credit for first time users.

Select ‘Create Instance’ and then click the option for Postgres SQL. Why Postgres and not MSSQL? It really doesn’t matter too much — but I’ve personally found that Postgres is the simplest to interface with for data operations, writing SQL, doing analysis and ingesting data. There are subtle differences between the two that give Postgres a slight edge.

On the next screen, enable the API — which will allow Google Cloud to manage the instance through the Cloud portal.

On the next page you’ll complete the setup by entering an Instance ID and Password. These can be whatever you want.

For this guide, select Postgres 12 (rather than the latest version) and select ‘Production’ as your ‘configuration to start with.’ For your region select the region you live in and will be connecting to the database from. Then select ‘Single Zone’ for now.

Under ‘Machine Type’ change the setting to ‘Lightweight’ as we’re keeping things small and simple. Then adjust the storage to just 10GB. This will save you significant money. GCP will allow you to scale up as much as you need/want later.

Finally, hit ‘Create Instance’ on the bottom of the page.

*Ignore the PostgreSQL 14 — Yours should say 12

After a few minutes your database will be live!

Now let’s connect.

I like to use a database tool called DBeaver. You can download it here for free.

Open Dbeaver and click the icon that looks like a plug in the top left to add a new connection.

On the ‘Connect to a database’ screen (shown above), double click PostgreSQL.

You can get the credentials you’ll need to connect to your new database in Google Cloud.

In the screenshot above, it’s showing the IP address to my new instance in the overview page for the database I just created.

You’ll also need your user name and password. Your username can be obtained by clicking the users tab on the left hand side of the screen. Your password is the initial password you set when you went through the database creation process. If you forgot the password, you can reset it on the user screen.

Before you can connect, you’ll need to whitelist your IP address. Under the ‘Connections’ tab find the ‘Authorized networks’ section. Then click ‘Add Network’ and add your IP address and give it a name. Then hit ‘Save.’ This will allow you to login to the database from that specific IP address. I recommend doing this at your primary place of work or your home.

Once that’s done, enter the required credentials into DBeaver and hit ‘Test Connection’ — you should see a screen that looks like the following:

This means you’ve successfully connected!

Now let’s get some data into your database.

Step 3. Set up an ETL Tool or ETL Server

There are dozens of options for getting data into your database. If you have very custom needs you will probably want an ETL Server where you can host and execute Python (or similar) scripts. If your needs are simpler (like pulling Linkedin, Facebook marketing data, etc.) you can use something like Fivetran. I will walk through how to set up both.

Virtual Machine

Go back to Google Cloud and hit the 4 bars in the top left of the screen. Then scroll down and hover over ‘Compute Engine.’ At the very top of the new selection bar, click on ‘Virtual Machine.’

Click on ‘Create Instance.’

On the next screen give the VM a name and under ‘Machine type’ select e2-small. You can upgrade this at any time so always pick as small as possible at first.

Under ‘Firewall’ check ‘Allow HTTP’ and ‘Allow HTTPS’ traffic.

Under ‘Boot Disk’ you will see the ‘image’ is set at Debian as the default Operating System. You can change this by clicking ‘Change.’ I prefer to use Ubuntu based on familiarity, but you can use whichever you’re most comfortable with.

There will be several OS options with different versions, I recommend picking the latest version of whatever OS you decide to use.

Once that is set, scroll to the bottom and click ‘Create.’

After a few seconds, you should see a green checkmark on the next screen indicating your server is now live.

To setup access to your new VM, click on the VM and then click the drop down next to where it says ‘SSH’ and use ‘Open in Browser Window.’

This will open a new browser window where you will be connected to the VM.

Connecting this way every time is inefficient, so I recommend creating ssh-keys so you can connect using either Powershell on Windows or Terminal on Mac.

To do this on Mac go to Terminal and type in

cd ~/.ssh/

This will take you to the ssh keys folder where you will store the ssh keys to connect to your new VM. Now type in

ssh-keygen

The command will prompt you for a name. I usually like to use the name of the VM I just created, but you can use whatever you want.

Two files will be created: One called [name] (the name you just used) and the other called [name].pub

To set up your local computer to connect to your VM you need to add the public key in the .pub file to your new virtual machine.

You can do this by opening up your new VM (per the instructions above) and typing in

cd ~/.ssh/

You’ll now be in the ssh key folder of your virtual machine.

add the public key (found inside the .pub file) to the authorized keys file by typing

echo [PUBLIC KEY] >> authorized_keys

Back on your local machine type in

vim ~/.ssh/config 

Here you’ll need to enter the host information for your new connection. This is especially important if you have more than one key-pair stored on your machine.

In the config file you’ll add the following

Host [whatever you want]
HostName [External IP Address]
User [username it shows on the VM]
IdentityFile ~/.ssh/[name of the private key - the file without .pub]
ServerAliveInterval 50

*Note that you can find the ‘External IP Address’ to your VM in the Google Cloud Console by going to Compute Engine>Virtual Machine> and then selecting your VM.

Hit shift z, z to save and exit.

To connect to your VM from your local machine just type in ‘ssh [Host]’ into Terminal and you should be connected.

Don’t fret if this doesn’t work on the first try. If you’re having issues feel free to message me or check out this article for additional guidance: https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys-on-ubuntu-1804

You now have a virtual machine that you can use to write and deploy python scripts and set up automated cron jobs to deploy or use data in interesting ways.

For an example of what you could do, check out my article on How to Pull data from an API using Python Requests.

ETL Tool

There are dozens of ETL tools on the market with out of the box connectors to some of the most popular applications, websites, and APIs (usually for around $150 a month or less to start).

In this example, I’ll pull some Linkedin Advertising metrics for my company using an ETL tool called Fivetran.

Tools like Fivetran make ETL largely trivial.

Pick a data source and a destination (in this case, Postgres), add in your credentials and follow the instructions. Don’t forget to whitelist the IP addresses on your database for whatever third party tools you use. Tools like Fivetran have extensive help pages that will tell you which IP addresses you need to allow for the software to function correctly. (You can whitelist IPs by following the instructions above where I discuss how to whitelist your client IP for Dbeaver.)

Tools like Fivetran will do a complete sync to your database without even needing to setup merge criteria or configure a data model.

The data will automatically synchronize pre-modeled tables into your Postgres database on whatever cadence you set it for.

Here’s what it looks like in the database:

Step 3. Set up Reporting

In just 2 steps you’re already 80% of the way to having automated reporting and analytics.

You didn’t even need Hadoop, a Snowflake instance, DBT, or any other fancy enterprise level tool to do it.

Now on to the fun part.

There are many different BI and reporting solutions on the market. Tableau is the most popular but is expensive and time consuming to set up.

MS Power BI is cheap and it’s likely your company’s already paying for it through your Office 365 subscription.

But the by far the best and simplest option for getting started is Google Data Studio.

Go to Google Data Studio and login using the same login you used to access the Google Cloud console. Hit ‘Create’ in the top left and select ‘Data Source.’

You’ll see dozens of different options. Type ‘postgres’ in the search box and select ‘PostgreSQL.’

In the top left you can give your data source a name. Just use something that you’ll remember or the same name you used for your database.

On the next screen you’ll need to once again add your database credentials and authenticate your database. You’ll first need to whitelist Data Studio’s IP addresses that it uses for connection (you can do this by following the same instructions provided in Step 1 for whitelisting your client PC/Laptop).

After you input your credentials you’ll see your schemas and tables as reflected in your database.

From here you can select a table, or you can select the ‘CUSTOM QUERY’ option and write SQL to get the data pre-aggregated the way you want.

On the next page, you’ll be able to format your data fields based on the custom query or table you’re pulling in. This is a good chance to convert certain fields to be text, numeric, currency, or date fields depending on the context of your data.

When you’re done formatting your fields, click ‘Create Report’ in the top right.

At this point you will be manipulating your freshly ingested data into actual reporting.

Depending on the context of your data, there are literally hundreds of options for displaying, aggregating, and formatting your data.

I built this Linkedin Ads dashboard in just a few minutes:

When you’re finished editing your reports you can click ‘View’ to see what it looks like. Then you can share it out to members of your team or any relevant stakeholders directly from the same screen.

Voila! You’ve just created an automated dashboard.

As long as your data source is being updated consistently in the database (using Fivetran or your custom ETL scripts), the report will automatically refresh whenever a user opens it to check for new data.

What’s Next?

Congratulations! If you’ve followed this guide to completion, you’ve successfully set up a full blown analytics architecture.

From here you can go about adding new data sources with an ETL tool or using Python scripts, building more and better dashboards, or do some analysis using SQL.

As your data grows you can use Google Cloud (or whatever servicer you have chosen) to quickly and easily scale your database.

Thanks for reading! If you enjoyed this guide, follow me here on Medium for more articles on analytics and getting value from data.

If you want to work with me on a project, or get help setting up single-source-of-truth analytics for your team or business, reach out to me directly.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment