Techno Blender
Digitally Yours.

When Spreadsheets Aren’t Good Enough: A Lesson in Relational Databases | by Stephanie Lo | May, 2023

0 30


Now that we’ve gotten to grips with some definitions, let’s apply these normal form rules in a working example to let these concepts sink in.

Tutorial: Segmenting audiences in a Canadian-based retailer

Photo by charlesdeluvio on Unsplash

For this example, I used MySQL to carry out the queries, where each script should be adapted should you want to follow along with an alternate SQL server such as PostgreSQL or Oracle. If you are newer to MySQL I would suggest reading these articles here for installation instructions and here to get started. Just a caveat that the follow-along code assumes some foundational SQL knowledge.

The following data is not based on actual e-commerce/subscription data and should only be considered for demonstrative purposes only. Synthetic data has been created in order to explore the concepts discussed in the article (in a real life enterprise, you would be working with thousands of rows of data and not just twenty rows). To follow along with the tutorial please check out my Github repository here, which includes a complete transformation of the data including data creation, data transformation and queries.

Database Normalization in practice

Let’s say you’ve just been hired as an analyst at an e-commerce retailer. Your company is running an important sale and you’ve been assigned to carry out the following task:

Separate out our current customers and email subscribers so that we can direct different email offers to each segment in order to drive purchases during our sale period.

When you access the database, a few immediate things stand out. Most notably, email subscriptions and customer information are housed in separate tables without any established relationship, which poses a challenge in carrying out the assigned task. However, there is a silver lining in that you have relatively complete data, and with a few tweaks relationships can be established between the two to carry out the task. Here’s a breakdown of the current database:

The tables outlined above contain a multitude of variables relevant to the customer and email subscribers, including their geographic location, the emails they have previously received, as well as associated email metrics.

First Normal Form (1NF)

The first thing we can do is to take a look at the data and data types contained in each column after we have uploaded each file onto an SQL server. We can do this by performing the following queries:

It looks like we have a long way to go to conform to the first normal form! Recall that the requirements to achieve first normal form concerns proper row identification and grouping data correctly. Our current set up is in violation of all three rules of first normal form, which is mainly concentrated in the email_newsletter table:

  • Currently email_newsletter doesn’t have a primary key. This forces a database user to search for each subscriber using their email address, which is not considered good practice due to the possibility of users updating their email address and privacy concerns.
  • The data in email_newsletter is not in its most reduced form. Upon further analysis, column contents can be separated into 2 columns, such as subject_line and promo_code.
  • In email_newsletter, the repeating group is the email send and content. One user can receive many emails, so the best course is to separate this from the table.

To solve our first, we can add a unique primary key to email_newsletter in the first column of the table and run the following query, where email_sign_up_id is the column name and our data type is INTEGER :

With this query, we added a unique identifying primary key that automatically increments with each new data record.

For the second revision, we need to split out contents into their respective field of subject_line and promo_code, we see that as a general pattern each field is separated by a comma which we can use to split out each one:

Our result is that we have each column separated out into its most reduced form so we are able to see the categories of each column more clearly.

Lastly, we can remove repeating groups by organizing all of the email content-related data into a new table named email_distribution , don’t forget to add the data in from the original tables! (See GitHub here for full SQL including data loading):

With the query above, we have separated out the email contents into a separate table, to make sure that we don’t lose any information. We now need to connect this to email_newsletter table using a foreign key using the following queries:

This helps us conform to the first normal form, where as a final step we can drop any redundant columns and look at our layout before moving onto the second:

Second Normal Form (2NF)

To adhere to the rules of the second normal form, we need to achieve conformity of the first normal form ✓, establish relationships between tables and ensure non-key columns depend on the primary key. Here’s how we can approach transforming our set up into second normal form:

  • The customer table is currently disconnected from the other two tables, where there are missing customers who have given their email consent in the column email_opt_in who are currently not being reached.
  • There are unconnected columns to the primary key in email_newsletter, where email metrics such as email_open and click_through are not super relevant to their personal information, so we could consider separating these out.

We want to establish a relationship between the customer and email_distribution table in order to have our two segments of email prospects and current customers. However, if we drew a connection as it stands it may seem messy since one customer can receive many emails and disturb the primary key. Therefore, a logical intermediary step is to create a new table that can be placed between the two which we will name customer_newsletter_metrics to allow for these attributes to live in a table of their own:

This table helps to connect links between our entire database since we already stated the foreign keys in the creation of the table, with each column in the table being relevant and dependent on the primary key.

The set up we have now is almost complete but a little unbalanced — since we have an established metrics table for customers but not one for email prospects. By applying the same treatment to this group this could help mitigate future issues since if a prospect’s email changes it would require less revisions if this information lived in a separate table (again, please follow along with the full code on GitHub here).

After having renamed the email_newsletter to email_newsletter_metrics and created our email_prospect table here is our database after the 2NF transformation:

Third Normal Form (3NF)

The final step of our example is to follow third normal form rules which consist of completing first and second normal form rules ✓ and lastly, splitting out non-key columns that are dependent on one another.

Overall, we’ve done a good job so far to mitigate interdependency between non-key columns although, if we were to take a closer look at the customer table we can see city and postal_code are dependent on another since the postal code is related to where the customer lives, but is also related to where the city where the customer resides. This may present problems in the long run in situations where the customer moves, one column may be updated but maybe not the other.

With this being said we can create a new table named postal_code which separates out postal_code and city from the customer table and form connections between these two tables:

The last block of code above now completes our normalization process (yay!) where our final transformed database can be visualized below:

Our initial ask revisited: using a relational database

So we have our relational database ready through conforming to the first, second and third normal forms, why don’t we check out the original request that was given to us:

Separate out our current customers and email subscribers so that we can direct different email offers to each segment in order to drive purchases during our sale period.

Let’s form two simple query statements to grab the information of each segment:

The queries will result in two tables, and have an output like the one below:

Now you’re ready to send out your next sale email with confidence that there is no duplication or missed contacts between the tables. What’s more, we’ve structured the database in such a way that makes it easier for data analysis and ready and equipped for more customers.

Some closing thoughts

By recognizing pain points in your daily work routine such as a slow or complicated database, this article aims to offer implementation of a relational database as a solution. By using techniques like first to third normal form, you can break down your data into manageable chunks that are easier to work with and less prone to errors.

As a friendly final reminder, for a complete breakdown of the code used please visit my GitHub repository here. I hope this article may have inspired a newfound interest in relational databases ✨. Thanks for reading and following along!


Now that we’ve gotten to grips with some definitions, let’s apply these normal form rules in a working example to let these concepts sink in.

Tutorial: Segmenting audiences in a Canadian-based retailer

Photo by charlesdeluvio on Unsplash

For this example, I used MySQL to carry out the queries, where each script should be adapted should you want to follow along with an alternate SQL server such as PostgreSQL or Oracle. If you are newer to MySQL I would suggest reading these articles here for installation instructions and here to get started. Just a caveat that the follow-along code assumes some foundational SQL knowledge.

The following data is not based on actual e-commerce/subscription data and should only be considered for demonstrative purposes only. Synthetic data has been created in order to explore the concepts discussed in the article (in a real life enterprise, you would be working with thousands of rows of data and not just twenty rows). To follow along with the tutorial please check out my Github repository here, which includes a complete transformation of the data including data creation, data transformation and queries.

Database Normalization in practice

Let’s say you’ve just been hired as an analyst at an e-commerce retailer. Your company is running an important sale and you’ve been assigned to carry out the following task:

Separate out our current customers and email subscribers so that we can direct different email offers to each segment in order to drive purchases during our sale period.

When you access the database, a few immediate things stand out. Most notably, email subscriptions and customer information are housed in separate tables without any established relationship, which poses a challenge in carrying out the assigned task. However, there is a silver lining in that you have relatively complete data, and with a few tweaks relationships can be established between the two to carry out the task. Here’s a breakdown of the current database:

The tables outlined above contain a multitude of variables relevant to the customer and email subscribers, including their geographic location, the emails they have previously received, as well as associated email metrics.

First Normal Form (1NF)

The first thing we can do is to take a look at the data and data types contained in each column after we have uploaded each file onto an SQL server. We can do this by performing the following queries:

It looks like we have a long way to go to conform to the first normal form! Recall that the requirements to achieve first normal form concerns proper row identification and grouping data correctly. Our current set up is in violation of all three rules of first normal form, which is mainly concentrated in the email_newsletter table:

  • Currently email_newsletter doesn’t have a primary key. This forces a database user to search for each subscriber using their email address, which is not considered good practice due to the possibility of users updating their email address and privacy concerns.
  • The data in email_newsletter is not in its most reduced form. Upon further analysis, column contents can be separated into 2 columns, such as subject_line and promo_code.
  • In email_newsletter, the repeating group is the email send and content. One user can receive many emails, so the best course is to separate this from the table.

To solve our first, we can add a unique primary key to email_newsletter in the first column of the table and run the following query, where email_sign_up_id is the column name and our data type is INTEGER :

With this query, we added a unique identifying primary key that automatically increments with each new data record.

For the second revision, we need to split out contents into their respective field of subject_line and promo_code, we see that as a general pattern each field is separated by a comma which we can use to split out each one:

Our result is that we have each column separated out into its most reduced form so we are able to see the categories of each column more clearly.

Lastly, we can remove repeating groups by organizing all of the email content-related data into a new table named email_distribution , don’t forget to add the data in from the original tables! (See GitHub here for full SQL including data loading):

With the query above, we have separated out the email contents into a separate table, to make sure that we don’t lose any information. We now need to connect this to email_newsletter table using a foreign key using the following queries:

This helps us conform to the first normal form, where as a final step we can drop any redundant columns and look at our layout before moving onto the second:

Second Normal Form (2NF)

To adhere to the rules of the second normal form, we need to achieve conformity of the first normal form ✓, establish relationships between tables and ensure non-key columns depend on the primary key. Here’s how we can approach transforming our set up into second normal form:

  • The customer table is currently disconnected from the other two tables, where there are missing customers who have given their email consent in the column email_opt_in who are currently not being reached.
  • There are unconnected columns to the primary key in email_newsletter, where email metrics such as email_open and click_through are not super relevant to their personal information, so we could consider separating these out.

We want to establish a relationship between the customer and email_distribution table in order to have our two segments of email prospects and current customers. However, if we drew a connection as it stands it may seem messy since one customer can receive many emails and disturb the primary key. Therefore, a logical intermediary step is to create a new table that can be placed between the two which we will name customer_newsletter_metrics to allow for these attributes to live in a table of their own:

This table helps to connect links between our entire database since we already stated the foreign keys in the creation of the table, with each column in the table being relevant and dependent on the primary key.

The set up we have now is almost complete but a little unbalanced — since we have an established metrics table for customers but not one for email prospects. By applying the same treatment to this group this could help mitigate future issues since if a prospect’s email changes it would require less revisions if this information lived in a separate table (again, please follow along with the full code on GitHub here).

After having renamed the email_newsletter to email_newsletter_metrics and created our email_prospect table here is our database after the 2NF transformation:

Third Normal Form (3NF)

The final step of our example is to follow third normal form rules which consist of completing first and second normal form rules ✓ and lastly, splitting out non-key columns that are dependent on one another.

Overall, we’ve done a good job so far to mitigate interdependency between non-key columns although, if we were to take a closer look at the customer table we can see city and postal_code are dependent on another since the postal code is related to where the customer lives, but is also related to where the city where the customer resides. This may present problems in the long run in situations where the customer moves, one column may be updated but maybe not the other.

With this being said we can create a new table named postal_code which separates out postal_code and city from the customer table and form connections between these two tables:

The last block of code above now completes our normalization process (yay!) where our final transformed database can be visualized below:

Our initial ask revisited: using a relational database

So we have our relational database ready through conforming to the first, second and third normal forms, why don’t we check out the original request that was given to us:

Separate out our current customers and email subscribers so that we can direct different email offers to each segment in order to drive purchases during our sale period.

Let’s form two simple query statements to grab the information of each segment:

The queries will result in two tables, and have an output like the one below:

Now you’re ready to send out your next sale email with confidence that there is no duplication or missed contacts between the tables. What’s more, we’ve structured the database in such a way that makes it easier for data analysis and ready and equipped for more customers.

Some closing thoughts

By recognizing pain points in your daily work routine such as a slow or complicated database, this article aims to offer implementation of a relational database as a solution. By using techniques like first to third normal form, you can break down your data into manageable chunks that are easier to work with and less prone to errors.

As a friendly final reminder, for a complete breakdown of the code used please visit my GitHub repository here. I hope this article may have inspired a newfound interest in relational databases ✨. Thanks for reading and following along!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment