Techno Blender
Digitally Yours.

Digital Marketing Analysis simultaneously with Python and MySQL | by Gonçalo Guimarães Gomes | Mar, 2023

0 43


A digital marketing analytics exercise with explained step-by-step code in both SQL and Python environments

Photo by Zdeněk Macháček on Unsplash

Along this brief journey, we will explore a short and simple dataset with a collection of basic website marketing metrics such as ‘users’, ‘sessions’ and ‘bounces’, over a five months period.

The purpose of this setup, rather than focusing on understanding the website performance, is to gain some basic but somehow useful knowledge to answer a number of must-have operational marketing questions.

We’ll focus on two powerful and most-used digital tools exploring two ways leading us to the same results at the end of the day.

On the one hand, we will explore the syntax of MySQL Workbench with some diverse queries, in parallel, for each question, with the syntax of Python using graphic and visual resources. Both environments will be titled as # MySQL and # Python, respectively. For each question, with notes and explanations on both codes for a deeper understanding.

# MySQL

-- displaying dataset (case_sql.csv)
SELECT * FROM case_sql;
Image by author.

You can download the SQL dataset here.

# Python

# import python libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style as style
%matplotlib inline
color = sns.color_palette()
from pandas.plotting import table
from datetime import datetime
# load the data set
df = pd.read_csv("case.csv", sep=";")

# number of observations and columns
df.shape
(31507, 7)

# display rows sample
df.sample(15)

Image by author.
# SHAPE
# Create a function that prints the shape of the dataframe and some other basic info
# --> number of observations, features, duplicates, missing values (True, False) datatypes and its proportion.

def shape_df(df):
print(f"Number of observations: {df.shape[0]}")
print(f"Number of variables: {df.shape[1]}")
print(f"Number of duplicates: {df.duplicated().sum()}")
print(f"Are there any missing values? {df.isnull().values.any()}\n-----")
print(f"{df.dtypes.sort_values(ascending=True)}\n-----")
print(f"Datatypes' proportion:\n{df.dtypes.value_counts(ascending=True)}")

# calling the function
shape_df(df)

Number of observations: 31507
Number of variables: 7
Number of duplicates: 4083
Are there any missing values? False
-----
date int64
users int64
sessions int64
bounces int64
brand object
device_category object
default_channel_grouping object
dtype: object
------
Datatypes proportion:
object 3
int64 4
dtype: int64

# lowering columns' capital letters for easy typing
df.columns = map(str.lower, df.columns)

# remove spaces in columns name
df.columns = df.columns.str.replace(' ','_')

Convert dates into datetime type

# make string version of original column 'date', call it 'date_'
df['date_'] = df['date'].astype(str)

# create the new columns using string indexing
df['year'] = df['date_'].str[0:4]
df['month'] = df['date_'].str[4:6]
df['day'] = df['date_'].str[6:]

# concatenate 'year', 'month' and 'day'
df["date"] = df["year"] + "-" + df["month"] + "-" + df["day"]

# convert to datetime
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")

# extract 'year', 'month' and 'weekday'
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["weekday"] = df["date"].dt.dayofweek.map({0 : "Mon", 1 : "Tue", 2 : "Wed", 3: "Thu", 4 : "Fri", 5 : "Sat", 6 : "Sun"})
# select columns to perform exploratory data analysis
cols = "date year month weekday brand device_category default_channel_grouping users sessions bounces".split()
df = df[cols].copy()

# display final dataset
df.head(10)

Image by author.

You can download the Python dataset here.

# MySQL

-- Device distribution
SELECT
device_category, -- select the device_category column
ROUND(COUNT(users) / (SELECT
COUNT(users)
FROM
case_sql) * 100,
1) AS percent -- calculate the percentage of users in each category
FROM
case_sql -- select data from the case_sql table
GROUP BY 1 -- group the result by device_category
ORDER BY 1; -- order the result by device_category in ascending order
Image by author.

We can see that the distribution by device shows mobile and desktop side by side as the most frequent access type.

# Python

# device distribution
# counts the number of occurrences of each unique device category in the device_category column of the DataFrame df, including missing values (if any).
df.device_category.value_counts(dropna=False).plot(kind='pie', figsize=(8,4),
explode = (0.02, 0.02, 0.02),
autopct='%1.1f%%',
startangle=150);
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.title("Device distribution", fontsize=10, loc="right");
Image by author.

# MySQL

-- Brand distribution
SELECT
brand, -- select the brand column
COUNT(users) AS users, -- count the number of users for each unique brand and alias the result as "users"
ROUND(COUNT(users) / (SELECT
COUNT(users)
FROM
case_sql) * 100,
2) AS percent -- calculate the percentage of users for each brand out of the total number of users in the case_sql table and alias the result as "percent"
FROM
case_sql -- select data from the case_sql table
GROUP BY 1; -- group the result by the first column (brand)
Image by author.

At a brand level, Brand 2 has the highest number of visits with 56.28% against 43.72% of total visits on Brand 1.

# Python

# Brand distribution
absolut = df["brand"].value_counts().to_frame()

# Pie chart
absolut.plot(kind='pie', subplots=True, autopct='%1.2f%%',
explode= (0.05, 0.05), startangle=20,
legend=False, fontsize=12, figsize=(8,4))

# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.title("Brand's distribution", fontsize=10, loc="right");

display(absolut) # Table

Image by author.

# MySQL

SELECT 
date, -- select the date column
DAYNAME(date) AS day_name, -- calculate the day name corresponding to each date
SUM(users) AS users -- sum the number of users for each unique date where the brand is 'Brand 1'
FROM
case_sql -- select data from the case_sql table
WHERE
brand = 'Brand 1' -- filter rows where the brand is 'Brand 1'
GROUP BY 1 -- group the result by date
ORDER BY 3 DESC -- order the result by users in descending order
LIMIT 1; -- select only the first row of the result (the row with the highest number of users)
Image by author.

From the 298 412 users between Sep 2019 and Jan 2020, the day most users arrive on the Brand 1 website was on the 2019–11–22 with a total of 885 visits, it was a Friday.

# Python

# filter users that arrived at 'Brand 1' only, assign it 'brand_1'
brand_1 = df[df["brand"] == "Brand 1"].copy()

''' sum total users that came from all "channelgrouping" for the same date,
assign it 'brandgroup' no matter the type of device '''

brandgroup = brand_1.groupby(["date","weekday"])[["default_channel_grouping","users"]].sum()

# filter the date by maximum users, assign it 'users'
users = brandgroup[brandgroup["users"] == brandgroup.users.max()].copy()

# reseat index
users.reset_index(["date"], inplace=True)
users.reset_index(["weekday"], inplace=True)

# results
print(f"""Date: {users.date} \n\nTotal users: {users.users} \n\nDay of week: {users.weekday}""")

Date: 0 2019-11-22
Name: date, dtype: datetime64[ns]

Total users: 0 885
Name: users, dtype: int64

Day of week: 0 Fri
Name: weekday, dtype: object

# calling the variable
users

Image by author.

# MySQL

SELECT 
DATE(date) AS date, -- Select the date from the 'date' column and convert it to a date data type
DAYNAME(date) AS dayofweek, -- Select the day of the week from the 'date' column
SUM(CASE
WHEN brand = 'Brand 1' THEN users -- Sum the 'users' column for Brand 1
ELSE NULL
END) AS b1_users,
SUM(CASE
WHEN brand = 'Brand 2' THEN users -- Sum the 'users' column for Brand 2
ELSE NULL
END) AS b2_users
FROM
case_sql -- From the 'case_sql' table
GROUP BY 1, 2 -- Group the results by the first and second columns (date and dayofweek)
ORDER BY 3 DESC -- Order the results by b1_users in descending order
LIMIT 1; -- Limit the results to only the highest total number of Brand 1 users
Image by author.

Actually, both brands saw the highest number of visits on the same day.

# Python

# filter users that arrived at 'Brand 2', assign it 'brand_2'
brand_2 = df[df["brand"] == "Brand 2"].copy()

# rename the 'users' column from previous (above) Python code
brandgroup.rename(columns = {'users':'brand1_users'}, inplace = True)

# include a new column with the filtered users from 'Brand_2'
brandgroup["brand2_users"] = brand_2.groupby(["date","weekday"])[["default_channel_grouping","users"]].sum()

# filter the new column (brand2_users) by maximum users
users2 = brandgroup[brandgroup["brand2_users"] == brandgroup.brand2_users.max()].copy()

Image by author.

# MySQL

SELECT 
default_channel_grouping AS channels,
SUM(users) AS total_users,
ROUND(SUM(users) / (SELECT
SUM(users)
FROM
case_sql) * 100,
1) AS percent -- calculate the percentage of users for each channel
FROM
case_sql
GROUP BY 1
ORDER BY 2 DESC;
Image by author.

Organic Search is by far the channel that generates most users (almost 141000) representing almost half of total visits on both websites, followed by Paid Search and Direct. Display takes the 4th position and Social the 6th contributing with 6722.

# Python

# sum users by all channel groups and plot bar chart 
ax = df.groupby("default_channel_grouping")["users"].sum().sort_values(ascending=True)\
.plot(kind="bar", figsize=(9,6), fontsize=12, linewidth=2,
color=sns.color_palette("rocket"), grid=False, table=False)

# show data labels
for p in ax.patches:
ax.annotate("%.0f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 7), textcoords='offset points')

# params
plt.xlabel("Channel groups", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Best channel group (highest number of users)", fontsize=10, loc="right");

Image by author.

# MySQL

SELECT 
default_channel_grouping AS channels,
SUM(CASE -- sum users by brand and map to new columns
WHEN brand = 'brand 1' THEN users -- if brand = 'brand 1', sum users and store in 'Brand_1' column
ELSE NULL -- if not 'brand 1', set value to null
END) AS Brand_1, -- create column for Brand 1 users
SUM(CASE
WHEN brand = 'brand 2' THEN users
ELSE NULL
END) AS Brand_2
FROM
case_sql
GROUP BY 1 -- group by channel
ORDER BY 3 DESC; -- order by Brand 2 users in descending order
Image by author.

# Python

# create pivot_table
# sum all users for each brand by channels
type_pivot = df.pivot_table(
columns="brand",
index="default_channel_grouping",
values="users", aggfunc=sum)

display(type_pivot)

#Display pivot_table with a bar chart
type_pivot.sort_values(by=["Brand 2"], ascending=True).plot(kind="bar", figsize=(12,8) ,fontsize = 15)
plt.xlabel("Channel groups", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Channel groups by brand (highest number of users)", fontsize=10, loc="right");

Image by author.
Image by author.

Organic Search has contributed by 105.062 of users for Brand 2, and by 35.911 users for Brand 1. With the exception of ‘Other’ in which the Brand 1 is superior, the Brand 2 contributes the highest delivering users to the website in all channels.

# MySQL

SELECT 
brand,
default_channel_grouping AS channels,
ROUND(SUM(sessions) / (SELECT
SUM(sessions)
FROM
case_sql) * 100,
1) AS percent
FROM
case_sql
WHERE
default_channel_grouping IN ('Paid Search' , 'Paid Social', 'Display', 'Other Advertising') -- include only rows with these values
AND date < '2020-01-01' -- only date before '2020-01-01' will be included.
GROUP BY 1 , 2
HAVING percent > 5 -- filters the groups to only include values greater than 5%.
ORDER BY 1 , 3 DESC
Image by author

# Python

# groupby dataframe by selected cols
df = df.groupby(["date","brand","default_channel_grouping"])["sessions"].sum().to_frame().copy()

# calculate percentages (new column)
df["percent"] = (df.apply(lambda x: x/x.sum())*100).round(2)

# reset index
df = df.reset_index().copy()

# display a 5 rows sample
df.sample(5)

Image by author.
# filter paid channels using lambda function
paid = df.apply(lambda row: row[df['default_channel_grouping'].isin(['Display','Paid Search','Paid Social','Other Advertising'])])

# filter year 2019
paid = paid[paid['date'] < '2020-01-01']

# groupby channels by brand
paid = paid.groupby(["brand","default_channel_grouping"])[["sessions","percent"]].sum()

# filter sessions higher than 5%
paid[paid["percent"] >5]

Image by author.

# MySQL

SELECT 
brand,
SUM(CASE
WHEN device_category = 'Desktop' THEN users
ELSE NULL
END) AS desktop,
SUM(CASE
WHEN device_category = 'Mobile' THEN users
ELSE NULL
END) AS mobile,
SUM(CASE
WHEN device_category = 'Tablet' THEN users
ELSE NULL
END) AS tablet
FROM
case_sql
GROUP BY 1
ORDER BY 1;
Image by author.

# Python

# pivot_table
type_pivot = df.pivot_table(
columns="device_category",
index="brand",
values="users", aggfunc=sum)

display(type_pivot)

# display pivot_table (chart)
ax = type_pivot.sort_values(by=["brand"], ascending=False).plot(kind="bar", figsize=(12,8) ,fontsize = 15);

# adding data labels
for p in ax.patches:
ax.annotate("%.0f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')

plt.xlabel("Brands", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Brand by type of device", fontsize=10, loc="right")
plt.legend(["Desktop","Mobile","Tablet"]);

Image by author.

Mobile is the preferred type of device at Brand 2 whereas desktop is the most used device at Brand 1.

# MySQL

SELECT 
default_channel_grouping,
AVG(CASE
WHEN device_category = 'Desktop' THEN users
ELSE NULL
END) AS desktop,
AVG(CASE
WHEN device_category = 'Mobile' THEN users
ELSE NULL
END) AS mobile,
AVG(CASE
WHEN device_category = 'Tablet' THEN users
ELSE NULL
END) AS tablet
FROM
case_sql
GROUP BY 1
ORDER BY 1;
Image by author.

# Python

# pivot_table
type_pivot = df.pivot_table(
columns="device_category",
index="default_channel_grouping",
values="users", aggfunc=np.mean)

display(type_pivot)

Image by author.
# display pivot_table
type_pivot.sort_values(by=["default_channel_grouping"], ascending=False).plot(kind="bar", figsize=(12,8) ,fontsize = 15);

plt.xlabel("Date", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Average use of device types by channel grouping", fontsize=10, loc="right")
plt.legend(["Desktop","Mobile","Tablet"]);

Image by author.

On average, desktop is more often used regarding Referral, Direct and Others. As for the other channels, vertical oriented content should be always taken in consideration.

The Bounce Rate is calculated as the total number of bounces divided by the total number of sessions.

# MySQL

SELECT 
default_channel_grouping,
SUM(sessions) AS sessions,
SUM(bounces) AS bounces,
ROUND(SUM(bounces) / SUM(sessions) * 100, 2) AS bounces_r
FROM
case_sql
GROUP BY 1
ORDER BY 4 DESC;
Image by author.

Average Bounce Rate: 54.93% (avg_bounces_r)

SELECT 
SUM(sessions) AS sessions,
SUM(bounces) AS bounces,
ROUND(SUM(bounces) / SUM(sessions) * 100, 2) AS bounces_r,
AVG(ROUND(bounces/sessions*100, 2)) AS avg_bounces_r
FROM
case_sql;

# Python

# group individual channels by sum of users
dfbounce = df.groupby("default_channel_grouping")["users"].sum().to_frame()

# group individual channels by sum of sessions
dfbounce["sessions"] = df.groupby("default_channel_grouping")["sessions"].sum()

# group individual channels by sum of bounces
dfbounce["bounces"] = df.groupby("default_channel_grouping")["bounces"].sum()

# calculus of bounce rate for each individual channel
dfbounce["bounces_r"] = dfbounce.apply(lambda x: 0.0 if x["sessions"] == 0.0 else (x["bounces"] / x["sessions"])*100, axis=1).round(2)

dff = dfbounce.copy()

dfbounce.drop(["users"],axis=1,inplace=True)

# sort values by rate
dfbounce.sort_values(by="bounces_r", ascending=False)

Image by author.
# display bar chart with the bounce rate for each channel
ax = dfbounce.groupby("default_channel_grouping")["bounces_r"].sum().sort_values(ascending=True)\
.plot(kind="bar", figsize=(9,6), fontsize=12, linewidth=2, color=sns.color_palette("rocket"), grid=False, table=False)

for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 7), textcoords='offset points')

plt.axhline(dfbounce.groupby("default_channel_grouping")["bounces_r"].mean().mean(), linewidth=1, color ="r")
plt.xlabel("channel groups", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Bounce rate by channelGrouping", fontsize=10, loc="right");

Image by author.

As excepted, Display has the highest bounce rate of all channels followed by Direct and Paid Social. Organic Search is levelled with the average. Below the threshold with the lowest Bounce Rates we find Referral, Native and Other Advertising.

# MySQL

SELECT 
YEAR(date) AS year, -- extract year
MONTH(date) AS month, -- extract month
DATE_FORMAT(date, '%b') AS month_, -- format the date column to display month name
ROUND(SUM(bounces) / SUM(sessions) * 100, 2) AS bounces_r -- calculate bounce rate
case_sql
GROUP BY 1 , 2 , 3
ORDER BY 1 , 2 , 3;
Image by author.

# Python


df_date = df.groupby("date")[['sessions','bounces']].sum()

''' create function to assess the bounce rate, assign it as 'bounce_r'
Return 0 if session's value is 0, else divide the bounces by sessions
for each date and multiply it by 100 to get the percentage '''

def div(bounces, sessions):
return lambda row: 0.0 if row[sessions] == 0.0 else float((row[bounces]/(row[sessions])))*100

# create column 'bounce_r' with the function results
df_date["bounce_r"] = (df_date.apply(div('bounces', 'sessions'), axis=1)).round(1)

# drop unnecessary columns
df_date.drop(["sessions","bounces"], axis=1, inplace=True)

# sum all bounces over time and plot chart
ax = df_date.plot(kind="line", figsize=(14,6), fontsize=12, linewidth=2)

plt.xlabel("Date", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Rate", fontsize=10)
plt.title("Evolution of the bounce rate over time", fontsize=10, loc="right");

Image by author.
# Smoothing the line with a step of 15 days interval
resampled = df_date["bounce_r"].resample("m").mean()

plt.figure(figsize = (12,6))
ax = sns.lineplot(data = resampled)
plt.title("Evolution of the bounce rate over time (smooth)", fontsize=10, loc="right")
plt.xlabel("Date", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Rate", fontsize=10);

Image by author.

The Bounce Rate on site is improving over time.

# Python

# filter by brand
b1 = df[df["brand"] == "Brand 1"]
b2 = df[df["brand"] == "Brand 2"]

# ** brand 1 **

# group individual channels by sum of sessions for brand 1
dfbrand = b1.groupby("default_channel_grouping")["sessions"].sum().to_frame()
dfbrand.rename(columns={"sessions":"sessions1"}, inplace=True)

# group individual channels by sum of bounces for brand 1
dfbrand["bounces1"] = b1.groupby("default_channel_grouping")["bounces"].sum()

# calculus of bounce rate for each individual channel for brand 1
dfbrand["1bounces_r"] = dfbrand.apply(lambda x: 0.0 if x["sessions1"] == 0.0 else (x["bounces1"] / x["sessions1"]*100), axis=1).round(2)

# ** brand 2 **

# group individual channels by sum of bounces for brand 2
dfbrand["sessions2"] = b2.groupby("default_channel_grouping")["sessions"].sum()

# group individual channels by sum of bounces for brand 2
dfbrand["bounces2"] = b2.groupby("default_channel_grouping")["bounces"].sum()

# calculus of bounce rate for each individual channel for brand 2
dfbrand["2bounces_r"] = dfbrand.apply(lambda x: 0.0 if x["sessions2"] == 0.0 else (x["bounces2"] / x["sessions2"]*100), axis=1).round(2)

# sort values by rate
dfbrand.sort_values(by="1bounces_r", ascending=False)

Image by author.
# clean dataframe
dfchannels = dfbrand.copy()
dfbrand_chart = dfbrand.copy()
dfbrand_chart.drop(["sessions1","sessions2","bounces1","bounces2"], axis=1, inplace=True)

# display bar chart with the average bounce rate for each channel
ax = dfbrand_chart.plot(kind="bar", figsize=(13,6), fontsize=12, linewidth=2, color=sns.color_palette("BrBG"), grid=False, table=False)

for p in ax.patches:
ax.annotate("%.1f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')

plt.xlabel("channel groups", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Bounce rate by channelGrouping and by brand", fontsize=10, loc="right")
plt.legend(["Brand 1","Brand 2"]);

Image by author.

Brand 1 has a lower rate of what would be expected for the Display channel. Usually, this channels drive a high Bounce Rate. Important to understand in detail the content strategy and try to adapt it to Brand 2.

Brand 2 shows a higher bounce rate of what would be acceptable in the Referral channel.

# MySQL

SELECT 
brand,
CASE
WHEN
default_channel_grouping IN ('Paid Search',
'Paid Social',
'Display',
'Other Advertising')
THEN
'Paid'
WHEN
default_channel_grouping IN ('Direct',
'Native',
'Organic Search',
'Referral',
'Social',
'Email',
'(Other)')
THEN
'Organic'
ELSE NULL
END AS media,
ROUND(SUM(bounces) / SUM(sessions) * 100, 2) AS bounce_r
FROM
case_sql
GROUP BY brand , media
ORDER BY 1;
Image by author.

# Python

# create dictionary
media_dict =
{
'Display': 'paid',
'Paid Search': 'paid',
'Paid Social': 'paid',
'Other Advertising': 'paid',
'Direct': 'organic',
'Native': 'organic',
'Organic Search': 'organic',
'Referral': 'organic',
'Social': 'organic',
'Email': 'organic',
'(Other)': 'organic'
}

# mapping the dict into a new column
df['media'] = df['default_channel_grouping'].map(media_dict)

# define cols position in dataframe
cols = ['brand','media','sessions','bounces']

# reindex columns order
df = df.reindex(columns = cols)

# groupby dataframe by selected cols
df = df.groupby(["brand","media"])[["sessions","bounces"]].sum()

# bounce rate by channel
df["bounces_r"] = df.apply(lambda x: 0.0 if x["sessions"] == 0.0 else (x["bounces"] / x["sessions"])*100, axis=1).round(2)

Image by author.

As promised, we went through a step-by-step approach to conducting a simple digital marketing analysis working alongside MySQL Workbench and Python.

Both tools have their specificities, their demands, but the reasoning is relatively similar, leaving aside their graphic capabilities and limitations.

Feel free to download the datasets and explore by practicing some of the technical details covered here, implementing new code as you ask further analysis questions.

Explore others projects you might also like:

How to get in contact with me:

✅ Thanks for reading!


A digital marketing analytics exercise with explained step-by-step code in both SQL and Python environments

Photo by Zdeněk Macháček on Unsplash

Along this brief journey, we will explore a short and simple dataset with a collection of basic website marketing metrics such as ‘users’, ‘sessions’ and ‘bounces’, over a five months period.

The purpose of this setup, rather than focusing on understanding the website performance, is to gain some basic but somehow useful knowledge to answer a number of must-have operational marketing questions.

We’ll focus on two powerful and most-used digital tools exploring two ways leading us to the same results at the end of the day.

On the one hand, we will explore the syntax of MySQL Workbench with some diverse queries, in parallel, for each question, with the syntax of Python using graphic and visual resources. Both environments will be titled as # MySQL and # Python, respectively. For each question, with notes and explanations on both codes for a deeper understanding.

# MySQL

-- displaying dataset (case_sql.csv)
SELECT * FROM case_sql;
Image by author.

You can download the SQL dataset here.

# Python

# import python libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style as style
%matplotlib inline
color = sns.color_palette()
from pandas.plotting import table
from datetime import datetime
# load the data set
df = pd.read_csv("case.csv", sep=";")

# number of observations and columns
df.shape
(31507, 7)

# display rows sample
df.sample(15)

Image by author.
# SHAPE
# Create a function that prints the shape of the dataframe and some other basic info
# --> number of observations, features, duplicates, missing values (True, False) datatypes and its proportion.

def shape_df(df):
print(f"Number of observations: {df.shape[0]}")
print(f"Number of variables: {df.shape[1]}")
print(f"Number of duplicates: {df.duplicated().sum()}")
print(f"Are there any missing values? {df.isnull().values.any()}\n-----")
print(f"{df.dtypes.sort_values(ascending=True)}\n-----")
print(f"Datatypes' proportion:\n{df.dtypes.value_counts(ascending=True)}")

# calling the function
shape_df(df)

Number of observations: 31507
Number of variables: 7
Number of duplicates: 4083
Are there any missing values? False
-----
date int64
users int64
sessions int64
bounces int64
brand object
device_category object
default_channel_grouping object
dtype: object
------
Datatypes proportion:
object 3
int64 4
dtype: int64

# lowering columns' capital letters for easy typing
df.columns = map(str.lower, df.columns)

# remove spaces in columns name
df.columns = df.columns.str.replace(' ','_')

Convert dates into datetime type

# make string version of original column 'date', call it 'date_'
df['date_'] = df['date'].astype(str)

# create the new columns using string indexing
df['year'] = df['date_'].str[0:4]
df['month'] = df['date_'].str[4:6]
df['day'] = df['date_'].str[6:]

# concatenate 'year', 'month' and 'day'
df["date"] = df["year"] + "-" + df["month"] + "-" + df["day"]

# convert to datetime
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")

# extract 'year', 'month' and 'weekday'
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["weekday"] = df["date"].dt.dayofweek.map({0 : "Mon", 1 : "Tue", 2 : "Wed", 3: "Thu", 4 : "Fri", 5 : "Sat", 6 : "Sun"})
# select columns to perform exploratory data analysis
cols = "date year month weekday brand device_category default_channel_grouping users sessions bounces".split()
df = df[cols].copy()

# display final dataset
df.head(10)

Image by author.

You can download the Python dataset here.

# MySQL

-- Device distribution
SELECT
device_category, -- select the device_category column
ROUND(COUNT(users) / (SELECT
COUNT(users)
FROM
case_sql) * 100,
1) AS percent -- calculate the percentage of users in each category
FROM
case_sql -- select data from the case_sql table
GROUP BY 1 -- group the result by device_category
ORDER BY 1; -- order the result by device_category in ascending order
Image by author.

We can see that the distribution by device shows mobile and desktop side by side as the most frequent access type.

# Python

# device distribution
# counts the number of occurrences of each unique device category in the device_category column of the DataFrame df, including missing values (if any).
df.device_category.value_counts(dropna=False).plot(kind='pie', figsize=(8,4),
explode = (0.02, 0.02, 0.02),
autopct='%1.1f%%',
startangle=150);
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.title("Device distribution", fontsize=10, loc="right");
Image by author.

# MySQL

-- Brand distribution
SELECT
brand, -- select the brand column
COUNT(users) AS users, -- count the number of users for each unique brand and alias the result as "users"
ROUND(COUNT(users) / (SELECT
COUNT(users)
FROM
case_sql) * 100,
2) AS percent -- calculate the percentage of users for each brand out of the total number of users in the case_sql table and alias the result as "percent"
FROM
case_sql -- select data from the case_sql table
GROUP BY 1; -- group the result by the first column (brand)
Image by author.

At a brand level, Brand 2 has the highest number of visits with 56.28% against 43.72% of total visits on Brand 1.

# Python

# Brand distribution
absolut = df["brand"].value_counts().to_frame()

# Pie chart
absolut.plot(kind='pie', subplots=True, autopct='%1.2f%%',
explode= (0.05, 0.05), startangle=20,
legend=False, fontsize=12, figsize=(8,4))

# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.title("Brand's distribution", fontsize=10, loc="right");

display(absolut) # Table

Image by author.

# MySQL

SELECT 
date, -- select the date column
DAYNAME(date) AS day_name, -- calculate the day name corresponding to each date
SUM(users) AS users -- sum the number of users for each unique date where the brand is 'Brand 1'
FROM
case_sql -- select data from the case_sql table
WHERE
brand = 'Brand 1' -- filter rows where the brand is 'Brand 1'
GROUP BY 1 -- group the result by date
ORDER BY 3 DESC -- order the result by users in descending order
LIMIT 1; -- select only the first row of the result (the row with the highest number of users)
Image by author.

From the 298 412 users between Sep 2019 and Jan 2020, the day most users arrive on the Brand 1 website was on the 2019–11–22 with a total of 885 visits, it was a Friday.

# Python

# filter users that arrived at 'Brand 1' only, assign it 'brand_1'
brand_1 = df[df["brand"] == "Brand 1"].copy()

''' sum total users that came from all "channelgrouping" for the same date,
assign it 'brandgroup' no matter the type of device '''

brandgroup = brand_1.groupby(["date","weekday"])[["default_channel_grouping","users"]].sum()

# filter the date by maximum users, assign it 'users'
users = brandgroup[brandgroup["users"] == brandgroup.users.max()].copy()

# reseat index
users.reset_index(["date"], inplace=True)
users.reset_index(["weekday"], inplace=True)

# results
print(f"""Date: {users.date} \n\nTotal users: {users.users} \n\nDay of week: {users.weekday}""")

Date: 0 2019-11-22
Name: date, dtype: datetime64[ns]

Total users: 0 885
Name: users, dtype: int64

Day of week: 0 Fri
Name: weekday, dtype: object

# calling the variable
users

Image by author.

# MySQL

SELECT 
DATE(date) AS date, -- Select the date from the 'date' column and convert it to a date data type
DAYNAME(date) AS dayofweek, -- Select the day of the week from the 'date' column
SUM(CASE
WHEN brand = 'Brand 1' THEN users -- Sum the 'users' column for Brand 1
ELSE NULL
END) AS b1_users,
SUM(CASE
WHEN brand = 'Brand 2' THEN users -- Sum the 'users' column for Brand 2
ELSE NULL
END) AS b2_users
FROM
case_sql -- From the 'case_sql' table
GROUP BY 1, 2 -- Group the results by the first and second columns (date and dayofweek)
ORDER BY 3 DESC -- Order the results by b1_users in descending order
LIMIT 1; -- Limit the results to only the highest total number of Brand 1 users
Image by author.

Actually, both brands saw the highest number of visits on the same day.

# Python

# filter users that arrived at 'Brand 2', assign it 'brand_2'
brand_2 = df[df["brand"] == "Brand 2"].copy()

# rename the 'users' column from previous (above) Python code
brandgroup.rename(columns = {'users':'brand1_users'}, inplace = True)

# include a new column with the filtered users from 'Brand_2'
brandgroup["brand2_users"] = brand_2.groupby(["date","weekday"])[["default_channel_grouping","users"]].sum()

# filter the new column (brand2_users) by maximum users
users2 = brandgroup[brandgroup["brand2_users"] == brandgroup.brand2_users.max()].copy()

Image by author.

# MySQL

SELECT 
default_channel_grouping AS channels,
SUM(users) AS total_users,
ROUND(SUM(users) / (SELECT
SUM(users)
FROM
case_sql) * 100,
1) AS percent -- calculate the percentage of users for each channel
FROM
case_sql
GROUP BY 1
ORDER BY 2 DESC;
Image by author.

Organic Search is by far the channel that generates most users (almost 141000) representing almost half of total visits on both websites, followed by Paid Search and Direct. Display takes the 4th position and Social the 6th contributing with 6722.

# Python

# sum users by all channel groups and plot bar chart 
ax = df.groupby("default_channel_grouping")["users"].sum().sort_values(ascending=True)\
.plot(kind="bar", figsize=(9,6), fontsize=12, linewidth=2,
color=sns.color_palette("rocket"), grid=False, table=False)

# show data labels
for p in ax.patches:
ax.annotate("%.0f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 7), textcoords='offset points')

# params
plt.xlabel("Channel groups", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Best channel group (highest number of users)", fontsize=10, loc="right");

Image by author.

# MySQL

SELECT 
default_channel_grouping AS channels,
SUM(CASE -- sum users by brand and map to new columns
WHEN brand = 'brand 1' THEN users -- if brand = 'brand 1', sum users and store in 'Brand_1' column
ELSE NULL -- if not 'brand 1', set value to null
END) AS Brand_1, -- create column for Brand 1 users
SUM(CASE
WHEN brand = 'brand 2' THEN users
ELSE NULL
END) AS Brand_2
FROM
case_sql
GROUP BY 1 -- group by channel
ORDER BY 3 DESC; -- order by Brand 2 users in descending order
Image by author.

# Python

# create pivot_table
# sum all users for each brand by channels
type_pivot = df.pivot_table(
columns="brand",
index="default_channel_grouping",
values="users", aggfunc=sum)

display(type_pivot)

#Display pivot_table with a bar chart
type_pivot.sort_values(by=["Brand 2"], ascending=True).plot(kind="bar", figsize=(12,8) ,fontsize = 15)
plt.xlabel("Channel groups", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Channel groups by brand (highest number of users)", fontsize=10, loc="right");

Image by author.
Image by author.

Organic Search has contributed by 105.062 of users for Brand 2, and by 35.911 users for Brand 1. With the exception of ‘Other’ in which the Brand 1 is superior, the Brand 2 contributes the highest delivering users to the website in all channels.

# MySQL

SELECT 
brand,
default_channel_grouping AS channels,
ROUND(SUM(sessions) / (SELECT
SUM(sessions)
FROM
case_sql) * 100,
1) AS percent
FROM
case_sql
WHERE
default_channel_grouping IN ('Paid Search' , 'Paid Social', 'Display', 'Other Advertising') -- include only rows with these values
AND date < '2020-01-01' -- only date before '2020-01-01' will be included.
GROUP BY 1 , 2
HAVING percent > 5 -- filters the groups to only include values greater than 5%.
ORDER BY 1 , 3 DESC
Image by author

# Python

# groupby dataframe by selected cols
df = df.groupby(["date","brand","default_channel_grouping"])["sessions"].sum().to_frame().copy()

# calculate percentages (new column)
df["percent"] = (df.apply(lambda x: x/x.sum())*100).round(2)

# reset index
df = df.reset_index().copy()

# display a 5 rows sample
df.sample(5)

Image by author.
# filter paid channels using lambda function
paid = df.apply(lambda row: row[df['default_channel_grouping'].isin(['Display','Paid Search','Paid Social','Other Advertising'])])

# filter year 2019
paid = paid[paid['date'] < '2020-01-01']

# groupby channels by brand
paid = paid.groupby(["brand","default_channel_grouping"])[["sessions","percent"]].sum()

# filter sessions higher than 5%
paid[paid["percent"] >5]

Image by author.

# MySQL

SELECT 
brand,
SUM(CASE
WHEN device_category = 'Desktop' THEN users
ELSE NULL
END) AS desktop,
SUM(CASE
WHEN device_category = 'Mobile' THEN users
ELSE NULL
END) AS mobile,
SUM(CASE
WHEN device_category = 'Tablet' THEN users
ELSE NULL
END) AS tablet
FROM
case_sql
GROUP BY 1
ORDER BY 1;
Image by author.

# Python

# pivot_table
type_pivot = df.pivot_table(
columns="device_category",
index="brand",
values="users", aggfunc=sum)

display(type_pivot)

# display pivot_table (chart)
ax = type_pivot.sort_values(by=["brand"], ascending=False).plot(kind="bar", figsize=(12,8) ,fontsize = 15);

# adding data labels
for p in ax.patches:
ax.annotate("%.0f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')

plt.xlabel("Brands", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Brand by type of device", fontsize=10, loc="right")
plt.legend(["Desktop","Mobile","Tablet"]);

Image by author.

Mobile is the preferred type of device at Brand 2 whereas desktop is the most used device at Brand 1.

# MySQL

SELECT 
default_channel_grouping,
AVG(CASE
WHEN device_category = 'Desktop' THEN users
ELSE NULL
END) AS desktop,
AVG(CASE
WHEN device_category = 'Mobile' THEN users
ELSE NULL
END) AS mobile,
AVG(CASE
WHEN device_category = 'Tablet' THEN users
ELSE NULL
END) AS tablet
FROM
case_sql
GROUP BY 1
ORDER BY 1;
Image by author.

# Python

# pivot_table
type_pivot = df.pivot_table(
columns="device_category",
index="default_channel_grouping",
values="users", aggfunc=np.mean)

display(type_pivot)

Image by author.
# display pivot_table
type_pivot.sort_values(by=["default_channel_grouping"], ascending=False).plot(kind="bar", figsize=(12,8) ,fontsize = 15);

plt.xlabel("Date", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Average use of device types by channel grouping", fontsize=10, loc="right")
plt.legend(["Desktop","Mobile","Tablet"]);

Image by author.

On average, desktop is more often used regarding Referral, Direct and Others. As for the other channels, vertical oriented content should be always taken in consideration.

The Bounce Rate is calculated as the total number of bounces divided by the total number of sessions.

# MySQL

SELECT 
default_channel_grouping,
SUM(sessions) AS sessions,
SUM(bounces) AS bounces,
ROUND(SUM(bounces) / SUM(sessions) * 100, 2) AS bounces_r
FROM
case_sql
GROUP BY 1
ORDER BY 4 DESC;
Image by author.

Average Bounce Rate: 54.93% (avg_bounces_r)

SELECT 
SUM(sessions) AS sessions,
SUM(bounces) AS bounces,
ROUND(SUM(bounces) / SUM(sessions) * 100, 2) AS bounces_r,
AVG(ROUND(bounces/sessions*100, 2)) AS avg_bounces_r
FROM
case_sql;

# Python

# group individual channels by sum of users
dfbounce = df.groupby("default_channel_grouping")["users"].sum().to_frame()

# group individual channels by sum of sessions
dfbounce["sessions"] = df.groupby("default_channel_grouping")["sessions"].sum()

# group individual channels by sum of bounces
dfbounce["bounces"] = df.groupby("default_channel_grouping")["bounces"].sum()

# calculus of bounce rate for each individual channel
dfbounce["bounces_r"] = dfbounce.apply(lambda x: 0.0 if x["sessions"] == 0.0 else (x["bounces"] / x["sessions"])*100, axis=1).round(2)

dff = dfbounce.copy()

dfbounce.drop(["users"],axis=1,inplace=True)

# sort values by rate
dfbounce.sort_values(by="bounces_r", ascending=False)

Image by author.
# display bar chart with the bounce rate for each channel
ax = dfbounce.groupby("default_channel_grouping")["bounces_r"].sum().sort_values(ascending=True)\
.plot(kind="bar", figsize=(9,6), fontsize=12, linewidth=2, color=sns.color_palette("rocket"), grid=False, table=False)

for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 7), textcoords='offset points')

plt.axhline(dfbounce.groupby("default_channel_grouping")["bounces_r"].mean().mean(), linewidth=1, color ="r")
plt.xlabel("channel groups", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Bounce rate by channelGrouping", fontsize=10, loc="right");

Image by author.

As excepted, Display has the highest bounce rate of all channels followed by Direct and Paid Social. Organic Search is levelled with the average. Below the threshold with the lowest Bounce Rates we find Referral, Native and Other Advertising.

# MySQL

SELECT 
YEAR(date) AS year, -- extract year
MONTH(date) AS month, -- extract month
DATE_FORMAT(date, '%b') AS month_, -- format the date column to display month name
ROUND(SUM(bounces) / SUM(sessions) * 100, 2) AS bounces_r -- calculate bounce rate
case_sql
GROUP BY 1 , 2 , 3
ORDER BY 1 , 2 , 3;
Image by author.

# Python


df_date = df.groupby("date")[['sessions','bounces']].sum()

''' create function to assess the bounce rate, assign it as 'bounce_r'
Return 0 if session's value is 0, else divide the bounces by sessions
for each date and multiply it by 100 to get the percentage '''

def div(bounces, sessions):
return lambda row: 0.0 if row[sessions] == 0.0 else float((row[bounces]/(row[sessions])))*100

# create column 'bounce_r' with the function results
df_date["bounce_r"] = (df_date.apply(div('bounces', 'sessions'), axis=1)).round(1)

# drop unnecessary columns
df_date.drop(["sessions","bounces"], axis=1, inplace=True)

# sum all bounces over time and plot chart
ax = df_date.plot(kind="line", figsize=(14,6), fontsize=12, linewidth=2)

plt.xlabel("Date", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Rate", fontsize=10)
plt.title("Evolution of the bounce rate over time", fontsize=10, loc="right");

Image by author.
# Smoothing the line with a step of 15 days interval
resampled = df_date["bounce_r"].resample("m").mean()

plt.figure(figsize = (12,6))
ax = sns.lineplot(data = resampled)
plt.title("Evolution of the bounce rate over time (smooth)", fontsize=10, loc="right")
plt.xlabel("Date", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Rate", fontsize=10);

Image by author.

The Bounce Rate on site is improving over time.

# Python

# filter by brand
b1 = df[df["brand"] == "Brand 1"]
b2 = df[df["brand"] == "Brand 2"]

# ** brand 1 **

# group individual channels by sum of sessions for brand 1
dfbrand = b1.groupby("default_channel_grouping")["sessions"].sum().to_frame()
dfbrand.rename(columns={"sessions":"sessions1"}, inplace=True)

# group individual channels by sum of bounces for brand 1
dfbrand["bounces1"] = b1.groupby("default_channel_grouping")["bounces"].sum()

# calculus of bounce rate for each individual channel for brand 1
dfbrand["1bounces_r"] = dfbrand.apply(lambda x: 0.0 if x["sessions1"] == 0.0 else (x["bounces1"] / x["sessions1"]*100), axis=1).round(2)

# ** brand 2 **

# group individual channels by sum of bounces for brand 2
dfbrand["sessions2"] = b2.groupby("default_channel_grouping")["sessions"].sum()

# group individual channels by sum of bounces for brand 2
dfbrand["bounces2"] = b2.groupby("default_channel_grouping")["bounces"].sum()

# calculus of bounce rate for each individual channel for brand 2
dfbrand["2bounces_r"] = dfbrand.apply(lambda x: 0.0 if x["sessions2"] == 0.0 else (x["bounces2"] / x["sessions2"]*100), axis=1).round(2)

# sort values by rate
dfbrand.sort_values(by="1bounces_r", ascending=False)

Image by author.
# clean dataframe
dfchannels = dfbrand.copy()
dfbrand_chart = dfbrand.copy()
dfbrand_chart.drop(["sessions1","sessions2","bounces1","bounces2"], axis=1, inplace=True)

# display bar chart with the average bounce rate for each channel
ax = dfbrand_chart.plot(kind="bar", figsize=(13,6), fontsize=12, linewidth=2, color=sns.color_palette("BrBG"), grid=False, table=False)

for p in ax.patches:
ax.annotate("%.1f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')

plt.xlabel("channel groups", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Bounce rate by channelGrouping and by brand", fontsize=10, loc="right")
plt.legend(["Brand 1","Brand 2"]);

Image by author.

Brand 1 has a lower rate of what would be expected for the Display channel. Usually, this channels drive a high Bounce Rate. Important to understand in detail the content strategy and try to adapt it to Brand 2.

Brand 2 shows a higher bounce rate of what would be acceptable in the Referral channel.

# MySQL

SELECT 
brand,
CASE
WHEN
default_channel_grouping IN ('Paid Search',
'Paid Social',
'Display',
'Other Advertising')
THEN
'Paid'
WHEN
default_channel_grouping IN ('Direct',
'Native',
'Organic Search',
'Referral',
'Social',
'Email',
'(Other)')
THEN
'Organic'
ELSE NULL
END AS media,
ROUND(SUM(bounces) / SUM(sessions) * 100, 2) AS bounce_r
FROM
case_sql
GROUP BY brand , media
ORDER BY 1;
Image by author.

# Python

# create dictionary
media_dict =
{
'Display': 'paid',
'Paid Search': 'paid',
'Paid Social': 'paid',
'Other Advertising': 'paid',
'Direct': 'organic',
'Native': 'organic',
'Organic Search': 'organic',
'Referral': 'organic',
'Social': 'organic',
'Email': 'organic',
'(Other)': 'organic'
}

# mapping the dict into a new column
df['media'] = df['default_channel_grouping'].map(media_dict)

# define cols position in dataframe
cols = ['brand','media','sessions','bounces']

# reindex columns order
df = df.reindex(columns = cols)

# groupby dataframe by selected cols
df = df.groupby(["brand","media"])[["sessions","bounces"]].sum()

# bounce rate by channel
df["bounces_r"] = df.apply(lambda x: 0.0 if x["sessions"] == 0.0 else (x["bounces"] / x["sessions"])*100, axis=1).round(2)

Image by author.

As promised, we went through a step-by-step approach to conducting a simple digital marketing analysis working alongside MySQL Workbench and Python.

Both tools have their specificities, their demands, but the reasoning is relatively similar, leaving aside their graphic capabilities and limitations.

Feel free to download the datasets and explore by practicing some of the technical details covered here, implementing new code as you ask further analysis questions.

Explore others projects you might also like:

How to get in contact with me:

✅ Thanks for reading!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment