Techno Blender
Digitally Yours.

Crack SQL Interview Question: Window Functions with Partition-By | by Aaron Zhu | Sep, 2022

0 42


Solving SQL questions with useful procedures

Photo by Marvin Meyer on Unsplash

In this article, we will go over a SQL question from an Amazon data science interview. Hope the procedure explained in this article would help you become more effective in writing SQL queries.

SQL Question:

Marketing Campaign SuccessYou have a table of in-app purchases by user. Users that make their first in-app purchase are placed in a marketing campaign where they see call-to-actions for more in-app purchases. Find the number of users that made additional in-app purchases due to the success of the marketing campaign.The marketing campaign doesn't start until one day after the initial in-app purchase so users that only made one or multiple purchases on the first day do not count, nor do we count users that over time purchase only the products they purchased on the first day.Source: stratascratch.com

Table: marketing_campaign

Image by author

Step 1: let’s check out the raw data first.

  • user_id: this field identifies a unique user. We definitely need this field to compute the number of distinct users.
  • created_at: this field indicates the date a transaction was made. This can be used to identify whether a given transaction was made on the first day (when the first in-app purchase is placed) or during the marketing campaign (starting one day after the date of the first in-app purchase)
  • product_id: this field identifies a unique product. Combined with created_at, they can be used to identify whether a given product was purchased on the first day or during the marketing campaign, or both.
  • quantity & price: these two fields represent the quantity and price associated with a given transaction. They are not relevant to this question.

Step 2: let’s brainstorm how to solve the problem.

The most important task for this exercise is to figure out how to identify

  • users who purchased new product(s) during the marketing campaign (i.e., Success)
  • users who purchased the same product(s) they purchased on the first day during the marketing campaign (i.e., Failure)
  • users who didn’t purchase any products during the marketing campaign (i.e., Failure)
Image by Author

Once we classify the users into the right categories, then we can easily count the number of distinct users that purchased new products due to the success of the marketing campaign.

Step 3: let’s prepare the data and get it ready for analysis.

Method 1: Using the Window function, MIN() with OVER-PARTITION-BY

We would need to construct two new variables

  • first_purchase_date: this field would give us the first day on which a given user made an in-app purchase.
  • first_product_purchase_date: this field would give us the first day on which a user purchased a given product

The following script can be run to produce these two variables.

SELECT *,MIN(created_at) OVER (PARTITION BY user_id) AS first_purchase_date,MIN(created_at) OVER (PARTITION BY user_id, product_id) AS first_product_purchase_dateFROM marketing_campaign

Let’s understand the code in detail:

  • MIN() : this is the window function we would use to compute the earliest purchase date in each partition. There are also other window functions, such as max, sum, row number, rank, etc.
  • OVER : this indicates the function we use here is a window function, not an aggregate function.
  • PARTITION BY : this partitions the rows in the data table, so that we can define which rows the window function would be applied to. For this exercise, “first_purchase_date” is computed at user_id partitions while “first_product_purchase_date” is computed at user_id and product_id partitions.

When we run the above code, we can produce a table like the following. Let’s check if the new variables make sense. For example, user_id 10 made the first in-app purchase on 1/1/2019. This user purchased three distinct products, 101, 111, and 119 with earliest purchase dates of 1/1/2019, 3/31/2019, and 1/2/2019 respectively. With these two variables, we can easily conclude that user_id 10 purchased two new products during the marketing campaign.

Image by author

After we create these two new variables, we are able to flag new product(s) users purchased during the marketing campaign (starting one day after the date of first in-app purchase) using WHERE first_purchase_date < first_product_purchase_date .

Method 2: Using the Window function, DENSE_RANK() with OVER-PARTITION-BY

Similarly, we would need to construct two new variables.

  • user_date_rank: this field would give us the order of purchase dates for a given user. For example, user_date_rank = 1 represents the first day an in-app purchase was made for a given user. user_date_rank = 2 represents the second earliest date and so on. Therefore, user_date_rank > 1 represents the purchase records during the marketing campaign.
  • product_date_rank: this field would give us the order of purchase dates for a given user and product. For example, product_date_rank = 1 represents the first day on which a given product was purchased by a user.

The following script can be run to produce these two variables.

SELECT  *,DENSE_RANK() OVER (PARTITION BY user_id ORDER BY created_at) AS user_date_rank,DENSE_RANK() OVER (PARTITION BY user_id, product_id ORDER BY created_at) AS product_date_rankFROM marketing_campaign

Let’s understand the code in detail:

  • DENSE_RANK : this is a commonly used window function. This function gives the rank of each row within each partition, with consecutive ranking values even if there is a tie. For example, 1, 2, 2, 3, …. RANK is an alternation ranking function. The difference is the latter would produce a gap in ranking values if there is a tie. For example, 1, 2, 2, 4, … For the exercise, DENSE_RANK is more appropriate to use.
  • ORDER BY is used to sort the observations within each partition. In this exercise, we would sort purchase dates in each partition.

When we run the above code, we can produce a table like the following. Let’s check these new variables. For example, user_id 10 made purchases on three different dates, 1/1/2019, 1/2/2019, and 3/31/2019, which are sorted from earliest to latest with the order number (i.e., “user_date_rank”) of 1, 2, 3 respectively. This user purchased three distinct products, 101, 119, and 111. The order number in “product_date_rank”, 1, 1, and 1 mean this user only purchased these three products once.

Image by author

After we create these two new variables, we can check if a purchase was made during the marketing campaign using WHERE user_date_rank > 1 . We can also check if a given product was purchased for the first time using WHERE product_date_rank = 1 . Combing these two conditions, we can flag the records of the new product(s) purchased during a marking campaign for a given user.

Step 4: Once the data is prepared in step 3, we can compute the number of distinct users who purchased new products during the marketing campaign. We just need to use the aggregate function COUNT(DISTINCT) with the WHERE statement keeping the records that meet the criteria.

Final solution using Method 1: Using Window function, MIN() with OVER-PARTITION-BY

WITH cte AS (SELECTuser_id,MIN(created_at) OVER (PARTITION BY user_id) AS first_purchase_date,MIN(created_at) OVER (PARTITION BY user_id, product_id) AS first_product_purchase_dateFROM marketing_campaign)SELECT COUNT(DISTINCT(user_id))FROM cteWHERE first_purchase_date < first_product_purchase_date;

Final solution using Method 2: Using Window function, DENSE_RANK() with OVER-PARTITION-BY

WITH cte AS (SELECTuser_id,DENSE_RANK() OVER (PARTITION BY user_id ORDER BY created_at) AS user_date_rank,DENSE_RANK() OVER (PARTITION BY user_id, product_id ORDER BY created_at) AS product_date_rankFROM marketing_campaign)SELECT COUNT(DISTINCT user_id)FROM cteWHERE user_date_rank > 1 AND product_date_rank = 1;

Answer: 23


Solving SQL questions with useful procedures

Photo by Marvin Meyer on Unsplash

In this article, we will go over a SQL question from an Amazon data science interview. Hope the procedure explained in this article would help you become more effective in writing SQL queries.

SQL Question:

Marketing Campaign SuccessYou have a table of in-app purchases by user. Users that make their first in-app purchase are placed in a marketing campaign where they see call-to-actions for more in-app purchases. Find the number of users that made additional in-app purchases due to the success of the marketing campaign.The marketing campaign doesn't start until one day after the initial in-app purchase so users that only made one or multiple purchases on the first day do not count, nor do we count users that over time purchase only the products they purchased on the first day.Source: stratascratch.com

Table: marketing_campaign

Image by author

Step 1: let’s check out the raw data first.

  • user_id: this field identifies a unique user. We definitely need this field to compute the number of distinct users.
  • created_at: this field indicates the date a transaction was made. This can be used to identify whether a given transaction was made on the first day (when the first in-app purchase is placed) or during the marketing campaign (starting one day after the date of the first in-app purchase)
  • product_id: this field identifies a unique product. Combined with created_at, they can be used to identify whether a given product was purchased on the first day or during the marketing campaign, or both.
  • quantity & price: these two fields represent the quantity and price associated with a given transaction. They are not relevant to this question.

Step 2: let’s brainstorm how to solve the problem.

The most important task for this exercise is to figure out how to identify

  • users who purchased new product(s) during the marketing campaign (i.e., Success)
  • users who purchased the same product(s) they purchased on the first day during the marketing campaign (i.e., Failure)
  • users who didn’t purchase any products during the marketing campaign (i.e., Failure)
Image by Author

Once we classify the users into the right categories, then we can easily count the number of distinct users that purchased new products due to the success of the marketing campaign.

Step 3: let’s prepare the data and get it ready for analysis.

Method 1: Using the Window function, MIN() with OVER-PARTITION-BY

We would need to construct two new variables

  • first_purchase_date: this field would give us the first day on which a given user made an in-app purchase.
  • first_product_purchase_date: this field would give us the first day on which a user purchased a given product

The following script can be run to produce these two variables.

SELECT *,MIN(created_at) OVER (PARTITION BY user_id) AS first_purchase_date,MIN(created_at) OVER (PARTITION BY user_id, product_id) AS first_product_purchase_dateFROM marketing_campaign

Let’s understand the code in detail:

  • MIN() : this is the window function we would use to compute the earliest purchase date in each partition. There are also other window functions, such as max, sum, row number, rank, etc.
  • OVER : this indicates the function we use here is a window function, not an aggregate function.
  • PARTITION BY : this partitions the rows in the data table, so that we can define which rows the window function would be applied to. For this exercise, “first_purchase_date” is computed at user_id partitions while “first_product_purchase_date” is computed at user_id and product_id partitions.

When we run the above code, we can produce a table like the following. Let’s check if the new variables make sense. For example, user_id 10 made the first in-app purchase on 1/1/2019. This user purchased three distinct products, 101, 111, and 119 with earliest purchase dates of 1/1/2019, 3/31/2019, and 1/2/2019 respectively. With these two variables, we can easily conclude that user_id 10 purchased two new products during the marketing campaign.

Image by author

After we create these two new variables, we are able to flag new product(s) users purchased during the marketing campaign (starting one day after the date of first in-app purchase) using WHERE first_purchase_date < first_product_purchase_date .

Method 2: Using the Window function, DENSE_RANK() with OVER-PARTITION-BY

Similarly, we would need to construct two new variables.

  • user_date_rank: this field would give us the order of purchase dates for a given user. For example, user_date_rank = 1 represents the first day an in-app purchase was made for a given user. user_date_rank = 2 represents the second earliest date and so on. Therefore, user_date_rank > 1 represents the purchase records during the marketing campaign.
  • product_date_rank: this field would give us the order of purchase dates for a given user and product. For example, product_date_rank = 1 represents the first day on which a given product was purchased by a user.

The following script can be run to produce these two variables.

SELECT  *,DENSE_RANK() OVER (PARTITION BY user_id ORDER BY created_at) AS user_date_rank,DENSE_RANK() OVER (PARTITION BY user_id, product_id ORDER BY created_at) AS product_date_rankFROM marketing_campaign

Let’s understand the code in detail:

  • DENSE_RANK : this is a commonly used window function. This function gives the rank of each row within each partition, with consecutive ranking values even if there is a tie. For example, 1, 2, 2, 3, …. RANK is an alternation ranking function. The difference is the latter would produce a gap in ranking values if there is a tie. For example, 1, 2, 2, 4, … For the exercise, DENSE_RANK is more appropriate to use.
  • ORDER BY is used to sort the observations within each partition. In this exercise, we would sort purchase dates in each partition.

When we run the above code, we can produce a table like the following. Let’s check these new variables. For example, user_id 10 made purchases on three different dates, 1/1/2019, 1/2/2019, and 3/31/2019, which are sorted from earliest to latest with the order number (i.e., “user_date_rank”) of 1, 2, 3 respectively. This user purchased three distinct products, 101, 119, and 111. The order number in “product_date_rank”, 1, 1, and 1 mean this user only purchased these three products once.

Image by author

After we create these two new variables, we can check if a purchase was made during the marketing campaign using WHERE user_date_rank > 1 . We can also check if a given product was purchased for the first time using WHERE product_date_rank = 1 . Combing these two conditions, we can flag the records of the new product(s) purchased during a marking campaign for a given user.

Step 4: Once the data is prepared in step 3, we can compute the number of distinct users who purchased new products during the marketing campaign. We just need to use the aggregate function COUNT(DISTINCT) with the WHERE statement keeping the records that meet the criteria.

Final solution using Method 1: Using Window function, MIN() with OVER-PARTITION-BY

WITH cte AS (SELECTuser_id,MIN(created_at) OVER (PARTITION BY user_id) AS first_purchase_date,MIN(created_at) OVER (PARTITION BY user_id, product_id) AS first_product_purchase_dateFROM marketing_campaign)SELECT COUNT(DISTINCT(user_id))FROM cteWHERE first_purchase_date < first_product_purchase_date;

Final solution using Method 2: Using Window function, DENSE_RANK() with OVER-PARTITION-BY

WITH cte AS (SELECTuser_id,DENSE_RANK() OVER (PARTITION BY user_id ORDER BY created_at) AS user_date_rank,DENSE_RANK() OVER (PARTITION BY user_id, product_id ORDER BY created_at) AS product_date_rankFROM marketing_campaign)SELECT COUNT(DISTINCT user_id)FROM cteWHERE user_date_rank > 1 AND product_date_rank = 1;

Answer: 23

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment