Identifying New and Returning Customers in BigQuery using SQL | by Romain Granger | Jan, 2023

By Jessie Hobb On Jan 4, 2023

To better understand your customer’s interests and behaviors, as well as to improve your marketing strategy

Photo by Vincent van Zalinge on Unsplash

Classifying customers, both new and returning, can help in defining which marketing or sale strategy to use. Without a doubt, it will depend on the nature of your business whether you prioritize acquisition, retention, or both.

Before delving into the SQL approach and steps, here is a simple definition and business example of the terms:

New customer: Someone who makes their first purchase
Returning customer: Someone who has made several purchases

Let’s take for example a car company and a coffee shop company:

For a car company, returning customers is likely to be low, because car purchases are typically infrequent due to a high price point and customers may only make one purchase every few years. The strategy might be to focus on acquisition and reaching out to new customers.

In contrast, online coffee shops sell consumable products that are purchased on a regular basis, such as coffee beans or ground coffee. The price point is much more affordable, which makes the likelihood of a customer returning higher.

Strategy can be adapted to both: Either for acquiring new customers via free products, giveaways, upper funnel advertising, or trying to drive brand discovery and awareness. Either for returning customers through tailored marketing and messaging, product recommendations, incentives and discounts, a loyalty program, and so on.

In SQL, giving a label to your customers could help you enable a few insights projects:

Understanding customer behavior: By examining returning customers’ behavior and derived patterns, you might learn what motivates them to buy again
Personalization and custom messaging: By forwarding to your marketing tools attributes (eg. new or returning) and creating special marketing segments for each customer type
Customer experience and satisfaction: By running surveys or looking at customer services issues raised by customer type
Product education or understanding: By looking at product onboarding and usage (some products may be more difficult to understand or use at first.)

Let’s jump into the data and make a classification step by step!

To illustrate how to achieve this classification, we will use some data from the Google Merchandise Store, an online store that sells Google-branded products.

In this data set, we do have three months of history (from November 1, 2020, to January 31, 2021) with all different kinds of events (purchase, page view, session_start, add_to_cart, etc.)

Our first step will be to build a basic table with 3 fields containing orders by customer and date.

The SQL query for Google Analytics 4 (GA4) data looks like this:

SELECT
user_pseudo_id AS user_id,
ecommerce.transaction_id AS order_id,
PARSE_DATE('%Y%m%d', event_date) AS order_date
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
WHERE
event_name = 'purchase'
AND ecommerce.transaction_id <> '(not set)'
GROUP BY
1,
2,
3

The data is filtered to only include rows representing purchases with a valid transaction id.

It’s worth noting that Google Analytics will store a default (not set) value if the data isn’t accessible, isn’t being tracked properly, or for other reasons.

The results are then grouped across all three dimensions so that each row represents a unique order per client per date.

This yields the following data table, and we are done with our first step!

Each row represents an order placed by a customer on a specific date (Image by Author)

If you’d like to read more about not set values, you can refer to the Google Analytics documentation.

If you want to access and explore the Google Analytics 4 sample data set, you can access it from this link.

Our goal is to determine whether a customer is new or returning based on their order history and add a column to our table to give them a label.

We use a window function, DENSE_RANK(), to find the first order placed by each customer.

When the order’s rank is 1, we consider the customer 'new'
When the order’s rank is greater than 1, then we consider the customer 'returning'

SELECT
*,
CASE
WHEN DENSE_RANK() OVER(PARTITION BY user_id ORDER BY order_date) = 1 
THEN 'new' ELSE 'returning'
END
AS customer_type
FROM
`datastic.new_existing.base_table`

Using DENSE_RANK() allows us to assign the same rank to multiple orders made on the same day. For more details about this function and other numbering functions, you can read this medium article:

You would see in the new column, customer_type , the label depending on the order_date and the customer_id.

Customer’s classification as new or returning based on their order history (Image by Author)

On September 12, 2020, the customer 13285520 makes the first order and was given the label 'new'.

Then, on December 12, 2020, this same customer makes a second order, which resulted in having a 'returning' type.

Analyzing visually new and returning customers

We can then plot over time the share of new vs returning customers:

New vs returning customers over time (Image by Author)

In this case, we observe that most of the customers are new, and few are actually returning over time. But let’s keep in mind that we do not have a lot of historical data.

Repeat purchase rate

One potential benefit of categorizing clients as new or returning is that we can compute a metric that provides us with a general score that indicates how many customers return to buy.

From our previous overtime chart, we can assume that the Google Merchant Store has a low repeat purchase rate. To compute this metric, we use the following formula:

Repeat Customer Rate (%) = (Number of returning customers / Total number of customers) × 100

In our case in this 3-month period, we have 3713 customers (note that they are all new at one point), out of these, 256 made more than one order.

This gives (256/3713) * 100 = 6.9% repeat customer rate

Considering time resolution

In our example, we are looking at customer orders on a daily basis, with the idea that time plays a role and not only the number of orders.

If you would run a query only looking at the number of orders per customer, with no time perception (eg. assuming that a customer with more than 1 order is returning), it might happen that some customers have multiple orders on the same day, which in this case, they would be perceived as returning, but would maybe not return later in time.

In addition, looking back on a monthly or on yearly basis might also lead to different numbers. The longer the time period, the higher the chance for a customer to be both, new and returning . In this case, it’s about managing duplicates and adding an extra classification like both or prioritize either thenew or thereturning type.

To illustrate this idea, let’s look by month for only the year 2020 (we stored the previous query result in a table called customer_table):

SELECT
user_id,
DATE_TRUNC(order_date,MONTH) AS month,
STRING_AGG(DISTINCT customer_type
ORDER BY
customer_type ASC) AS customer_type
FROM
customer_table
WHERE
order_date < '2021-01-01'
GROUP BY
1,
2

The STRING_AGG() function is used to concatenate the distinct customer_type values into a single string ordered alphabetically (it is then easier to use with a CASE WHEN statement as the value’s order will stay alphabetically sorted).

The query would return the following results:

A customer may be new and return within the same month (Image by Author)

As you can see, on December 2020, the customer (234561…901) was making the first order, and in the same month, return and order again.

In this case, you might want to define a classification of either:

Consider these customers as new
Consider these customers as both

To change the label in SQL, you could do it like this:

WITH
customer_month AS (
SELECT
user_id,
DATE_TRUNC(order_date,MONTH) AS month,
STRING_AGG(DISTINCT customer_type
ORDER BY
customer_type ASC) AS customer_type
FROM
customer_table
WHERE
order_date < '2021-01-01'
GROUP BY
1,
2)-- Main Query
SELECT
*,
CASE
WHEN customer_type LIKE 'new,returning' THEN 'both'
ELSE
customer_type
END
AS c_type
FROM
customer_month

We could find a more ideal query by directly grouping by month, however, this could be useful when using it in a reporting tool and keeping a daily resolution, but adding a filter on a dashboard, for example.

The table with the new field would look like this:

Customers who are both new and returning are classified differently (Image by Author)

Additional supporting metrics

While the repeat purchase rate indicates can be an indicator of loyalty or satisfaction customers may have towards your products, it does not indicate how much revenue they generate (are they bringing more value) or if customers are happy with their purchases.

To support this assumption, we could combine repeat purchase rate with other metrics such as:

Customer lifetime value (CLV), for example, can be a significant growth metric that shows how long a customer will stay with you and how much they will spend on your products
Net Promoter Score (NPS) can be another interesting metric, to evaluate customer satisfaction and maybe understand with additional questions or surveys, why they are purchasing again

There are several other indicators we could add, including Average Order Value (AOV) or client retention rate using a monthly or yearly cohort-based approach.

We might also use RFM scores (Recency, Frequency, Monetary) or other methods to better understand your customer base.