Techno Blender
Digitally Yours.

The Best SQL Template for Customer Lifetime Value | by Josh Berry | Aug, 2022

0 86


Reusable SQL for RFM and CLV Analysis

Photo by energepic.com: https://www.pexels.com/photo/black-payment-terminal-2988232/

Introduction

Over the past 10 years of my career, I’ve worked on Customer Lifetime Value (CLV) models with a variety of clients, industries, and datasets. There are many different ways to solve the problem, but this article is not about that. Instead, I want to explain the data setup, which is often neglected in the blogosphere.

In my last blog post, Lets go further with SQL, I talked about how excited I am about the potential for combining jinja and SQL, in order to leverage semantic abstractions that sit above the SQL code itself. This post is the first of many ideas that have been floating around in my head recently. My goal is to (1) show you that the data prep for a CLV model can be done in SQL, but more importantly, (2) give you a template that you can apply elsewhere, so you don’t have to actually write the SQL from scratch.

Getting started

The most commonly used probabilistic model for predicting CLV is the Beta Geometric/Negative Binomial Distribution model (often abbreviated BG/NBD). It has a popular cousin, the Pareto/NBD model. These models try to predict the future transactions of a given customer. The Gamma-Gamma extension model focuses on the monetary aspect of how much those transactions will be.

This year, hundreds, if not thousands, of data scientists across the world are going to build this kind of model.

How will they start? How would you start? If you’re being honest, the first step is probably to Google it, read a few blogs, and then follow along with your own data. If you can find a good enough example online, you can methodically manipulate your data in a way that is similar to the blog post, until you’ve got it close enough that you’re off and running.

A familiar pattern emerges

If you read enough of these blogs, you’ll begin to realize that there is a familiar pattern. All of them start with manipulating the data to a specific shape — often called an RFM table.

There are other methods to calculate CLV, and I’m not going to cover those. My main focus is the RFM table because not only is it a very popular approach, it is also very repeatable. If you’re not familiar with an RFM Table, check out my friend Olubukunola Akinsola’s blog post about it.

The basics are:

R — the most recent date a customer purchased from the business
F — how frequently does a customer purchases from the business
M — the monetary value of the customer’s purchase.

Choosing SQL

If you’ve read any of my past blog posts, then you already know that I’m passionate about utilizing SQL in order to take advantage of the data warehouse’s processing power. In order to create the RFM table, we need to aggregate the data from the raw transaction level, which is likely one of the largest tables in our database. While it is true that there are plenty of python solutions available, why bother going through the effort to offload millions or billions of transactions, which might not even fit in my python memory, when I have a perfectly good data warehouse?

Making it repeatable

In the following sections, I’m going to walk through an example that creates an RFM table from a sales dataset that I found online. However, that means that anyone who stumbles upon this post in the future, is still going to rewrite the SQL carefully to match their own database.

This is where jinja comes into play. Jinja is a 14 year old templating engine that is mainly used to dynamically generate HTML, XML, and other markup formats. Recently, it has been shown to be a powerful partner to SQL, being leveraged by both dbt and Rasgo. If you want to read up on the basics, Julien Kervizic wrote a prophetic post back in 2019 called Jinja the SQL way of the Ninja.

So, instead of just sharing the SQL code and logic, I’m going to take it to the next level — I’m going to share a template that you can use to dynamically create the SQL for your own data.

Walking through an example

Let’s take some example sales transactions from the UCI Machine Learning Repository.

Example of the raw transactional data

The way I approached this problem was by first constructing the SQL. This part is obviously challenging, but I did it so that you don’t have to. Here is the working SQL.

The result of this query is a properly formatted RFM table, just like we would need to start our CLV modeling.

Result — RFM Summary of Transactions

Being a jinja ninja

In order to build our RFM table, we need a table of transactions to aggregate. Within that table, what we need to find is:

  • CustomerID — a column uniquely identifying a customer
  • TransactionID — a column uniquely identifying a transaction or purchase
  • TransactionDate— a column identifying the date/time of transaction
  • TransactionAmount— a column identifying the amount of monetary value for the transaction

Since our goal is to eventually build a model, we also have to consider a period of time that is for training, and a subsequent period of time that is validation/holdout.

  • ObservationEnd — the date at which point we consider the end of our data (end of holdout)
  • TrainingPeriodEnd — the date at which point we split the end of training from the beginning of holdout

From our example data, we know these:

The next step, is to create a SQL template that is jinja compatible, so I can re-use this for different projects and clients.

Because the code is long, I’ll explain how this is done with a short example. If our original query is:

SELECT CustomerID
FROM tblCustomers

And we want to create parameters like,

Then our jinja template would be:

SELECT {{ column_name }}
FROM {{ table_name }}

Thus, I parameterized the entire CLV-RFM query so that we can run this on any transaction table in the future.

To render jinja, there are plenty of online renderers, or you can use the python package jinja2. Personally, I use Rasgo to render the templates.

Rending the SQL template with Rasgo

Here is the full code, ready to go! Just plug in your own table name, column names, and this will dynamically render the SQL that is customized to your data.

Conclusion

I am really excited for the possibilities of SQL templates like these. In fact, I already have some other ideas. For example, time-series problems often require similar data preparation steps such as,

  • remove negative sales amounts
  • add rows for dates that had no sales
  • forward fill last-known-values for those missing rows
  • add lags and window-based aggregates

I think this sounds like a good candidate for another template.

Can you think of any? Want to contribute? Please let me know!


Reusable SQL for RFM and CLV Analysis

Photo by energepic.com: https://www.pexels.com/photo/black-payment-terminal-2988232/

Introduction

Over the past 10 years of my career, I’ve worked on Customer Lifetime Value (CLV) models with a variety of clients, industries, and datasets. There are many different ways to solve the problem, but this article is not about that. Instead, I want to explain the data setup, which is often neglected in the blogosphere.

In my last blog post, Lets go further with SQL, I talked about how excited I am about the potential for combining jinja and SQL, in order to leverage semantic abstractions that sit above the SQL code itself. This post is the first of many ideas that have been floating around in my head recently. My goal is to (1) show you that the data prep for a CLV model can be done in SQL, but more importantly, (2) give you a template that you can apply elsewhere, so you don’t have to actually write the SQL from scratch.

Getting started

The most commonly used probabilistic model for predicting CLV is the Beta Geometric/Negative Binomial Distribution model (often abbreviated BG/NBD). It has a popular cousin, the Pareto/NBD model. These models try to predict the future transactions of a given customer. The Gamma-Gamma extension model focuses on the monetary aspect of how much those transactions will be.

This year, hundreds, if not thousands, of data scientists across the world are going to build this kind of model.

How will they start? How would you start? If you’re being honest, the first step is probably to Google it, read a few blogs, and then follow along with your own data. If you can find a good enough example online, you can methodically manipulate your data in a way that is similar to the blog post, until you’ve got it close enough that you’re off and running.

A familiar pattern emerges

If you read enough of these blogs, you’ll begin to realize that there is a familiar pattern. All of them start with manipulating the data to a specific shape — often called an RFM table.

There are other methods to calculate CLV, and I’m not going to cover those. My main focus is the RFM table because not only is it a very popular approach, it is also very repeatable. If you’re not familiar with an RFM Table, check out my friend Olubukunola Akinsola’s blog post about it.

The basics are:

R — the most recent date a customer purchased from the business
F — how frequently does a customer purchases from the business
M — the monetary value of the customer’s purchase.

Choosing SQL

If you’ve read any of my past blog posts, then you already know that I’m passionate about utilizing SQL in order to take advantage of the data warehouse’s processing power. In order to create the RFM table, we need to aggregate the data from the raw transaction level, which is likely one of the largest tables in our database. While it is true that there are plenty of python solutions available, why bother going through the effort to offload millions or billions of transactions, which might not even fit in my python memory, when I have a perfectly good data warehouse?

Making it repeatable

In the following sections, I’m going to walk through an example that creates an RFM table from a sales dataset that I found online. However, that means that anyone who stumbles upon this post in the future, is still going to rewrite the SQL carefully to match their own database.

This is where jinja comes into play. Jinja is a 14 year old templating engine that is mainly used to dynamically generate HTML, XML, and other markup formats. Recently, it has been shown to be a powerful partner to SQL, being leveraged by both dbt and Rasgo. If you want to read up on the basics, Julien Kervizic wrote a prophetic post back in 2019 called Jinja the SQL way of the Ninja.

So, instead of just sharing the SQL code and logic, I’m going to take it to the next level — I’m going to share a template that you can use to dynamically create the SQL for your own data.

Walking through an example

Let’s take some example sales transactions from the UCI Machine Learning Repository.

Example of the raw transactional data

The way I approached this problem was by first constructing the SQL. This part is obviously challenging, but I did it so that you don’t have to. Here is the working SQL.

The result of this query is a properly formatted RFM table, just like we would need to start our CLV modeling.

Result — RFM Summary of Transactions

Being a jinja ninja

In order to build our RFM table, we need a table of transactions to aggregate. Within that table, what we need to find is:

  • CustomerID — a column uniquely identifying a customer
  • TransactionID — a column uniquely identifying a transaction or purchase
  • TransactionDate— a column identifying the date/time of transaction
  • TransactionAmount— a column identifying the amount of monetary value for the transaction

Since our goal is to eventually build a model, we also have to consider a period of time that is for training, and a subsequent period of time that is validation/holdout.

  • ObservationEnd — the date at which point we consider the end of our data (end of holdout)
  • TrainingPeriodEnd — the date at which point we split the end of training from the beginning of holdout

From our example data, we know these:

The next step, is to create a SQL template that is jinja compatible, so I can re-use this for different projects and clients.

Because the code is long, I’ll explain how this is done with a short example. If our original query is:

SELECT CustomerID
FROM tblCustomers

And we want to create parameters like,

Then our jinja template would be:

SELECT {{ column_name }}
FROM {{ table_name }}

Thus, I parameterized the entire CLV-RFM query so that we can run this on any transaction table in the future.

To render jinja, there are plenty of online renderers, or you can use the python package jinja2. Personally, I use Rasgo to render the templates.

Rending the SQL template with Rasgo

Here is the full code, ready to go! Just plug in your own table name, column names, and this will dynamically render the SQL that is customized to your data.

Conclusion

I am really excited for the possibilities of SQL templates like these. In fact, I already have some other ideas. For example, time-series problems often require similar data preparation steps such as,

  • remove negative sales amounts
  • add rows for dates that had no sales
  • forward fill last-known-values for those missing rows
  • add lags and window-based aggregates

I think this sounds like a good candidate for another template.

Can you think of any? Want to contribute? Please let me know!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment