How to Build Popularity-Based Recommenders with Polars | by Dr. Robert Kübler | Apr, 2023

By Jessie Hobb On Apr 28, 2023

Basic recommenders that are easy to understand and implement, as well as fast to train

Recommender systems are algorithms designed to provide user recommendations based on their past behavior, preferences, and interactions. Becoming integral to various industries, including e-commerce, entertainment, and advertising, recommender systems improve user experience, increase customer retention, and drive sales.

While various advanced recommender systems exist, today I want to show you one of the most straightforward — yet often difficult to beat — recommenders: the popularity-based recommender. It is an excellent baseline recommender that you should always try out in addition to a more advanced model, such as matrix factorization.

We will create two different flavors of popularity-based recommenders using polars in this article. Don’t worry if you have not used the fast pandas-alternative polars before; this article is a great place to learn it along the way. Let’s start!

Popularity-based recommenders work by suggesting the most frequently purchased products to customers. This vague idea can be turned into at least two concrete implementations:

Check which articles are bought most often across all customers. Recommend these articles to each customer.
Check which articles are bought most often per customer. Recommend these per-customer articles to their corresponding customer.

We will now show how to implement these concretely using our own custom-crated dataset.

If you want to follow along with a real-life dataset, the H&M Personalized Fashion Recommendations challenge on Kaggle provides you with an excellent example. Due to copyright reasons, I will not use this lovely dataset for this article.

The Data

First, we will create our own dataset. Make sure to install polars if you haven’t done so already:

pip install polars

Then, let us create random data consisting of a (customer_id, article_id) pairs that you should interpret as “The customer with this ID bought the article with that ID.”. We will use 1,000,000 customers that can buy 50,000 products.

import numpy as npnp.random.seed(0)
N_CUSTOMERS = 1_000_000
N_PRODUCTS = 50_000
N_PURCHASES_MEAN = 100 # customers buy 100 articles on average
with open("transactions.csv", "w") as file:
file.write(f"customer_id,article_id\n") # header
for customer_id in tqdm(range(N_CUSTOMERS)):
n_purchases = np.random.poisson(lam=N_PURCHASES_MEAN)
articles = np.random.randint(low=0, high=N_PRODUCTS, size=n_purchases)
for article_id in articles:
file.write(f"{customer_id},{article_id}\n") # transaction as a row

This medium-sized dataset has over 100,000,000 rows (transactions), an amount you could find in a business context.

The Task

We now want to build recommender systems that scan this dataset in order to recommend popular items in some sense. We will shed light on two variants of how to interpret this:

most popular across all customers
most popular per customer

Our recommenders should recommend ten articles for each customer.

Note: We will not assess the quality of the recommenders here. Drop me a message if you are interested in this topic, though, since it’s worth having a separate article about this.

In this recommender, we don’t even care who bought the articles — all the information we need is in the article_id column alone.

High-level, it works like this:

Load the data.
Count how often each article appears in the column article_id.
Return the ten most frequent products as the recommendation for each customer.

Familiar Pandas Version

As a gentle start, let us check out how you could do this in pandas.

import pandas as pddata = pd.read_csv("transactions.csv", usecols=["article_id"])
purchase_counts = data["article_id"].value_counts()
most_popular_articles = purchase_counts.head(10).index.tolist()

On my machine, this takes about 31 seconds. This sounds like a little, but the dataset still has only a moderate size; things get really ugly with larger datasets. To be fair, 10 seconds are used for loading the CSV file. Using a better format, such as parquet, would decrease the loading time.

Note: I used pandas 2.0.1, which is the latest and most optimized version.

Still, to prepare yet a little bit more for the polars version, let us do the pandas version using method chaining, a technique I grew to love.

most_popular_articles = (
pd.read_csv("transactions.csv", usecols=["article_id"])
.squeeze() # turn the dataframe with one column into a series
.value_counts()
.head(10)
.index
.tolist()
)

This is lovely since you can read from top to bottom what is happening without the need for a lot of intermediate variables that people usually struggle to name (df_raw → df_filtered → df_filtered_copy → … → df_final anyone?). The run time is the same, however.

Faster Polars Version

Let us implement the same logic in polars using method chaining as well.

import polars as plmost_popular_articles = (
pl.read_csv("transactions.csv", columns=["article_id"])
.get_column("article_id")
.value_counts()
.sort("counts", descending=True) # value_counts does not sort automatically
.head(10)
.get_column("article_id") # there are no indices in polars
.to_list()
)

Things look pretty similar, except for the running time: 3 seconds instead of 31, which is impressive!

Polars is just SO much faster than pandas.

Unarguably, this is one of the main advantages of polars over pandas. Apart from that, polars also has a convenient syntax for creating complex operations that pandas does not have. We will see more of that when creating the other popularity-based recommender.

It is also important to note that pandas and polars produce the same output as expected.

In contrast to our first recommender, we want to slice the dataframe per customer now and get the most popular products for each customer. This means that we need the customer_id as well as the article_id now.

We illustrate the logic using a small dataframe consisting of only ten transactions from three customers A, B, and C buying four articles 1, 2, 3, and 4. We want to get the top two articles per customer. We can achieve this using the following steps:

We start with the original dataframe.
We then group by customer_id and article_id and aggregate via a count.
We then aggregate again over the customer_id and write the article_ids in a list, just as in our last recommender. The twist is that we sort this list by the count column.

That way, we end up with precisely what we want.

A bought products 1 and 2 most frequently.
B bought products 4 and 2 most frequently. Products 4 and 1 would have been a correct solution as well, but internal orderings just happened to flush product 2 into the recommendation.
C only bought product 3, so that’s all there is.

Step 3 of this procedure sounds especially difficult, but polars lets us handle this conveniently.

most_popular_articles_per_user = (
pl.read_csv("transactions.csv")
.groupby(["customer_id", "article_id"]) # first arrow from the picture
.agg(pl.count())                        # first arrow from the picture
.groupby("customer_id")                                               # second arrow
.agg(pl.col("article_id").sort_by("count", descending=True).head(10)) # second arrow
)

By the way: This version runs for about about a minute on my machine already. I did not create a pandas version for this, and I’m definitely scared to do so and let it run. If you are brave, give it a try!

A Small Improvement

So far, some users might have less than ten recommendations, and some even have none. An easy thing to do is pad each customer’s recommendations to ten articles. For example,

using random articles, or
using the most popular articles across all customers from our first popularity-based recommender.

We can implement the second version like this:

improved_recommendations = (
most_popular_articles_per_user
.with_columns([
pl.col("article_id").fill_null([]).alias("personal_top_<=10"),
pl.lit([most_popular_articles]).alias("global_top_10")
])
.with_columns(
pl.col("personal_top_<=10").arr.concat(pl.col("global_top_10")).arr.head(10).alias("padded_recommendations")
)
.select(["customer_id", "padded_recommendations"])
)

Popularity-based recommenders hold a significant position in the realm of recommendation systems due to their simplicity, ease of implementation, and effectiveness as an initial approach and a difficult-to-beat baseline.

In this article, we have learned how to transform the simple idea of popularity-based recommendations into code using the fabulous polars library.

The main disadvantage, especially of the personalized popularity-based recommender, is that the recommendations are not inspiring in any way. People have seen all of the recommended things before, meaning they are stuck in an echo chamber. I expect recommenders that output previously unseen articles to the customers to perform much better since fashion is about creativity and new ideas. This is something worth trying out in a future article! 😉

I hope that you learned something new, interesting, and valuable today. Thanks for reading!

As the last point, if you

want to support me in writing more about machine learning and
plan to get a Medium subscription anyway,

why not do it via this link? This would help me a lot! 😊

To be transparent, the price for you does not change, but about half of the subscription fees go directly to me.

Thanks a lot if you consider supporting me!

If you have any questions, write me on LinkedIn!