Techno Blender
Digitally Yours.
0 7

Mean Average Precision at K ([email protected]) is one of the most commonly used evaluation metrics for recommender systems and other ranking related classification tasks. Since this metric is a composition of different error metrics or layers, it may not be that easy to understand at first glance.

This article explains [email protected] and its components step by step. At the end of this article you will also find code snippets how to calculate the metric. But before diving into each part of the metric, let’s talk about the WHY first.

[email protected] is an error metric that can be used when the sequence or ranking of your recommended items plays an important role or is the objective of your task. By using it, you get answers to the following questions:

• Are my generated or predicted recommendations relevant?
• Are the most relevant recommendations on the first ranks?

Now that you know the WHY, let’s talk about the HOW. The following chapters explain step by step in an “onion” style, from the inside (starting with Precision P) to the outside ([email protected]), [email protected]’s structure.

To make the steps and their composition easier to understand, we work with the following example: We want to evaluate our recommender system, which recommends six items to potential customers when visiting a product detail page (figure 1).

You might have already heard of precision in books or articles when you learned about error metrics for classification models. Precision can be seen as a measure of quality. High precision means that our model returns more relevant than irrelevant results or recommendations.

Precision can be defined as the fraction of relevant items in all recommended items (relevant + irrelevant items).

The example below (figure 3) shows 6 recommended items. Out of these 6 recommendations, 2 are relevant.

By putting these values in our formula (figure 1), we get a precision of 0.33 ( `2 relevant items / (2 relevant + 4 irrelevant items)` ).

The precision metric (figure 2) itself does not consider the rank or order in which the relevant items appear. Time to include the ranks to our precision formula. Precision@K can be defined as the fraction of relevant items in the top K recommended items (figure 4).

The following figure (figure 5) shows our example from above (figure 3) in a ranked scenario.

The Precis[email protected] column shows for each rank (1 to 6) the [email protected] value. The K stands for the number of ranks (1, 2, …, 6) we consider.

## [email protected]

Assuming we would consider only the first rank (K=1), we would then have 0 relevant items divided by 1 (total items), which leads to 0 for [email protected]

## [email protected]

Let’s assume we consider the first three ranks (K=3) we then would have under the top 3 recommendations 1 relevant item and 2 irrelevant ones. If we place these numbers in our formula (figure 3), we would get 0.33
(`1 relevant item / (1 relevant + 2 irrelevant items)` ).

## [email protected]

Last but not least, let’s consider the first five ranks (K=5). We then would have under the first 5 recommendations 2 relevant and 3 irrelevant ones. If we do the math again, we would get a value of 0.4
(`2 relevant items / (2 relevant + 3 irrelevant items)`).

As we have seen, Precision and [email protected] are pretty straightforward. The next step has a bit more complexity.

The Average [email protected] or [email protected] is the sum of [email protected] where the item at the kₜₕ rank is relevant (`rel(k)`) divided by the total number of relevant items (r) in the top K recommendations (figure 6).

Confused? Let’s have a look at the following example for [email protected] (figure 7).

The total number of relevant items (r) is in this example 2 (on the ranks 2 and 4). Therefore, we can place the 2 in the fraction `1/r`.

Looking at the first rank 1. The [email protected] is 0 and the item is not relevant (grey). Therefore, we multiply it’s [email protected] value with 0 which leads to `0 * 0`.

On the 2nd rank however, we have a [email protected] value of 0.5 and a relevant item on the 2nd rank. This leads to `0.5 * 1` .

On the 3rd rank we have again an irrelevant item and a [email protected] of 0.33. This results in `0.33 * 0`.

We would proceed here by going over each rank. If the rank k contains a relevant item, we would multiply its [email protected] by 1. In case it is irrelevant, we multiply it with 0, which means it has no effect in increasing our summation.

The end result for this example would be a [email protected] of 0.5.

Before we move on to the last step, do you remember when we said at the beginning that [email protected] can answer the question:

Are the most relevant recommendations on the first ranks?

A great characteristic of this metric is that it penalizes relevant items in the lower ranks. To give you a better understanding, let’s look at the following example (figure 8).

Compared to the initial example (figure 7), the number of relevant items has not changed. But what has changed is the rank in which they are placed. [email protected] (and so then [email protected]) penalizes your recommendations or model if the relevant items are placed on lower ranks.

The previous steps and examples were all based on evaluating one single query or one single list of recommendations one visitor gets when browsing the product detail page of product X. But we have more than one visitor…

Mean Average [email protected] or [email protected] considers that. It averages the [email protected] for recommendations shown to M users.

Please note: For the sake of simplicity I chose in this example “users”. However, depending on your case, M could be also e.g., search queries.

To get a better idea of how [email protected] is calculated, the following example can be seen below (figure 10).

Based on 3 different users (or queries), the [email protected] is 0.59.

Now that we are familiar with the theory, let’s do some coding. We will work in the following examples with two lists `actuals` and `predicted`, that contain product_ids.

The goal is to check if our actual values appear in our predicted list and, if yes, on which rank/place they appear. That’s why the order of the items does not matter in our `actuals` but in our `predicted` list.

## [email protected]

The following lists reflect the example we used earlier (figure 11).

`actuals = ['p_a', 'p_b']predicted = ['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f']`

If we compare the actuals vs. the predicted, then we can see that `p_b` shows up on the 2nd rank and `p_d` on the 4th one.

`def apk(y_true, y_pred, k_max=0):# Check if all elements in lists are uniqueif len(set(y_true)) != len(y_true):raise ValueError("Values in y_true are not unique")if len(set(y_pred)) != len(y_pred):raise ValueError("Values in y_pred are not unique")if k_max != 0:y_pred = y_pred[:k_max]correct_predictions = 0running_sum = 0for i, yp_item in enumerate(y_pred):k = i+1 # our rank starts at 1if yp_item in actual:correct_predictions += 1running_sum += correct_predictions/kreturn running_sum/len(y_true)`

If we place our two lists in our function

`apk(actuals, predicted)`

then we get 0.5 like in our manual example (figure 7).

## [email protected]

Since [email protected] averages over multiple queries, we adjust our lists to the following structure:

`actual = ['p_a', 'p_b']predic = [['p_a', 'p_b', 'p_c', 'p_d', 'p_e', 'p_f'],['p_c', 'p_d', 'p_e', 'p_f', 'p_a', 'p_b'],['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f'],]`

Our `actuals` stay the same but our `predicted` list contains now several lists (one for each query). The `predicted` lists correspond to the ones from figure 10.

The code below shows the calculation of [email protected]:

`import numpy as npfrom itertools import productdef mapk(actuals, predicted, k=0):return np.mean([apk(a,p,k) for a,p in product([actual], predicted)])`

If we place our lists in this function

`mapk(actuals, predicted)`

then we get 0.59.

When evaluating recommender systems or ranking models, [email protected] is a great choice. It not only provides insights if your recommendations are relevant but also considers the rank of your correct predictions.

Due to its ranking considerations, it is a bit more challenging to understand than other standard error metrics like the F1 score. But I hope that this article gave you a comprehensive explanation of how this error metric is calculated and implemented.

If you want to measure not only if your recommendations are relevant but also how relevant they are, feel free to check out Normalized Discounted Cumulative Gain (NDCG).

Introduction to Information Retrieval — Evaluation, https://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-1-per.pdf

Mean Average Precision at K ([email protected]) is one of the most commonly used evaluation metrics for recommender systems and other ranking related classification tasks. Since this metric is a composition of different error metrics or layers, it may not be that easy to understand at first glance.

This article explains [email protected] and its components step by step. At the end of this article you will also find code snippets how to calculate the metric. But before diving into each part of the metric, let’s talk about the WHY first.

[email protected] is an error metric that can be used when the sequence or ranking of your recommended items plays an important role or is the objective of your task. By using it, you get answers to the following questions:

• Are my generated or predicted recommendations relevant?
• Are the most relevant recommendations on the first ranks?

Now that you know the WHY, let’s talk about the HOW. The following chapters explain step by step in an “onion” style, from the inside (starting with Precision P) to the outside ([email protected]), [email protected]’s structure.

To make the steps and their composition easier to understand, we work with the following example: We want to evaluate our recommender system, which recommends six items to potential customers when visiting a product detail page (figure 1).

You might have already heard of precision in books or articles when you learned about error metrics for classification models. Precision can be seen as a measure of quality. High precision means that our model returns more relevant than irrelevant results or recommendations.

Precision can be defined as the fraction of relevant items in all recommended items (relevant + irrelevant items).

The example below (figure 3) shows 6 recommended items. Out of these 6 recommendations, 2 are relevant.

By putting these values in our formula (figure 1), we get a precision of 0.33 ( `2 relevant items / (2 relevant + 4 irrelevant items)` ).

The precision metric (figure 2) itself does not consider the rank or order in which the relevant items appear. Time to include the ranks to our precision formula. Precision@K can be defined as the fraction of relevant items in the top K recommended items (figure 4).

The following figure (figure 5) shows our example from above (figure 3) in a ranked scenario.

The [email protected] column shows for each rank (1 to 6) the [email protected] value. The K stands for the number of ranks (1, 2, …, 6) we consider.

## [email protected]

Assuming we would consider only the first rank (K=1), we would then have 0 relevant items divided by 1 (total items), which leads to 0 for [email protected]

## [email protected]

Let’s assume we consider the first three ranks (K=3) we then would have under the top 3 recommendations 1 relevant item and 2 irrelevant ones. If we place these numbers in our formula (figure 3), we would get 0.33
(`1 relevant item / (1 relevant + 2 irrelevant items)` ).

## [email protected]

Last but not least, let’s consider the first five ranks (K=5). We then would have under the first 5 recommendations 2 relevant and 3 irrelevant ones. If we do the math again, we would get a value of 0.4
(`2 relevant items / (2 relevant + 3 irrelevant items)`).

As we have seen, Precision and [email protected] are pretty straightforward. The next step has a bit more complexity.

The Average [email protected] or [email protected] is the sum of [email protected] where the item at the kₜₕ rank is relevant (`rel(k)`) divided by the total number of relevant items (r) in the top K recommendations (figure 6).

Confused? Let’s have a look at the following example for [email protected] (figure 7).

The total number of relevant items (r) is in this example 2 (on the ranks 2 and 4). Therefore, we can place the 2 in the fraction `1/r`.

Looking at the first rank 1. The [email protected] is 0 and the item is not relevant (grey). Therefore, we multiply it’s [email protected] value with 0 which leads to `0 * 0`.

On the 2nd rank however, we have a [email protected] value of 0.5 and a relevant item on the 2nd rank. This leads to `0.5 * 1` .

On the 3rd rank we have again an irrelevant item and a [email protected] of 0.33. This results in `0.33 * 0`.

We would proceed here by going over each rank. If the rank k contains a relevant item, we would multiply its [email protected] by 1. In case it is irrelevant, we multiply it with 0, which means it has no effect in increasing our summation.

The end result for this example would be a [email protected] of 0.5.

Before we move on to the last step, do you remember when we said at the beginning that [email protected] can answer the question:

Are the most relevant recommendations on the first ranks?

A great characteristic of this metric is that it penalizes relevant items in the lower ranks. To give you a better understanding, let’s look at the following example (figure 8).

Compared to the initial example (figure 7), the number of relevant items has not changed. But what has changed is the rank in which they are placed. [email protected] (and so then [email protected]) penalizes your recommendations or model if the relevant items are placed on lower ranks.

The previous steps and examples were all based on evaluating one single query or one single list of recommendations one visitor gets when browsing the product detail page of product X. But we have more than one visitor…

Mean Average [email protected] or [email protected] considers that. It averages the [email protected] for recommendations shown to M users.

Please note: For the sake of simplicity I chose in this example “users”. However, depending on your case, M could be also e.g., search queries.

To get a better idea of how [email protected] is calculated, the following example can be seen below (figure 10).

Based on 3 different users (or queries), the [email protected] is 0.59.

Now that we are familiar with the theory, let’s do some coding. We will work in the following examples with two lists `actuals` and `predicted`, that contain product_ids.

The goal is to check if our actual values appear in our predicted list and, if yes, on which rank/place they appear. That’s why the order of the items does not matter in our `actuals` but in our `predicted` list.

## [email protected]

The following lists reflect the example we used earlier (figure 11).

`actuals = ['p_a', 'p_b']predicted = ['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f']`

If we compare the actuals vs. the predicted, then we can see that `p_b` shows up on the 2nd rank and `p_d` on the 4th one.

`def apk(y_true, y_pred, k_max=0):# Check if all elements in lists are uniqueif len(set(y_true)) != len(y_true):raise ValueError("Values in y_true are not unique")if len(set(y_pred)) != len(y_pred):raise ValueError("Values in y_pred are not unique")if k_max != 0:y_pred = y_pred[:k_max]correct_predictions = 0running_sum = 0for i, yp_item in enumerate(y_pred):k = i+1 # our rank starts at 1if yp_item in actual:correct_predictions += 1running_sum += correct_predictions/kreturn running_sum/len(y_true)`

If we place our two lists in our function

`apk(actuals, predicted)`

then we get 0.5 like in our manual example (figure 7).

## [email protected]

Since [email protected] averages over multiple queries, we adjust our lists to the following structure:

`actual = ['p_a', 'p_b']predic = [['p_a', 'p_b', 'p_c', 'p_d', 'p_e', 'p_f'],['p_c', 'p_d', 'p_e', 'p_f', 'p_a', 'p_b'],['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f'],]`

Our `actuals` stay the same but our `predicted` list contains now several lists (one for each query). The `predicted` lists correspond to the ones from figure 10.

The code below shows the calculation of [email protected]:

`import numpy as npfrom itertools import productdef mapk(actuals, predicted, k=0):return np.mean([apk(a,p,k) for a,p in product([actual], predicted)])`

If we place our lists in this function

`mapk(actuals, predicted)`

then we get 0.59.

When evaluating recommender systems or ranking models, [email protected] is a great choice. It not only provides insights if your recommendations are relevant but also considers the rank of your correct predictions.

Due to its ranking considerations, it is a bit more challenging to understand than other standard error metrics like the F1 score. But I hope that this article gave you a comprehensive explanation of how this error metric is calculated and implemented.

If you want to measure not only if your recommendations are relevant but also how relevant they are, feel free to check out Normalized Discounted Cumulative Gain (NDCG).

Introduction to Information Retrieval — Evaluation, https://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-1-per.pdf