Mean Average Precision at K (MAP@K) clearly explained

One of the most popular evaluation metrics for recommender or ranking problems step by step explained

Mean Average Precision at K (MAP@K) is one of the most commonly used evaluation metrics for recommender systems and other ranking related classification tasks. Since this metric is a composition of different error metrics or layers, it may not be that easy to understand at first glance.

This article explains MAP@K and its components step by step. At the end of this article you will also find code snippets how to calculate the metric. But before diving into each part of the metric, let’s talk about the WHY first.

MAP@K is an error metric that can be used when the sequence or ranking of your recommended items plays an important role or is the objective of your task. By using it, you get answers to the following questions:

Are my generated or predicted recommendations relevant?
Are the most relevant recommendations on the first ranks?

Now that you know the WHY, let’s talk about the HOW. The following chapters explain step by step in an “onion” style, from the inside (starting with Precision P) to the outside (MAP@K), MAP@K’s structure.

To make the steps and their composition easier to understand, we work with the following example: We want to evaluate our recommender system, which recommends six items to potential customers when visiting a product detail page (figure 1).

Figure 1. Recommendation example (image by author).

You might have already heard of precision in books or articles when you learned about error metrics for classification models. Precision can be seen as a measure of quality. High precision means that our model returns more relevant than irrelevant results or recommendations.

Precision can be defined as the fraction of relevant items in all recommended items (relevant + irrelevant items).

Figure 2. Precision formula (image by author).

The example below (figure 3) shows 6 recommended items. Out of these 6 recommendations, 2 are relevant.

Figure 3. Precision example (image by author).

By putting these values in our formula (figure 1), we get a precision of 0.33 ( 2 relevant items / (2 relevant + 4 irrelevant items) ).

The precision metric (figure 2) itself does not consider the rank or order in which the relevant items appear. Time to include the ranks to our precision formula. Precision@K can be defined as the fraction of relevant items in the top K recommended items (figure 4).

Figure 4. Precision@K formula (image by author).

The following figure (figure 5) shows our example from above (figure 3) in a ranked scenario.

Figure 5. Precision@K example (image by author).

The Precision@K column shows for each rank (1 to 6) the Precision@K value. The K stands for the number of ranks (1, 2, …, 6) we consider.

Precision@1

Assuming we would consider only the first rank (K=1), we would then have 0 relevant items divided by 1 (total items), which leads to 0 for Precision@1.

Precision@3

Let’s assume we consider the first three ranks (K=3) we then would have under the top 3 recommendations 1 relevant item and 2 irrelevant ones. If we place these numbers in our formula (figure 3), we would get 0.33
(1 relevant item / (1 relevant + 2 irrelevant items) ).

Precision@5

Last but not least, let’s consider the first five ranks (K=5). We then would have under the first 5 recommendations 2 relevant and 3 irrelevant ones. If we do the math again, we would get a value of 0.4
(2 relevant items / (2 relevant + 3 irrelevant items)).

As we have seen, Precision and Precision@K are pretty straightforward. The next step has a bit more complexity.

The Average Precision@K or AP@K is the sum of precision@K where the item at the kₜₕ rank is relevant (rel(k)) divided by the total number of relevant items (r) in the top K recommendations (figure 6).

Figure 6. AP@K formula (image by author).

Confused? Let’s have a look at the following example for AP@6 (figure 7).

Figure 7. AP@K example 1 (image by author).

The total number of relevant items (r) is in this example 2 (on the ranks 2 and 4). Therefore, we can place the 2 in the fraction 1/r.

Looking at the first rank 1. The Precision@1 is 0 and the item is not relevant (grey). Therefore, we multiply it’s Precision@1 value with 0 which leads to 0 * 0.

On the 2nd rank however, we have a Precision@2 value of 0.5 and a relevant item on the 2nd rank. This leads to 0.5 * 1 .

On the 3rd rank we have again an irrelevant item and a Precision@3 of 0.33. This results in 0.33 * 0.

We would proceed here by going over each rank. If the rank k contains a relevant item, we would multiply its Precision@k by 1. In case it is irrelevant, we multiply it with 0, which means it has no effect in increasing our summation.

The end result for this example would be a AP@6 of 0.5.

Before we move on to the last step, do you remember when we said at the beginning that MAP@K can answer the question:

Are the most relevant recommendations on the first ranks?

A great characteristic of this metric is that it penalizes relevant items in the lower ranks. To give you a better understanding, let’s look at the following example (figure 8).

Figure 8. Example for relevant items on different ranks. Best case (left), worst case (right). The higher the better (image by author).

Compared to the initial example (figure 7), the number of relevant items has not changed. But what has changed is the rank in which they are placed. AP@K (and so then MAP@K) penalizes your recommendations or model if the relevant items are placed on lower ranks.

The previous steps and examples were all based on evaluating one single query or one single list of recommendations one visitor gets when browsing the product detail page of product X. But we have more than one visitor…

Mean Average Precision@K or MAP@K considers that. It averages the AP@K for recommendations shown to M users.

Figure 9. MAP@K formula (image by author).

Please note: For the sake of simplicity I chose in this example “users”. However, depending on your case, M could be also e.g., search queries.

To get a better idea of how MAP@K is calculated, the following example can be seen below (figure 10).

Figure 10. Example of MAP@K (image by author).

Based on 3 different users (or queries), the MAP@6 is 0.59.

Now that we are familiar with the theory, let’s do some coding. We will work in the following examples with two lists actuals and predicted, that contain product_ids.

The goal is to check if our actual values appear in our predicted list and, if yes, on which rank/place they appear. That’s why the order of the items does not matter in our actuals but in our predicted list.

AP@K

The following lists reflect the example we used earlier (figure 11).

Figure 11. Recap example from the start (image by author).

actuals = ['p_a', 'p_b']
predicted = ['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f']

If we compare the actuals vs. the predicted, then we can see that p_b shows up on the 2nd rank and p_d on the 4th one.

def apk(y_true, y_pred, k_max=0):# Check if all elements in lists are unique
if len(set(y_true)) != len(y_true):
raise ValueError("Values in y_true are not unique")
if len(set(y_pred)) != len(y_pred):
raise ValueError("Values in y_pred are not unique")
if k_max != 0:
y_pred = y_pred[:k_max]
correct_predictions = 0
running_sum = 0
for i, yp_item in enumerate(y_pred):
k = i+1 # our rank starts at 1
if yp_item in actual:
correct_predictions += 1
running_sum += correct_predictions/k
return running_sum/len(y_true)

If we place our two lists in our function

apk(actuals, predicted)

then we get 0.5 like in our manual example (figure 7).

MAP@K

Since MAP@K averages over multiple queries, we adjust our lists to the following structure:

actual = ['p_a', 'p_b']predic = [
['p_a', 'p_b', 'p_c', 'p_d', 'p_e', 'p_f'],
['p_c', 'p_d', 'p_e', 'p_f', 'p_a', 'p_b'],
['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f'],
]

Our actuals stay the same but our predicted list contains now several lists (one for each query). The predicted lists correspond to the ones from figure 10.

Figure 12. Recap example from figure 10 (image by author).

The code below shows the calculation of MAP@K:

import numpy as np
from itertools import productdef mapk(actuals, predicted, k=0):
return np.mean([apk(a,p,k) for a,p in product([actual], predicted)])

If we place our lists in this function

mapk(actuals, predicted)

then we get 0.59.

When evaluating recommender systems or ranking models, MAP@K is a great choice. It not only provides insights if your recommendations are relevant but also considers the rank of your correct predictions.

Due to its ranking considerations, it is a bit more challenging to understand than other standard error metrics like the F1 score. But I hope that this article gave you a comprehensive explanation of how this error metric is calculated and implemented.

If you want to measure not only if your recommendations are relevant but also how relevant they are, feel free to check out Normalized Discounted Cumulative Gain (NDCG).

Introduction to Information Retrieval — Evaluation, https://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-1-per.pdf

One of the most popular evaluation metrics for recommender or ranking problems step by step explained

Photo by Joshua Burdick on Unsplash.

Are my generated or predicted recommendations relevant?
Are the most relevant recommendations on the first ranks?

Figure 1. Recommendation example (image by author).

Precision can be defined as the fraction of relevant items in all recommended items (relevant + irrelevant items).

Figure 2. Precision formula (image by author).

The example below (figure 3) shows 6 recommended items. Out of these 6 recommendations, 2 are relevant.

Figure 3. Precision example (image by author).

By putting these values in our formula (figure 1), we get a precision of 0.33 ( 2 relevant items / (2 relevant + 4 irrelevant items) ).

Figure 4. Precision@K formula (image by author).

The following figure (figure 5) shows our example from above (figure 3) in a ranked scenario.

Figure 5. Precision@K example (image by author).

The Precision@K column shows for each rank (1 to 6) the Precision@K value. The K stands for the number of ranks (1, 2, …, 6) we consider.

Precision@1

Assuming we would consider only the first rank (K=1), we would then have 0 relevant items divided by 1 (total items), which leads to 0 for Precision@1.

Precision@3

Precision@5

As we have seen, Precision and Precision@K are pretty straightforward. The next step has a bit more complexity.

Figure 6. AP@K formula (image by author).

Confused? Let’s have a look at the following example for AP@6 (figure 7).

Figure 7. AP@K example 1 (image by author).

The total number of relevant items (r) is in this example 2 (on the ranks 2 and 4). Therefore, we can place the 2 in the fraction 1/r.

Looking at the first rank 1. The Precision@1 is 0 and the item is not relevant (grey). Therefore, we multiply it’s Precision@1 value with 0 which leads to 0 * 0.

On the 2nd rank however, we have a Precision@2 value of 0.5 and a relevant item on the 2nd rank. This leads to 0.5 * 1 .

On the 3rd rank we have again an irrelevant item and a Precision@3 of 0.33. This results in 0.33 * 0.

The end result for this example would be a AP@6 of 0.5.

Before we move on to the last step, do you remember when we said at the beginning that MAP@K can answer the question:

Are the most relevant recommendations on the first ranks?

A great characteristic of this metric is that it penalizes relevant items in the lower ranks. To give you a better understanding, let’s look at the following example (figure 8).

Figure 8. Example for relevant items on different ranks. Best case (left), worst case (right). The higher the better (image by author).

Mean Average Precision@K or MAP@K considers that. It averages the AP@K for recommendations shown to M users.

Figure 9. MAP@K formula (image by author).

Please note: For the sake of simplicity I chose in this example “users”. However, depending on your case, M could be also e.g., search queries.

To get a better idea of how MAP@K is calculated, the following example can be seen below (figure 10).

Figure 10. Example of MAP@K (image by author).

Based on 3 different users (or queries), the MAP@6 is 0.59.

Now that we are familiar with the theory, let’s do some coding. We will work in the following examples with two lists actuals and predicted, that contain product_ids.

AP@K

The following lists reflect the example we used earlier (figure 11).

Figure 11. Recap example from the start (image by author).

actuals = ['p_a', 'p_b']
predicted = ['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f']

If we compare the actuals vs. the predicted, then we can see that p_b shows up on the 2nd rank and p_d on the 4th one.

def apk(y_true, y_pred, k_max=0):# Check if all elements in lists are unique
if len(set(y_true)) != len(y_true):
raise ValueError("Values in y_true are not unique")
if len(set(y_pred)) != len(y_pred):
raise ValueError("Values in y_pred are not unique")
if k_max != 0:
y_pred = y_pred[:k_max]
correct_predictions = 0
running_sum = 0
for i, yp_item in enumerate(y_pred):
k = i+1 # our rank starts at 1
if yp_item in actual:
correct_predictions += 1
running_sum += correct_predictions/k
return running_sum/len(y_true)

If we place our two lists in our function

apk(actuals, predicted)

then we get 0.5 like in our manual example (figure 7).

MAP@K

Since MAP@K averages over multiple queries, we adjust our lists to the following structure:

actual = ['p_a', 'p_b']predic = [
['p_a', 'p_b', 'p_c', 'p_d', 'p_e', 'p_f'],
['p_c', 'p_d', 'p_e', 'p_f', 'p_a', 'p_b'],
['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f'],
]

Our actuals stay the same but our predicted list contains now several lists (one for each query). The predicted lists correspond to the ones from figure 10.

Figure 12. Recap example from figure 10 (image by author).

The code below shows the calculation of MAP@K:

import numpy as np
from itertools import productdef mapk(actuals, predicted, k=0):
return np.mean([apk(a,p,k) for a,p in product([actual], predicted)])

If we place our lists in this function

mapk(actuals, predicted)

then we get 0.59.

If you want to measure not only if your recommendations are relevant but also how relevant they are, feel free to check out Normalized Discounted Cumulative Gain (NDCG).

Introduction to Information Retrieval — Evaluation, https://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-1-per.pdf

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.

Mean Average Precision at K (MAP@K) clearly explained

One of the most popular evaluation metrics for recommender or ranking problems step by step explained

Precision@1

Precision@3

Precision@5

AP@K

MAP@K

One of the most popular evaluation metrics for recommender or ranking problems step by step explained

Precision@1

Precision@3

Precision@5

AP@K

MAP@K

Related Posts