# Mean Average Precision at K ([email protected]) clearly explained

## One of the most popular evaluation metrics for recommender or ranking problems step by step explained

Mean Average Precision at K ([email protected]) is one of the most commonly used evaluation metrics for recommender systems and other **ranking** related classification tasks. Since this metric is a composition of different error metrics or layers, it may not be that easy to understand at first glance.

This article explains [email protected] and its components **step by step**. At the end of this article you will also find **code snippets** how to calculate the metric. But before diving into each part of the metric, let’s talk about the **WHY** first.

[email protected] is an error metric that can be used when the **sequence** or **ranking** of your recommended items **plays an important role or is the objective of your task. **By using it, you get answers to the following questions:

- Are my generated or predicted recommendations
**relevant**? - Are the
**most relevant**recommendations on the**first ranks**?

Now that you know the WHY, let’s talk about the **HOW**. The following chapters explain step by step in an “onion” style, from the inside (starting with Precision P) to the outside ([email protected]), [email protected]’s structure.

To make the steps and their composition easier to understand, we work with the following example: We want to evaluate our recommender system, which recommends six items to potential customers when visiting a product detail page (figure 1).

You might have already heard of precision in books or articles when you learned about error metrics for classification models. Precision can be seen as a measure of **quality**. High precision means that our model **returns more relevant than irrelevant results** or recommendations.

Precision can be defined as the fraction of **relevant items **in **all recommended items (relevant + irrelevant items)**.

The example below (figure 3) shows 6 recommended items. Out of these 6 recommendations, 2 are relevant.

By putting these values in our formula (figure 1), we get a precision of **0.33** (

).**2** relevant items / (**2** relevant **+** **4** irrelevant items)

The precision metric (figure 2) itself does not consider the rank or order in which the relevant items appear. Time to include the ranks to our precision formula. Precision**@K** can be defined as the fraction of relevant items in the top **K** recommended items (figure 4).

The following figure (figure 5) shows our example from above (figure 3) in a **ranked** scenario.

The **Precis[email protected]** column shows for each rank (1 to 6) the [email protected] value. The **K** stands for the **number of ranks** (1, 2, …, 6) we consider.

## [email protected]

Assuming we would consider **only the first rank** (**K=1**), we would then have **0 relevant items** divided by **1** (total items), which leads to **0** for [email protected]

## [email protected]

Let’s assume we consider the **first three ranks **(K=3) we then would have under the top 3 recommendations 1** relevant item** and 2** irrelevant ones**. If we place these numbers in our formula (figure 3), we would get **0.33 **(

**1** relevant item / (**1** relevant + **2** irrelevant items)

).## [email protected]

Last but not least, let’s consider the **first five ranks** (K=5). We then would have under the first 5 recommendations 2** relevant** and 3** irrelevant** ones. If we do the math again, we would get a value of **0.4**

(

).**2** relevant items / (**2** relevant + **3** irrelevant items)

As we have seen, Precision and [email protected] are pretty straightforward. The next step has a bit more complexity.

The Average [email protected] or [email protected] is the sum of [email protected] where the item at the kₜₕ rank is **relevant** (`rel(k)`

) divided by the **total number of relevant items** (r) in the top K recommendations (figure 6).

Confused? Let’s have a look at the following example for [email protected] (figure 7).

The total number of relevant items (r) is in this example **2** (on the ranks 2 and 4). Therefore, we can place the 2 in the fraction `1/r`

.

Looking at the first rank 1. The [email protected] is 0 and the item is not relevant (grey). Therefore, we multiply it’s [email protected] value with 0 which leads to `0 * 0`

.

On the 2nd rank however, we have a [email protected] value of 0.5 and a relevant item on the 2nd rank. This leads to `0.5 * 1`

.

On the 3rd rank we have again an irrelevant item and a [email protected] of 0.33. This results in `0.33 * 0`

.

We would proceed here by going over each rank. If the rank k contains a relevant item, we would multiply its [email protected] by 1. In case it is irrelevant, we multiply it with 0, which means it has no effect in increasing our summation.

The **end result** for this example would be a **[email protected]** of **0.5.**

Before we move on to the last step, do you remember when we said at the beginning that [email protected] can answer the question:

Are the

most relevantrecommendations on thefirst ranks?

A great characteristic of this metric is that it penalizes relevant items in the lower ranks. To give you a better understanding, let’s look at the following example (figure 8).

Compared to the initial example (figure 7), the number of relevant items has not changed. But what has changed is the rank in which they are placed. [email protected] (and so then [email protected]) penalizes your recommendations or model if the relevant items are placed on lower ranks.

The previous steps and examples were all based on evaluating **one single query** or one single list of recommendations one visitor gets when browsing the product detail page of product X. But we have more than one visitor…

Mean Average [email protected] or [email protected] considers that. It averages the [email protected] for recommendations shown to *M* users.

Please note: For the sake of simplicity I chose in this example “users”. However, depending on your case, M could be also e.g., search queries.

To get a better idea of how [email protected] is calculated, the following example can be seen below (figure 10).

Based on 3 different users (or queries), the [email protected] is **0.59**.

Now that we are familiar with the theory, let’s do some coding. We will work in the following examples with two lists `actuals`

and `predicted`

, that contain product_ids.

The goal is to check if our actual values appear in our predicted list and, if yes, on which rank/place they appear. That’s why the order of the items does not matter in our `actuals`

but in our `predicted`

list.

## [email protected]

The following lists reflect the example we used earlier (figure 11).

`actuals = ['p_a', 'p_b']`

predicted = ['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f']

If we compare the actuals vs. the predicted, then we can see that `p_b`

shows up on the **2nd** rank and `p_d`

on the **4th** one.

`def apk(y_true, y_pred, k_max=0):`# Check if all elements in lists are unique

if len(set(y_true)) != len(y_true):

raise ValueError("Values in y_true are not unique")

if len(set(y_pred)) != len(y_pred):

raise ValueError("Values in y_pred are not unique")

if k_max != 0:

y_pred = y_pred[:k_max]

correct_predictions = 0

running_sum = 0

for i, yp_item in enumerate(y_pred):

k = i+1 # our rank starts at 1

if yp_item in actual:

correct_predictions += 1

running_sum += correct_predictions/k

return running_sum/len(y_true)

If we place our two lists in our function

`apk(actuals, predicted)`

then we get **0.5 **like in our manual example (figure 7).

## [email protected]

Since [email protected] averages over multiple queries, we adjust our lists to the following structure:

`actual = ['p_a', 'p_b']`predic = [

['p_a', 'p_b', 'p_c', 'p_d', 'p_e', 'p_f'],

['p_c', 'p_d', 'p_e', 'p_f', 'p_a', 'p_b'],

['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f'],

]

Our `actuals`

stay the same but our `predicted`

list contains now several lists (one for each query). The `predicted`

lists correspond to the ones from figure 10.

The code below shows the calculation of [email protected]:

`import numpy as np`

from itertools import productdef mapk(actuals, predicted, k=0):

return np.mean([apk(a,p,k) for a,p in product([actual], predicted)])

If we place our lists in this function

`mapk(actuals, predicted)`

then we get** 0.59**.

When evaluating recommender systems or ranking models, [email protected] is a great choice. It not only provides insights if your recommendations are relevant but also considers the rank of your correct predictions.

Due to its ranking considerations, it is a bit more challenging to understand than other standard error metrics like the F1 score. But I hope that this article gave you a comprehensive explanation of how this error metric is calculated and implemented.

If you want to measure not only **if your recommendations are relevant** but also **how relevant they are**, feel free to check out Normalized Discounted Cumulative Gain (NDCG).

**Introduction to Information Retrieval — Evaluation**, https://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-1-per.pdf

## One of the most popular evaluation metrics for recommender or ranking problems step by step explained

Mean Average Precision at K ([email protected]) is one of the most commonly used evaluation metrics for recommender systems and other **ranking** related classification tasks. Since this metric is a composition of different error metrics or layers, it may not be that easy to understand at first glance.

This article explains [email protected] and its components **step by step**. At the end of this article you will also find **code snippets** how to calculate the metric. But before diving into each part of the metric, let’s talk about the **WHY** first.

[email protected] is an error metric that can be used when the **sequence** or **ranking** of your recommended items **plays an important role or is the objective of your task. **By using it, you get answers to the following questions:

- Are my generated or predicted recommendations
**relevant**? - Are the
**most relevant**recommendations on the**first ranks**?

Now that you know the WHY, let’s talk about the **HOW**. The following chapters explain step by step in an “onion” style, from the inside (starting with Precision P) to the outside ([email protected]), [email protected]’s structure.

To make the steps and their composition easier to understand, we work with the following example: We want to evaluate our recommender system, which recommends six items to potential customers when visiting a product detail page (figure 1).

You might have already heard of precision in books or articles when you learned about error metrics for classification models. Precision can be seen as a measure of **quality**. High precision means that our model **returns more relevant than irrelevant results** or recommendations.

Precision can be defined as the fraction of **relevant items **in **all recommended items (relevant + irrelevant items)**.

The example below (figure 3) shows 6 recommended items. Out of these 6 recommendations, 2 are relevant.

By putting these values in our formula (figure 1), we get a precision of **0.33** (

).**2** relevant items / (**2** relevant **+** **4** irrelevant items)

The precision metric (figure 2) itself does not consider the rank or order in which the relevant items appear. Time to include the ranks to our precision formula. Precision**@K** can be defined as the fraction of relevant items in the top **K** recommended items (figure 4).

The following figure (figure 5) shows our example from above (figure 3) in a **ranked** scenario.

The **[email protected]** column shows for each rank (1 to 6) the [email protected] value. The **K** stands for the **number of ranks** (1, 2, …, 6) we consider.

## [email protected]

Assuming we would consider **only the first rank** (**K=1**), we would then have **0 relevant items** divided by **1** (total items), which leads to **0** for [email protected]

## [email protected]

Let’s assume we consider the **first three ranks **(K=3) we then would have under the top 3 recommendations 1** relevant item** and 2** irrelevant ones**. If we place these numbers in our formula (figure 3), we would get **0.33 **(

**1** relevant item / (**1** relevant + **2** irrelevant items)

).## [email protected]

Last but not least, let’s consider the **first five ranks** (K=5). We then would have under the first 5 recommendations 2** relevant** and 3** irrelevant** ones. If we do the math again, we would get a value of **0.4**

(

).**2** relevant items / (**2** relevant + **3** irrelevant items)

As we have seen, Precision and [email protected] are pretty straightforward. The next step has a bit more complexity.

The Average [email protected] or [email protected] is the sum of [email protected] where the item at the kₜₕ rank is **relevant** (`rel(k)`

) divided by the **total number of relevant items** (r) in the top K recommendations (figure 6).

Confused? Let’s have a look at the following example for [email protected] (figure 7).

The total number of relevant items (r) is in this example **2** (on the ranks 2 and 4). Therefore, we can place the 2 in the fraction `1/r`

.

Looking at the first rank 1. The [email protected] is 0 and the item is not relevant (grey). Therefore, we multiply it’s [email protected] value with 0 which leads to `0 * 0`

.

On the 2nd rank however, we have a [email protected] value of 0.5 and a relevant item on the 2nd rank. This leads to `0.5 * 1`

.

On the 3rd rank we have again an irrelevant item and a [email protected] of 0.33. This results in `0.33 * 0`

.

We would proceed here by going over each rank. If the rank k contains a relevant item, we would multiply its [email protected] by 1. In case it is irrelevant, we multiply it with 0, which means it has no effect in increasing our summation.

The **end result** for this example would be a **[email protected]** of **0.5.**

Before we move on to the last step, do you remember when we said at the beginning that [email protected] can answer the question:

Are the

most relevantrecommendations on thefirst ranks?

A great characteristic of this metric is that it penalizes relevant items in the lower ranks. To give you a better understanding, let’s look at the following example (figure 8).

Compared to the initial example (figure 7), the number of relevant items has not changed. But what has changed is the rank in which they are placed. [email protected] (and so then [email protected]) penalizes your recommendations or model if the relevant items are placed on lower ranks.

The previous steps and examples were all based on evaluating **one single query** or one single list of recommendations one visitor gets when browsing the product detail page of product X. But we have more than one visitor…

Mean Average [email protected] or [email protected] considers that. It averages the [email protected] for recommendations shown to *M* users.

Please note: For the sake of simplicity I chose in this example “users”. However, depending on your case, M could be also e.g., search queries.

To get a better idea of how [email protected] is calculated, the following example can be seen below (figure 10).

Based on 3 different users (or queries), the [email protected] is **0.59**.

Now that we are familiar with the theory, let’s do some coding. We will work in the following examples with two lists `actuals`

and `predicted`

, that contain product_ids.

The goal is to check if our actual values appear in our predicted list and, if yes, on which rank/place they appear. That’s why the order of the items does not matter in our `actuals`

but in our `predicted`

list.

## [email protected]

The following lists reflect the example we used earlier (figure 11).

`actuals = ['p_a', 'p_b']`

predicted = ['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f']

If we compare the actuals vs. the predicted, then we can see that `p_b`

shows up on the **2nd** rank and `p_d`

on the **4th** one.

`def apk(y_true, y_pred, k_max=0):`# Check if all elements in lists are unique

if len(set(y_true)) != len(y_true):

raise ValueError("Values in y_true are not unique")

if len(set(y_pred)) != len(y_pred):

raise ValueError("Values in y_pred are not unique")

if k_max != 0:

y_pred = y_pred[:k_max]

correct_predictions = 0

running_sum = 0

for i, yp_item in enumerate(y_pred):

k = i+1 # our rank starts at 1

if yp_item in actual:

correct_predictions += 1

running_sum += correct_predictions/k

return running_sum/len(y_true)

If we place our two lists in our function

`apk(actuals, predicted)`

then we get **0.5 **like in our manual example (figure 7).

## [email protected]

Since [email protected] averages over multiple queries, we adjust our lists to the following structure:

`actual = ['p_a', 'p_b']`predic = [

['p_a', 'p_b', 'p_c', 'p_d', 'p_e', 'p_f'],

['p_c', 'p_d', 'p_e', 'p_f', 'p_a', 'p_b'],

['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f'],

]

Our `actuals`

stay the same but our `predicted`

list contains now several lists (one for each query). The `predicted`

lists correspond to the ones from figure 10.

The code below shows the calculation of [email protected]:

`import numpy as np`

from itertools import productdef mapk(actuals, predicted, k=0):

return np.mean([apk(a,p,k) for a,p in product([actual], predicted)])

If we place our lists in this function

`mapk(actuals, predicted)`

then we get** 0.59**.

When evaluating recommender systems or ranking models, [email protected] is a great choice. It not only provides insights if your recommendations are relevant but also considers the rank of your correct predictions.

Due to its ranking considerations, it is a bit more challenging to understand than other standard error metrics like the F1 score. But I hope that this article gave you a comprehensive explanation of how this error metric is calculated and implemented.

If you want to measure not only **if your recommendations are relevant** but also **how relevant they are**, feel free to check out Normalized Discounted Cumulative Gain (NDCG).

**Introduction to Information Retrieval — Evaluation**, https://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-1-per.pdf

**Denial of responsibility!**Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.