RoIAlign Explained. With Python implementation | by Alexey Kravets | Dec, 2022


https://unsplash.com/@wansan_99

Introduction

In this tutorial we are going to reproduce in Python and explain the RoIAlign function from torchvision.ops.roi_align. I couldn’t find online any code that would exactly reproduce torchvision library results, thus I had to go through and translate into Python the C++ implementation in torchvision that you can find here.

Background

Region of Interest (RoI) in computer vision can be defined as a region in a image where a potential object might be located in a object detection task. An example of RoI proposals is shown in Figure 1 below.

https://web.eecs.umich.edu/~justincj/slides/eecs498/WI2022 — Figure 1

One of the object detection models where RoI are involved is Faster R-CNN. Faster R-CNN can be described two phases: Region Proposal Network (RPN) which proposes RoIs and whether the RoI contains an object or background and a classification network that predicts the object class contained in the RoI and offsets, i.e., transformations of RoIs to move and resize and hence transform them into final proposals using these offsets to enclose the object better in the bounding box. Classification network does also reject negative proposals which do not contain objects — these negative proposals are classified as background.
It’s important to know that RoIs are predicted not in the original image space but in feature space which is extracted from a vision model. The image below illustrates this idea:

https://web.eecs.umich.edu/~justincj/slides/eecs498/WI2022 — Figure 2

We pass in a pretrained vision model the original image and then extract a 3D tensor of features, each in the above case of size 20×15. It can be, however, different depending on which layer we extract features from and the vision model we use. As we can see we can find the exact correspondence of the box in the original image coordinates in image features coordinates. Now, why do we really need RoI pooling?
The problem with the RoIs is that they are all of different sizes, while the classification network requires fixed sized features.

https://web.eecs.umich.edu/~justincj/slides/eecs498/WI2022 — Figure 3

Thus, RoI pooling enables us to to map into the same size all the RoIs, e.g. into 3×3 fixed size features, and predict classes they contains and the offsets. There are several variations of RoI pooling — in this article we will focus on RoIAlign. Let’s finally see how this is implemented!

Set up

Let’s first define an example of a feature map to work with. Thus we assume we are at a stage when we extracted a 7×7 dimensional features from an image of interest.

Image by Author — Figure 3

Now, let’s assume we extracted a RoI with the following coordinates in red in Figure 4 (we omit features values in the boxes):

Image by Author — Figure 4

In Figure 4 we also divided our RoI into 4 regions because we are pooling into a 2×2 dimensional feature. With RoIAlign we usually do average pooling.

Image by Author — Figure 5

Now the question is, how do we average pool these sub-regions? We can see they are misaligned to the grid, thus we cannot simply average cells within each sub-region. The solution is to sample regularly-spaced points in each sub-region with bi-linear interpolation.

Bi-linear interpolation and pooling

First we need to come up with the points we interpolate in each sub-region in RoI. Below we choose to pool into 2×2 region and we output the points we want to interpolate values for.

# 7x7 image features
img_feats = np.array([[0.5663671 , 0.2577112 , 0.20066682, 0.0127351 , 0.07388048,
0.38410962, 0.2822853 ],
[0.3358975 , 0. , 0. , 0. , 0. ,
0. , 0.07561569],
[0.23596162, 0. , 0. , 0. , 0. ,
0. , 0.04612046],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.18630868, 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.00289604, 0. ]], dtype=np.float32)

# roi proposal
roi_proposal = [2.2821481227874756, 0.3001725673675537, 4.599632263183594, 5.58889102935791]
roi_start_w, roi_start_h, roi_end_w, roi_end_h = roi_proposal

# pooling regions size
pooled_height = 2
pooled_width = 2

# RoI width and height
roi_width = roi_end_w - roi_start_w
roi_height = roi_end_h - roi_start_h
# roi_height= 5.288, roi_width = 2.317

# we divide each RoI sub-region into roi_bin_grid_h x roi_bin_grid_w areas.
# These will defined the number of sampling points in each sub-region
roi_bin_grid_h = np.ceil(roi_height / pooled_height)
roi_bin_grid_w = np.ceil(roi_width / pooled_width)
# roi_bin_grid_h = 3, roi_bin_grid_w = 2
# Thus overall we have 6 sampling points in each sub-region

# raw height and weight of each RoI sub-regions
bin_size_h = roi_height / pooled_height
bin_size_w = roi_width / pooled_width
# bin_size_h = 2.644, bin_size_w = 1.158

# variable to be used to calculate pooled value in each sub-region
output_val = 0

# ph and pw define each square (sub-region) RoI is divided into.
ph = 0
pw = 0
# iy and ix represent sampled points within each sub-region in RoI.
# In this example roi_bin_grid_h = 3 and roi_bin_grid_w = 2, thus we
# have overall 6 points for which we interpolate the values and then average
# them to come up with a value for each of the 4 areas in pooled RoI
# sub-regions
for iy in range(int(roi_bin_grid_h)):
# ph * bin_size_h - which square in RoI to pick vertically (on y axis)
# (iy + 0.5) * bin_size_h / roi_bin_grid_h - which of the roi_bin_grid_h
# points vertically to select within square
yy = roi_start_h + ph * bin_size_h + (iy + 0.5) * bin_size_h / roi_bin_grid_h
for ix in range(int(roi_bin_grid_w)):
# pw * bin_size_w - which square in RoI to pick horizontally (on x axis)
# (ix + 0.5) * bin_size_w / roi_bin_grid_w - which of the roi_bin_grid_w
# points vertically to select within square
xx = roi_start_w + pw * bin_size_w + (ix + 0.5) * bin_size_w / roi_bin_grid_w
print(xx, yy)

# xx and yy values:
# 2.57 0.74
# 3.15 0.74
# 2.57 1.62
# 3.15 1.62
# 2.57 2.50
# 3.15 2.50

In Figure 6 we can see the corresponding 6 sample points for sub-region 1.

Image by Author — Figure 6

To do the bi-linear interpolation of the value corresponding to the first point of coordinates (2.57, 0.74) , we find the box where this point is positioned. So we take the floor of these values — (2, 0) which corresponds to the top-left point of the box (x_low, y_low) and then adding 1 to these coordinates we find the bottom-right point (x_high, y_high) of the box — (3, 1). This is represented in the below Figure:

Image by Author — Figure 6

According to Figure 3, point (0, 2) corresponds to 0.2, point (0,3) to 0.012 and so on. Following the previous code, inside the last loop we find the interpolated value for red point inside the sub-region:

x = xx; y = yy
if y <= 0: y = 0
if x <= 0: x = 0
y_low = int(y); x_low = int(x)
if (y_low >= height - 1):
y_high = y_low = height - 1
y = y_low
else:
y_high = y_low + 1

if (x_low >= width-1):
x_high = x_low = width-1
x = x_low
else:
x_high = x_low + 1

# compute weights and bilinear interpolation
ly = y - y_low; lx = x - x_low;
hy = 1. - ly; hx = 1. - lx;
w1 = hy * hx; w2 = hy * lx; w3 = ly * hx; w4 = ly * lx;

output_val += w1 * img_feats[y_low, x_low] + w2 * img_feats[y_low, x_high] + \
w3 * img_feats[y_high, x_low] + w4 * img_feats[y_high, x_high]

So we have for the red point the following result:

Image by Author — Figure 7

If we then do it for all the 6 points in the sub-region, we get the following results:

# interpolated values for each point in the sub-region
[0.0241, 0.0057, 0., 0., 0., 0.]

# if we then take the average we get the pooled average value for
# the first region:
0.004973

At the end we get the following average pooled results:

Image by Author — Figure 8

The full code:

Additional comments to the code

The code above contains some additional features we did not discuss that I will briefly explain here:

  • you can change the align variable to be either True or False. If True, pixel shift the box coordinates by -0.5 for a better alignment with the two neighboring pixel indices. This version is used in Detectron2.
  • sampling_ratio defines the number of sampling points in each sub-region of a RoI as illustrated in Figure 6 where 6 sampling points were used. If sampling_ratio = -1 , then it’s computed automatically as we saw in the first code snippet:
roi_bin_grid_h = np.ceil(roi_height / pooled_height)
roi_bin_grid_w = np.ceil(roi_width / pooled_width)

Conclusions

In this article we have seen how RoIAlign works and how it is implemented in torchvision library. RoIAlign can be seen as a layer in a neural network architecture and as every layer you can propagate forward and backword through it, enabling to train your models end-to-end.
After reading this article I would encourage you to also read about RoI pooling and why RoIAlign is preferred to it. If you understood RoIAlign, understanding RoI pooling shouldn’t be a problem.


https://unsplash.com/@wansan_99

Introduction

In this tutorial we are going to reproduce in Python and explain the RoIAlign function from torchvision.ops.roi_align. I couldn’t find online any code that would exactly reproduce torchvision library results, thus I had to go through and translate into Python the C++ implementation in torchvision that you can find here.

Background

Region of Interest (RoI) in computer vision can be defined as a region in a image where a potential object might be located in a object detection task. An example of RoI proposals is shown in Figure 1 below.

https://web.eecs.umich.edu/~justincj/slides/eecs498/WI2022 — Figure 1

One of the object detection models where RoI are involved is Faster R-CNN. Faster R-CNN can be described two phases: Region Proposal Network (RPN) which proposes RoIs and whether the RoI contains an object or background and a classification network that predicts the object class contained in the RoI and offsets, i.e., transformations of RoIs to move and resize and hence transform them into final proposals using these offsets to enclose the object better in the bounding box. Classification network does also reject negative proposals which do not contain objects — these negative proposals are classified as background.
It’s important to know that RoIs are predicted not in the original image space but in feature space which is extracted from a vision model. The image below illustrates this idea:

https://web.eecs.umich.edu/~justincj/slides/eecs498/WI2022 — Figure 2

We pass in a pretrained vision model the original image and then extract a 3D tensor of features, each in the above case of size 20×15. It can be, however, different depending on which layer we extract features from and the vision model we use. As we can see we can find the exact correspondence of the box in the original image coordinates in image features coordinates. Now, why do we really need RoI pooling?
The problem with the RoIs is that they are all of different sizes, while the classification network requires fixed sized features.

https://web.eecs.umich.edu/~justincj/slides/eecs498/WI2022 — Figure 3

Thus, RoI pooling enables us to to map into the same size all the RoIs, e.g. into 3×3 fixed size features, and predict classes they contains and the offsets. There are several variations of RoI pooling — in this article we will focus on RoIAlign. Let’s finally see how this is implemented!

Set up

Let’s first define an example of a feature map to work with. Thus we assume we are at a stage when we extracted a 7×7 dimensional features from an image of interest.

Image by Author — Figure 3

Now, let’s assume we extracted a RoI with the following coordinates in red in Figure 4 (we omit features values in the boxes):

Image by Author — Figure 4

In Figure 4 we also divided our RoI into 4 regions because we are pooling into a 2×2 dimensional feature. With RoIAlign we usually do average pooling.

Image by Author — Figure 5

Now the question is, how do we average pool these sub-regions? We can see they are misaligned to the grid, thus we cannot simply average cells within each sub-region. The solution is to sample regularly-spaced points in each sub-region with bi-linear interpolation.

Bi-linear interpolation and pooling

First we need to come up with the points we interpolate in each sub-region in RoI. Below we choose to pool into 2×2 region and we output the points we want to interpolate values for.

# 7x7 image features
img_feats = np.array([[0.5663671 , 0.2577112 , 0.20066682, 0.0127351 , 0.07388048,
0.38410962, 0.2822853 ],
[0.3358975 , 0. , 0. , 0. , 0. ,
0. , 0.07561569],
[0.23596162, 0. , 0. , 0. , 0. ,
0. , 0.04612046],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.18630868, 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.00289604, 0. ]], dtype=np.float32)

# roi proposal
roi_proposal = [2.2821481227874756, 0.3001725673675537, 4.599632263183594, 5.58889102935791]
roi_start_w, roi_start_h, roi_end_w, roi_end_h = roi_proposal

# pooling regions size
pooled_height = 2
pooled_width = 2

# RoI width and height
roi_width = roi_end_w - roi_start_w
roi_height = roi_end_h - roi_start_h
# roi_height= 5.288, roi_width = 2.317

# we divide each RoI sub-region into roi_bin_grid_h x roi_bin_grid_w areas.
# These will defined the number of sampling points in each sub-region
roi_bin_grid_h = np.ceil(roi_height / pooled_height)
roi_bin_grid_w = np.ceil(roi_width / pooled_width)
# roi_bin_grid_h = 3, roi_bin_grid_w = 2
# Thus overall we have 6 sampling points in each sub-region

# raw height and weight of each RoI sub-regions
bin_size_h = roi_height / pooled_height
bin_size_w = roi_width / pooled_width
# bin_size_h = 2.644, bin_size_w = 1.158

# variable to be used to calculate pooled value in each sub-region
output_val = 0

# ph and pw define each square (sub-region) RoI is divided into.
ph = 0
pw = 0
# iy and ix represent sampled points within each sub-region in RoI.
# In this example roi_bin_grid_h = 3 and roi_bin_grid_w = 2, thus we
# have overall 6 points for which we interpolate the values and then average
# them to come up with a value for each of the 4 areas in pooled RoI
# sub-regions
for iy in range(int(roi_bin_grid_h)):
# ph * bin_size_h - which square in RoI to pick vertically (on y axis)
# (iy + 0.5) * bin_size_h / roi_bin_grid_h - which of the roi_bin_grid_h
# points vertically to select within square
yy = roi_start_h + ph * bin_size_h + (iy + 0.5) * bin_size_h / roi_bin_grid_h
for ix in range(int(roi_bin_grid_w)):
# pw * bin_size_w - which square in RoI to pick horizontally (on x axis)
# (ix + 0.5) * bin_size_w / roi_bin_grid_w - which of the roi_bin_grid_w
# points vertically to select within square
xx = roi_start_w + pw * bin_size_w + (ix + 0.5) * bin_size_w / roi_bin_grid_w
print(xx, yy)

# xx and yy values:
# 2.57 0.74
# 3.15 0.74
# 2.57 1.62
# 3.15 1.62
# 2.57 2.50
# 3.15 2.50

In Figure 6 we can see the corresponding 6 sample points for sub-region 1.

Image by Author — Figure 6

To do the bi-linear interpolation of the value corresponding to the first point of coordinates (2.57, 0.74) , we find the box where this point is positioned. So we take the floor of these values — (2, 0) which corresponds to the top-left point of the box (x_low, y_low) and then adding 1 to these coordinates we find the bottom-right point (x_high, y_high) of the box — (3, 1). This is represented in the below Figure:

Image by Author — Figure 6

According to Figure 3, point (0, 2) corresponds to 0.2, point (0,3) to 0.012 and so on. Following the previous code, inside the last loop we find the interpolated value for red point inside the sub-region:

x = xx; y = yy
if y <= 0: y = 0
if x <= 0: x = 0
y_low = int(y); x_low = int(x)
if (y_low >= height - 1):
y_high = y_low = height - 1
y = y_low
else:
y_high = y_low + 1

if (x_low >= width-1):
x_high = x_low = width-1
x = x_low
else:
x_high = x_low + 1

# compute weights and bilinear interpolation
ly = y - y_low; lx = x - x_low;
hy = 1. - ly; hx = 1. - lx;
w1 = hy * hx; w2 = hy * lx; w3 = ly * hx; w4 = ly * lx;

output_val += w1 * img_feats[y_low, x_low] + w2 * img_feats[y_low, x_high] + \
w3 * img_feats[y_high, x_low] + w4 * img_feats[y_high, x_high]

So we have for the red point the following result:

Image by Author — Figure 7

If we then do it for all the 6 points in the sub-region, we get the following results:

# interpolated values for each point in the sub-region
[0.0241, 0.0057, 0., 0., 0., 0.]

# if we then take the average we get the pooled average value for
# the first region:
0.004973

At the end we get the following average pooled results:

Image by Author — Figure 8

The full code:

Additional comments to the code

The code above contains some additional features we did not discuss that I will briefly explain here:

  • you can change the align variable to be either True or False. If True, pixel shift the box coordinates by -0.5 for a better alignment with the two neighboring pixel indices. This version is used in Detectron2.
  • sampling_ratio defines the number of sampling points in each sub-region of a RoI as illustrated in Figure 6 where 6 sampling points were used. If sampling_ratio = -1 , then it’s computed automatically as we saw in the first code snippet:
roi_bin_grid_h = np.ceil(roi_height / pooled_height)
roi_bin_grid_w = np.ceil(roi_width / pooled_width)

Conclusions

In this article we have seen how RoIAlign works and how it is implemented in torchvision library. RoIAlign can be seen as a layer in a neural network architecture and as every layer you can propagate forward and backword through it, enabling to train your models end-to-end.
After reading this article I would encourage you to also read about RoI pooling and why RoIAlign is preferred to it. If you understood RoIAlign, understanding RoI pooling shouldn’t be a problem.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
AlexeyDecexplainedImplementationKravetslatest newspythonRoIAlignTech NewsTechnology
Comments (0)
Add Comment