Point Net for Segmentation – 3D Deep Learning

By Jessie Hobb On Jan 18, 2023

How to Train Point Net for Point Cloud Segmentation

This is the fourth part of the Point Net Series:

In this tutorial we will learn how to train Point Net for semantic segmentation on the Stanford 3D Indoor Scene Data Set (S3DIS). S3DIS is a 3D data set containing point clouds of indoor spaces from several buildings and covers an area of more than 6000m² [1]. Point Net is a novel architecture that consumes entire point clouds and is capable of classification and segmentation tasks [2]. If you have been following the Point Net series, you already know how it works and how to code it.

In the previous tutorial, we learned how to train Point Net for classification on a mini version of the shapenet data set. In this tutorial we will train Point Net for Semantic Segmentation using the S3DIS data set. Code for this tutorial is located in this repository, and we will working out of this notebook.

Here’s an overview of what’s to come:

Data Set Overview
Methodology
Model Training
Model Evaluation
Conclusion
References

Overview

The full S3DIS data set used in the can be downloaded by requesting access here. The data set is divided into six different areas that correspond to different buildings, Within each area there are different indoor spaces which correspond to different rooms such as offices or conference rooms. There are two versions of this data set, the raw and aligned, we choose to use the aligned version. The aligned version is the same as the raw except each point cloud is rotated such that the x-axis is aligned along the entrance of the room, the y-axis is perpendicular to the entrance wall, and the z-axis remains the vertical axis. This alignment forms a canonical (i.e. common) coordinate system that allows us to exploit the consistent structures that are found in each of the point clouds [1].

Data Reduction

The data set is nearly 30GB on disk (6GB zipped), but we have a reduced version that only takes up ~6 GB unzipped. During the data reduction the true data point colors have been removed and all data points have been converted to float32 leaving us with Nx4 arrays containing (x,y,z) points and a class. Each space has been partitioned into a subspace of approximately 1×1 meters and saved as an hdf5 file. Going through the process is outside the scope of this tutorial, but here is the notebook used to generate the reduced data set.

Data Hyper-parameters

We may not often think of hyper-parameters when it comes to data, but the types of augmentations (and even normalization) are in fact hyper-parameters since they play an important role in the learning process [3]. Our data hyper-parameters can be broken down into two categories: the actual transforms themselves (e.g. image rotation VS image warping) and the parameters that control the transforms (e.g. image rotation angle). The model can not learn either of these things directly and we typically adjust these based on validation performance just as we do for the model hyper-parameters (e.g. learning rate, batch size). It is also worth noting that data hyper-parameters can greatly increase a model’s capability to learn, and this can be verified empirically.

In the training and validation sets we add add random Gaussian Noise with 0.01 standard deviation. Exclusively in the training set we randomly rotate about the vertical axis with probability of 0.25. Restricting the rotations about the vertical axis allows the base structure to vary, while the floor, walls, and ceiling (background) all maintain a similar relation across all partitions. For all splits, we perform min/max normalization so that each partition ranges from 0–1. Similar to [2] we randomly down sample 4096 points on the fly for each partition during training and validation. During test we prefer to use more points to obtain a better understanding of model performance, so we down sample with 15000 points.

The PyTorch data set script is located here, please follow along with the notebook to see how to generate the datasets. For our splits we use areas 1–4 for training, area 5 for validation, and area 6 for test.

Data Exploration

An example of a full space is shown in figure 1. While an example of a regular VS rotated partition is shown in figure 2.

Figure 1. A full space with color denoting different classes. Source: Author.

Figure 2. Regular VS Rotated training partition. Source: Author.

Now let’s explore the training class frequencies, they are displayed in figure 3. We can see that this data set is highly imbalanced and some of the classes seem to constitute background classes (ceiling, floor, wall). We should note that the clutter class is actually a category for any other miscellaneous object, such as a whiteboard or picture on a wall, or a printer on a desk.

Figure 3. Class Frequencies of S3DIS dataset. Source: Author.

Problem Definition

When you hear Semantic Segmentation you may think of images, since it is the concept of recognizing every pixel in a given image [4]. Segmentation can be generalized into high dimensional space and for 3D point clouds it is the concept of assigning a class to every 3D point. To better understand what this problem consists of we should have a good understanding of what a point cloud actually is. Let’s consider the classes that we want to segment, if you look at figure 2, you will notice that every class (except clutter) has unique and consistent structures. i.e. walls, floors, and ceilings are flat and continuous planes; things like chairs and bookcases are expected to also have consistent structures across many different areas. We want our model to be able to recognize the different structures of the different classes with some degree of accuracy. We will need to construct a loss function to entice our model to learn these structures in a useful manner.

Loss Function

In figure 2, we can clearly see that this data set is imbalanced. We address this in a manner similar to that of the classification tutorial. We incorporate the Balanced Focal Loss, which is based off of the Cross Entropy Loss with a couple of extra terms that scale it. The first scaling factor is a class weighting (alpha) that determines the importance of each class, this is where the “balanced” term comes from. We may either use inverse class weights or manually set it as a hyper-parameter. The second term is what transforms the Balanced Cross Entropy Loss to the Balanced Focal Loss, this term is deemed a modulating factor that forces the model to focus on the difficult classes i.e. those that are predicted with low confidence [5]. The modulating factor is controlled via a hyper-parameter gamma as shown in figure 4. The gamma term may range from 0–5, but really depends on the situation.

Figure 4. Focal Loss for class t. The alpha is the class weight, the term raised to the gamma power is the modulating term, and the log term is the Cross Entropy Loss. Source: [5]

The authors of [1] suggest that the semantic segmentation problem is actually better approached as a detection problem rather than a segmentation problem. We won’t expand much on that here, but we will make an attempt to take overall class structure into account with our loss function. Our model needs to learn the basic representations of the class structures, it needs to learn that they are continuous and not sparse. We can incorporate the Dice Loss to help account for this. The Dice Score quantifies how well our predicted classes overlap with the ground truth. The Dice Loss is just one minus the Dice Coefficient and is shown in figure 5, we add epsilon to avoid division by zero [6].

We incorporate the Dice Loss to discourage the model from predicting sparse structure classifications. That is, we would prefer an entire wall to be segmented, rather than a mix of wall and clutter. During training, we add the Focal Loss and Dice Loss and use that as our loss. The code for the loss function is available here, and the Dice Loss code in PyTorch is given below:

@staticmethod
def dice_loss(predictions, targets, eps=1):targets = targets.reshape(-1)
predictions = predictions.reshape(-1)
cats = torch.unique(targets)
top = 0
bot = 0
for i, c in enumerate(cats):
locs = targets == c
y_tru = targets[locs]
y_hat = predictions[locs]
top += torch.sum(y_hat == y_tru)
bot += len(y_tru) + len(y_hat)
return 1 - 2*((top + eps)/(bot + eps))

Note that the authors, in their implementation, did not use feature matrix regularization for semantic segmentation, so we will not use it either.

Model Hyper-parameters

The model hyper-parameters are listed in the training set up code below, once again the notebook is here.

import torch.optim as optim
from point_net_loss import PointNetSegLossEPOCHS = 100
LR = 0.0001
# manually set alpha weights
alpha = np.ones(len(CATEGORIES))
alpha[0:3] *= 0.25 # balance background classes
alpha[-1] *= 0.75  # balance clutter class
gamma = 1
optimizer = optim.Adam(seg_model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=0.0001, max_lr=0.01, 
step_size_up=2000, cycle_momentum=False)
criterion = PointNetSegLoss(alpha=alpha, gamma=gamma, dice=True).to(DEVICE)

We manually weight the background and clutter classes and set gamma equal to 1 for our Focal Loss. We use a the Adam Optimizer and a cyclic learning rate scheduler. The author of [7] notes that the learning rate is the most important hyper-parameter and suggests that a cyclic learning rate (CLR) may produce better results faster without the need to heavily tune the learning rate. We have taken the CLR approach and should also note that most of the hyper-parameter tuning effort for this experiment was concentrated on the data hyper-parameters. However, we should note that using a CLR led to increased model performance as compared to a static learning rate.

Training Results

During training, we tracked the loss, accuracy, Mathews Correlation Coefficient (MCC), and Intersection over Union (IOU). The training results are shown in figure 6.

Figure 6. Training metrics. Source: Author.

We see around epoch 30 that the validation loss starts to become unstable, despite of this the metrics still improve. The jaggedness of the metrics is typical of a cyclic learning rate since the metrics tend to peak at the end of each cycle [7]. We know from the classification tutorial that the MCC is usually a better representation of classification than the F1 score or accuracy [8]. Even through we are training for segmentation, it is still a nice metric to observe. What we are really interested in is the IOU (or Jaccard Index). This is because the classes are not just categories, they are continuous structures contained within the point cloud. We want to see the percent overlap of our predictions with ground truth which is what the IOU quantifies. Figure 6 shows how to compute the IOU in terms of sets.

Figure 6. The Jaccard index (Intersection over Union). Source: [9]

We compute the IOU in PyTorch via:

def compute_iou(targets, predictions):targets = targets.reshape(-1)
predictions = predictions.reshape(-1)
intersection = torch.sum(predictions == targets) # true positives
union = len(predictions) + len(targets) - intersection
return intersection / union

Test Results

From our training, we find that the model trained on the 68th epoch produces the best IOU performance on the test set. Test metrics for area 6 are shown below in figure 7, for this area we used 15000 points per partition rather than the 4096 for our training and validation. The weights learned by the model carry over to the more dense test point clouds since all splits possess similar structures.

Figure 7. Test Metrics for model 68. Source: Author.

Segmentation Results

To really assess our model, we have made a special function in the data loader to obtain the partitions that make up a full space. We can then stitch these partitions together to get the full space. This way we can see how an entire predicted space compares with the ground truth. Once again the code for the data set is located here, and we can grab a random space with:

points, targets = s3dis_test.get_random_partitioned_space()

Results on couple of full test spaces are shown in figure 8. This is a case of good segmentation results in a couple of office layouts. You can see that clutter (in black) appears to be randomly assigned to areas in the predicted point cloud.

igure 8. Segmentation Results for two Random Test Spaces. Source: Author.

The full view is nice to see, but it’s still important to check how the model is performing on each partition. This allows us to really see how well model learns the structures of the classes. The segmentation results of various partitions are shown in figure 9.

Figure 9. Segmentation results of various test partitions. Source: Author.

In the partition example in the top right of figure 9. you will see that the model has trouble defining the border of clutter (black) and table (aqua). A general observation is that any excessive perturbation tends to be labelled as clutter. Overall the model’s performance is fairly good as it is able to obtain reasonable segmentation performance as quantified by the IOU. We can also observe some fairly reasonable performance on the the test spaces as seen in figure 8.

Critical Sets

If you recall from the intro to Point Net article, Point Net is capable of learning what is essentially the skeleton of a point cloud structure, something that [2] refers to as the Critical Set. In the classification tutorial, we were able view the learned critical sets, we will do the same for this tutorial. We use 1024 points for each partition since this is the dimension of Global Features learned by the model. Code for stitching together and displaying the critical set for an entire space is given below. Please see the notebook for more details.

points = points.to('cpu')
crit_idxs = crit_idxs.to('cpu')
targets = targets.to('cpu')pcds = []
for i in range(points.shape[0]):
pts = points[i, :]
cdx = crit_idxs[i, :]
tgt = targets[i, :]
critical_points = pts[cdx, :]
critical_point_colors = np.vstack(v_map_colors(tgt[cdx])).T/255
pcd = o3.geometry.PointCloud()
pcd.points = o3.utility.Vector3dVector(critical_points)
pcd.colors = o3.utility.Vector3dVector(critical_point_colors)
pcds.append(pcd)
# o3.visualization.draw_plotly([pcds]) # works in Colab
draw(pcds, point_size=5) # Non-Colab

We display the critical set using the truth labels for color, and a result is shown below in figure 9.

Figure 9. Comparison between the Ground Truth and learned Critical Set for a random test space. Source: Author.

A GIF of another random Critical Set is presented in figure 10. From this it is more clear that the Critical Set maintains the base structure of the indoor space.

Figure 10. GIF of Critical Set of a random test space. Source: Author.

In this tutorial we learned about the S3DIS data set and how to train Point Net on it. We have learned how to combine loss functions in order to achieve good segmentation performance. Even though we trained on partitions of the spaces, we have also been able to stitch together these partitions and visualize their performance on the test set where we have observed good performance. We were able to view the learned Critical Sets and confirm that the model is actually learning the underlying structures of the indoor spaces. Even though this model has performed fairly well, we still have more room for improvement, here are some suggestions for future work.

Suggestions for future work

Use a different Loss function
– Apply different weightings on the Focal and Dice Loss
Use k-fold cross validation with areas 1–5 to tune hyper parameters
– Train on areas 1–5 once hyper parameters are found
– Use area 6 for test
Increase the augmentation intensity as the training epochs increase
Try a different model
Implement Detection pipeline as described in Appendix D of [2]

[1] Armeni, I., Sener, O., Zamir, A. R., Jiang, H., Brilakis, I., Fischer, M., & Savarese, S. (2016). 3D semantic parsing of large-scale indoor spaces. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2016.170

[2] Charles, R. Q., Su, H., Kaichun, M., & Guibas, L. J. (2017). PointNet: Deep Learning on point sets for 3D classification and segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2017.16

[3] Ottoni, A. L., de Amorim, R. M., Novo, M. S., & Costa, D. B. (2022). Tuning of data augmentation hyperparameters in deep learning to building construction image classification with small datasets. International Journal of Machine Learning and Cybernetics. https://doi.org/10.1007/s13042-022-01555-1

[4] Wang, T. (n.d.). Semantic segmentation. www.cs.toronto.edu. Retrieved December 17, 2022, from https://www.cs.toronto.edu/~tingwuwang/semantic_segmentation.pdf

[5] Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection. 2017 IEEE International Conference on Computer Vision (ICCV). https://doi.org/10.1109/iccv.2017.324

[6] Zhou, T., Ruan, S., & Canu, S. (2019). A review: Deep Learning for Medical Image segmentation using multi-modality fusion. Array, 3–4, 100004. https://doi.org/10.1016/j.array.2019.100004

[7] Smith, L. N. (2017). Cyclical learning rates for training neural networks. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). https://doi.org/10.1109/wacv.2017.58

[8] Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1). https://doi.org/10.1186/s12864-019-6413-7

[9] Wikimedia Foundation. (2022, October 6). Jaccard index. Wikipedia. Retrieved December 17, 2022, from https://en.wikipedia.org/wiki/Jaccard_index