Techno Blender
Digitally Yours.

Multimodal Data Augmentation in Detectron2 | by Faruk Cankaya | Oct, 2022

0 61


A step-by-step guide to implementing a new data augmentation method that needs image, mask, and bounding boxes at the same time such as Simple Copy Paste

Photo by Sigmund on Unsplash

Table of Contents

Introduction
How do data augmentations work in Detectron2?
Implementing Multimodal Augmentations
Usecase 1: Instance Color Jitter Augmentation
Usecase 2: Copy Paste Augmentation

Detectron2 is one of the most powerful deep learning toolboxes for visual recognition tasks. It allows easily switch between recognition tasks such as object detection and panoptic segmentation. Also, it has many built-in modules like dataloaders for popular datasets, extensive network models, visualization, data augmentation, etc. If you are not familiar with Detectron2, you can check my Detectron2 Starter Guide for Researchers article. I gave an overview of Detectron2 API and I mentioned about some missing features that are not provided out of the box.

Detectron2 currently provides 13 data augmentation methods as of October 2022. Some of them are RandomFlip, Resize, RandomCrop, etc. All these methods can only be applied to a single image and it is called ‘image manipulation methods’, ‘classic/traditional image augmentation methods’, or ‘geometric/color image augmentation methods’. As they might be quite enough for many deep learning tasks, there are many different Image Data Augmentation methods available in the literature. For example, Object-Aware Data Augmentations allow copying some instances from one image to another. In this way, we can achieve more robust models by increasing dataset size and diversity.

Figure 1: CopyPasteAugmentation + LargeScaleJittering (Dog image from Mattys Flicks, smiling balloon image from Timothy Tolle, green&orange balloons image from William Warby, white balloon image from Stewart Black on Flickr. All are licensed under CC BY 2.0)

For object-aware augmentation, we need object masks in addition to the image itself. Unfortunately, the current augmentation architecture of Detectron2 doesn’t allow to implementation of such multi-modal augmentations out of the box. In this article, first I’ll give an overview of data flow and augmentation structure of Detectron2. I’ll highlight important points and bottlenecks of the architecture. Then, I’ll show my way of extending the Detectron2 to support multi-modal augmentations. Finally, we’ll implement two new object-aware augmentations using my proposed concept step by step. The first augmentation named ‘InstanceColorJitterAugmentation’ allows changing the color of instances in the image randomly. The second augmentation is named ‘CopyPasteAugmentation’ which is the simplified version of Simple Copy Paste(2021). Both augmentations are just for proving the concept. I recommend you verify them before using them in production.

Augmentations in Detectron2 are implemented by extending Augmentation and Transform, and they are applied in DatasetMapper through AugInput. Since it might be hard to understand the relation between classes from this description, I tried to illustrate the relation in Figure 2.

Figure 2: Image Data Augmentation Flow in Detectron2. (Illustration by Author)

Dataflow:

  • Data is loaded from files into memory by a dataset script. In most cases, data has ‘file path’ to image, ‘mask’ in polygon or binary bitmask format, bounding box in list or numpy array format, and other related metadata.
  • MapDataset selects an item from the dataset and forwards it to DatasetMapper. This class is responsible for handling error cases. e.g. if DatasetMapper cannot handle the selected item, returns None. Then, MapDataset selects a different item from the dataset and retries again.
  • DatasetMapper is the actual class where augmentation and all other data manipulations happen. It holds a set of augmentations and applies them to the data(image, masks, etc.) stored in AugInput.
DatasetMapper

Building Blocks:

  • Augmentation defines which transformation is applied in its get_transform method and returns that transformation. When augmentation is executed e.g. augmentations(aug_input) , in its Augmentation.__call__ method, required arguments e.g. image are extracted from aug_input and transformation to be applied is created by get_transform method. Finally, it passes created transform to AugInput to be executed and returns it. It is important to mention here that returned transformations are deterministic. They can be used later to transform different data. For example, you want to resize the image and of course, its masks. By default, AugInput accepts only images as arguments. When you apply augmentation transforms = augs(aug_input), image is transformed in-place inside aug_input. Now you can apply the same transformation to masks by transforms.apply_segmentation(mask).
  • Transform is responsible for actually executing transformation operations. It has methods such as apply_image, apply_segmentation etc. that defines how to transform each data type.
  • AugInput stores inputs that are needed by Augmentation. By default, it supports image, bounding box, and mask for semantic segmentation. It transforms each data type by calling corresponding Transform’s methods such as apply_image, apply_box, apply_segmentation.

Limitations of current architecture

In the current architecture, augmentations can only be applied on images, bounding boxes, and masks separately. For example, in the instance segmentation task, the given augmentations image is transformed and applied transformations are returned. Object instance masks can only be transformed later by transforms.apply_segmentation method through returned transforms. For object-aware augmentations, we need image and masks at the same time so that we can extract object instances from image. To this end, we can add a new method that takes images and masks to Transform class.

The other missing feature for applying multi-modal augmentation is the ability to sample additional data points from the dataset. In this way, we can implement an augmentation method like MixUp, CutMix, Simple Copy Paste that need multiple images. This can be achieved by either manipulating MapDataset to pass multiple data points to DatasetMapper, returning additional images and masks to the actual data in Dataset, or passing the dataset instance to the augmentation method that needs it. The first two ways seemed to me that require too much work to implement and they are not flexible for different scenarios. For example, Simple Copy Paste requires 2 images but Mosaic requires 4 images. In the first two approaches, we had to decide how many data points to return depending on the used augmentations. Therefore, I decided to go with the third option which allows augmentation methods to sample new data points from the dataset how they like.

I introduced MultiModalAugmentation and MultiModalTransform abstractions to be able to detect if multi-modal augmentation is applied. MultiModalAugmentation is an empty class that extends Augmentation. MultiModalTransform extends Transform but also it has an additional .apply_multi_modal() method that forces newly created multi-modal augmentation to implement.

We also need to adapt DatasetMapper and AugInput to be able to use the abstractions above. Since these classes are from the Detectron2 library, I created new classes that extend them instead of directly manipulating the library. You can see which parts of the code are changed in Figure 3 below.

Figure 3: Multi-modal Image Data Augmentation Flow in Detectron2. (Illustration by Author)

I’ll exemplify how this abstraction can be used in the real world with two use cases:

We use a publicly available the balloon segmentation dataset which only has one class: balloon. Images are collected from Flickr by limiting the license type to “Commercial use & mods allowed” as stated here. The goal is very simple, changing the color of balloons in the images randomly. For this task, we only need additional object masks to be able to detect a particular balloon instance. To this end, I created a new augmentation that extends MultiModalTransform. The logic to change color is executed in apply_multi_modal() method below:

Now, the only thing is to apply this augmentation. It can be executed by using Detectron2’s existing architecture like:

Here in the first row, I used the ImageEnhance.Color module from Pillow to change the color by a factor of 10. It is directly applied to randomly selected balloon instances. You can use any function you like. The sky’s the limit 🙂 The final output will look like this in Figure 4:

Figure 4: InstanceColorJitterAugmentation with the instance change rate of 50%. (Dog image from Mattys Flicks, smiling balloon image from Timothy Tolle on Flickr. All are licensed under CC BY 2.0)

I train Mask RCNN with this augmentation method on the whole balloon dataset using “Detectron2’s Tutorial notebook”. You can find all codes and training results in this notebook.

We’ll use the same balloon dataset for this example, too. The goal of CopyPasteAugmentation is to copy randomly selected balloons from one image to another image. So, this augmentation requires sampling additional images from the dataset. We achieved this functionally by passing the dataset instance to CopyPasteAugmentation.

copy_paste_aug = CopyPasteAugmentation(dataset=dataset, image_format=cfg.INPUT.FORMAT, pre_augs=pre_augs)

Disclaimer: This is not a complete implementation of SimpleCopyPaste but just a proof of concept that shows the proposed abstraction (MultiModalAugmentation&MultiModalTransform) can be used to implement various augmentations.

Since the code of CopyPasteAugmentation is a bit much to put here, I don’t share it here. However, you can check it from this notebook.

Figure 5: CopyPasteAugmentation + LargeScaleJittering (Red balloon image from Blondinrikard Fröberg, green&orange balloons image from William Warby, white balloon image from Stewart Black on Flickr. All are licensed under CC BY 2.0)

Surprise: Similar to the previous use case, I train Mask RCNN with CopyPasteAug on the whole balloon dataset. With just this augmentation method, we achieved better results compared to the official baseline and InstanceColorJitterAugmentation. Check the training notebook here.

In this article, I tried to give a background of how data augmentations work in Detectron2 with illustrations. Based on this introduction, I tried to explain how a new augmentation method that needs multiple modalities such as image and mask can be implemented. Then, I showed the way how I implemented such augmentations with two concrete examples. I published all resources used in this article here. You can test the augmentations shown in the use cases on google colab. Since I don’t test this abstraction in production yet, some errors may occur in memory consumption, parallelism, multi-GPU training, etc. If you encounter a problem or you use this abstraction in your work, let me know in the comments.


A step-by-step guide to implementing a new data augmentation method that needs image, mask, and bounding boxes at the same time such as Simple Copy Paste

Photo by Sigmund on Unsplash

Table of Contents

Introduction
How do data augmentations work in Detectron2?
Implementing Multimodal Augmentations
Usecase 1: Instance Color Jitter Augmentation
Usecase 2: Copy Paste Augmentation

Detectron2 is one of the most powerful deep learning toolboxes for visual recognition tasks. It allows easily switch between recognition tasks such as object detection and panoptic segmentation. Also, it has many built-in modules like dataloaders for popular datasets, extensive network models, visualization, data augmentation, etc. If you are not familiar with Detectron2, you can check my Detectron2 Starter Guide for Researchers article. I gave an overview of Detectron2 API and I mentioned about some missing features that are not provided out of the box.

Detectron2 currently provides 13 data augmentation methods as of October 2022. Some of them are RandomFlip, Resize, RandomCrop, etc. All these methods can only be applied to a single image and it is called ‘image manipulation methods’, ‘classic/traditional image augmentation methods’, or ‘geometric/color image augmentation methods’. As they might be quite enough for many deep learning tasks, there are many different Image Data Augmentation methods available in the literature. For example, Object-Aware Data Augmentations allow copying some instances from one image to another. In this way, we can achieve more robust models by increasing dataset size and diversity.

Figure 1: CopyPasteAugmentation + LargeScaleJittering (Dog image from Mattys Flicks, smiling balloon image from Timothy Tolle, green&orange balloons image from William Warby, white balloon image from Stewart Black on Flickr. All are licensed under CC BY 2.0)

For object-aware augmentation, we need object masks in addition to the image itself. Unfortunately, the current augmentation architecture of Detectron2 doesn’t allow to implementation of such multi-modal augmentations out of the box. In this article, first I’ll give an overview of data flow and augmentation structure of Detectron2. I’ll highlight important points and bottlenecks of the architecture. Then, I’ll show my way of extending the Detectron2 to support multi-modal augmentations. Finally, we’ll implement two new object-aware augmentations using my proposed concept step by step. The first augmentation named ‘InstanceColorJitterAugmentation’ allows changing the color of instances in the image randomly. The second augmentation is named ‘CopyPasteAugmentation’ which is the simplified version of Simple Copy Paste(2021). Both augmentations are just for proving the concept. I recommend you verify them before using them in production.

Augmentations in Detectron2 are implemented by extending Augmentation and Transform, and they are applied in DatasetMapper through AugInput. Since it might be hard to understand the relation between classes from this description, I tried to illustrate the relation in Figure 2.

Figure 2: Image Data Augmentation Flow in Detectron2. (Illustration by Author)

Dataflow:

  • Data is loaded from files into memory by a dataset script. In most cases, data has ‘file path’ to image, ‘mask’ in polygon or binary bitmask format, bounding box in list or numpy array format, and other related metadata.
  • MapDataset selects an item from the dataset and forwards it to DatasetMapper. This class is responsible for handling error cases. e.g. if DatasetMapper cannot handle the selected item, returns None. Then, MapDataset selects a different item from the dataset and retries again.
  • DatasetMapper is the actual class where augmentation and all other data manipulations happen. It holds a set of augmentations and applies them to the data(image, masks, etc.) stored in AugInput.
DatasetMapper

Building Blocks:

  • Augmentation defines which transformation is applied in its get_transform method and returns that transformation. When augmentation is executed e.g. augmentations(aug_input) , in its Augmentation.__call__ method, required arguments e.g. image are extracted from aug_input and transformation to be applied is created by get_transform method. Finally, it passes created transform to AugInput to be executed and returns it. It is important to mention here that returned transformations are deterministic. They can be used later to transform different data. For example, you want to resize the image and of course, its masks. By default, AugInput accepts only images as arguments. When you apply augmentation transforms = augs(aug_input), image is transformed in-place inside aug_input. Now you can apply the same transformation to masks by transforms.apply_segmentation(mask).
  • Transform is responsible for actually executing transformation operations. It has methods such as apply_image, apply_segmentation etc. that defines how to transform each data type.
  • AugInput stores inputs that are needed by Augmentation. By default, it supports image, bounding box, and mask for semantic segmentation. It transforms each data type by calling corresponding Transform’s methods such as apply_image, apply_box, apply_segmentation.

Limitations of current architecture

In the current architecture, augmentations can only be applied on images, bounding boxes, and masks separately. For example, in the instance segmentation task, the given augmentations image is transformed and applied transformations are returned. Object instance masks can only be transformed later by transforms.apply_segmentation method through returned transforms. For object-aware augmentations, we need image and masks at the same time so that we can extract object instances from image. To this end, we can add a new method that takes images and masks to Transform class.

The other missing feature for applying multi-modal augmentation is the ability to sample additional data points from the dataset. In this way, we can implement an augmentation method like MixUp, CutMix, Simple Copy Paste that need multiple images. This can be achieved by either manipulating MapDataset to pass multiple data points to DatasetMapper, returning additional images and masks to the actual data in Dataset, or passing the dataset instance to the augmentation method that needs it. The first two ways seemed to me that require too much work to implement and they are not flexible for different scenarios. For example, Simple Copy Paste requires 2 images but Mosaic requires 4 images. In the first two approaches, we had to decide how many data points to return depending on the used augmentations. Therefore, I decided to go with the third option which allows augmentation methods to sample new data points from the dataset how they like.

I introduced MultiModalAugmentation and MultiModalTransform abstractions to be able to detect if multi-modal augmentation is applied. MultiModalAugmentation is an empty class that extends Augmentation. MultiModalTransform extends Transform but also it has an additional .apply_multi_modal() method that forces newly created multi-modal augmentation to implement.

We also need to adapt DatasetMapper and AugInput to be able to use the abstractions above. Since these classes are from the Detectron2 library, I created new classes that extend them instead of directly manipulating the library. You can see which parts of the code are changed in Figure 3 below.

Figure 3: Multi-modal Image Data Augmentation Flow in Detectron2. (Illustration by Author)

I’ll exemplify how this abstraction can be used in the real world with two use cases:

We use a publicly available the balloon segmentation dataset which only has one class: balloon. Images are collected from Flickr by limiting the license type to “Commercial use & mods allowed” as stated here. The goal is very simple, changing the color of balloons in the images randomly. For this task, we only need additional object masks to be able to detect a particular balloon instance. To this end, I created a new augmentation that extends MultiModalTransform. The logic to change color is executed in apply_multi_modal() method below:

Now, the only thing is to apply this augmentation. It can be executed by using Detectron2’s existing architecture like:

Here in the first row, I used the ImageEnhance.Color module from Pillow to change the color by a factor of 10. It is directly applied to randomly selected balloon instances. You can use any function you like. The sky’s the limit 🙂 The final output will look like this in Figure 4:

Figure 4: InstanceColorJitterAugmentation with the instance change rate of 50%. (Dog image from Mattys Flicks, smiling balloon image from Timothy Tolle on Flickr. All are licensed under CC BY 2.0)

I train Mask RCNN with this augmentation method on the whole balloon dataset using “Detectron2’s Tutorial notebook”. You can find all codes and training results in this notebook.

We’ll use the same balloon dataset for this example, too. The goal of CopyPasteAugmentation is to copy randomly selected balloons from one image to another image. So, this augmentation requires sampling additional images from the dataset. We achieved this functionally by passing the dataset instance to CopyPasteAugmentation.

copy_paste_aug = CopyPasteAugmentation(dataset=dataset, image_format=cfg.INPUT.FORMAT, pre_augs=pre_augs)

Disclaimer: This is not a complete implementation of SimpleCopyPaste but just a proof of concept that shows the proposed abstraction (MultiModalAugmentation&MultiModalTransform) can be used to implement various augmentations.

Since the code of CopyPasteAugmentation is a bit much to put here, I don’t share it here. However, you can check it from this notebook.

Figure 5: CopyPasteAugmentation + LargeScaleJittering (Red balloon image from Blondinrikard Fröberg, green&orange balloons image from William Warby, white balloon image from Stewart Black on Flickr. All are licensed under CC BY 2.0)

Surprise: Similar to the previous use case, I train Mask RCNN with CopyPasteAug on the whole balloon dataset. With just this augmentation method, we achieved better results compared to the official baseline and InstanceColorJitterAugmentation. Check the training notebook here.

In this article, I tried to give a background of how data augmentations work in Detectron2 with illustrations. Based on this introduction, I tried to explain how a new augmentation method that needs multiple modalities such as image and mask can be implemented. Then, I showed the way how I implemented such augmentations with two concrete examples. I published all resources used in this article here. You can test the augmentations shown in the use cases on google colab. Since I don’t test this abstraction in production yet, some errors may occur in memory consumption, parallelism, multi-GPU training, etc. If you encounter a problem or you use this abstraction in your work, let me know in the comments.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment