Multimodal Data Augmentation in Detectron2 | by Faruk Cankaya | Oct, 2022
GETTING STARTED, DATA AUGMENTATION, DETECTRON2, TUTORIAL
A step-by-step guide to implementing a new data augmentation method that needs image, mask, and bounding boxes at the same time such as Simple Copy Paste
Table of Contents
— Introduction
— How do data augmentations work in Detectron2?
— Implementing Multimodal Augmentations
— Usecase 1: Instance Color Jitter Augmentation
— Usecase 2: Copy Paste Augmentation
Detectron2 is one of the most powerful deep learning toolboxes for visual recognition tasks. It allows easily switch between recognition tasks such as object detection and panoptic segmentation. Also, it has many built-in modules like dataloaders for popular datasets, extensive network models, visualization, data augmentation, etc. If you are not familiar with Detectron2, you can check my Detectron2 Starter Guide for Researchers article. I gave an overview of Detectron2 API and I mentioned about some missing features that are not provided out of the box.
Detectron2 currently provides 13 data augmentation methods as of October 2022. Some of them are RandomFlip, Resize, RandomCrop, etc. All these methods can only be applied to a single image and it is called ‘image manipulation methods’, ‘classic/traditional image augmentation methods’, or ‘geometric/color image augmentation methods’. As they might be quite enough for many deep learning tasks, there are many different Image Data Augmentation methods available in the literature. For example, Object-Aware Data Augmentations allow copying some instances from one image to another. In this way, we can achieve more robust models by increasing dataset size and diversity.
For object-aware augmentation, we need object masks in addition to the image itself. Unfortunately, the current augmentation architecture of Detectron2 doesn’t allow to implementation of such multi-modal augmentations out of the box. In this article, first I’ll give an overview of data flow and augmentation structure of Detectron2. I’ll highlight important points and bottlenecks of the architecture. Then, I’ll show my way of extending the Detectron2 to support multi-modal augmentations. Finally, we’ll implement two new object-aware augmentations using my proposed concept step by step. The first augmentation named ‘InstanceColorJitterAugmentation’ allows changing the color of instances in the image randomly. The second augmentation is named ‘CopyPasteAugmentation’ which is the simplified version of Simple Copy Paste(2021). Both augmentations are just for proving the concept. I recommend you verify them before using them in production.
Augmentations in Detectron2 are implemented by extending Augmentation and Transform, and they are applied in DatasetMapper through AugInput. Since it might be hard to understand the relation between classes from this description, I tried to illustrate the relation in Figure 2.
Dataflow:
- Data is loaded from files into memory by a dataset script. In most cases, data has ‘file path’ to image, ‘mask’ in
polygon
or binarybitmask
format, bounding box inlist
ornumpy array
format, and other related metadata. - MapDataset selects an item from the dataset and forwards it to DatasetMapper. This class is responsible for handling error cases. e.g. if DatasetMapper cannot handle the selected item, returns
None
. Then, MapDataset selects a different item from the dataset and retries again. - DatasetMapper is the actual class where augmentation and all other data manipulations happen. It holds a set of augmentations and applies them to the data(image, masks, etc.) stored in AugInput.
Building Blocks:
- Augmentation defines which transformation is applied in its
get_transform
method and returns that transformation. When augmentation is executed e.g.augmentations(aug_input)
, in itsAugmentation.__call__
method, required arguments e.g. image are extracted fromaug_input
and transformation to be applied is created byget_transform
method. Finally, it passes created transform to AugInput to be executed and returns it. It is important to mention here that returned transformations are deterministic. They can be used later to transform different data. For example, you want to resize the image and of course, its masks. By default, AugInput accepts only images as arguments. When you apply augmentationtransforms = augs(aug_input)
, image is transformed in-place insideaug_input
. Now you can apply the same transformation to masks bytransforms.apply_segmentation(mask)
.
- Transform is responsible for actually executing transformation operations. It has methods such as
apply_image
,apply_segmentation
etc. that defines how to transform each data type. - AugInput stores inputs that are needed by Augmentation. By default, it supports image, bounding box, and mask for semantic segmentation. It transforms each data type by calling corresponding Transform’s methods such as
apply_image
,apply_box
,apply_segmentation
.
Limitations of current architecture
In the current architecture, augmentations can only be applied on images, bounding boxes, and masks separately. For example, in the instance segmentation task, the given augmentations image is transformed and applied transformations are returned. Object instance masks can only be transformed later by transforms.apply_segmentation
method through returned transforms
. For object-aware augmentations, we need image and masks at the same time so that we can extract object instances from image. To this end, we can add a new method that takes images and masks to Transform class.
The other missing feature for applying multi-modal augmentation is the ability to sample additional data points from the dataset. In this way, we can implement an augmentation method like MixUp, CutMix, Simple Copy Paste that need multiple images. This can be achieved by either manipulating MapDataset to pass multiple data points to DatasetMapper, returning additional images and masks to the actual data in Dataset, or passing the dataset instance to the augmentation method that needs it. The first two ways seemed to me that require too much work to implement and they are not flexible for different scenarios. For example, Simple Copy Paste requires 2 images but Mosaic requires 4 images. In the first two approaches, we had to decide how many data points to return depending on the used augmentations. Therefore, I decided to go with the third option which allows augmentation methods to sample new data points from the dataset how they like.
I introduced MultiModalAugmentation
and MultiModalTransform
abstractions to be able to detect if multi-modal augmentation is applied. MultiModalAugmentation is an empty class that extends Augmentation
. MultiModalTransform extends Transform
but also it has an additional .apply_multi_modal()
method that forces newly created multi-modal augmentation to implement.
We also need to adapt DatasetMapper and AugInput to be able to use the abstractions above. Since these classes are from the Detectron2 library, I created new classes that extend them instead of directly manipulating the library. You can see which parts of the code are changed in Figure 3 below.
I’ll exemplify how this abstraction can be used in the real world with two use cases:
We use a publicly available the balloon segmentation dataset which only has one class: balloon. Images are collected from Flickr by limiting the license type to “Commercial use & mods allowed” as stated here. The goal is very simple, changing the color of balloons in the images randomly. For this task, we only need additional object masks to be able to detect a particular balloon instance. To this end, I created a new augmentation that extends MultiModalTransform. The logic to change color is executed in apply_multi_modal()
method below:
Now, the only thing is to apply this augmentation. It can be executed by using Detectron2’s existing architecture like:
Here in the first row, I used the ImageEnhance.Color
module from Pillow to change the color by a factor of 10
. It is directly applied to randomly selected balloon instances. You can use any function you like. The sky’s the limit 🙂 The final output will look like this in Figure 4:
I train Mask RCNN with this augmentation method on the whole balloon dataset using “Detectron2’s Tutorial notebook”. You can find all codes and training results in this notebook.
We’ll use the same balloon dataset for this example, too. The goal of CopyPasteAugmentation is to copy randomly selected balloons from one image to another image. So, this augmentation requires sampling additional images from the dataset. We achieved this functionally by passing the dataset instance to CopyPasteAugmentation.
copy_paste_aug = CopyPasteAugmentation(dataset=dataset, image_format=cfg.INPUT.FORMAT, pre_augs=pre_augs)
Disclaimer: This is not a complete implementation of SimpleCopyPaste but just a proof of concept that shows the proposed abstraction (
MultiModalAugmentation
&MultiModalTransform
) can be used to implement various augmentations.
Since the code of CopyPasteAugmentation is a bit much to put here, I don’t share it here. However, you can check it from this notebook.
Surprise: Similar to the previous use case, I train Mask RCNN with CopyPasteAug on the whole balloon dataset. With just this augmentation method, we achieved better results compared to the official baseline and InstanceColorJitterAugmentation. Check the training notebook here.
In this article, I tried to give a background of how data augmentations work in Detectron2 with illustrations. Based on this introduction, I tried to explain how a new augmentation method that needs multiple modalities such as image and mask can be implemented. Then, I showed the way how I implemented such augmentations with two concrete examples. I published all resources used in this article here. You can test the augmentations shown in the use cases on google colab. Since I don’t test this abstraction in production yet, some errors may occur in memory consumption, parallelism, multi-GPU training, etc. If you encounter a problem or you use this abstraction in your work, let me know in the comments.
GETTING STARTED, DATA AUGMENTATION, DETECTRON2, TUTORIAL
A step-by-step guide to implementing a new data augmentation method that needs image, mask, and bounding boxes at the same time such as Simple Copy Paste
Table of Contents
— Introduction
— How do data augmentations work in Detectron2?
— Implementing Multimodal Augmentations
— Usecase 1: Instance Color Jitter Augmentation
— Usecase 2: Copy Paste Augmentation
Detectron2 is one of the most powerful deep learning toolboxes for visual recognition tasks. It allows easily switch between recognition tasks such as object detection and panoptic segmentation. Also, it has many built-in modules like dataloaders for popular datasets, extensive network models, visualization, data augmentation, etc. If you are not familiar with Detectron2, you can check my Detectron2 Starter Guide for Researchers article. I gave an overview of Detectron2 API and I mentioned about some missing features that are not provided out of the box.
Detectron2 currently provides 13 data augmentation methods as of October 2022. Some of them are RandomFlip, Resize, RandomCrop, etc. All these methods can only be applied to a single image and it is called ‘image manipulation methods’, ‘classic/traditional image augmentation methods’, or ‘geometric/color image augmentation methods’. As they might be quite enough for many deep learning tasks, there are many different Image Data Augmentation methods available in the literature. For example, Object-Aware Data Augmentations allow copying some instances from one image to another. In this way, we can achieve more robust models by increasing dataset size and diversity.
For object-aware augmentation, we need object masks in addition to the image itself. Unfortunately, the current augmentation architecture of Detectron2 doesn’t allow to implementation of such multi-modal augmentations out of the box. In this article, first I’ll give an overview of data flow and augmentation structure of Detectron2. I’ll highlight important points and bottlenecks of the architecture. Then, I’ll show my way of extending the Detectron2 to support multi-modal augmentations. Finally, we’ll implement two new object-aware augmentations using my proposed concept step by step. The first augmentation named ‘InstanceColorJitterAugmentation’ allows changing the color of instances in the image randomly. The second augmentation is named ‘CopyPasteAugmentation’ which is the simplified version of Simple Copy Paste(2021). Both augmentations are just for proving the concept. I recommend you verify them before using them in production.
Augmentations in Detectron2 are implemented by extending Augmentation and Transform, and they are applied in DatasetMapper through AugInput. Since it might be hard to understand the relation between classes from this description, I tried to illustrate the relation in Figure 2.
Dataflow:
- Data is loaded from files into memory by a dataset script. In most cases, data has ‘file path’ to image, ‘mask’ in
polygon
or binarybitmask
format, bounding box inlist
ornumpy array
format, and other related metadata. - MapDataset selects an item from the dataset and forwards it to DatasetMapper. This class is responsible for handling error cases. e.g. if DatasetMapper cannot handle the selected item, returns
None
. Then, MapDataset selects a different item from the dataset and retries again. - DatasetMapper is the actual class where augmentation and all other data manipulations happen. It holds a set of augmentations and applies them to the data(image, masks, etc.) stored in AugInput.
Building Blocks:
- Augmentation defines which transformation is applied in its
get_transform
method and returns that transformation. When augmentation is executed e.g.augmentations(aug_input)
, in itsAugmentation.__call__
method, required arguments e.g. image are extracted fromaug_input
and transformation to be applied is created byget_transform
method. Finally, it passes created transform to AugInput to be executed and returns it. It is important to mention here that returned transformations are deterministic. They can be used later to transform different data. For example, you want to resize the image and of course, its masks. By default, AugInput accepts only images as arguments. When you apply augmentationtransforms = augs(aug_input)
, image is transformed in-place insideaug_input
. Now you can apply the same transformation to masks bytransforms.apply_segmentation(mask)
.
- Transform is responsible for actually executing transformation operations. It has methods such as
apply_image
,apply_segmentation
etc. that defines how to transform each data type. - AugInput stores inputs that are needed by Augmentation. By default, it supports image, bounding box, and mask for semantic segmentation. It transforms each data type by calling corresponding Transform’s methods such as
apply_image
,apply_box
,apply_segmentation
.
Limitations of current architecture
In the current architecture, augmentations can only be applied on images, bounding boxes, and masks separately. For example, in the instance segmentation task, the given augmentations image is transformed and applied transformations are returned. Object instance masks can only be transformed later by transforms.apply_segmentation
method through returned transforms
. For object-aware augmentations, we need image and masks at the same time so that we can extract object instances from image. To this end, we can add a new method that takes images and masks to Transform class.
The other missing feature for applying multi-modal augmentation is the ability to sample additional data points from the dataset. In this way, we can implement an augmentation method like MixUp, CutMix, Simple Copy Paste that need multiple images. This can be achieved by either manipulating MapDataset to pass multiple data points to DatasetMapper, returning additional images and masks to the actual data in Dataset, or passing the dataset instance to the augmentation method that needs it. The first two ways seemed to me that require too much work to implement and they are not flexible for different scenarios. For example, Simple Copy Paste requires 2 images but Mosaic requires 4 images. In the first two approaches, we had to decide how many data points to return depending on the used augmentations. Therefore, I decided to go with the third option which allows augmentation methods to sample new data points from the dataset how they like.
I introduced MultiModalAugmentation
and MultiModalTransform
abstractions to be able to detect if multi-modal augmentation is applied. MultiModalAugmentation is an empty class that extends Augmentation
. MultiModalTransform extends Transform
but also it has an additional .apply_multi_modal()
method that forces newly created multi-modal augmentation to implement.
We also need to adapt DatasetMapper and AugInput to be able to use the abstractions above. Since these classes are from the Detectron2 library, I created new classes that extend them instead of directly manipulating the library. You can see which parts of the code are changed in Figure 3 below.
I’ll exemplify how this abstraction can be used in the real world with two use cases:
We use a publicly available the balloon segmentation dataset which only has one class: balloon. Images are collected from Flickr by limiting the license type to “Commercial use & mods allowed” as stated here. The goal is very simple, changing the color of balloons in the images randomly. For this task, we only need additional object masks to be able to detect a particular balloon instance. To this end, I created a new augmentation that extends MultiModalTransform. The logic to change color is executed in apply_multi_modal()
method below:
Now, the only thing is to apply this augmentation. It can be executed by using Detectron2’s existing architecture like:
Here in the first row, I used the ImageEnhance.Color
module from Pillow to change the color by a factor of 10
. It is directly applied to randomly selected balloon instances. You can use any function you like. The sky’s the limit 🙂 The final output will look like this in Figure 4:
I train Mask RCNN with this augmentation method on the whole balloon dataset using “Detectron2’s Tutorial notebook”. You can find all codes and training results in this notebook.
We’ll use the same balloon dataset for this example, too. The goal of CopyPasteAugmentation is to copy randomly selected balloons from one image to another image. So, this augmentation requires sampling additional images from the dataset. We achieved this functionally by passing the dataset instance to CopyPasteAugmentation.
copy_paste_aug = CopyPasteAugmentation(dataset=dataset, image_format=cfg.INPUT.FORMAT, pre_augs=pre_augs)
Disclaimer: This is not a complete implementation of SimpleCopyPaste but just a proof of concept that shows the proposed abstraction (
MultiModalAugmentation
&MultiModalTransform
) can be used to implement various augmentations.
Since the code of CopyPasteAugmentation is a bit much to put here, I don’t share it here. However, you can check it from this notebook.
Surprise: Similar to the previous use case, I train Mask RCNN with CopyPasteAug on the whole balloon dataset. With just this augmentation method, we achieved better results compared to the official baseline and InstanceColorJitterAugmentation. Check the training notebook here.
In this article, I tried to give a background of how data augmentations work in Detectron2 with illustrations. Based on this introduction, I tried to explain how a new augmentation method that needs multiple modalities such as image and mask can be implemented. Then, I showed the way how I implemented such augmentations with two concrete examples. I published all resources used in this article here. You can test the augmentations shown in the use cases on google colab. Since I don’t test this abstraction in production yet, some errors may occur in memory consumption, parallelism, multi-GPU training, etc. If you encounter a problem or you use this abstraction in your work, let me know in the comments.