Techno Blender
Digitally Yours.

YOLOv7: A deep dive into the current state-of-the-art for object detection | by Chris Hughes & Bernat Puig Camps

0 80


Everything you need to know to use YOLOv7 in custom training scripts

This article was co-authored by Chris Hughes & Bernat Puig Camps

Shortly after its publication, YOLOv7 is the fastest and most accurate real-time object detection model for computer vision tasks. The official paper demonstrates how this improved architecture surpasses all previous YOLO versions — as well as all other object detection models — in terms of both speed and accuracy on the MS COCO dataset; achieving this performance without utilizing any pretrained weights. Additionally, following all of the controversy around the naming conventions of previous YOLO models, as YOLOv7 was released by the same authors that developed Scaled-YOLOv4, the machine learning community seems happy to accept this as the next iteration of the ‘official’ YOLO family!

At the point that YOLOv7 was released, we — as part of Microsoft’s Data & AI Service Line — were partway through a challenging object-detection based customer project, in a domain drastically different to COCO. Needless to say, both ourselves and the customer were very excited at the prospect of applying YOLOv7 to our problem. Unfortunately, when using the out-of-the-box settings, the results were … let’s just say, not great.

After reading the official paper, we found that, whilst it presents a comprehensive overview on the architectural changes, it omits many details around how the model was trained; for example, which data augmentation techniques were applied, and how the loss function measures that the model is doing a good job! To understand these technicalities, we decided to debug the code directly. However, as the YOLOv7 repository is a modified fork of the YOLOR codebase — which itself is a fork of YOLOv5 — we found that it includes a lot of complex functionality, much of which is not needed when just training a model; for example, being able to specify custom architectures in Yaml format and have these translated into PyTorch models. Additionally, the codebase contains many custom components which have been implemented from scratch — such as a multi-GPU training loop, several data augmentations, samplers to preserve dataloader workers, and multiple learning rate schedulers — many of which are now available in PyTorch, or other libraries. As a result, there was a lot of code to dissect; it took us a long time to understand how everything worked, as well as the intricacies of the training loop which contribute to the model’s excellent performance! Eventually, with this understanding, we were able to set up our training recipe to obtain consistently good results on our task.

In this article, we intend to take a practical approach in demonstrating how to train YOLOv7 models in custom training scripts, as well as exploring areas such as data augmentation techniques, how to select and modify anchor boxes, and demystifying how the loss function works; (hopefully!) enabling you to build up an intuition of what is likely to work well for your own problems. As the YOLOv7 architecture is well described in detail in the official paper, as well as in many other sources, we are not going to cover this here. Instead, we intend to focus on all of the other details which, whilst contribute to YOLOv7’s performance, are not covered in the paper. This tends to be knowledge which has been accumulated over multiple versions of YOLO models but can be incredibly difficult to track down for someone just entering the field.

To illustrate these concepts, we shall be using our own implementation of YOLOv7, which utilises the official pretrained weights, but has been written with modularity and readability in mind. This project initially started as an exercise for us to improve our understanding of how YOLOv7 works under the hood — in order to better understand how to apply it — but after successfully using it on a few different tasks, we have decided to make it publicly available. Whilst we would recommend using the official implementation if you wish to exactly reproduce the published results on COCO, we find that this implementation is more flexible to apply, and extend, to custom domains. Hopefully, this implementation will provide a clean and clear starting point for anyone wishing to experiment with YOLOv7 in their own custom training scripts, as well as providing more transparency around the techniques that were used during training in the original implementation.

In this article, we shall cover:

Exploring all of the details along the way, such as:

Tl;dr: If you just want to see some working code that you can use directly, all of the code required to replicate this post is available as a notebook here. Whilst code snippets are used throughout the article, this is primarily for aesthetic purposes, please defer to the notebook, and the repo for working code.

We would like to thank British Airways, for without their consistently delayed flights, this post would probably not have happened.

First, let’s take a look at how to load our dataset in the format that YOLOv7 expects.

Selecting a dataset

Throughout this article, we shall use the Kaggle cars object detection dataset; however, as our aim is to demonstrate how YOLOv7 can be applied to any problem, this is really the least important part of this work. Additionally, as the images are quite similar to COCO, it will enable us to experiment with a pretrained model before we do any training.

The annotations for this dataset are in the form of a .csv file, which associates the image name with the corresponding annotations; where each row represents one bounding box. Whilst there are around 1000 images in the training set, only those with annotations are included in this file.

We can view the format of this by loading it into a pandas DataFrame.

As it is not usually the case that all images in our dataset contain instances of the objects that we are trying to detect, we would also like to include some images that do not contain cars. To do this, we can define a function to load the annotations which also includes 100 ‘negative’ images. Additionally, as the designated test set is unlabelled, let’s randomly take 20% of these images to use as our validation set.

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/examples/train_cars.py

import pandas as pd
import random

def load_cars_df(annotations_file_path, images_path):
all_images = sorted(set([p.parts[-1] for p in images_path.iterdir()]))
image_id_to_image = {i: im for i, im in enumerate(all_images)}
image_to_image_id = {v: k for k, v, in image_id_to_image.items()}

annotations_df = pd.read_csv(annotations_file_path)
annotations_df.loc[:, "class_name"] = "car"
annotations_df.loc[:, "has_annotation"] = True

# add 100 empty images to the dataset
empty_images = sorted(set(all_images) - set(annotations_df.image.unique()))
non_annotated_df = pd.DataFrame(list(empty_images)[:100], columns=["image"])
non_annotated_df.loc[:, "has_annotation"] = False
non_annotated_df.loc[:, "class_name"] = "background"

df = pd.concat((annotations_df, non_annotated_df))

class_id_to_label = dict(
enumerate(df.query("has_annotation == True").class_name.unique())
)
class_label_to_id = {v: k for k, v in class_id_to_label.items()}

df["image_id"] = df.image.map(image_to_image_id)
df["class_id"] = df.class_name.map(class_label_to_id)

file_names = tuple(df.image.unique())
random.seed(42)
validation_files = set(random.sample(file_names, int(len(df) * 0.2)))
train_df = df[~df.image.isin(validation_files)]
valid_df = df[df.image.isin(validation_files)]

lookups = {
"image_id_to_image": image_id_to_image,
"image_to_image_id": image_to_image_id,
"class_id_to_label": class_id_to_label,
"class_label_to_id": class_label_to_id,
}
return train_df, valid_df, lookups

We can now use this function to load our data:

To make it easier to associate predictions with an image, we have assigned each image a unique id; in this case it is just an incrementing integer count. Additionally, we have added an integer value to represent the classes that we want to detect, which is a single class — ‘car’ — in this case.

Generally, object detection models reserve 0 as the background class, so class labels should start from 1. This is not the case for YOLOv7, so we start our class encoding from 0. For images that do not contain a car, we do not require a class id. We can confirm that this is the case by inspecting the lookups returned by our function.

Finally, let’s see the number of images in each class for our training and validation sets. As an image can have multiple annotations, we need to make sure that we account for this when calculating our counts:

Create a Dataset Adaptor

Usually, at this point, we would create a PyTorch dataset specific to the model that we shall be training.

However, we often use the pattern of first creating a dataset ‘adaptor’ class, with the sole responsibility of wrapping the underlying data sources and loading this appropriately. This way, we can easily switch out adaptors when using different datasets, without changing any pre-processing logic which is specific to the model that we are training.

Therefore, let’s focus for now on creating a CarsDatasetAdaptor class, which converts the specific raw dataset format into an image and corresponding annotations. Additionally, let’s load the image id that we assigned, as well as the height and width of our image, as they may be useful to us later on.

An implementation of this is presented below:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/examples/train_cars.py

from torch.utils.data import Dataset

class CarsDatasetAdaptor(Dataset):
def __init__(
self,
images_dir_path,
annotations_dataframe,
transforms=None,
):
self.images_dir_path = Path(images_dir_path)
self.annotations_df = annotations_dataframe
self.transforms = transforms

self.image_idx_to_image_id = {
idx: image_id
for idx, image_id in enumerate(self.annotations_df.image_id.unique())
}
self.image_id_to_image_idx = {
v: k for k, v, in self.image_idx_to_image_id.items()
}

def __len__(self) -> int:
return len(self.image_idx_to_image_id)

def __getitem__(self, index):
image_id = self.image_idx_to_image_id[index]
image_info = self.annotations_df[self.annotations_df.image_id == image_id]
file_name = image_info.image.values[0]
assert image_id == image_info.image_id.values[0]

image = Image.open(self.images_dir_path / file_name).convert("RGB")
image = np.array(image)

image_hw = image.shape[:2]

if image_info.has_annotation.any():
xyxy_bboxes = image_info[["xmin", "ymin", "xmax", "ymax"]].values
class_ids = image_info["class_id"].values
else:
xyxy_bboxes = np.array([])
class_ids = np.array([])

if self.transforms is not None:
transformed = self.transforms(
image=image, bboxes=xyxy_bboxes, labels=class_ids
)
image = transformed["image"]
xyxy_bboxes = np.array(transformed["bboxes"])
class_ids = np.array(transformed["labels"])

return image, xyxy_bboxes, class_ids, image_id, image_hw

Notice that, for our background images, we are just returning an empty array for our bounding boxes and class ids.

Using this, we can confirm that the length of our dataset is the same as the total number of training images that we calculated earlier.

Now, we can use this to visualise some of our images, as demonstrated below:

Create a YOLOv7 dataset

Now that we have created our dataset adaptor, let’s create a dataset which preprocesses our inputs into the format required by YOLOv7; these steps should remain the same regardless of the adaptor that we are using.

An implementation of this is presented below:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/dataset.py

class Yolov7Dataset(Dataset):
"""
A dataset which takes an object detection dataset returning (image, boxes, classes, image_id, image_hw)
and applies the necessary preprocessing steps as required by Yolov7 models.

By default, this class expects the image, boxes (N, 4) and classes (N,) to be numpy arrays,
with the boxes in (x1,y1,x2,y2) format, but this behaviour can be modified by
overriding the `load_from_dataset` method.
"""

def __init__(self, dataset, transforms=None):
self.ds = dataset
self.transforms = transforms

def __len__(self):
return len(self.ds)

def load_from_dataset(self, index):
image, boxes, classes, image_id, shape = self.ds[index]
return image, boxes, classes, image_id, shape

def __getitem__(self, index):
image, boxes, classes, image_id, original_image_size = self.load_from_dataset(
index
)

if self.transforms is not None:
transformed = self.transforms(image=image, bboxes=boxes, labels=classes)
image = transformed["image"]
boxes = np.array(transformed["bboxes"])
classes = np.array(transformed["labels"])

image = image / 255 # 0 - 1 range

if len(boxes) != 0:
# filter boxes with 0 area in any dimension
valid_boxes = (boxes[:, 2] > boxes[:, 0]) & (boxes[:, 3] > boxes[:, 1])
boxes = boxes[valid_boxes]
classes = classes[valid_boxes]

boxes = torchvision.ops.box_convert(
torch.as_tensor(boxes, dtype=torch.float32), "xyxy", "cxcywh"
)
boxes[:, [1, 3]] /= image.shape[0] # normalized height 0-1
boxes[:, [0, 2]] /= image.shape[1] # normalized width 0-1
classes = np.expand_dims(classes, 1)

labels_out = torch.hstack(
(
torch.zeros((len(boxes), 1)),
torch.as_tensor(classes, dtype=torch.float32),
boxes,
)
)
else:
labels_out = torch.zeros((0, 6))

try:
if len(image_id) > 0:
image_id_tensor = torch.as_tensor([])

except TypeError:
image_id_tensor = torch.as_tensor(image_id)

return (
torch.as_tensor(image.transpose(2, 0, 1), dtype=torch.float32),
labels_out,
image_id_tensor,
torch.as_tensor(original_image_size),
)

Let’s wrap our data adaptor using this dataset and inspect some of the outputs:

As we haven’t defined any transforms, the output is largely the same, with the main exception being that the boxes are now in normalized cxcywh format and all of our outputs have been converted into tensors. Note that cx, cy stands for center x and y and it means that the coordinates correspond to the centre of the box.

One thing to note is that our labels take the form [0, class_id, ncx, ncy, nw, nh]. The zero space at the start of the tensor will be utilised by the collate function later on.

Transforms

Now, let’s define some transforms! For this, we shall use the excellent Albumentations library, which provides many options for transforming both images and bounding boxes.

Whilst the transforms that we select will largely be domain specific, here, we shall ‘s define similar transforms to those used in the original implementation.

These are:

  • Resize the image to the given input (multiple of 640) whilst maintaining the aspect ratio
  • If the image is not square, apply padding. For this, we shall follow the paper in using a grey padding, this is an arbitrary choice.

During training:

We can use the following function to create these transforms as demonstrated below:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/dataset.py

def create_yolov7_transforms(
image_size=(640, 640),
training=False,
training_transforms=(A.HorizontalFlip(p=0.5),),
):
transforms = [
A.LongestMaxSize(max(image_size)),
A.PadIfNeeded(
image_size[0],
image_size[1],
border_mode=0,
value=(114, 114, 114),
),
]

if training:
transforms.extend(training_transforms)

return A.Compose(
transforms,
bbox_params=A.BboxParams(format="pascal_voc", label_fields=["labels"]),
)

Now, let’s re-create our dataset, this time passing the default transforms that will be used during evaluation. For our target image size, we shall use 640 which is the value that the smaller YOLOv7 models were trained on. In general, we can select any multiple of 8 for this.

Using these transforms, we can see that our image has been resized to our target size and padding has been applied. The reason that padding is used is so that we can maintain the aspect ratio of the objects in the images, but have a common size for images in our dataset; enabling us to batch them efficiently!

Now that we have explored how to load and prepare our data, let’s move on to take a look at how we can leverage a pretrained model to make some predictions!

Loading the model

So that we can understand how to interface with the model, let’s load a pretrained checkpoint and use this for inference on some images in our dataset. As this checkpoint was trained on COCO, which contains images of cars, we can assume that the model should perform moderately well on this task out of the box. To see the models that are available, we can import the AVAILABLE_MODELS variable.

Here, we can see that the available models are the architectures defined in the original paper. Let’s create the standard yolov7 model, using the create_yolov7_model function.

Now, let’s take a look at the model’s predictions. The forward pass through the model will return the raw feature maps given by the FPN heads, to convert these into meaningful predictions, we can use the postprocess method.

Inspecting the shape, we can see that the model has made 25,200 predictions! Each prediction has an associated tensor of length 6 — the entries correspond to the bounding box coordinates in xyxy format, a confidence score, and a class index.

Often, object detection models tend to make a lot of similar, overlapping predictions. Whilst there are many ways of dealing with this, in the original paper, the authors used non-maximum-suppression (NMS) to solve this problem. We can apply NMS, as well as a secondary round of confidence thresholding, using the function below. In addition, during postprocessing, we often want to filter our any predictions with a confidence level below a predefined threshold, let’s increase our confidence threshold here.

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/trainer.py

def filter_eval_predictions(
predictions: List[Tensor],
confidence_threshold: float = 0.2,
nms_threshold: float = 0.65,
) -> List[Tensor]:
nms_preds = []
for pred in predictions:
pred = pred[pred[:, 4] > confidence_threshold]

nms_idx = torchvision.ops.batched_nms(
boxes=pred[:, :4],
scores=pred[:, 4],
idxs=pred[:, 5],
iou_threshold=nms_threshold,
)
nms_preds.append(pred[nms_idx])

return nms_preds

After applying NMS, we can see that now we only have a single prediction for this image. Let’s visualise how this looks:

We can see that this looks pretty good! The prediction from the model is actually tighter around the car than the ground truth!

Now that we have our prediction, the only thing to note is that the bounding box is relative to the resized image size. To scale our predictions back to the original image size, we can use the following function:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/trainer.py

def scale_bboxes_to_original_image_size(
xyxy_boxes, resized_hw, original_hw, is_padded=True
):
scaled_boxes = xyxy_boxes.clone()
scale_ratio = resized_hw[0] / original_hw[0], resized_hw[1] / original_hw[1]

if is_padded:
# remove padding
pad_scale = min(scale_ratio)
padding = (resized_hw[1] - original_hw[1] * pad_scale) / 2, (
resized_hw[0] - original_hw[0] * pad_scale
) / 2
scaled_boxes[:, [0, 2]] -= padding[0] # x padding
scaled_boxes[:, [1, 3]] -= padding[1] # y padding
scale_ratio = (pad_scale, pad_scale)

scaled_boxes[:, [0, 2]] /= scale_ratio[1]
scaled_boxes[:, [1, 3]] /= scale_ratio[0]

# Clip bounding xyxy bounding boxes to image shape (height, width)
scaled_boxes[:, 0].clamp_(0, original_hw[1]) # x1
scaled_boxes[:, 1].clamp_(0, original_hw[0]) # y1
scaled_boxes[:, 2].clamp_(0, original_hw[1]) # x2
scaled_boxes[:, 3].clamp_(0, original_hw[0]) # y2

return scaled_boxes

Before we can start training, in addition to a model architecture, we need a loss function which will enable us to measure how well our model is performing; in order to be able to update our parameters. Since Object Detection is a difficult problem to teach a model, the loss functions of such models are usually quite complex and YOLOv7 is not an exception. Here, we shall do our best to illustrate the intuitions behind it to facilitate its understanding.

Before we can delve deeper into the actual loss function, let’s cover a few background concepts that we need to understand.

Anchor boxes

One of the main difficulties of object detection is outputting detection boxes. That is, how do we train a model to create a bounding box and localize it correctly in an image?

There are a few different approaches, but the YOLOv7 family is what we call an anchor-based model. In these models, the general philosophy is to first create lots of potential bounding boxes, then select the most promising options to match to our target objects; slightly moving and resizing them as necessary to obtain the best possible fit.

The basic idea is that we draw a grid on top of each image and, at each grid intersection (anchor point), generate candidate boxes (anchor boxes) based on a number of anchor sizes. That is, the same set of boxes is repeated at each anchor point. This way, the task that model has to learn, slightly relocating and resizing these boxes, is simpler than generating boxes from scratch.

An example of anchor boxes generated at a sample of anchor points.

However, one issue with this approach is that our target, ground truth, boxes can range in size — from tiny to huge! Therefore, it is usually not possible to define a single set of anchor sizes that can be matched to all targets. For this reason, anchor-based model architectures usually employ a Feature-Pyramid-Network (FPN) to assist with this; which is the case with YOLOv7.

Feature Pyramid Networks (FPN)

The main idea behind FPNs (introduced in Feature Pyramid Networks for Object Detection) is to leverage the nature of convolutional layers — which reduce the size of the feature space and increase the coverage of each feature in the initial image — to output predictions at different scales¹. FPNs are usually implemented as a stack of convolutional layers, as we can see by inspecting the detection head of our YOLOv7 model.

Whilst we could simply take the outputs of the final layer as predictions, as the deeper convolutional layers implicitly utilise the information from previous layers to learn more high-level features, they do not have access to the information of how to detect the lower-level features contained in earlier layers; this can result in poor performance when detecting smaller objects.

For this reason, a top-down pathway and lateral connections are added to the regular bottom-up pathway (normal flow of a convolution layer). The top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. Then, these features are enhanced with features from the bottom-up pathway through the lateral connections. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times¹.

In summary, FPNs provide semantically strong features at multiple scales which make them extremely well suited for object detection. The connections that YOLOv7 implements in its FPN are illustrated in the figure below:

Representation of the YOLOv7 family Feature Proposal Network architecture. Source: YOLOv7 paper.

Here, we can see that we have a “Normal model” and a “Model with auxiliary head”. This is because some of the larger models in the YOLOv7 family use deep supervision when training; that is, they leverage the outputs of deeper layers in the loss in order to try and better learn the task. We shall explore this further later on.

From the image, we can see that each layer in the FPN (also known as each FPN head), has a feature scale that is half the size of the previous one (the scale is the same for each Lead head and its corresponding Aux head). This can be understood as each subsequent FPN head “seeing” object scales twice as big as the previous one. We can leverage that by assigning grids with different strides (grid cell side size), and proportional anchor sizes, to each FPN head.

For instance, the anchor configuration for the basic yolov7 model looks like this:

Illustration of the anchor grid and the different (default) anchor box sizes for each fpn head in the main model in the YOLOv7 family

As we can see, we have anchor box sizes and grids that cover completely different scales: from tiny objects to objects that can occupy the whole image.

Now, that we understand these ideas conceptually, let’s take a look at the FPN outputs that come out of our model, which is what will be used to calculate our loss.

¹ These sections were directly taken from the original FPN paper as we felt that no further explanation was needed.

Breaking down the FPN outputs

Recall that, when we made our predictions earlier, we used the model’s postprocess method to convert the raw FPN outputs into usable bounding boxes. Now that we understand the intuition behind what the FPN is trying to do, let’s inspect these raw outputs.

The outputs of our model are always a List[Tensor], where each component corresponds to a head of the FPN. For models that use Deep Supervision, the Aux Head outputs come after the Lead Head outputs (always the same number of each, both sides of the pair ordered equally). For the rest, including the one we are using here, only the Lead Head outputs are present.

Inspecting the shape of each FPN output, we can see that each one has the following dimensions:

[n_images, n_anchor_sizes, n_grid_rows, n_grid_cols, n_features]

where:

  • n_images — The number of images in the batch (batch size).
  • n_anchor_sizes – The anchor sizes associated with the head (usually 3).
  • n_grid_rows – The number of anchors vertically, img_height / stride.
  • n_grid_cols – The number of anchors horizontally, img_width / stride.
  • n_features5 + num_classes
    cx – Horizontal correction for the anchor box centre.
    cy – Vertical correction for the anchor box centre.
    w – Width correction for the anchor box.
    h – Height correction for the anchor box.
    obj_score – Score proportional to the probability of an object being contained inside the anchor box.
    cls_score – One per class, score proportional to the probability of that being the class of the object.

When these outputs are mapped into useful predictions during post-processing, we apply the following operations:

  • cx, cy : final = 2 * sigmoid(initial) - 0.5
    [(−∞, ∞), (−∞, ∞)] → [(−0.5, 1.5), (−0.5, 1.5)]
    – The model can only move the anchor centre from 0.5 cell behind to 1.5 cells forward. Note that for the loss (i.e., when we train) we use grid coordinates.
  • w, h : final = (2 * sigmoid(initial)**2
    [(−∞, ∞), (−∞, ∞)] → [(0, 4), (0, 4)]
    – The model can make the anchor box arbitrarily smaller but at most 4 times bigger. Larger objects, outside of this range, must be predicted by the next FPN head.
  • obj_score : final = sigmoid(initial)
    (−∞, ∞) → (0, 1)
    – Makes sure the score is mapped to a probability.
  • cls_score : final = sigmoid(initial)
    (−∞, ∞) → (0, 1)
    – Makes sure the score is mapped to a probability.

Center Priors

Now, it is easy to see that if we put 3 anchor boxes in each anchor point of each of the grids, we end up with a lot of boxes: 3*80*80 + 3*40*40 + 3*20*20=25200 for each 640x640px image to be exact! The issue is that most of these predictions are not going to contain an object, which we classify as ‘background’. Depending on the sequence of operations that we need to apply to each prediction, computations can easily stack up and slow down the training!

To make the problem cheaper computationally, the YOLOv7 loss finds first the anchor boxes that are likely to match each target box and treats them differently — these are known as the center prior anchor boxes. This process is applied at each FPN head, for each target box, across all images in batch at once.

Each anchor — which are the coordinates in our grid — defines a grid cell; where we consider the anchor to be at the top left of its corresponding grid cell. Subsequently, each cell (except cells on the border) has 4 adjacent cells (top, bottom, left, right). Each target box, for each FPN head, lies somewhere inside a grid cell. Imagine that we have the following grid, and the centre of a target box is represented by a *:

Based on the way the model is designed and trained, the x and y corrections that it can output are in the range of [-0.5, 1.5] grid cells. Thus, only a subset of the closest anchor boxes will be able to match the target centre. We select some of these anchor boxes to represent the center prior for the target box.

  • For the Lead Heads, we use a fine Center Prior, which is a more targeted selection. This is comprised of 3 anchors per head: the anchor associated the cell containing the target box centre, alongside the anchors for the 2 closest grid cells to the target box centre. In the diagram, the Center Prior anchors are marked with an X.
Selected centre priors for lead detection heads
  • For the Auxiliary Heads (for models that use deep supervision), we use a coarse Center Prior, which is a less targeted selection. This is comprised of 5 anchors per head: the anchor of the cell containing the target box centre, alongside all 4 adjacent grid cells.
Selected centre priors for auxillary detection heads

The reasoning behind this fine and coarse distinction is that the learning ability of Auxiliary Heads is lower than that of the Lead Heads, because the Lead Heads are deeper in the network. Thus, we try to avoid limiting too much from where the Auxiliary Head can learn to make sure we do not lose valuable information.

Similarly to the coordinate corrections, the model can only apply a multiplicative modifier to the width and height of each anchor box in the interval [0, 4]. This means that, at most, it can make the sides of the anchor boxes 4 times bigger. Therefore, from the anchor boxes selected as Center Prior, we filter those that are either 4 times bigger or smaller than the target box.

In summary, the Center Prior is comprised by the anchor boxes whose anchor is close enough to the target box centre and whose sides are not too far off from the target box side size.

Optimal Transport Assignment

One of the difficulties when evaluating object detection models is being able to match predicted boxes to target boxes in order to quantify if the model is doing a good job or not.

The simplest approach is to define an Intersection over Union (IoU) threshold and decide based on that. While this generally works, it becomes problematic when there are occlusions, ambiguity or when multiple objects are very close together. Optimal Transport Assignment (OTA) aims to solve some of these problems by considering label assignment as a global optimization problem for each image.

The main intuition consists in considering each target box a supplier of k positive label assignments and each predicted box a demander of either one positive label assignment or one background assignment. k is dynamic and depends on each target box. Then, transporting one positive label assignment from target box to predicted box has a cost based on classification and regression. Finally, the goal is to find a transportation plan (label assignment) that minimizes the total cost over the image.

This can be done using an off-the-shelf solver, but YOLOv7 implements simOTA (introduced in the YOLOX paper), a simplified version of the OTA problem. With the goal of reducing the computational cost of label assignment, it assigns the 𝑘 predicted boxes for each target that have the lowest transportation cost instead of solving the global problem. The Center Prior boxes are used as candidates for this process.

This helps us to further filter the amount of model outputs that can potentially be matched to a ground truth target.

YOLOv7 Loss algorithm

Now that we have introduced the most complicated pieces used in the YOLOv7 loss calculation, we can break down the algorithm used into the following steps:

  1. For each FPN head (or each FPN head and Aux FPN head pair if Aux heads used):
  • Find the Center Prior anchor boxes.
  • Refine the candidate selection through the simOTA algorithm. Always use lead FPN heads for this.
  • Obtain the objectness loss score using Binary Cross Entropy Loss between the predicted objectness probability and the Complete Intersection over Union (CIoU) with the matched target as ground truth. If there are no matches, this is 0.
  • If there are any selected anchor box candidates, also calculate (otherwise they are just 0):
    – The box (or regression) loss, defined as the mean(1 - CIoU) between all candidate anchor boxes and their matched target.
    – The classification loss, using Binary Cross Entropy Loss between the predicted class probabilities for each anchor box and a one-hot encoded vector of the true class of the matched target.
  • If model uses auxiliary heads, add each component obtained from the aux head to the corresponding main loss component (i.e., x = x + aux_wt*aux_x). The contribution weight (aux_wt) is defined by a predefined hyperparameter.
  • Multiply the objectness loss by the corresponding FPN head weight (predefined hyperparameter).

2. Multiply each loss component (objectness, classification, regression) by their contribution weight (predefined hyperparameter).

3. Sum the already weighted loss components.

4. Multiply the final loss value by the batch size.

As a technical detail, the loss reported during evaluation is made computationally cheaper by skipping the simOTA and never using the auxiliary heads, even for the models that fashion deep supervision.

Whilst this process contains a lot of complexity, in practice, this is all encapsulated in a single class, which can be created as demonstrated below:

Now that we understand how to use a pretrained model to make predictions, and how our loss function measures the quality of these predictions, let’s look at how we can finetune a model to a custom task. To obtain the level of performance reported in the paper, YOLOv7 was trained using a variety of techniques. However, for our purposes, lets start with the minimal possible training loop required, before gradually introducing different techniques.

To handle the boilerplate aspects of the training loop, let’s use PyTorch-accelerated. This will enable us to define only the the parts of the training loop which are relevant to our use case, without having to manage all of the boilerplate. To do this, we can override parts of the default PyTorch-accelerated Trainer and create a trainer specific to our YOLOv7 model, as demonstrated below:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/trainer.py

from pytorch_accelerated import Trainer

class Yolov7Trainer(Trainer):
YOLO7_PADDING_VALUE = -2.0

def __init__(
self,
model,
loss_func,
optimizer,
callbacks,
filter_eval_predictions_fn=None,
):
super().__init__(
model=model, loss_func=loss_func, optimizer=optimizer, callbacks=callbacks
)
self.filter_eval_predictions = filter_eval_predictions_fn

def training_run_start(self):
self.loss_func.to(self.device)

def evaluation_run_start(self):
self.loss_func.to(self.device)

def train_epoch_start(self):
super().train_epoch_start()
self.loss_func.train()

def eval_epoch_start(self):
super().eval_epoch_start()
self.loss_func.eval()

def calculate_train_batch_loss(self, batch) -> dict:
images, labels = batch[0], batch[1]

fpn_heads_outputs = self.model(images)
loss, _ = self.loss_func(
fpn_heads_outputs=fpn_heads_outputs, targets=labels, images=images
)

return {
"loss": loss,
"model_outputs": fpn_heads_outputs,
"batch_size": images.size(0),
}

def calculate_eval_batch_loss(self, batch) -> dict:
with torch.no_grad():
images, labels, image_ids, original_image_sizes = (
batch[0],
batch[1],
batch[2],
batch[3].cpu(),
)
fpn_heads_outputs = self.model(images)
val_loss, _ = self.loss_func(
fpn_heads_outputs=fpn_heads_outputs, targets=labels
)

preds = self.model.postprocess(fpn_heads_outputs, conf_thres=0.001)

if self.filter_eval_predictions is not None:
preds = self.filter_eval_predictions(preds)

resized_image_sizes = torch.as_tensor(
images.shape[2:], device=original_image_sizes.device
)[None].repeat(len(preds), 1)

formatted_predictions = self.get_formatted_preds(
image_ids, preds, original_image_sizes, resized_image_sizes
)

gathered_predictions = (
self.gather(formatted_predictions, padding_value=self.YOLO7_PADDING_VALUE)
.detach()
.cpu()
)

return {
"loss": val_loss,
"model_outputs": fpn_heads_outputs,
"predictions": gathered_predictions,
"batch_size": images.size(0),
}

def get_formatted_preds(
self, image_ids, preds, original_image_sizes, resized_image_sizes
):
"""
scale bboxes to original image dimensions, and associate image id with predictions
"""
formatted_preds = []
for i, (image_id, image_preds) in enumerate(zip(image_ids, preds)):
# image_id, x1, y1, x2, y2, score, class_id
formatted_preds.append(
torch.cat(
(
scale_bboxes_to_original_image_size(
image_preds[:, :4],
resized_hw=resized_image_sizes[i],
original_hw=original_image_sizes[i],
is_padded=True,
),
image_preds[:, 4:],
image_id.repeat(image_preds.shape[0])[None].T,
),
1,
)
)

if not formatted_preds:
# if no predictions, create placeholder so that it can be gathered across processes
stacked_preds = torch.tensor(
[self.YOLO7_PADDING_VALUE] * 7, device=self.device
)[None]
else:
stacked_preds = torch.vstack(formatted_preds)

return stacked_preds

Our training step is quite straightforward, with the only modification being that we need to extract the total loss from the dictionary that is returned. For the evaluation step, we first calculate the losses, and then retrieve the detections.

Evaluation logic

To evaluate our model’s performance on this task, we can use Mean Average Precision (mAP); a standard metric for object detection tasks. Perhaps the most widely used (and trusted) implementation of mAP, is the class that is included in the PyCOCOTools package, which is used to evaluate official COCO leaderboard submissions.

However, as this does not have the most inituitive interface, we have created a simple wrapper around this, to make it a little more user-friendly. Additionally, as for many cases outside the COCO competition leaderboard, it can be advantageous to evaluate predictions using a fixed IoU threshold — as opposed to the range of IoUs that is used by default — we have added an option to do this to our evaluator.

To encapsulate our evaluation logic to use during training, let’s create a callback for this; which will be updated at the end of each evaluation step and then calculated at the end of each evaluation epoch.


# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/evaluation/calculate_map_callback.py

from pytorch_accelerated.callbacks import TrainerCallback

class CalculateMeanAveragePrecisionCallback(TrainerCallback):
"""
A callback which accumulates predictions made during an epoch and uses these to calculate the Mean Average Precision
from the given targets.

.. Note:: If using distributed training or evaluation, this callback assumes that predictions have been gathered
from all processes during the evaluation step of the main training loop.
"""

def __init__(
self,
targets_json,
iou_threshold=None,
save_predictions_output_dir_path=None,
verbose=False,
):
"""
:param targets_json: a COCO-formatted dictionary with the keys "images", "categories" and "annotations"
:param iou_threshold: If set, the IoU threshold at which mAP will be calculated. Otherwise, the COCO default range of IoU thresholds will be used.
:param save_predictions_output_dir_path: If provided, the path to which the accumulated predictions will be saved, in coco json format.
:param verbose: If True, display the output provided by pycocotools, containing the average precision and recall across a range of box sizes.
"""
self.evaluator = COCOMeanAveragePrecision(iou_threshold)
self.targets_json = targets_json
self.verbose = verbose
self.save_predictions_path = (
Path(save_predictions_output_dir_path)
if save_predictions_output_dir_path is not None
else None
)

self.eval_predictions = []
self.image_ids = set()

def on_eval_step_end(self, trainer, batch, batch_output, **kwargs):
predictions = batch_output["predictions"]
if len(predictions) > 0:
self._update(predictions)

def on_eval_epoch_end(self, trainer, **kwargs):
preds_df = pd.DataFrame(
self.eval_predictions,
columns=[
XMIN_COL,
YMIN_COL,
XMAX_COL,
YMAX_COL,
SCORE_COL,
CLASS_ID_COL,
IMAGE_ID_COL,
],
)

predictions_json = self.evaluator.create_predictions_coco_json_from_df(preds_df)
self._save_predictions(trainer, predictions_json)

if self.verbose and trainer.run_config.is_local_process_zero:
self.evaluator.verbose = True

map_ = self.evaluator.compute(self.targets_json, predictions_json)
trainer.run_history.update_metric(f"map", map_)

self._reset()

@classmethod
def create_from_targets_df(
cls,
targets_df,
image_ids,
iou_threshold=None,
save_predictions_output_dir_path=None,
verbose=False,
):
"""
Create an instance of :class:`CalculateMeanAveragePrecisionCallback` from a dataframe containing the ground
truth targets and a collections of all image ids in the dataset.

:param targets_df: DF w/ cols: ["image_id", "xmin", "ymin", "xmax", "ymax", "class_id"]
:param image_ids: A collection of all image ids in the dataset, including those without annotations.
:param iou_threshold: If set, the IoU threshold at which mAP will be calculated. Otherwise, the COCO default range of IoU thresholds will be used.
:param save_predictions_output_dir_path: If provided, the path to which the accumulated predictions will be saved, in coco json format.
:param verbose: If True, display the output provided by pycocotools, containing the average precision and recall across a range of box sizes.
:return: An instance of :class:`CalculateMeanAveragePrecisionCallback`
"""

targets_json = COCOMeanAveragePrecision.create_targets_coco_json_from_df(
targets_df, image_ids
)

return cls(
targets_json=targets_json,
iou_threshold=iou_threshold,
save_predictions_output_dir_path=save_predictions_output_dir_path,
verbose=verbose,
)

def _remove_seen(self, labels):
"""
Remove any image id that has already been seen during the evaluation epoch. This can arise when performing
distributed evaluation on a dataset where the batch size does not evenly divide the number of samples.

"""
image_ids = labels[:, -1].tolist()

# remove any image_idx that has already been seen
# this can arise from distributed training where batch size does not evenly divide dataset
seen_id_mask = torch.as_tensor(
[False if idx not in self.image_ids else True for idx in image_ids]
)

if seen_id_mask.all():
# no update required as all ids already seen this pass
return []
elif seen_id_mask.any(): # at least one True
# remove predictions for images already seen this pass
labels = labels[~seen_id_mask]

return labels

def _update(self, predictions):
filtered_predictions = self._remove_seen(predictions)

if len(filtered_predictions) > 0:
self.eval_predictions.extend(filtered_predictions.tolist())
updated_ids = filtered_predictions[:, -1].unique().tolist()
self.image_ids.update(updated_ids)

def _reset(self):
self.image_ids = set()
self.eval_predictions = []

def _save_predictions(self, trainer, predictions_json):
if (
self.save_predictions_path is not None
and trainer.run_config.is_world_process_zero
):
with open(self.save_predictions_path / "predictions.json", "w") as f:
json.dump(predictions_json, f)

Now, all that we have to do is plug our callback into our Trainer, and our mAP will be recorded at each epoch!

Run training

Now, let’s put everything we have seen so far into a simple training script. Here, we have used a simple training recipe that works well for a variety of tasks and have carried out minimal hyperparameter tuning.

As we noticed that the ground truth boxes for this dataset can contain quite a bit of space around the object, we decided to set the IoU threshold used for evaluation quite low; as it is likely that the boxes produced by the model will be tighter around the object.

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/examples/minimal_finetune_cars.py

import os
import random
from functools import partial
from pathlib import Path

import numpy as np
import pandas as pd
import torch
from func_to_script import script
from PIL import Image
from pytorch_accelerated.callbacks import (
EarlyStoppingCallback,
SaveBestModelCallback,
get_default_callbacks,
)
from pytorch_accelerated.schedulers import CosineLrScheduler
from torch.utils.data import Dataset

from yolov7 import create_yolov7_model
from yolov7.dataset import Yolov7Dataset, create_yolov7_transforms, yolov7_collate_fn
from yolov7.evaluation import CalculateMeanAveragePrecisionCallback
from yolov7.loss_factory import create_yolov7_loss
from yolov7.trainer import Yolov7Trainer, filter_eval_predictions

def load_cars_df(annotations_file_path, images_path):
all_images = sorted(set([p.parts[-1] for p in images_path.iterdir()]))
image_id_to_image = {i: im for i, im in enumerate(all_images)}
image_to_image_id = {v: k for k, v, in image_id_to_image.items()}

annotations_df = pd.read_csv(annotations_file_path)
annotations_df.loc[:, "class_name"] = "car"
annotations_df.loc[:, "has_annotation"] = True

# add 100 empty images to the dataset
empty_images = sorted(set(all_images) - set(annotations_df.image.unique()))
non_annotated_df = pd.DataFrame(list(empty_images)[:100], columns=["image"])
non_annotated_df.loc[:, "has_annotation"] = False
non_annotated_df.loc[:, "class_name"] = "background"

df = pd.concat((annotations_df, non_annotated_df))

class_id_to_label = dict(
enumerate(df.query("has_annotation == True").class_name.unique())
)
class_label_to_id = {v: k for k, v in class_id_to_label.items()}

df["image_id"] = df.image.map(image_to_image_id)
df["class_id"] = df.class_name.map(class_label_to_id)

file_names = tuple(df.image.unique())
random.seed(42)
validation_files = set(random.sample(file_names, int(len(df) * 0.2)))
train_df = df[~df.image.isin(validation_files)]
valid_df = df[df.image.isin(validation_files)]

lookups = {
"image_id_to_image": image_id_to_image,
"image_to_image_id": image_to_image_id,
"class_id_to_label": class_id_to_label,
"class_label_to_id": class_label_to_id,
}
return train_df, valid_df, lookups

class CarsDatasetAdaptor(Dataset):
def __init__(
self,
images_dir_path,
annotations_dataframe,
transforms=None,
):
self.images_dir_path = Path(images_dir_path)
self.annotations_df = annotations_dataframe
self.transforms = transforms

self.image_idx_to_image_id = {
idx: image_id
for idx, image_id in enumerate(self.annotations_df.image_id.unique())
}
self.image_id_to_image_idx = {
v: k for k, v, in self.image_idx_to_image_id.items()
}

def __len__(self) -> int:
return len(self.image_idx_to_image_id)

def __getitem__(self, index):
image_id = self.image_idx_to_image_id[index]
image_info = self.annotations_df[self.annotations_df.image_id == image_id]
file_name = image_info.image.values[0]
assert image_id == image_info.image_id.values[0]

image = Image.open(self.images_dir_path / file_name).convert("RGB")
image = np.array(image)

image_hw = image.shape[:2]

if image_info.has_annotation.any():
xyxy_bboxes = image_info[["xmin", "ymin", "xmax", "ymax"]].values
class_ids = image_info["class_id"].values
else:
xyxy_bboxes = np.array([])
class_ids = np.array([])

if self.transforms is not None:
transformed = self.transforms(
image=image, bboxes=xyxy_bboxes, labels=class_ids
)
image = transformed["image"]
xyxy_bboxes = np.array(transformed["bboxes"])
class_ids = np.array(transformed["labels"])

return image, xyxy_bboxes, class_ids, image_id, image_hw

DATA_PATH = Path("/".join(Path(__file__).absolute().parts[:-2])) / "data/cars"

@script
def main(
data_path: str = DATA_PATH,
image_size: int = 640,
pretrained: bool = True,
num_epochs: int = 30,
batch_size: int = 8,
):

# Load data
data_path = Path(data_path)
images_path = data_path / "training_images"
annotations_file_path = data_path / "annotations.csv"

train_df, valid_df, lookups = load_cars_df(annotations_file_path, images_path)
num_classes = 1

# Create datasets
train_ds = CarsDatasetAdaptor(
images_path,
train_df,
)
eval_ds = CarsDatasetAdaptor(images_path, valid_df)

train_yds = Yolov7Dataset(
train_ds,
create_yolov7_transforms(training=True, image_size=(image_size, image_size)),
)
eval_yds = Yolov7Dataset(
eval_ds,
create_yolov7_transforms(training=False, image_size=(image_size, image_size)),
)

# Create model, loss function and optimizer
model = create_yolov7_model(
architecture="yolov7", num_classes=num_classes, pretrained=pretrained
)

loss_func = create_yolov7_loss(model, image_size=image_size)

optimizer = torch.optim.SGD(
model.parameters(), lr=0.01, momentum=0.9, nesterov=True
)
# Create trainer and train
trainer = Yolov7Trainer(
model=model,
optimizer=optimizer,
loss_func=loss_func,
filter_eval_predictions_fn=partial(
filter_eval_predictions, confidence_threshold=0.01, nms_threshold=0.3
),
callbacks=[
CalculateMeanAveragePrecisionCallback.create_from_targets_df(
targets_df=valid_df.query("has_annotation == True")[
["image_id", "xmin", "ymin", "xmax", "ymax", "class_id"]
],
image_ids=set(valid_df.image_id.unique()),
iou_threshold=0.2,
),
SaveBestModelCallback(watch_metric="map", greater_is_better=True),
EarlyStoppingCallback(
early_stopping_patience=3,
watch_metric="map",
greater_is_better=True,
early_stopping_threshold=0.001,
),
*get_default_callbacks(progress_bar=True),
],
)

trainer.train(
num_epochs=num_epochs,
train_dataset=train_yds,
eval_dataset=eval_yds,
per_device_batch_size=batch_size,
create_scheduler_fn=CosineLrScheduler.create_scheduler_fn(
num_warmup_epochs=5,
num_cooldown_epochs=5,
k_decay=2,
),
collate_fn=yolov7_collate_fn,
)

if __name__ == "__main__":
main()

Launching training as described here, using a single V100 GPU with fp16 enabled, after 3 epochs we obtained a mAP of 0.995, which suggests that the model has learned the task almost perfectly!

However, whilst this is a great result, it is largely expected as COCO contains image of cars.

Now that we have successfully finetuned a pretrained YOLOv7 model, let’s explore how we can train the model from scratch. Whilst this could be done using numerous different training recipes, let’s take a look at some of the key techniques that were used by the authors when training on COCO.

Mosaic Augmentation

Data augmentation is an important technique in deep learning where we synthetically expand our dataset by applying a series of augmentations to our data during training. Whilst common transforms in object detection tend to be augmentations such as flips and rotations, the YOLO authors take a slightly different approach by applying Mosaic augmentation; which was previously used by YOLOv4, YOLOv5 and YOLOX models.

The objective of mosaic augmentation is to overcome the observation that object detection models tend to focus on detecting items towards the centre of the image. The key idea is that, if we stitch multiple images together, the objects are likely to be in positions and contexts that are not normally observed in images seen in the dataset; which should force the features learned by the model to be more position invariant.

Whilst there are a couple of different implementations of mosaic, each with minor differences, here we shall present an implementation that combines four different images. This implementation has worked well for us in the past, with a variety of object detection models.

Although there is no requirement to resize images prior to creating a mosaic, it does result in the created mosaics being similar sizes. Therefore, we shall take that approach here. We can do this by creating a simple resizing transform and adding it to our dataset adaptor.

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/dataset.py

import albumentations as A

def create_base_transforms(target_image_size):
return A.Compose(
[
A.LongestMaxSize(target_image_size),
],
bbox_params=A.BboxParams(format="pascal_voc", label_fields=["labels"]),
)

To apply our augmentations, once again, we are using Albumentations, which supports many object detection transforms.

Whilst data augmentations are usually implemented as functions, which are passed to a PyTorch dataset and applied shortly after loading an image, as mosaic requires loading multiple images from the dataset, this approach will not work here. We decided to implement mosaic as a dataset wrapper class, to cleanly encapsulate this logic. We can import and use this as demonstrated below:

Let’s take a look at some examples of the types of images that are produced. As we haven’t (yet) passed any resizing transforms to our mosaic dataset, these images are quite large.

Notice that, whilst the mosaic images appear quite different, they were all called with the same index, therefore were applied to the same image! When a mosaic is created, it randomly selects 3 other images from the dataset and places them in random positions, this results in different looking images being produced each time. Therefore, applying this augmentation does break down our concept of a training epoch — where each image in the dataset is seen exactly once — as images can be seen multiple times!

As a result, when training with mosaic, our strategy is not to think too much about the number of epochs and train the model for as long as possible until it stops converging. After all, the notion of an epoch is only really useful to help us track our training — the model just sees a continuous stream of images either way!

Mixup Augmentation

Mosaic augmentation is often applied alongside another transform — Mixup. To visualise what this does, let’s disable mosaic for the moment and enable mixup on its own, we can do this as demonstrated below:

Interesting! We can see that it has combined two images together, which results in some ‘ghostly’ looking cars and backgrounds! Now, let’s enable both transforms and inspect our outputs.

Wow! There are quite a lot of cars to detect in our resulting image, in many different positions — which will definitely be a challenge for the model! Notice that when we apply mosaic and mixup together, a single image is mixed with a mosaic.

Post-mosaic affine transforms

As we noted earlier, the mosaics that we are creating are significantly bigger than the image sizes we will use to train our model, so we will need to do some sort of resizing here. The simplest way would be to simply apply a resize transform after creating the mosaic.

Whilst this would work, this is likely to result in some very small objects, as we are essentially resizing four images to the size of one – which is likely to become a problem where the domain already contains very small bounding boxes! Additionally, each of our mosaics are structurally quite similar, with an image in each quadrant. Recalling that our aim was to make the model more robust to position changes, this may not actually help that much; as the model is likely just to start looking in the middle of each quadrant.

To overcome this, one approach that we can take is to simply take a random crop from our mosaic. This will still provide the variability in positioning whilst preserving the size and aspect ratio of the target objects. At this point, it may also be a good opportunity to add in some other transforms such as scaling and rotation to add even more variability.

The exact transforms, and magnitudes, used will be heavily dependent on the images that you are using, so we would recommend experimenting with these setting first — to ensure that all objects are still visible and recognisable — prior to training a model!

We can define the transforms to apply to our mosaic images as demonstrated below. Here, we have chosen a selection of affine transforms — in sensible ranges for our target data — followed by a random crop. Following the original implementation, we are also applying mixup less frequently than mosaic.

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/mosaic.py

def create_post_mosaic_transform(
output_height,
output_width,
pad_colour=(0, 0, 0),
rotation_range=(-10, 10),
shear_range=(-10, 10),
translation_percent_range=(-0.2, 0.2),
scale_range=(0.08, 1.0),
apply_prob=0.8,
):
return A.Compose(
[
A.Affine(
cval=pad_colour,
rotate=rotation_range,
shear=shear_range,
translate_percent=translation_percent_range,
scale=None,
keep_ratio=True,
p=apply_prob,
),
A.HorizontalFlip(),
A.RandomResizedCrop(height=output_height, width=output_width, scale=scale_range),
],
bbox_params=A.BboxParams(format="pascal_voc", label_fields=["labels"], min_visibility=0.25),
)

Looking at these images, we can see a huge amount of variation and the images are now the correct size for training. As we have selected a random scale, we can also see that not every image looks like a mosaic, so these outputs should not be too dissimilar to the images that the model will see during inference. If more extreme augmentations are used — such that there is a notable difference between the training and inference images — it can be advantageous to disable these shortly before the end of training.

In the official implementation, the authors use mosaics of both 4 and 9 images during training. However, inspecting the outputs of these augmentations when combined with scaling and cropping, in many cases the outputs looked very similar, so we have chosen to omit this here.

Applying weight decay to parameter groups

In our simple example earlier, we created our optimizer so that it would optimize all of the parameters of our model. However, if we would like to follow the authors in introducing weight decay regularization, following the guidance given in Bag of Tricks for Image Classification with Convolutional Neural Networks this may not be optimal; with this paper recommending that weight decay should be applied to only convolutional and fully connected layers.

To implement this in PyTorch, we will need to create two distinct parameter groups to be optimized; one containing our convolutional weights and the other with the remaining parameters. We can do this as demonstrated below:

Inspecting the method definition, we can see that this is a simple filter operation:

Now we can simply pass these to the optimizer:

optimizer = torch.optim.SGD(
param_groups["other_params"], lr=0.01, momentum=0.937, nesterov=True
)

optimizer.add_param_group(
{"params": param_groups["conv_weights"], "weight_decay": weight_decay}
)

Learning rate scheduling

When training neural networks, we often wish to adjust the value of our learning rate during training; this is done using a learning rate scheduler. Whilst there are many popular schedules, the authors opt for a cosine learning rate schedule — with a linear warmup at the start of training. This has the following shape:

Cosine learning rate schedule (with warmup)

In practice, we find that a period of warmup, and cooldown — where the learning rate is held at its minimum value — is often a good strategy for this scheduler. Additionally, the scheduler PyTorch-accelerated supports a k-decay argument which can be used to adjust how aggressive the annealing is.

For this problem, we found that using k-decay to hold the learning rate at a higher value for longer worked quite well. This schedule, along with warmup and cooldown epochs, can be seen below:

Cosine learning rate schedule (with warmup), set with k_decay = 2

Gradient accumulation, scaling weight decay

When training a model, the batch size we use is often determined by our hardware; as we want to try to maximise the amount of data that we can put on the GPU. However, some considerations must be made:

  • For very small batch sizes, we are unable to approximate the gradients of the whole dataset. This can result in unstable training.
  • Modifying the batch size can result in different settings being needed for hyperparameters such as the learning rate and weight decay. This can make it difficult to find a consistent set of hyperparameters.

To overcome this, the authors use use a technique called gradient accumulation, in which the gradients from multiple steps are accumulated to simulate a bigger batch size. For example, suppose that the maximum batch size that we can fit on our GPU is 8. Instead of updating the parameters of the model at the end of each batch, we can save gradient values, proceed to the next batch and add these new gradients. After a designated number of steps, we then perform the update; if we set our number of steps to 4, this is roughly equivalent of using a batch size of 32!

In PyTorch, this could be performed manually as follows:

num_accumulation_steps = 4  

# loop through ennumerated batches
for step, (inputs, labels) in enumerate(data_loader):

model_outputs = model(inputs)
loss = loss_fn(model_outputs, labels)

# normalize loss to account for batch accumulation
loss = loss / num_accumulation_steps

# calculate gradients, these are summed automatically
loss.backward()

if ((step + 1) % num_accumulation_steps == 0) or
(step + 1 == len(data_loader)):
# perform weight update
optimizer.step()
optimizer.zero_grad()

In the original YOLOv7 implementation, the number of gradient accumulation steps is selected so that the total batch size (across all processes) is at least 64; which mitigates both of the issues discussed earlier. Additionally, the authors scale the weight decay used based on the batch size in the following way:

nominal_batch_size = 64
num_accumulate_steps = max(round(nominal_batch_size / total_batch_size), 1)

base_weight_decay = 0.0005
scaled_weight_decay = (
base_weight_decay * total_batch_size * num_accumulate_steps / nominal_batch_size
)

We can visualise these relationships below:

Looking first at the number of accumulation steps, we can see that the number of accumulation steps decreases until we hit our nominal batch size, and then gradient accumulation is no longer needed.

Now looking at the amount of weight decay used, we can see that it is held at the base value until the nominal batch size is reached, and then is scaled linearly with the batch size; with more weight decay applied as the batch size gets bigger.

Model EMA

When training a model, it can be beneficial to set the values for the model weights by taking a moving average of the parameters that were observed across the entire training run, as opposed to using the parameters obtained after the last incremental update. This is often done by maintaining an exponentially weighted average (EMA) of the model parameters, in practice, this usually means maintaining another copy of the model to store these averaged weights. However, rather than updating all of the parameters of this model after every update step, we set these parameters using a linear combination of the existing parameter values and the updated values.

This is done using the following formula:

updated_EMA_model_weights = decay * EMA_model_weights + (1. - decay) * updated_model_weights

where the decay is a parameter that we set. For example, if we set decay=0.99, we have:

updated_EMA_model_weights = 0.99 * EMA_model_weights + 0.01 * updated_model_wei.99 * EMA_model_weights + 0.01 * updated_model_weights

which we can see is keeping 99% of the existing state and only 1% of the new state!

To understand why this may be beneficial, let’s consider the case that our model, in an early stage of training, performs exceptionally poorly on a batch of data. This may result in a large update update to our parameters, overcompensating for the high loss obtained, which will be detrimental for the upcoming batches. By only incorporating only a small percentage of the latest parameters, large updates will be ‘smoothed’, and have less of an overall impact on the model’s weights. Sometimes, these averaged parameters can sometimes produce significantly better results during evaluation, and this technique has been employed in several training schemes for popular models such as training MNASNet, MobileNet-V3 and EfficientNet; using the implementation included in TensorFlow.

The approach to EMA taken by the YOLOv7 authors is slightly different to other implementations as, instead of using a fixed decay, the amount of decay changes based on the number of updates that have been made. We can extend the ModelEMA class included with PyTorch-accelerated to implement this behaviour as defined below:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/utils.py

from pytorch-accelerated.utils import ModelEma

class Yolov7ModelEma(ModelEma):
def __init__(self, model, decay=0.9999):
super().__init__(model, decay)
self.num_updates = 0
self.decay_fn = lambda x: decay * (
1 - math.exp(-x / 2000)
) # decay exponential ramp (to help early epochs)
self.decay = self.decay_fn(self.num_updates)

def update(self, model):
super().update(model)
self.num_updates += 1
self.decay = self.decay_fn(self.num_updates)

Here, we can see that the decay is set by calling a function after each update. Let’s visualise how this looks:

From this, we can see that the amount of decay increases with the number of updates, which is once per epoch.

Recalling the formulas above, this means that, initially, we favour using the updated model weights rather than a historical average. However, as training progresses, we start to incorporate more of the averaged weights from previous epochs. This is an interesting departure from the usual usage of this technique, which is designed to help the EMA model converge more quickly in earlier epochs.

Selecting appropriate anchor box sizes

Recalling the earlier discussion on anchor boxes, and how these play an important part on how YOLOv7 is able to detect objects, let’s look at how we can evaluate whether our chosen anchor boxes are suitable for our problem and, if not, find some sensible choices for our dataset.

The approach here is largely adapted from the autoanchor approach used in YOLOv5, which was also used with YOLOv7.

Evaluating current anchor boxes

The simplest approach would be to simply use the same anchor boxes as used for COCO, which are already bundled with the defined architectures.

Here we can see that we have 3 groups, one for each layer of the feature pyramid network. The numbers correspond to our anchor sizes, the width and height of the anchor boxes that will be generated.

Recall that, the Feature Pyramid Network (FPN) has three outputs, and each output’s role is to detect objects according to their scale.

For example:

  • P3/8 is for detecting smaller objects.
  • P4/16 is for detecting medium objects.
  • P5/32 is for detecting bigger objects.

With this in mind, we need to set our anchor sizes accordingly for each layer.

To evaluate our current anchor boxes, we can calculate the best possible recall, which would occur if the model was able to successfully match an appropriate anchor box with a ground truth.

Find and Resize ground truth bounding boxes

To evaluate our anchor boxes, we first need some knowledge of the shapes and sizes of the objects in our dataset. However, before we can evaluate, we need to resize the width and height of our ground truth boxes based on the size of the images that we will train on — for this architecture, this is recommended to be 640.

Let’s start by finding the width and height of all ground truth boxes in the training set. We can calculate these as demonstrated below:

Next, we will need the height and width of our images. Sometimes, we have this information ahead of time, in which case we can use this knowledge directly. Otherwise, we can do this as follows:

We can now merge this with our existing DataFrame:

Now, we can use this information to get the resized widths and heights of our ground truth targets, with respect to our target image size. To preserve the aspect ratios of the objects in our images, the recommended approach to resizing is to scale the image so that the longest size is equal to our target size. We can do this using the function below:


# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/anchors.py

def calculate_resized_gt_wh(gt_wh, image_sizes, target_image_size=640):
"""
Given an array of bounding box widths and heights, and their corresponding image sizes,
resize these relative to the specified target image size.

This function assumes that resizing will be performed by scaling the image such that the longest
side is equal to the given target image size.

:param gt_wh: an array of shape [N, 2] containing the raw width and height of each box.
:param image_sizes: an array of shape [N, 2] or [1, 2] containing the width and height of the image corresponding to each box.
:param target_image_size: the size of the images that will be used during training.

"""
normalized_gt_wh = gt_wh / image_sizes
target_image_sizes = (
target_image_size * image_sizes / image_sizes.max(1, keepdims=True)
)

resized_gt_wh = target_image_sizes * normalized_gt_wh

tiny_boxes_exist = (resized_gt_wh < 3).any(1).sum()
if tiny_boxes_exist:
print(
f"""WARNING: Extremely small objects found.
{tiny_boxes_exist} of {len(resized_gt_wh)} labels are < 3 pixels in size. These will be removed
"""
)
resized_gt_wh = resized_gt_wh[(resized_gt_wh >= 2.0).any(1)]

return resized_gt_wh

Alternatively, as all of our images are the same size in this case, we could simply specify a single image size.

Note that we have also filtered out any boxes what will be incredibly small (less than 3 pixels in either height or width), with respect to the new image size, as these boxes are usually too small to be considered useful!

Calculating Best Possible Recall

Now that we have the width and height of all ground truth boxes in our training set, we can evaluate our current anchor boxes as follows:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/anchors.py

def calculate_best_possible_recall(anchors, gt_wh):
"""
Given a tensor of anchors and and an array of widths and heights for each bounding box in the dataset,
calculate the best possible recall that can be obtained if every box was matched to an appropriate anchor.

:param anchors: a tensor of shape [N, 2] representing the width and height of each anchor
:param gt_wh: a tensor of shape [N, 2] representing the width and height of each ground truth bounding box

"""
best_anchor_ratio = calculate_best_anchor_ratio(anchors=anchors, wh=gt_wh)
best_possible_recall = (
(best_anchor_ratio > 1.0 / LOSS_ANCHOR_MULTIPLE_THRESHOLD).float().mean()
)

return best_possible_recall

From this, we can see that the current anchor boxes are a good fit for this dataset; which makes sense, as the images are quite similar to those in COCO.

How does this work?

At this point, you may be wondering, how exactly do we calculate the best possible recall. To answer this, let’s go through the process manually.

Intuitively, we would like to ensure that at least one anchor can be matched to each ground truth box. Whilst we could do this by framing it as an optimization problem — how do we match each ground truth box with its optimal anchor — this would introduce a lot of complexity for what we are trying to do.

Given an anchor box, we need a simpler way of measuring how well it can be made to fit a ground truth box. Let’s examine one approach that can be taken to do this, starting with the width and height of a single ground truth box.

For each anchor box, we can inspect the ratios of its height and width when compared to the height and width of our ground truth target and use this to understand where the biggest differences are.

As the scale of these ratios will depend on whether the anchor box sides are greater or smaller than the sides of our ground truth box, we can ensure that our magnitudes are in the range [0, 1] by also calculating the reciprocal and taking the minimum ratios for each anchor.

From this, we now have an indication of how well, independently, the width and height of each anchor box ‘fits’ to our ground truth target.

Now, our challenge is how to evaluate the matching of the the width and height together!

One way we can approach this is, to take the minimum ratio for each anchor; representing the side that worst matches our ground truth.

The reason why we have selected the worst fitting side here, is because we know that the other side matches our target at least as well as the one selected; we can think of this as the worst case scenario!

Now, let’s select the anchor box which matches the best out of these options, this is simply the largest value.

Out of the worst fitting options, this is our selected match!

Recalling that the loss function only looks to match anchor boxes that are up to 4 times greater or smaller than the size of the ground truth target, we can now verify whether this anchor is within this range and would be considered a successful match.

We can do that as demonstrated below, taking the reciprocal of our loss multiple, to ensure that it is in the same range as our value:

From this, we can see that at least one of our anchors could be successfully matched to our selected ground truth target!

Now that we understand the sequence of steps, we can now apply the same logic to all of our ground truth boxes to see how many matches we can obtain with our current set of anchors:

Now that we have calculated, for each ground truth box, whether it has a match. We can take the mean number of matches to find out best possible recall; in our case, this is 1, as we saw earlier!

Selecting new anchor boxes

Whilst using the pre-defined anchors may be a good choice for similar datasets, this may not be appropriate for all datasets, for example, those that contain lots of small objects. In these cases, a better approach may be to select entirely new anchors.

Let’s explore how we can do this!

First, let’s define the number of anchors that we need for our architecture.

Now, based on our bounding boxes, we need to define a sensible set widths and heights of anchor templates. One way that we can estimate this is by using K-means to cluster our ground truth aspect ratios, based on the number of anchor sizes that we need. We can then use these centroids as our starting estimates. We can do this using the following function:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/anchors.py

def estimate_anchors(num_anchors, gt_wh):
"""
Given a target number of anchors and an array of widths and heights for each bounding box in the dataset,
estimate a set of anchors using the centroids from Kmeans clustering.

:param num_anchors: the number of anchors to return
:param gt_wh: an array of shape [N, 2] representing the width and height of each ground truth bounding box

"""
print(f"Running kmeans for {num_anchors} anchors on {len(gt_wh)} points...")
std_dev = gt_wh.std(0)
proposed_anchors, _ = kmeans(
gt_wh / std_dev, num_anchors, iter=30
) # divide by std so they are in approx same range
proposed_anchors *= std_dev

return proposed_anchors

Here, we can see that we now have a set of anchor templates that we can use as a starting point. As before, let’s calculate our best possible recall using these anchor boxes:

Once again, we see that our best possible recall is 1, which means that these anchor sizes are also a good fit for our problem!

Whilst it is perhaps unnecessary in this case, we may be able improve these anchors further using a genetic algorithm. Following this methodology, we can define a fitness (or reward) function to measure how well our anchor boxes match our data and make small, random changes to our anchor sizes to try and maximise this function.

In this case we can define our fitness function as follows:

def anchor_fitness(anchors, wh):
"""
A fitness function that can be used to evolve a set of anchors. This function calculates the mean best anchor ratio
for all matches that are within the multiple range considered during the loss calculation.
"""
best_anchor_ratio = calculate_best_anchor_ratio(anchors=anchors, gt_wh=wh)
return (
best_anchor_ratio
* (best_anchor_ratio > 1 / LOSS_ANCHOR_MULTIPLE_THRESHOLD).float()
).mean()

Here, we are taking the best anchor ratio for each match that will be considered during the loss calculation. If an anchor box is more than four times greater or smaller than its matched bounding box, it will not contribute to our score. Let’s use this to calculate a fitness score for our proposed anchor sizes:

Now, let’s use this as the fitness function when optimizing our anchors, as demonstrated below:

Inspecting the definition of this function, we can see that, for a specified number of iterations, we are simply sampling random noise from a normal distribution and using this to mutate our anchor sizes. If this change leads to an increased score, we keep these as our anchor sizes!

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/anchors.py

def evolve_anchors(
proposed_anchors,
gt_wh,
num_iterations=1000,
mutation_probability=0.9,
mutation_noise_mean=1,
mutation_noise_std=0.1,
anchor_fitness_fn=anchor_fitness,
verbose=False,
):
"""
Use a genetic algorithm to mutate the given anchors to try and optimise them based on the given widths and heights of the
ground truth boxes based on the provided fitness function. Anchor dimensions are mutated by adding random noise sampled
from a normal distribution with the mean and standard deviation provided.

:param proposed_anchors: a tensor containing the aspect ratios of the anchor boxes to evolve
:param gt_wh: a tensor of shape [N, 2] representing the width and height of each ground truth bounding box
:param num_generations: the number of iterations for which to run the algorithm
:param mutation_probability: the probability that each anchor dimension is mutated during each iteration
:param mutation_noise_mean: the mean of the normal distribution from which the mutation noise will be sampled
:param mutation_noise_std: the standard deviation of the normal distribution from which the mutation noise will be sampled
:param anchor_fitness_fn: the reward function that will be used during the optimization process. This should accept proposed_anchors and gt_wh as arguments
:param verbose: if True, the value of the fitness function will be printed at the end of each iteration

"""
best_fitness = anchor_fitness_fn(proposed_anchors, gt_wh)
anchor_shape = proposed_anchors.shape

pbar = tqdm(range(num_iterations), desc=f"Evolving anchors with Genetic Algorithm:")
for i, _ in enumerate(pbar):
# Define mutation by sampling noise from a normal distribution
anchor_mutation = np.ones(anchor_shape)
anchor_mutation = (
(np.random.random(anchor_shape) < mutation_probability)
* np.random.randn(*anchor_shape)
* mutation_noise_std
+ mutation_noise_mean
).clip(0.3, 3.0)

mutated_anchors = (proposed_anchors.copy() * anchor_mutation).clip(min=2.0)
mutated_anchor_fitness = anchor_fitness_fn(mutated_anchors, gt_wh)

if mutated_anchor_fitness > best_fitness:
best_fitness, proposed_anchors = (
mutated_anchor_fitness,
mutated_anchors.copy(),
)
pbar.desc = (
f"Evolving anchors with Genetic Algorithm: fitness = {best_fitness:.4f}"
)
if verbose:
print(f"Iteration: {i}, Fitness: {best_fitness}")

return proposed_anchors

Let’s see whether this has improved our score at all:

We can see that our evolved anchors have a better fitness score than our original proposed anchors, as we would expect!

Now, all that is left to do is to sort the anchors into a rough ascending order, considering the smallest dimension for each anchor.

Putting it all together

Now that we understand the process, we could calculate our anchors for our dataset in a single step using the following function.

In this case, as our best possible recall is already greater than the threshold, we can keep our original anchor sizes!

However, in cases where our anchor sizes change, we can update them as demonstrated below:

Run training

Now we have explored some of the tecnhiques used in the original training recipe, let’s update our training script to include some of these features. An updated script is presented below:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/examples/train_cars.py

import random
from functools import partial
from pathlib import Path

import numpy as np
import pandas as pd
import torch
from func_to_script import script
from PIL import Image
from pytorch_accelerated.callbacks import (
ModelEmaCallback,
ProgressBarCallback,
SaveBestModelCallback,
get_default_callbacks,
)
from pytorch_accelerated.schedulers import CosineLrScheduler
from torch.utils.data import Dataset

from yolov7 import create_yolov7_model
from yolov7.dataset import (
Yolov7Dataset,
create_base_transforms,
create_yolov7_transforms,
yolov7_collate_fn,
)
from yolov7.evaluation import CalculateMeanAveragePrecisionCallback
from yolov7.loss_factory import create_yolov7_loss
from yolov7.mosaic import MosaicMixupDataset, create_post_mosaic_transform
from yolov7.trainer import Yolov7Trainer, filter_eval_predictions
from yolov7.utils import SaveBatchesCallback, Yolov7ModelEma

def load_cars_df(annotations_file_path, images_path):
all_images = sorted(set([p.parts[-1] for p in images_path.iterdir()]))
image_id_to_image = {i: im for i, im in enumerate(all_images)}
image_to_image_id = {v: k for k, v, in image_id_to_image.items()}

annotations_df = pd.read_csv(annotations_file_path)
annotations_df.loc[:, "class_name"] = "car"
annotations_df.loc[:, "has_annotation"] = True

# add 100 empty images to the dataset
empty_images = sorted(set(all_images) - set(annotations_df.image.unique()))
non_annotated_df = pd.DataFrame(list(empty_images)[:100], columns=["image"])
non_annotated_df.loc[:, "has_annotation"] = False
non_annotated_df.loc[:, "class_name"] = "background"

df = pd.concat((annotations_df, non_annotated_df))

class_id_to_label = dict(
enumerate(df.query("has_annotation == True").class_name.unique())
)
class_label_to_id = {v: k for k, v in class_id_to_label.items()}

df["image_id"] = df.image.map(image_to_image_id)
df["class_id"] = df.class_name.map(class_label_to_id)

file_names = tuple(df.image.unique())
random.seed(42)
validation_files = set(random.sample(file_names, int(len(df) * 0.2)))
train_df = df[~df.image.isin(validation_files)]
valid_df = df[df.image.isin(validation_files)]

lookups = {
"image_id_to_image": image_id_to_image,
"image_to_image_id": image_to_image_id,
"class_id_to_label": class_id_to_label,
"class_label_to_id": class_label_to_id,
}
return train_df, valid_df, lookups

class CarsDatasetAdaptor(Dataset):
def __init__(
self,
images_dir_path,
annotations_dataframe,
transforms=None,
):
self.images_dir_path = Path(images_dir_path)
self.annotations_df = annotations_dataframe
self.transforms = transforms

self.image_idx_to_image_id = {
idx: image_id
for idx, image_id in enumerate(self.annotations_df.image_id.unique())
}
self.image_id_to_image_idx = {
v: k for k, v, in self.image_idx_to_image_id.items()
}

def __len__(self) -> int:
return len(self.image_idx_to_image_id)

def __getitem__(self, index):
image_id = self.image_idx_to_image_id[index]
image_info = self.annotations_df[self.annotations_df.image_id == image_id]
file_name = image_info.image.values[0]
assert image_id == image_info.image_id.values[0]

image = Image.open(self.images_dir_path / file_name).convert("RGB")
image = np.array(image)

image_hw = image.shape[:2]

if image_info.has_annotation.any():
xyxy_bboxes = image_info[["xmin", "ymin", "xmax", "ymax"]].values
class_ids = image_info["class_id"].values
else:
xyxy_bboxes = np.array([])
class_ids = np.array([])

if self.transforms is not None:
transformed = self.transforms(
image=image, bboxes=xyxy_bboxes, labels=class_ids
)
image = transformed["image"]
xyxy_bboxes = np.array(transformed["bboxes"])
class_ids = np.array(transformed["labels"])

return image, xyxy_bboxes, class_ids, image_id, image_hw

DATA_PATH = Path("/".join(Path(__file__).absolute().parts[:-2])) / "data/cars"

@script
def main(
data_path: str = DATA_PATH,
image_size: int = 640,
pretrained: bool = False,
num_epochs: int = 300,
batch_size: int = 8,
):

# load data
data_path = Path(data_path)
images_path = data_path / "training_images"
annotations_file_path = data_path / "annotations.csv"
train_df, valid_df, lookups = load_cars_df(annotations_file_path, images_path)
num_classes = 1

# create datasets
train_ds = DatasetAdaptor(
images_path, train_df, transforms=create_base_transforms(image_size)
)
eval_ds = DatasetAdaptor(images_path, valid_df)

mds = MosaicMixupDataset(
train_ds,
apply_mixup_probability=0.15,
post_mosaic_transforms=create_post_mosaic_transform(
output_height=image_size, output_width=image_size
),
)
if pretrained:
# disable mosaic if finetuning
mds.disable()

train_yds = Yolov7Dataset(
mds,
create_yolov7_transforms(training=True, image_size=(image_size, image_size)),
)
eval_yds = Yolov7Dataset(
eval_ds,
create_yolov7_transforms(training=False, image_size=(image_size, image_size)),
)

# create model, loss function and optimizer
model = create_yolov7_model(
architecture="yolov7", num_classes=num_classes, pretrained=pretrained
)
param_groups = model.get_parameter_groups()

loss_func = create_yolov7_loss(model, image_size=image_size)

optimizer = torch.optim.SGD(
param_groups["other_params"], lr=0.01, momentum=0.937, nesterov=True
)

# create evaluation callback and trainer
calculate_map_callback = (
CalculateMeanAveragePrecisionCallback.create_from_targets_df(
targets_df=valid_df.query("has_annotation == True")[
["image_id", "xmin", "ymin", "xmax", "ymax", "class_id"]
],
image_ids=set(valid_df.image_id.unique()),
iou_threshold=0.2,
)
)

trainer = Yolov7Trainer(
model=model,
optimizer=optimizer,
loss_func=loss_func,
filter_eval_predictions_fn=partial(
filter_eval_predictions, confidence_threshold=0.01, nms_threshold=0.3
),
callbacks=[
calculate_map_callback,
ModelEmaCallback(
decay=0.9999,
model_ema=Yolov7ModelEma,
callbacks=[ProgressBarCallback, calculate_map_callback],
),
SaveBestModelCallback(watch_metric="map", greater_is_better=True),
SaveBatchesCallback("./batches", num_images_per_batch=3),
*get_default_callbacks(progress_bar=True),
],
)

# calculate scaled weight decay and gradient accumulation steps
total_batch_size = (
batch_size * trainer._accelerator.num_processes
) # batch size across all processes

nominal_batch_size = 64
num_accumulate_steps = max(round(nominal_batch_size / total_batch_size), 1)
base_weight_decay = 0.0005
scaled_weight_decay = (
base_weight_decay * total_batch_size * num_accumulate_steps / nominal_batch_size
)

optimizer.add_param_group(
{"params": param_groups["conv_weights"], "weight_decay": scaled_weight_decay}
)

# run training
trainer.train(
num_epochs=num_epochs,
train_dataset=train_yds,
eval_dataset=eval_yds,
per_device_batch_size=batch_size,
create_scheduler_fn=CosineLrScheduler.create_scheduler_fn(
num_warmup_epochs=5,
num_cooldown_epochs=5,
k_decay=2,
),
collate_fn=yolov7_collate_fn,
gradient_accumulation_steps=num_accumulate_steps,
)

if __name__ == "__main__":
main()

Launching training once again, as described here, using a single V100 GPU with fp16 enabled, after 300 epochs we obtained a mAP of 0.997, for both the model and the EMA model; a marginal increase over our transfer learning run, and probably the maximum performance that can be achieved on this dataset!

Hopefully that has provided a somewhat comprehensive overview of some of the most interesting ideas from the YOLOv7 training process, and how these can be applied in custom training scripts.

All of the code required to replicate this post is available as a notebook here. Whilst code snippets are used throughout the article, this is primarily for aesthetic purposes, please defer to the notebook, and the repo for working code.

Chris Hughes and Bernat Puig Camps are on LinkedIn

Here, we used the car object detection dataset from Kaggle which was made publicly available as part of Competition Six (tjmachinelearning.com). This dataset is frequently used for learning purposes.

Whilst there is no clear license attached to this dataset, we received explicit permission from the authors to use this as part of this article. Unless otherwise stated, all images used in this article are taken from this dataset.


Everything you need to know to use YOLOv7 in custom training scripts

This article was co-authored by Chris Hughes & Bernat Puig Camps

Shortly after its publication, YOLOv7 is the fastest and most accurate real-time object detection model for computer vision tasks. The official paper demonstrates how this improved architecture surpasses all previous YOLO versions — as well as all other object detection models — in terms of both speed and accuracy on the MS COCO dataset; achieving this performance without utilizing any pretrained weights. Additionally, following all of the controversy around the naming conventions of previous YOLO models, as YOLOv7 was released by the same authors that developed Scaled-YOLOv4, the machine learning community seems happy to accept this as the next iteration of the ‘official’ YOLO family!

At the point that YOLOv7 was released, we — as part of Microsoft’s Data & AI Service Line — were partway through a challenging object-detection based customer project, in a domain drastically different to COCO. Needless to say, both ourselves and the customer were very excited at the prospect of applying YOLOv7 to our problem. Unfortunately, when using the out-of-the-box settings, the results were … let’s just say, not great.

After reading the official paper, we found that, whilst it presents a comprehensive overview on the architectural changes, it omits many details around how the model was trained; for example, which data augmentation techniques were applied, and how the loss function measures that the model is doing a good job! To understand these technicalities, we decided to debug the code directly. However, as the YOLOv7 repository is a modified fork of the YOLOR codebase — which itself is a fork of YOLOv5 — we found that it includes a lot of complex functionality, much of which is not needed when just training a model; for example, being able to specify custom architectures in Yaml format and have these translated into PyTorch models. Additionally, the codebase contains many custom components which have been implemented from scratch — such as a multi-GPU training loop, several data augmentations, samplers to preserve dataloader workers, and multiple learning rate schedulers — many of which are now available in PyTorch, or other libraries. As a result, there was a lot of code to dissect; it took us a long time to understand how everything worked, as well as the intricacies of the training loop which contribute to the model’s excellent performance! Eventually, with this understanding, we were able to set up our training recipe to obtain consistently good results on our task.

In this article, we intend to take a practical approach in demonstrating how to train YOLOv7 models in custom training scripts, as well as exploring areas such as data augmentation techniques, how to select and modify anchor boxes, and demystifying how the loss function works; (hopefully!) enabling you to build up an intuition of what is likely to work well for your own problems. As the YOLOv7 architecture is well described in detail in the official paper, as well as in many other sources, we are not going to cover this here. Instead, we intend to focus on all of the other details which, whilst contribute to YOLOv7’s performance, are not covered in the paper. This tends to be knowledge which has been accumulated over multiple versions of YOLO models but can be incredibly difficult to track down for someone just entering the field.

To illustrate these concepts, we shall be using our own implementation of YOLOv7, which utilises the official pretrained weights, but has been written with modularity and readability in mind. This project initially started as an exercise for us to improve our understanding of how YOLOv7 works under the hood — in order to better understand how to apply it — but after successfully using it on a few different tasks, we have decided to make it publicly available. Whilst we would recommend using the official implementation if you wish to exactly reproduce the published results on COCO, we find that this implementation is more flexible to apply, and extend, to custom domains. Hopefully, this implementation will provide a clean and clear starting point for anyone wishing to experiment with YOLOv7 in their own custom training scripts, as well as providing more transparency around the techniques that were used during training in the original implementation.

In this article, we shall cover:

Exploring all of the details along the way, such as:

Tl;dr: If you just want to see some working code that you can use directly, all of the code required to replicate this post is available as a notebook here. Whilst code snippets are used throughout the article, this is primarily for aesthetic purposes, please defer to the notebook, and the repo for working code.

We would like to thank British Airways, for without their consistently delayed flights, this post would probably not have happened.

First, let’s take a look at how to load our dataset in the format that YOLOv7 expects.

Selecting a dataset

Throughout this article, we shall use the Kaggle cars object detection dataset; however, as our aim is to demonstrate how YOLOv7 can be applied to any problem, this is really the least important part of this work. Additionally, as the images are quite similar to COCO, it will enable us to experiment with a pretrained model before we do any training.

The annotations for this dataset are in the form of a .csv file, which associates the image name with the corresponding annotations; where each row represents one bounding box. Whilst there are around 1000 images in the training set, only those with annotations are included in this file.

We can view the format of this by loading it into a pandas DataFrame.

As it is not usually the case that all images in our dataset contain instances of the objects that we are trying to detect, we would also like to include some images that do not contain cars. To do this, we can define a function to load the annotations which also includes 100 ‘negative’ images. Additionally, as the designated test set is unlabelled, let’s randomly take 20% of these images to use as our validation set.

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/examples/train_cars.py

import pandas as pd
import random

def load_cars_df(annotations_file_path, images_path):
all_images = sorted(set([p.parts[-1] for p in images_path.iterdir()]))
image_id_to_image = {i: im for i, im in enumerate(all_images)}
image_to_image_id = {v: k for k, v, in image_id_to_image.items()}

annotations_df = pd.read_csv(annotations_file_path)
annotations_df.loc[:, "class_name"] = "car"
annotations_df.loc[:, "has_annotation"] = True

# add 100 empty images to the dataset
empty_images = sorted(set(all_images) - set(annotations_df.image.unique()))
non_annotated_df = pd.DataFrame(list(empty_images)[:100], columns=["image"])
non_annotated_df.loc[:, "has_annotation"] = False
non_annotated_df.loc[:, "class_name"] = "background"

df = pd.concat((annotations_df, non_annotated_df))

class_id_to_label = dict(
enumerate(df.query("has_annotation == True").class_name.unique())
)
class_label_to_id = {v: k for k, v in class_id_to_label.items()}

df["image_id"] = df.image.map(image_to_image_id)
df["class_id"] = df.class_name.map(class_label_to_id)

file_names = tuple(df.image.unique())
random.seed(42)
validation_files = set(random.sample(file_names, int(len(df) * 0.2)))
train_df = df[~df.image.isin(validation_files)]
valid_df = df[df.image.isin(validation_files)]

lookups = {
"image_id_to_image": image_id_to_image,
"image_to_image_id": image_to_image_id,
"class_id_to_label": class_id_to_label,
"class_label_to_id": class_label_to_id,
}
return train_df, valid_df, lookups

We can now use this function to load our data:

To make it easier to associate predictions with an image, we have assigned each image a unique id; in this case it is just an incrementing integer count. Additionally, we have added an integer value to represent the classes that we want to detect, which is a single class — ‘car’ — in this case.

Generally, object detection models reserve 0 as the background class, so class labels should start from 1. This is not the case for YOLOv7, so we start our class encoding from 0. For images that do not contain a car, we do not require a class id. We can confirm that this is the case by inspecting the lookups returned by our function.

Finally, let’s see the number of images in each class for our training and validation sets. As an image can have multiple annotations, we need to make sure that we account for this when calculating our counts:

Create a Dataset Adaptor

Usually, at this point, we would create a PyTorch dataset specific to the model that we shall be training.

However, we often use the pattern of first creating a dataset ‘adaptor’ class, with the sole responsibility of wrapping the underlying data sources and loading this appropriately. This way, we can easily switch out adaptors when using different datasets, without changing any pre-processing logic which is specific to the model that we are training.

Therefore, let’s focus for now on creating a CarsDatasetAdaptor class, which converts the specific raw dataset format into an image and corresponding annotations. Additionally, let’s load the image id that we assigned, as well as the height and width of our image, as they may be useful to us later on.

An implementation of this is presented below:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/examples/train_cars.py

from torch.utils.data import Dataset

class CarsDatasetAdaptor(Dataset):
def __init__(
self,
images_dir_path,
annotations_dataframe,
transforms=None,
):
self.images_dir_path = Path(images_dir_path)
self.annotations_df = annotations_dataframe
self.transforms = transforms

self.image_idx_to_image_id = {
idx: image_id
for idx, image_id in enumerate(self.annotations_df.image_id.unique())
}
self.image_id_to_image_idx = {
v: k for k, v, in self.image_idx_to_image_id.items()
}

def __len__(self) -> int:
return len(self.image_idx_to_image_id)

def __getitem__(self, index):
image_id = self.image_idx_to_image_id[index]
image_info = self.annotations_df[self.annotations_df.image_id == image_id]
file_name = image_info.image.values[0]
assert image_id == image_info.image_id.values[0]

image = Image.open(self.images_dir_path / file_name).convert("RGB")
image = np.array(image)

image_hw = image.shape[:2]

if image_info.has_annotation.any():
xyxy_bboxes = image_info[["xmin", "ymin", "xmax", "ymax"]].values
class_ids = image_info["class_id"].values
else:
xyxy_bboxes = np.array([])
class_ids = np.array([])

if self.transforms is not None:
transformed = self.transforms(
image=image, bboxes=xyxy_bboxes, labels=class_ids
)
image = transformed["image"]
xyxy_bboxes = np.array(transformed["bboxes"])
class_ids = np.array(transformed["labels"])

return image, xyxy_bboxes, class_ids, image_id, image_hw

Notice that, for our background images, we are just returning an empty array for our bounding boxes and class ids.

Using this, we can confirm that the length of our dataset is the same as the total number of training images that we calculated earlier.

Now, we can use this to visualise some of our images, as demonstrated below:

Create a YOLOv7 dataset

Now that we have created our dataset adaptor, let’s create a dataset which preprocesses our inputs into the format required by YOLOv7; these steps should remain the same regardless of the adaptor that we are using.

An implementation of this is presented below:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/dataset.py

class Yolov7Dataset(Dataset):
"""
A dataset which takes an object detection dataset returning (image, boxes, classes, image_id, image_hw)
and applies the necessary preprocessing steps as required by Yolov7 models.

By default, this class expects the image, boxes (N, 4) and classes (N,) to be numpy arrays,
with the boxes in (x1,y1,x2,y2) format, but this behaviour can be modified by
overriding the `load_from_dataset` method.
"""

def __init__(self, dataset, transforms=None):
self.ds = dataset
self.transforms = transforms

def __len__(self):
return len(self.ds)

def load_from_dataset(self, index):
image, boxes, classes, image_id, shape = self.ds[index]
return image, boxes, classes, image_id, shape

def __getitem__(self, index):
image, boxes, classes, image_id, original_image_size = self.load_from_dataset(
index
)

if self.transforms is not None:
transformed = self.transforms(image=image, bboxes=boxes, labels=classes)
image = transformed["image"]
boxes = np.array(transformed["bboxes"])
classes = np.array(transformed["labels"])

image = image / 255 # 0 - 1 range

if len(boxes) != 0:
# filter boxes with 0 area in any dimension
valid_boxes = (boxes[:, 2] > boxes[:, 0]) & (boxes[:, 3] > boxes[:, 1])
boxes = boxes[valid_boxes]
classes = classes[valid_boxes]

boxes = torchvision.ops.box_convert(
torch.as_tensor(boxes, dtype=torch.float32), "xyxy", "cxcywh"
)
boxes[:, [1, 3]] /= image.shape[0] # normalized height 0-1
boxes[:, [0, 2]] /= image.shape[1] # normalized width 0-1
classes = np.expand_dims(classes, 1)

labels_out = torch.hstack(
(
torch.zeros((len(boxes), 1)),
torch.as_tensor(classes, dtype=torch.float32),
boxes,
)
)
else:
labels_out = torch.zeros((0, 6))

try:
if len(image_id) > 0:
image_id_tensor = torch.as_tensor([])

except TypeError:
image_id_tensor = torch.as_tensor(image_id)

return (
torch.as_tensor(image.transpose(2, 0, 1), dtype=torch.float32),
labels_out,
image_id_tensor,
torch.as_tensor(original_image_size),
)

Let’s wrap our data adaptor using this dataset and inspect some of the outputs:

As we haven’t defined any transforms, the output is largely the same, with the main exception being that the boxes are now in normalized cxcywh format and all of our outputs have been converted into tensors. Note that cx, cy stands for center x and y and it means that the coordinates correspond to the centre of the box.

One thing to note is that our labels take the form [0, class_id, ncx, ncy, nw, nh]. The zero space at the start of the tensor will be utilised by the collate function later on.

Transforms

Now, let’s define some transforms! For this, we shall use the excellent Albumentations library, which provides many options for transforming both images and bounding boxes.

Whilst the transforms that we select will largely be domain specific, here, we shall ‘s define similar transforms to those used in the original implementation.

These are:

  • Resize the image to the given input (multiple of 640) whilst maintaining the aspect ratio
  • If the image is not square, apply padding. For this, we shall follow the paper in using a grey padding, this is an arbitrary choice.

During training:

We can use the following function to create these transforms as demonstrated below:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/dataset.py

def create_yolov7_transforms(
image_size=(640, 640),
training=False,
training_transforms=(A.HorizontalFlip(p=0.5),),
):
transforms = [
A.LongestMaxSize(max(image_size)),
A.PadIfNeeded(
image_size[0],
image_size[1],
border_mode=0,
value=(114, 114, 114),
),
]

if training:
transforms.extend(training_transforms)

return A.Compose(
transforms,
bbox_params=A.BboxParams(format="pascal_voc", label_fields=["labels"]),
)

Now, let’s re-create our dataset, this time passing the default transforms that will be used during evaluation. For our target image size, we shall use 640 which is the value that the smaller YOLOv7 models were trained on. In general, we can select any multiple of 8 for this.

Using these transforms, we can see that our image has been resized to our target size and padding has been applied. The reason that padding is used is so that we can maintain the aspect ratio of the objects in the images, but have a common size for images in our dataset; enabling us to batch them efficiently!

Now that we have explored how to load and prepare our data, let’s move on to take a look at how we can leverage a pretrained model to make some predictions!

Loading the model

So that we can understand how to interface with the model, let’s load a pretrained checkpoint and use this for inference on some images in our dataset. As this checkpoint was trained on COCO, which contains images of cars, we can assume that the model should perform moderately well on this task out of the box. To see the models that are available, we can import the AVAILABLE_MODELS variable.

Here, we can see that the available models are the architectures defined in the original paper. Let’s create the standard yolov7 model, using the create_yolov7_model function.

Now, let’s take a look at the model’s predictions. The forward pass through the model will return the raw feature maps given by the FPN heads, to convert these into meaningful predictions, we can use the postprocess method.

Inspecting the shape, we can see that the model has made 25,200 predictions! Each prediction has an associated tensor of length 6 — the entries correspond to the bounding box coordinates in xyxy format, a confidence score, and a class index.

Often, object detection models tend to make a lot of similar, overlapping predictions. Whilst there are many ways of dealing with this, in the original paper, the authors used non-maximum-suppression (NMS) to solve this problem. We can apply NMS, as well as a secondary round of confidence thresholding, using the function below. In addition, during postprocessing, we often want to filter our any predictions with a confidence level below a predefined threshold, let’s increase our confidence threshold here.

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/trainer.py

def filter_eval_predictions(
predictions: List[Tensor],
confidence_threshold: float = 0.2,
nms_threshold: float = 0.65,
) -> List[Tensor]:
nms_preds = []
for pred in predictions:
pred = pred[pred[:, 4] > confidence_threshold]

nms_idx = torchvision.ops.batched_nms(
boxes=pred[:, :4],
scores=pred[:, 4],
idxs=pred[:, 5],
iou_threshold=nms_threshold,
)
nms_preds.append(pred[nms_idx])

return nms_preds

After applying NMS, we can see that now we only have a single prediction for this image. Let’s visualise how this looks:

We can see that this looks pretty good! The prediction from the model is actually tighter around the car than the ground truth!

Now that we have our prediction, the only thing to note is that the bounding box is relative to the resized image size. To scale our predictions back to the original image size, we can use the following function:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/trainer.py

def scale_bboxes_to_original_image_size(
xyxy_boxes, resized_hw, original_hw, is_padded=True
):
scaled_boxes = xyxy_boxes.clone()
scale_ratio = resized_hw[0] / original_hw[0], resized_hw[1] / original_hw[1]

if is_padded:
# remove padding
pad_scale = min(scale_ratio)
padding = (resized_hw[1] - original_hw[1] * pad_scale) / 2, (
resized_hw[0] - original_hw[0] * pad_scale
) / 2
scaled_boxes[:, [0, 2]] -= padding[0] # x padding
scaled_boxes[:, [1, 3]] -= padding[1] # y padding
scale_ratio = (pad_scale, pad_scale)

scaled_boxes[:, [0, 2]] /= scale_ratio[1]
scaled_boxes[:, [1, 3]] /= scale_ratio[0]

# Clip bounding xyxy bounding boxes to image shape (height, width)
scaled_boxes[:, 0].clamp_(0, original_hw[1]) # x1
scaled_boxes[:, 1].clamp_(0, original_hw[0]) # y1
scaled_boxes[:, 2].clamp_(0, original_hw[1]) # x2
scaled_boxes[:, 3].clamp_(0, original_hw[0]) # y2

return scaled_boxes

Before we can start training, in addition to a model architecture, we need a loss function which will enable us to measure how well our model is performing; in order to be able to update our parameters. Since Object Detection is a difficult problem to teach a model, the loss functions of such models are usually quite complex and YOLOv7 is not an exception. Here, we shall do our best to illustrate the intuitions behind it to facilitate its understanding.

Before we can delve deeper into the actual loss function, let’s cover a few background concepts that we need to understand.

Anchor boxes

One of the main difficulties of object detection is outputting detection boxes. That is, how do we train a model to create a bounding box and localize it correctly in an image?

There are a few different approaches, but the YOLOv7 family is what we call an anchor-based model. In these models, the general philosophy is to first create lots of potential bounding boxes, then select the most promising options to match to our target objects; slightly moving and resizing them as necessary to obtain the best possible fit.

The basic idea is that we draw a grid on top of each image and, at each grid intersection (anchor point), generate candidate boxes (anchor boxes) based on a number of anchor sizes. That is, the same set of boxes is repeated at each anchor point. This way, the task that model has to learn, slightly relocating and resizing these boxes, is simpler than generating boxes from scratch.

An example of anchor boxes generated at a sample of anchor points.

However, one issue with this approach is that our target, ground truth, boxes can range in size — from tiny to huge! Therefore, it is usually not possible to define a single set of anchor sizes that can be matched to all targets. For this reason, anchor-based model architectures usually employ a Feature-Pyramid-Network (FPN) to assist with this; which is the case with YOLOv7.

Feature Pyramid Networks (FPN)

The main idea behind FPNs (introduced in Feature Pyramid Networks for Object Detection) is to leverage the nature of convolutional layers — which reduce the size of the feature space and increase the coverage of each feature in the initial image — to output predictions at different scales¹. FPNs are usually implemented as a stack of convolutional layers, as we can see by inspecting the detection head of our YOLOv7 model.

Whilst we could simply take the outputs of the final layer as predictions, as the deeper convolutional layers implicitly utilise the information from previous layers to learn more high-level features, they do not have access to the information of how to detect the lower-level features contained in earlier layers; this can result in poor performance when detecting smaller objects.

For this reason, a top-down pathway and lateral connections are added to the regular bottom-up pathway (normal flow of a convolution layer). The top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. Then, these features are enhanced with features from the bottom-up pathway through the lateral connections. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times¹.

In summary, FPNs provide semantically strong features at multiple scales which make them extremely well suited for object detection. The connections that YOLOv7 implements in its FPN are illustrated in the figure below:

Representation of the YOLOv7 family Feature Proposal Network architecture. Source: YOLOv7 paper.

Here, we can see that we have a “Normal model” and a “Model with auxiliary head”. This is because some of the larger models in the YOLOv7 family use deep supervision when training; that is, they leverage the outputs of deeper layers in the loss in order to try and better learn the task. We shall explore this further later on.

From the image, we can see that each layer in the FPN (also known as each FPN head), has a feature scale that is half the size of the previous one (the scale is the same for each Lead head and its corresponding Aux head). This can be understood as each subsequent FPN head “seeing” object scales twice as big as the previous one. We can leverage that by assigning grids with different strides (grid cell side size), and proportional anchor sizes, to each FPN head.

For instance, the anchor configuration for the basic yolov7 model looks like this:

Illustration of the anchor grid and the different (default) anchor box sizes for each fpn head in the main model in the YOLOv7 family

As we can see, we have anchor box sizes and grids that cover completely different scales: from tiny objects to objects that can occupy the whole image.

Now, that we understand these ideas conceptually, let’s take a look at the FPN outputs that come out of our model, which is what will be used to calculate our loss.

¹ These sections were directly taken from the original FPN paper as we felt that no further explanation was needed.

Breaking down the FPN outputs

Recall that, when we made our predictions earlier, we used the model’s postprocess method to convert the raw FPN outputs into usable bounding boxes. Now that we understand the intuition behind what the FPN is trying to do, let’s inspect these raw outputs.

The outputs of our model are always a List[Tensor], where each component corresponds to a head of the FPN. For models that use Deep Supervision, the Aux Head outputs come after the Lead Head outputs (always the same number of each, both sides of the pair ordered equally). For the rest, including the one we are using here, only the Lead Head outputs are present.

Inspecting the shape of each FPN output, we can see that each one has the following dimensions:

[n_images, n_anchor_sizes, n_grid_rows, n_grid_cols, n_features]

where:

  • n_images — The number of images in the batch (batch size).
  • n_anchor_sizes – The anchor sizes associated with the head (usually 3).
  • n_grid_rows – The number of anchors vertically, img_height / stride.
  • n_grid_cols – The number of anchors horizontally, img_width / stride.
  • n_features5 + num_classes
    cx – Horizontal correction for the anchor box centre.
    cy – Vertical correction for the anchor box centre.
    w – Width correction for the anchor box.
    h – Height correction for the anchor box.
    obj_score – Score proportional to the probability of an object being contained inside the anchor box.
    cls_score – One per class, score proportional to the probability of that being the class of the object.

When these outputs are mapped into useful predictions during post-processing, we apply the following operations:

  • cx, cy : final = 2 * sigmoid(initial) - 0.5
    [(−∞, ∞), (−∞, ∞)] → [(−0.5, 1.5), (−0.5, 1.5)]
    – The model can only move the anchor centre from 0.5 cell behind to 1.5 cells forward. Note that for the loss (i.e., when we train) we use grid coordinates.
  • w, h : final = (2 * sigmoid(initial)**2
    [(−∞, ∞), (−∞, ∞)] → [(0, 4), (0, 4)]
    – The model can make the anchor box arbitrarily smaller but at most 4 times bigger. Larger objects, outside of this range, must be predicted by the next FPN head.
  • obj_score : final = sigmoid(initial)
    (−∞, ∞) → (0, 1)
    – Makes sure the score is mapped to a probability.
  • cls_score : final = sigmoid(initial)
    (−∞, ∞) → (0, 1)
    – Makes sure the score is mapped to a probability.

Center Priors

Now, it is easy to see that if we put 3 anchor boxes in each anchor point of each of the grids, we end up with a lot of boxes: 3*80*80 + 3*40*40 + 3*20*20=25200 for each 640x640px image to be exact! The issue is that most of these predictions are not going to contain an object, which we classify as ‘background’. Depending on the sequence of operations that we need to apply to each prediction, computations can easily stack up and slow down the training!

To make the problem cheaper computationally, the YOLOv7 loss finds first the anchor boxes that are likely to match each target box and treats them differently — these are known as the center prior anchor boxes. This process is applied at each FPN head, for each target box, across all images in batch at once.

Each anchor — which are the coordinates in our grid — defines a grid cell; where we consider the anchor to be at the top left of its corresponding grid cell. Subsequently, each cell (except cells on the border) has 4 adjacent cells (top, bottom, left, right). Each target box, for each FPN head, lies somewhere inside a grid cell. Imagine that we have the following grid, and the centre of a target box is represented by a *:

Based on the way the model is designed and trained, the x and y corrections that it can output are in the range of [-0.5, 1.5] grid cells. Thus, only a subset of the closest anchor boxes will be able to match the target centre. We select some of these anchor boxes to represent the center prior for the target box.

  • For the Lead Heads, we use a fine Center Prior, which is a more targeted selection. This is comprised of 3 anchors per head: the anchor associated the cell containing the target box centre, alongside the anchors for the 2 closest grid cells to the target box centre. In the diagram, the Center Prior anchors are marked with an X.
Selected centre priors for lead detection heads
  • For the Auxiliary Heads (for models that use deep supervision), we use a coarse Center Prior, which is a less targeted selection. This is comprised of 5 anchors per head: the anchor of the cell containing the target box centre, alongside all 4 adjacent grid cells.
Selected centre priors for auxillary detection heads

The reasoning behind this fine and coarse distinction is that the learning ability of Auxiliary Heads is lower than that of the Lead Heads, because the Lead Heads are deeper in the network. Thus, we try to avoid limiting too much from where the Auxiliary Head can learn to make sure we do not lose valuable information.

Similarly to the coordinate corrections, the model can only apply a multiplicative modifier to the width and height of each anchor box in the interval [0, 4]. This means that, at most, it can make the sides of the anchor boxes 4 times bigger. Therefore, from the anchor boxes selected as Center Prior, we filter those that are either 4 times bigger or smaller than the target box.

In summary, the Center Prior is comprised by the anchor boxes whose anchor is close enough to the target box centre and whose sides are not too far off from the target box side size.

Optimal Transport Assignment

One of the difficulties when evaluating object detection models is being able to match predicted boxes to target boxes in order to quantify if the model is doing a good job or not.

The simplest approach is to define an Intersection over Union (IoU) threshold and decide based on that. While this generally works, it becomes problematic when there are occlusions, ambiguity or when multiple objects are very close together. Optimal Transport Assignment (OTA) aims to solve some of these problems by considering label assignment as a global optimization problem for each image.

The main intuition consists in considering each target box a supplier of k positive label assignments and each predicted box a demander of either one positive label assignment or one background assignment. k is dynamic and depends on each target box. Then, transporting one positive label assignment from target box to predicted box has a cost based on classification and regression. Finally, the goal is to find a transportation plan (label assignment) that minimizes the total cost over the image.

This can be done using an off-the-shelf solver, but YOLOv7 implements simOTA (introduced in the YOLOX paper), a simplified version of the OTA problem. With the goal of reducing the computational cost of label assignment, it assigns the 𝑘 predicted boxes for each target that have the lowest transportation cost instead of solving the global problem. The Center Prior boxes are used as candidates for this process.

This helps us to further filter the amount of model outputs that can potentially be matched to a ground truth target.

YOLOv7 Loss algorithm

Now that we have introduced the most complicated pieces used in the YOLOv7 loss calculation, we can break down the algorithm used into the following steps:

  1. For each FPN head (or each FPN head and Aux FPN head pair if Aux heads used):
  • Find the Center Prior anchor boxes.
  • Refine the candidate selection through the simOTA algorithm. Always use lead FPN heads for this.
  • Obtain the objectness loss score using Binary Cross Entropy Loss between the predicted objectness probability and the Complete Intersection over Union (CIoU) with the matched target as ground truth. If there are no matches, this is 0.
  • If there are any selected anchor box candidates, also calculate (otherwise they are just 0):
    – The box (or regression) loss, defined as the mean(1 - CIoU) between all candidate anchor boxes and their matched target.
    – The classification loss, using Binary Cross Entropy Loss between the predicted class probabilities for each anchor box and a one-hot encoded vector of the true class of the matched target.
  • If model uses auxiliary heads, add each component obtained from the aux head to the corresponding main loss component (i.e., x = x + aux_wt*aux_x). The contribution weight (aux_wt) is defined by a predefined hyperparameter.
  • Multiply the objectness loss by the corresponding FPN head weight (predefined hyperparameter).

2. Multiply each loss component (objectness, classification, regression) by their contribution weight (predefined hyperparameter).

3. Sum the already weighted loss components.

4. Multiply the final loss value by the batch size.

As a technical detail, the loss reported during evaluation is made computationally cheaper by skipping the simOTA and never using the auxiliary heads, even for the models that fashion deep supervision.

Whilst this process contains a lot of complexity, in practice, this is all encapsulated in a single class, which can be created as demonstrated below:

Now that we understand how to use a pretrained model to make predictions, and how our loss function measures the quality of these predictions, let’s look at how we can finetune a model to a custom task. To obtain the level of performance reported in the paper, YOLOv7 was trained using a variety of techniques. However, for our purposes, lets start with the minimal possible training loop required, before gradually introducing different techniques.

To handle the boilerplate aspects of the training loop, let’s use PyTorch-accelerated. This will enable us to define only the the parts of the training loop which are relevant to our use case, without having to manage all of the boilerplate. To do this, we can override parts of the default PyTorch-accelerated Trainer and create a trainer specific to our YOLOv7 model, as demonstrated below:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/trainer.py

from pytorch_accelerated import Trainer

class Yolov7Trainer(Trainer):
YOLO7_PADDING_VALUE = -2.0

def __init__(
self,
model,
loss_func,
optimizer,
callbacks,
filter_eval_predictions_fn=None,
):
super().__init__(
model=model, loss_func=loss_func, optimizer=optimizer, callbacks=callbacks
)
self.filter_eval_predictions = filter_eval_predictions_fn

def training_run_start(self):
self.loss_func.to(self.device)

def evaluation_run_start(self):
self.loss_func.to(self.device)

def train_epoch_start(self):
super().train_epoch_start()
self.loss_func.train()

def eval_epoch_start(self):
super().eval_epoch_start()
self.loss_func.eval()

def calculate_train_batch_loss(self, batch) -> dict:
images, labels = batch[0], batch[1]

fpn_heads_outputs = self.model(images)
loss, _ = self.loss_func(
fpn_heads_outputs=fpn_heads_outputs, targets=labels, images=images
)

return {
"loss": loss,
"model_outputs": fpn_heads_outputs,
"batch_size": images.size(0),
}

def calculate_eval_batch_loss(self, batch) -> dict:
with torch.no_grad():
images, labels, image_ids, original_image_sizes = (
batch[0],
batch[1],
batch[2],
batch[3].cpu(),
)
fpn_heads_outputs = self.model(images)
val_loss, _ = self.loss_func(
fpn_heads_outputs=fpn_heads_outputs, targets=labels
)

preds = self.model.postprocess(fpn_heads_outputs, conf_thres=0.001)

if self.filter_eval_predictions is not None:
preds = self.filter_eval_predictions(preds)

resized_image_sizes = torch.as_tensor(
images.shape[2:], device=original_image_sizes.device
)[None].repeat(len(preds), 1)

formatted_predictions = self.get_formatted_preds(
image_ids, preds, original_image_sizes, resized_image_sizes
)

gathered_predictions = (
self.gather(formatted_predictions, padding_value=self.YOLO7_PADDING_VALUE)
.detach()
.cpu()
)

return {
"loss": val_loss,
"model_outputs": fpn_heads_outputs,
"predictions": gathered_predictions,
"batch_size": images.size(0),
}

def get_formatted_preds(
self, image_ids, preds, original_image_sizes, resized_image_sizes
):
"""
scale bboxes to original image dimensions, and associate image id with predictions
"""
formatted_preds = []
for i, (image_id, image_preds) in enumerate(zip(image_ids, preds)):
# image_id, x1, y1, x2, y2, score, class_id
formatted_preds.append(
torch.cat(
(
scale_bboxes_to_original_image_size(
image_preds[:, :4],
resized_hw=resized_image_sizes[i],
original_hw=original_image_sizes[i],
is_padded=True,
),
image_preds[:, 4:],
image_id.repeat(image_preds.shape[0])[None].T,
),
1,
)
)

if not formatted_preds:
# if no predictions, create placeholder so that it can be gathered across processes
stacked_preds = torch.tensor(
[self.YOLO7_PADDING_VALUE] * 7, device=self.device
)[None]
else:
stacked_preds = torch.vstack(formatted_preds)

return stacked_preds

Our training step is quite straightforward, with the only modification being that we need to extract the total loss from the dictionary that is returned. For the evaluation step, we first calculate the losses, and then retrieve the detections.

Evaluation logic

To evaluate our model’s performance on this task, we can use Mean Average Precision (mAP); a standard metric for object detection tasks. Perhaps the most widely used (and trusted) implementation of mAP, is the class that is included in the PyCOCOTools package, which is used to evaluate official COCO leaderboard submissions.

However, as this does not have the most inituitive interface, we have created a simple wrapper around this, to make it a little more user-friendly. Additionally, as for many cases outside the COCO competition leaderboard, it can be advantageous to evaluate predictions using a fixed IoU threshold — as opposed to the range of IoUs that is used by default — we have added an option to do this to our evaluator.

To encapsulate our evaluation logic to use during training, let’s create a callback for this; which will be updated at the end of each evaluation step and then calculated at the end of each evaluation epoch.


# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/evaluation/calculate_map_callback.py

from pytorch_accelerated.callbacks import TrainerCallback

class CalculateMeanAveragePrecisionCallback(TrainerCallback):
"""
A callback which accumulates predictions made during an epoch and uses these to calculate the Mean Average Precision
from the given targets.

.. Note:: If using distributed training or evaluation, this callback assumes that predictions have been gathered
from all processes during the evaluation step of the main training loop.
"""

def __init__(
self,
targets_json,
iou_threshold=None,
save_predictions_output_dir_path=None,
verbose=False,
):
"""
:param targets_json: a COCO-formatted dictionary with the keys "images", "categories" and "annotations"
:param iou_threshold: If set, the IoU threshold at which mAP will be calculated. Otherwise, the COCO default range of IoU thresholds will be used.
:param save_predictions_output_dir_path: If provided, the path to which the accumulated predictions will be saved, in coco json format.
:param verbose: If True, display the output provided by pycocotools, containing the average precision and recall across a range of box sizes.
"""
self.evaluator = COCOMeanAveragePrecision(iou_threshold)
self.targets_json = targets_json
self.verbose = verbose
self.save_predictions_path = (
Path(save_predictions_output_dir_path)
if save_predictions_output_dir_path is not None
else None
)

self.eval_predictions = []
self.image_ids = set()

def on_eval_step_end(self, trainer, batch, batch_output, **kwargs):
predictions = batch_output["predictions"]
if len(predictions) > 0:
self._update(predictions)

def on_eval_epoch_end(self, trainer, **kwargs):
preds_df = pd.DataFrame(
self.eval_predictions,
columns=[
XMIN_COL,
YMIN_COL,
XMAX_COL,
YMAX_COL,
SCORE_COL,
CLASS_ID_COL,
IMAGE_ID_COL,
],
)

predictions_json = self.evaluator.create_predictions_coco_json_from_df(preds_df)
self._save_predictions(trainer, predictions_json)

if self.verbose and trainer.run_config.is_local_process_zero:
self.evaluator.verbose = True

map_ = self.evaluator.compute(self.targets_json, predictions_json)
trainer.run_history.update_metric(f"map", map_)

self._reset()

@classmethod
def create_from_targets_df(
cls,
targets_df,
image_ids,
iou_threshold=None,
save_predictions_output_dir_path=None,
verbose=False,
):
"""
Create an instance of :class:`CalculateMeanAveragePrecisionCallback` from a dataframe containing the ground
truth targets and a collections of all image ids in the dataset.

:param targets_df: DF w/ cols: ["image_id", "xmin", "ymin", "xmax", "ymax", "class_id"]
:param image_ids: A collection of all image ids in the dataset, including those without annotations.
:param iou_threshold: If set, the IoU threshold at which mAP will be calculated. Otherwise, the COCO default range of IoU thresholds will be used.
:param save_predictions_output_dir_path: If provided, the path to which the accumulated predictions will be saved, in coco json format.
:param verbose: If True, display the output provided by pycocotools, containing the average precision and recall across a range of box sizes.
:return: An instance of :class:`CalculateMeanAveragePrecisionCallback`
"""

targets_json = COCOMeanAveragePrecision.create_targets_coco_json_from_df(
targets_df, image_ids
)

return cls(
targets_json=targets_json,
iou_threshold=iou_threshold,
save_predictions_output_dir_path=save_predictions_output_dir_path,
verbose=verbose,
)

def _remove_seen(self, labels):
"""
Remove any image id that has already been seen during the evaluation epoch. This can arise when performing
distributed evaluation on a dataset where the batch size does not evenly divide the number of samples.

"""
image_ids = labels[:, -1].tolist()

# remove any image_idx that has already been seen
# this can arise from distributed training where batch size does not evenly divide dataset
seen_id_mask = torch.as_tensor(
[False if idx not in self.image_ids else True for idx in image_ids]
)

if seen_id_mask.all():
# no update required as all ids already seen this pass
return []
elif seen_id_mask.any(): # at least one True
# remove predictions for images already seen this pass
labels = labels[~seen_id_mask]

return labels

def _update(self, predictions):
filtered_predictions = self._remove_seen(predictions)

if len(filtered_predictions) > 0:
self.eval_predictions.extend(filtered_predictions.tolist())
updated_ids = filtered_predictions[:, -1].unique().tolist()
self.image_ids.update(updated_ids)

def _reset(self):
self.image_ids = set()
self.eval_predictions = []

def _save_predictions(self, trainer, predictions_json):
if (
self.save_predictions_path is not None
and trainer.run_config.is_world_process_zero
):
with open(self.save_predictions_path / "predictions.json", "w") as f:
json.dump(predictions_json, f)

Now, all that we have to do is plug our callback into our Trainer, and our mAP will be recorded at each epoch!

Run training

Now, let’s put everything we have seen so far into a simple training script. Here, we have used a simple training recipe that works well for a variety of tasks and have carried out minimal hyperparameter tuning.

As we noticed that the ground truth boxes for this dataset can contain quite a bit of space around the object, we decided to set the IoU threshold used for evaluation quite low; as it is likely that the boxes produced by the model will be tighter around the object.

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/examples/minimal_finetune_cars.py

import os
import random
from functools import partial
from pathlib import Path

import numpy as np
import pandas as pd
import torch
from func_to_script import script
from PIL import Image
from pytorch_accelerated.callbacks import (
EarlyStoppingCallback,
SaveBestModelCallback,
get_default_callbacks,
)
from pytorch_accelerated.schedulers import CosineLrScheduler
from torch.utils.data import Dataset

from yolov7 import create_yolov7_model
from yolov7.dataset import Yolov7Dataset, create_yolov7_transforms, yolov7_collate_fn
from yolov7.evaluation import CalculateMeanAveragePrecisionCallback
from yolov7.loss_factory import create_yolov7_loss
from yolov7.trainer import Yolov7Trainer, filter_eval_predictions

def load_cars_df(annotations_file_path, images_path):
all_images = sorted(set([p.parts[-1] for p in images_path.iterdir()]))
image_id_to_image = {i: im for i, im in enumerate(all_images)}
image_to_image_id = {v: k for k, v, in image_id_to_image.items()}

annotations_df = pd.read_csv(annotations_file_path)
annotations_df.loc[:, "class_name"] = "car"
annotations_df.loc[:, "has_annotation"] = True

# add 100 empty images to the dataset
empty_images = sorted(set(all_images) - set(annotations_df.image.unique()))
non_annotated_df = pd.DataFrame(list(empty_images)[:100], columns=["image"])
non_annotated_df.loc[:, "has_annotation"] = False
non_annotated_df.loc[:, "class_name"] = "background"

df = pd.concat((annotations_df, non_annotated_df))

class_id_to_label = dict(
enumerate(df.query("has_annotation == True").class_name.unique())
)
class_label_to_id = {v: k for k, v in class_id_to_label.items()}

df["image_id"] = df.image.map(image_to_image_id)
df["class_id"] = df.class_name.map(class_label_to_id)

file_names = tuple(df.image.unique())
random.seed(42)
validation_files = set(random.sample(file_names, int(len(df) * 0.2)))
train_df = df[~df.image.isin(validation_files)]
valid_df = df[df.image.isin(validation_files)]

lookups = {
"image_id_to_image": image_id_to_image,
"image_to_image_id": image_to_image_id,
"class_id_to_label": class_id_to_label,
"class_label_to_id": class_label_to_id,
}
return train_df, valid_df, lookups

class CarsDatasetAdaptor(Dataset):
def __init__(
self,
images_dir_path,
annotations_dataframe,
transforms=None,
):
self.images_dir_path = Path(images_dir_path)
self.annotations_df = annotations_dataframe
self.transforms = transforms

self.image_idx_to_image_id = {
idx: image_id
for idx, image_id in enumerate(self.annotations_df.image_id.unique())
}
self.image_id_to_image_idx = {
v: k for k, v, in self.image_idx_to_image_id.items()
}

def __len__(self) -> int:
return len(self.image_idx_to_image_id)

def __getitem__(self, index):
image_id = self.image_idx_to_image_id[index]
image_info = self.annotations_df[self.annotations_df.image_id == image_id]
file_name = image_info.image.values[0]
assert image_id == image_info.image_id.values[0]

image = Image.open(self.images_dir_path / file_name).convert("RGB")
image = np.array(image)

image_hw = image.shape[:2]

if image_info.has_annotation.any():
xyxy_bboxes = image_info[["xmin", "ymin", "xmax", "ymax"]].values
class_ids = image_info["class_id"].values
else:
xyxy_bboxes = np.array([])
class_ids = np.array([])

if self.transforms is not None:
transformed = self.transforms(
image=image, bboxes=xyxy_bboxes, labels=class_ids
)
image = transformed["image"]
xyxy_bboxes = np.array(transformed["bboxes"])
class_ids = np.array(transformed["labels"])

return image, xyxy_bboxes, class_ids, image_id, image_hw

DATA_PATH = Path("/".join(Path(__file__).absolute().parts[:-2])) / "data/cars"

@script
def main(
data_path: str = DATA_PATH,
image_size: int = 640,
pretrained: bool = True,
num_epochs: int = 30,
batch_size: int = 8,
):

# Load data
data_path = Path(data_path)
images_path = data_path / "training_images"
annotations_file_path = data_path / "annotations.csv"

train_df, valid_df, lookups = load_cars_df(annotations_file_path, images_path)
num_classes = 1

# Create datasets
train_ds = CarsDatasetAdaptor(
images_path,
train_df,
)
eval_ds = CarsDatasetAdaptor(images_path, valid_df)

train_yds = Yolov7Dataset(
train_ds,
create_yolov7_transforms(training=True, image_size=(image_size, image_size)),
)
eval_yds = Yolov7Dataset(
eval_ds,
create_yolov7_transforms(training=False, image_size=(image_size, image_size)),
)

# Create model, loss function and optimizer
model = create_yolov7_model(
architecture="yolov7", num_classes=num_classes, pretrained=pretrained
)

loss_func = create_yolov7_loss(model, image_size=image_size)

optimizer = torch.optim.SGD(
model.parameters(), lr=0.01, momentum=0.9, nesterov=True
)
# Create trainer and train
trainer = Yolov7Trainer(
model=model,
optimizer=optimizer,
loss_func=loss_func,
filter_eval_predictions_fn=partial(
filter_eval_predictions, confidence_threshold=0.01, nms_threshold=0.3
),
callbacks=[
CalculateMeanAveragePrecisionCallback.create_from_targets_df(
targets_df=valid_df.query("has_annotation == True")[
["image_id", "xmin", "ymin", "xmax", "ymax", "class_id"]
],
image_ids=set(valid_df.image_id.unique()),
iou_threshold=0.2,
),
SaveBestModelCallback(watch_metric="map", greater_is_better=True),
EarlyStoppingCallback(
early_stopping_patience=3,
watch_metric="map",
greater_is_better=True,
early_stopping_threshold=0.001,
),
*get_default_callbacks(progress_bar=True),
],
)

trainer.train(
num_epochs=num_epochs,
train_dataset=train_yds,
eval_dataset=eval_yds,
per_device_batch_size=batch_size,
create_scheduler_fn=CosineLrScheduler.create_scheduler_fn(
num_warmup_epochs=5,
num_cooldown_epochs=5,
k_decay=2,
),
collate_fn=yolov7_collate_fn,
)

if __name__ == "__main__":
main()

Launching training as described here, using a single V100 GPU with fp16 enabled, after 3 epochs we obtained a mAP of 0.995, which suggests that the model has learned the task almost perfectly!

However, whilst this is a great result, it is largely expected as COCO contains image of cars.

Now that we have successfully finetuned a pretrained YOLOv7 model, let’s explore how we can train the model from scratch. Whilst this could be done using numerous different training recipes, let’s take a look at some of the key techniques that were used by the authors when training on COCO.

Mosaic Augmentation

Data augmentation is an important technique in deep learning where we synthetically expand our dataset by applying a series of augmentations to our data during training. Whilst common transforms in object detection tend to be augmentations such as flips and rotations, the YOLO authors take a slightly different approach by applying Mosaic augmentation; which was previously used by YOLOv4, YOLOv5 and YOLOX models.

The objective of mosaic augmentation is to overcome the observation that object detection models tend to focus on detecting items towards the centre of the image. The key idea is that, if we stitch multiple images together, the objects are likely to be in positions and contexts that are not normally observed in images seen in the dataset; which should force the features learned by the model to be more position invariant.

Whilst there are a couple of different implementations of mosaic, each with minor differences, here we shall present an implementation that combines four different images. This implementation has worked well for us in the past, with a variety of object detection models.

Although there is no requirement to resize images prior to creating a mosaic, it does result in the created mosaics being similar sizes. Therefore, we shall take that approach here. We can do this by creating a simple resizing transform and adding it to our dataset adaptor.

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/dataset.py

import albumentations as A

def create_base_transforms(target_image_size):
return A.Compose(
[
A.LongestMaxSize(target_image_size),
],
bbox_params=A.BboxParams(format="pascal_voc", label_fields=["labels"]),
)

To apply our augmentations, once again, we are using Albumentations, which supports many object detection transforms.

Whilst data augmentations are usually implemented as functions, which are passed to a PyTorch dataset and applied shortly after loading an image, as mosaic requires loading multiple images from the dataset, this approach will not work here. We decided to implement mosaic as a dataset wrapper class, to cleanly encapsulate this logic. We can import and use this as demonstrated below:

Let’s take a look at some examples of the types of images that are produced. As we haven’t (yet) passed any resizing transforms to our mosaic dataset, these images are quite large.

Notice that, whilst the mosaic images appear quite different, they were all called with the same index, therefore were applied to the same image! When a mosaic is created, it randomly selects 3 other images from the dataset and places them in random positions, this results in different looking images being produced each time. Therefore, applying this augmentation does break down our concept of a training epoch — where each image in the dataset is seen exactly once — as images can be seen multiple times!

As a result, when training with mosaic, our strategy is not to think too much about the number of epochs and train the model for as long as possible until it stops converging. After all, the notion of an epoch is only really useful to help us track our training — the model just sees a continuous stream of images either way!

Mixup Augmentation

Mosaic augmentation is often applied alongside another transform — Mixup. To visualise what this does, let’s disable mosaic for the moment and enable mixup on its own, we can do this as demonstrated below:

Interesting! We can see that it has combined two images together, which results in some ‘ghostly’ looking cars and backgrounds! Now, let’s enable both transforms and inspect our outputs.

Wow! There are quite a lot of cars to detect in our resulting image, in many different positions — which will definitely be a challenge for the model! Notice that when we apply mosaic and mixup together, a single image is mixed with a mosaic.

Post-mosaic affine transforms

As we noted earlier, the mosaics that we are creating are significantly bigger than the image sizes we will use to train our model, so we will need to do some sort of resizing here. The simplest way would be to simply apply a resize transform after creating the mosaic.

Whilst this would work, this is likely to result in some very small objects, as we are essentially resizing four images to the size of one – which is likely to become a problem where the domain already contains very small bounding boxes! Additionally, each of our mosaics are structurally quite similar, with an image in each quadrant. Recalling that our aim was to make the model more robust to position changes, this may not actually help that much; as the model is likely just to start looking in the middle of each quadrant.

To overcome this, one approach that we can take is to simply take a random crop from our mosaic. This will still provide the variability in positioning whilst preserving the size and aspect ratio of the target objects. At this point, it may also be a good opportunity to add in some other transforms such as scaling and rotation to add even more variability.

The exact transforms, and magnitudes, used will be heavily dependent on the images that you are using, so we would recommend experimenting with these setting first — to ensure that all objects are still visible and recognisable — prior to training a model!

We can define the transforms to apply to our mosaic images as demonstrated below. Here, we have chosen a selection of affine transforms — in sensible ranges for our target data — followed by a random crop. Following the original implementation, we are also applying mixup less frequently than mosaic.

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/mosaic.py

def create_post_mosaic_transform(
output_height,
output_width,
pad_colour=(0, 0, 0),
rotation_range=(-10, 10),
shear_range=(-10, 10),
translation_percent_range=(-0.2, 0.2),
scale_range=(0.08, 1.0),
apply_prob=0.8,
):
return A.Compose(
[
A.Affine(
cval=pad_colour,
rotate=rotation_range,
shear=shear_range,
translate_percent=translation_percent_range,
scale=None,
keep_ratio=True,
p=apply_prob,
),
A.HorizontalFlip(),
A.RandomResizedCrop(height=output_height, width=output_width, scale=scale_range),
],
bbox_params=A.BboxParams(format="pascal_voc", label_fields=["labels"], min_visibility=0.25),
)

Looking at these images, we can see a huge amount of variation and the images are now the correct size for training. As we have selected a random scale, we can also see that not every image looks like a mosaic, so these outputs should not be too dissimilar to the images that the model will see during inference. If more extreme augmentations are used — such that there is a notable difference between the training and inference images — it can be advantageous to disable these shortly before the end of training.

In the official implementation, the authors use mosaics of both 4 and 9 images during training. However, inspecting the outputs of these augmentations when combined with scaling and cropping, in many cases the outputs looked very similar, so we have chosen to omit this here.

Applying weight decay to parameter groups

In our simple example earlier, we created our optimizer so that it would optimize all of the parameters of our model. However, if we would like to follow the authors in introducing weight decay regularization, following the guidance given in Bag of Tricks for Image Classification with Convolutional Neural Networks this may not be optimal; with this paper recommending that weight decay should be applied to only convolutional and fully connected layers.

To implement this in PyTorch, we will need to create two distinct parameter groups to be optimized; one containing our convolutional weights and the other with the remaining parameters. We can do this as demonstrated below:

Inspecting the method definition, we can see that this is a simple filter operation:

Now we can simply pass these to the optimizer:

optimizer = torch.optim.SGD(
param_groups["other_params"], lr=0.01, momentum=0.937, nesterov=True
)

optimizer.add_param_group(
{"params": param_groups["conv_weights"], "weight_decay": weight_decay}
)

Learning rate scheduling

When training neural networks, we often wish to adjust the value of our learning rate during training; this is done using a learning rate scheduler. Whilst there are many popular schedules, the authors opt for a cosine learning rate schedule — with a linear warmup at the start of training. This has the following shape:

Cosine learning rate schedule (with warmup)

In practice, we find that a period of warmup, and cooldown — where the learning rate is held at its minimum value — is often a good strategy for this scheduler. Additionally, the scheduler PyTorch-accelerated supports a k-decay argument which can be used to adjust how aggressive the annealing is.

For this problem, we found that using k-decay to hold the learning rate at a higher value for longer worked quite well. This schedule, along with warmup and cooldown epochs, can be seen below:

Cosine learning rate schedule (with warmup), set with k_decay = 2

Gradient accumulation, scaling weight decay

When training a model, the batch size we use is often determined by our hardware; as we want to try to maximise the amount of data that we can put on the GPU. However, some considerations must be made:

  • For very small batch sizes, we are unable to approximate the gradients of the whole dataset. This can result in unstable training.
  • Modifying the batch size can result in different settings being needed for hyperparameters such as the learning rate and weight decay. This can make it difficult to find a consistent set of hyperparameters.

To overcome this, the authors use use a technique called gradient accumulation, in which the gradients from multiple steps are accumulated to simulate a bigger batch size. For example, suppose that the maximum batch size that we can fit on our GPU is 8. Instead of updating the parameters of the model at the end of each batch, we can save gradient values, proceed to the next batch and add these new gradients. After a designated number of steps, we then perform the update; if we set our number of steps to 4, this is roughly equivalent of using a batch size of 32!

In PyTorch, this could be performed manually as follows:

num_accumulation_steps = 4  

# loop through ennumerated batches
for step, (inputs, labels) in enumerate(data_loader):

model_outputs = model(inputs)
loss = loss_fn(model_outputs, labels)

# normalize loss to account for batch accumulation
loss = loss / num_accumulation_steps

# calculate gradients, these are summed automatically
loss.backward()

if ((step + 1) % num_accumulation_steps == 0) or
(step + 1 == len(data_loader)):
# perform weight update
optimizer.step()
optimizer.zero_grad()

In the original YOLOv7 implementation, the number of gradient accumulation steps is selected so that the total batch size (across all processes) is at least 64; which mitigates both of the issues discussed earlier. Additionally, the authors scale the weight decay used based on the batch size in the following way:

nominal_batch_size = 64
num_accumulate_steps = max(round(nominal_batch_size / total_batch_size), 1)

base_weight_decay = 0.0005
scaled_weight_decay = (
base_weight_decay * total_batch_size * num_accumulate_steps / nominal_batch_size
)

We can visualise these relationships below:

Looking first at the number of accumulation steps, we can see that the number of accumulation steps decreases until we hit our nominal batch size, and then gradient accumulation is no longer needed.

Now looking at the amount of weight decay used, we can see that it is held at the base value until the nominal batch size is reached, and then is scaled linearly with the batch size; with more weight decay applied as the batch size gets bigger.

Model EMA

When training a model, it can be beneficial to set the values for the model weights by taking a moving average of the parameters that were observed across the entire training run, as opposed to using the parameters obtained after the last incremental update. This is often done by maintaining an exponentially weighted average (EMA) of the model parameters, in practice, this usually means maintaining another copy of the model to store these averaged weights. However, rather than updating all of the parameters of this model after every update step, we set these parameters using a linear combination of the existing parameter values and the updated values.

This is done using the following formula:

updated_EMA_model_weights = decay * EMA_model_weights + (1. - decay) * updated_model_weights

where the decay is a parameter that we set. For example, if we set decay=0.99, we have:

updated_EMA_model_weights = 0.99 * EMA_model_weights + 0.01 * updated_model_wei.99 * EMA_model_weights + 0.01 * updated_model_weights

which we can see is keeping 99% of the existing state and only 1% of the new state!

To understand why this may be beneficial, let’s consider the case that our model, in an early stage of training, performs exceptionally poorly on a batch of data. This may result in a large update update to our parameters, overcompensating for the high loss obtained, which will be detrimental for the upcoming batches. By only incorporating only a small percentage of the latest parameters, large updates will be ‘smoothed’, and have less of an overall impact on the model’s weights. Sometimes, these averaged parameters can sometimes produce significantly better results during evaluation, and this technique has been employed in several training schemes for popular models such as training MNASNet, MobileNet-V3 and EfficientNet; using the implementation included in TensorFlow.

The approach to EMA taken by the YOLOv7 authors is slightly different to other implementations as, instead of using a fixed decay, the amount of decay changes based on the number of updates that have been made. We can extend the ModelEMA class included with PyTorch-accelerated to implement this behaviour as defined below:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/utils.py

from pytorch-accelerated.utils import ModelEma

class Yolov7ModelEma(ModelEma):
def __init__(self, model, decay=0.9999):
super().__init__(model, decay)
self.num_updates = 0
self.decay_fn = lambda x: decay * (
1 - math.exp(-x / 2000)
) # decay exponential ramp (to help early epochs)
self.decay = self.decay_fn(self.num_updates)

def update(self, model):
super().update(model)
self.num_updates += 1
self.decay = self.decay_fn(self.num_updates)

Here, we can see that the decay is set by calling a function after each update. Let’s visualise how this looks:

From this, we can see that the amount of decay increases with the number of updates, which is once per epoch.

Recalling the formulas above, this means that, initially, we favour using the updated model weights rather than a historical average. However, as training progresses, we start to incorporate more of the averaged weights from previous epochs. This is an interesting departure from the usual usage of this technique, which is designed to help the EMA model converge more quickly in earlier epochs.

Selecting appropriate anchor box sizes

Recalling the earlier discussion on anchor boxes, and how these play an important part on how YOLOv7 is able to detect objects, let’s look at how we can evaluate whether our chosen anchor boxes are suitable for our problem and, if not, find some sensible choices for our dataset.

The approach here is largely adapted from the autoanchor approach used in YOLOv5, which was also used with YOLOv7.

Evaluating current anchor boxes

The simplest approach would be to simply use the same anchor boxes as used for COCO, which are already bundled with the defined architectures.

Here we can see that we have 3 groups, one for each layer of the feature pyramid network. The numbers correspond to our anchor sizes, the width and height of the anchor boxes that will be generated.

Recall that, the Feature Pyramid Network (FPN) has three outputs, and each output’s role is to detect objects according to their scale.

For example:

  • P3/8 is for detecting smaller objects.
  • P4/16 is for detecting medium objects.
  • P5/32 is for detecting bigger objects.

With this in mind, we need to set our anchor sizes accordingly for each layer.

To evaluate our current anchor boxes, we can calculate the best possible recall, which would occur if the model was able to successfully match an appropriate anchor box with a ground truth.

Find and Resize ground truth bounding boxes

To evaluate our anchor boxes, we first need some knowledge of the shapes and sizes of the objects in our dataset. However, before we can evaluate, we need to resize the width and height of our ground truth boxes based on the size of the images that we will train on — for this architecture, this is recommended to be 640.

Let’s start by finding the width and height of all ground truth boxes in the training set. We can calculate these as demonstrated below:

Next, we will need the height and width of our images. Sometimes, we have this information ahead of time, in which case we can use this knowledge directly. Otherwise, we can do this as follows:

We can now merge this with our existing DataFrame:

Now, we can use this information to get the resized widths and heights of our ground truth targets, with respect to our target image size. To preserve the aspect ratios of the objects in our images, the recommended approach to resizing is to scale the image so that the longest size is equal to our target size. We can do this using the function below:


# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/anchors.py

def calculate_resized_gt_wh(gt_wh, image_sizes, target_image_size=640):
"""
Given an array of bounding box widths and heights, and their corresponding image sizes,
resize these relative to the specified target image size.

This function assumes that resizing will be performed by scaling the image such that the longest
side is equal to the given target image size.

:param gt_wh: an array of shape [N, 2] containing the raw width and height of each box.
:param image_sizes: an array of shape [N, 2] or [1, 2] containing the width and height of the image corresponding to each box.
:param target_image_size: the size of the images that will be used during training.

"""
normalized_gt_wh = gt_wh / image_sizes
target_image_sizes = (
target_image_size * image_sizes / image_sizes.max(1, keepdims=True)
)

resized_gt_wh = target_image_sizes * normalized_gt_wh

tiny_boxes_exist = (resized_gt_wh < 3).any(1).sum()
if tiny_boxes_exist:
print(
f"""WARNING: Extremely small objects found.
{tiny_boxes_exist} of {len(resized_gt_wh)} labels are < 3 pixels in size. These will be removed
"""
)
resized_gt_wh = resized_gt_wh[(resized_gt_wh >= 2.0).any(1)]

return resized_gt_wh

Alternatively, as all of our images are the same size in this case, we could simply specify a single image size.

Note that we have also filtered out any boxes what will be incredibly small (less than 3 pixels in either height or width), with respect to the new image size, as these boxes are usually too small to be considered useful!

Calculating Best Possible Recall

Now that we have the width and height of all ground truth boxes in our training set, we can evaluate our current anchor boxes as follows:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/anchors.py

def calculate_best_possible_recall(anchors, gt_wh):
"""
Given a tensor of anchors and and an array of widths and heights for each bounding box in the dataset,
calculate the best possible recall that can be obtained if every box was matched to an appropriate anchor.

:param anchors: a tensor of shape [N, 2] representing the width and height of each anchor
:param gt_wh: a tensor of shape [N, 2] representing the width and height of each ground truth bounding box

"""
best_anchor_ratio = calculate_best_anchor_ratio(anchors=anchors, wh=gt_wh)
best_possible_recall = (
(best_anchor_ratio > 1.0 / LOSS_ANCHOR_MULTIPLE_THRESHOLD).float().mean()
)

return best_possible_recall

From this, we can see that the current anchor boxes are a good fit for this dataset; which makes sense, as the images are quite similar to those in COCO.

How does this work?

At this point, you may be wondering, how exactly do we calculate the best possible recall. To answer this, let’s go through the process manually.

Intuitively, we would like to ensure that at least one anchor can be matched to each ground truth box. Whilst we could do this by framing it as an optimization problem — how do we match each ground truth box with its optimal anchor — this would introduce a lot of complexity for what we are trying to do.

Given an anchor box, we need a simpler way of measuring how well it can be made to fit a ground truth box. Let’s examine one approach that can be taken to do this, starting with the width and height of a single ground truth box.

For each anchor box, we can inspect the ratios of its height and width when compared to the height and width of our ground truth target and use this to understand where the biggest differences are.

As the scale of these ratios will depend on whether the anchor box sides are greater or smaller than the sides of our ground truth box, we can ensure that our magnitudes are in the range [0, 1] by also calculating the reciprocal and taking the minimum ratios for each anchor.

From this, we now have an indication of how well, independently, the width and height of each anchor box ‘fits’ to our ground truth target.

Now, our challenge is how to evaluate the matching of the the width and height together!

One way we can approach this is, to take the minimum ratio for each anchor; representing the side that worst matches our ground truth.

The reason why we have selected the worst fitting side here, is because we know that the other side matches our target at least as well as the one selected; we can think of this as the worst case scenario!

Now, let’s select the anchor box which matches the best out of these options, this is simply the largest value.

Out of the worst fitting options, this is our selected match!

Recalling that the loss function only looks to match anchor boxes that are up to 4 times greater or smaller than the size of the ground truth target, we can now verify whether this anchor is within this range and would be considered a successful match.

We can do that as demonstrated below, taking the reciprocal of our loss multiple, to ensure that it is in the same range as our value:

From this, we can see that at least one of our anchors could be successfully matched to our selected ground truth target!

Now that we understand the sequence of steps, we can now apply the same logic to all of our ground truth boxes to see how many matches we can obtain with our current set of anchors:

Now that we have calculated, for each ground truth box, whether it has a match. We can take the mean number of matches to find out best possible recall; in our case, this is 1, as we saw earlier!

Selecting new anchor boxes

Whilst using the pre-defined anchors may be a good choice for similar datasets, this may not be appropriate for all datasets, for example, those that contain lots of small objects. In these cases, a better approach may be to select entirely new anchors.

Let’s explore how we can do this!

First, let’s define the number of anchors that we need for our architecture.

Now, based on our bounding boxes, we need to define a sensible set widths and heights of anchor templates. One way that we can estimate this is by using K-means to cluster our ground truth aspect ratios, based on the number of anchor sizes that we need. We can then use these centroids as our starting estimates. We can do this using the following function:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/anchors.py

def estimate_anchors(num_anchors, gt_wh):
"""
Given a target number of anchors and an array of widths and heights for each bounding box in the dataset,
estimate a set of anchors using the centroids from Kmeans clustering.

:param num_anchors: the number of anchors to return
:param gt_wh: an array of shape [N, 2] representing the width and height of each ground truth bounding box

"""
print(f"Running kmeans for {num_anchors} anchors on {len(gt_wh)} points...")
std_dev = gt_wh.std(0)
proposed_anchors, _ = kmeans(
gt_wh / std_dev, num_anchors, iter=30
) # divide by std so they are in approx same range
proposed_anchors *= std_dev

return proposed_anchors

Here, we can see that we now have a set of anchor templates that we can use as a starting point. As before, let’s calculate our best possible recall using these anchor boxes:

Once again, we see that our best possible recall is 1, which means that these anchor sizes are also a good fit for our problem!

Whilst it is perhaps unnecessary in this case, we may be able improve these anchors further using a genetic algorithm. Following this methodology, we can define a fitness (or reward) function to measure how well our anchor boxes match our data and make small, random changes to our anchor sizes to try and maximise this function.

In this case we can define our fitness function as follows:

def anchor_fitness(anchors, wh):
"""
A fitness function that can be used to evolve a set of anchors. This function calculates the mean best anchor ratio
for all matches that are within the multiple range considered during the loss calculation.
"""
best_anchor_ratio = calculate_best_anchor_ratio(anchors=anchors, gt_wh=wh)
return (
best_anchor_ratio
* (best_anchor_ratio > 1 / LOSS_ANCHOR_MULTIPLE_THRESHOLD).float()
).mean()

Here, we are taking the best anchor ratio for each match that will be considered during the loss calculation. If an anchor box is more than four times greater or smaller than its matched bounding box, it will not contribute to our score. Let’s use this to calculate a fitness score for our proposed anchor sizes:

Now, let’s use this as the fitness function when optimizing our anchors, as demonstrated below:

Inspecting the definition of this function, we can see that, for a specified number of iterations, we are simply sampling random noise from a normal distribution and using this to mutate our anchor sizes. If this change leads to an increased score, we keep these as our anchor sizes!

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/yolov7/anchors.py

def evolve_anchors(
proposed_anchors,
gt_wh,
num_iterations=1000,
mutation_probability=0.9,
mutation_noise_mean=1,
mutation_noise_std=0.1,
anchor_fitness_fn=anchor_fitness,
verbose=False,
):
"""
Use a genetic algorithm to mutate the given anchors to try and optimise them based on the given widths and heights of the
ground truth boxes based on the provided fitness function. Anchor dimensions are mutated by adding random noise sampled
from a normal distribution with the mean and standard deviation provided.

:param proposed_anchors: a tensor containing the aspect ratios of the anchor boxes to evolve
:param gt_wh: a tensor of shape [N, 2] representing the width and height of each ground truth bounding box
:param num_generations: the number of iterations for which to run the algorithm
:param mutation_probability: the probability that each anchor dimension is mutated during each iteration
:param mutation_noise_mean: the mean of the normal distribution from which the mutation noise will be sampled
:param mutation_noise_std: the standard deviation of the normal distribution from which the mutation noise will be sampled
:param anchor_fitness_fn: the reward function that will be used during the optimization process. This should accept proposed_anchors and gt_wh as arguments
:param verbose: if True, the value of the fitness function will be printed at the end of each iteration

"""
best_fitness = anchor_fitness_fn(proposed_anchors, gt_wh)
anchor_shape = proposed_anchors.shape

pbar = tqdm(range(num_iterations), desc=f"Evolving anchors with Genetic Algorithm:")
for i, _ in enumerate(pbar):
# Define mutation by sampling noise from a normal distribution
anchor_mutation = np.ones(anchor_shape)
anchor_mutation = (
(np.random.random(anchor_shape) < mutation_probability)
* np.random.randn(*anchor_shape)
* mutation_noise_std
+ mutation_noise_mean
).clip(0.3, 3.0)

mutated_anchors = (proposed_anchors.copy() * anchor_mutation).clip(min=2.0)
mutated_anchor_fitness = anchor_fitness_fn(mutated_anchors, gt_wh)

if mutated_anchor_fitness > best_fitness:
best_fitness, proposed_anchors = (
mutated_anchor_fitness,
mutated_anchors.copy(),
)
pbar.desc = (
f"Evolving anchors with Genetic Algorithm: fitness = {best_fitness:.4f}"
)
if verbose:
print(f"Iteration: {i}, Fitness: {best_fitness}")

return proposed_anchors

Let’s see whether this has improved our score at all:

We can see that our evolved anchors have a better fitness score than our original proposed anchors, as we would expect!

Now, all that is left to do is to sort the anchors into a rough ascending order, considering the smallest dimension for each anchor.

Putting it all together

Now that we understand the process, we could calculate our anchors for our dataset in a single step using the following function.

In this case, as our best possible recall is already greater than the threshold, we can keep our original anchor sizes!

However, in cases where our anchor sizes change, we can update them as demonstrated below:

Run training

Now we have explored some of the tecnhiques used in the original training recipe, let’s update our training script to include some of these features. An updated script is presented below:

# https://github.com/Chris-hughes10/Yolov7-training/blob/main/examples/train_cars.py

import random
from functools import partial
from pathlib import Path

import numpy as np
import pandas as pd
import torch
from func_to_script import script
from PIL import Image
from pytorch_accelerated.callbacks import (
ModelEmaCallback,
ProgressBarCallback,
SaveBestModelCallback,
get_default_callbacks,
)
from pytorch_accelerated.schedulers import CosineLrScheduler
from torch.utils.data import Dataset

from yolov7 import create_yolov7_model
from yolov7.dataset import (
Yolov7Dataset,
create_base_transforms,
create_yolov7_transforms,
yolov7_collate_fn,
)
from yolov7.evaluation import CalculateMeanAveragePrecisionCallback
from yolov7.loss_factory import create_yolov7_loss
from yolov7.mosaic import MosaicMixupDataset, create_post_mosaic_transform
from yolov7.trainer import Yolov7Trainer, filter_eval_predictions
from yolov7.utils import SaveBatchesCallback, Yolov7ModelEma

def load_cars_df(annotations_file_path, images_path):
all_images = sorted(set([p.parts[-1] for p in images_path.iterdir()]))
image_id_to_image = {i: im for i, im in enumerate(all_images)}
image_to_image_id = {v: k for k, v, in image_id_to_image.items()}

annotations_df = pd.read_csv(annotations_file_path)
annotations_df.loc[:, "class_name"] = "car"
annotations_df.loc[:, "has_annotation"] = True

# add 100 empty images to the dataset
empty_images = sorted(set(all_images) - set(annotations_df.image.unique()))
non_annotated_df = pd.DataFrame(list(empty_images)[:100], columns=["image"])
non_annotated_df.loc[:, "has_annotation"] = False
non_annotated_df.loc[:, "class_name"] = "background"

df = pd.concat((annotations_df, non_annotated_df))

class_id_to_label = dict(
enumerate(df.query("has_annotation == True").class_name.unique())
)
class_label_to_id = {v: k for k, v in class_id_to_label.items()}

df["image_id"] = df.image.map(image_to_image_id)
df["class_id"] = df.class_name.map(class_label_to_id)

file_names = tuple(df.image.unique())
random.seed(42)
validation_files = set(random.sample(file_names, int(len(df) * 0.2)))
train_df = df[~df.image.isin(validation_files)]
valid_df = df[df.image.isin(validation_files)]

lookups = {
"image_id_to_image": image_id_to_image,
"image_to_image_id": image_to_image_id,
"class_id_to_label": class_id_to_label,
"class_label_to_id": class_label_to_id,
}
return train_df, valid_df, lookups

class CarsDatasetAdaptor(Dataset):
def __init__(
self,
images_dir_path,
annotations_dataframe,
transforms=None,
):
self.images_dir_path = Path(images_dir_path)
self.annotations_df = annotations_dataframe
self.transforms = transforms

self.image_idx_to_image_id = {
idx: image_id
for idx, image_id in enumerate(self.annotations_df.image_id.unique())
}
self.image_id_to_image_idx = {
v: k for k, v, in self.image_idx_to_image_id.items()
}

def __len__(self) -> int:
return len(self.image_idx_to_image_id)

def __getitem__(self, index):
image_id = self.image_idx_to_image_id[index]
image_info = self.annotations_df[self.annotations_df.image_id == image_id]
file_name = image_info.image.values[0]
assert image_id == image_info.image_id.values[0]

image = Image.open(self.images_dir_path / file_name).convert("RGB")
image = np.array(image)

image_hw = image.shape[:2]

if image_info.has_annotation.any():
xyxy_bboxes = image_info[["xmin", "ymin", "xmax", "ymax"]].values
class_ids = image_info["class_id"].values
else:
xyxy_bboxes = np.array([])
class_ids = np.array([])

if self.transforms is not None:
transformed = self.transforms(
image=image, bboxes=xyxy_bboxes, labels=class_ids
)
image = transformed["image"]
xyxy_bboxes = np.array(transformed["bboxes"])
class_ids = np.array(transformed["labels"])

return image, xyxy_bboxes, class_ids, image_id, image_hw

DATA_PATH = Path("/".join(Path(__file__).absolute().parts[:-2])) / "data/cars"

@script
def main(
data_path: str = DATA_PATH,
image_size: int = 640,
pretrained: bool = False,
num_epochs: int = 300,
batch_size: int = 8,
):

# load data
data_path = Path(data_path)
images_path = data_path / "training_images"
annotations_file_path = data_path / "annotations.csv"
train_df, valid_df, lookups = load_cars_df(annotations_file_path, images_path)
num_classes = 1

# create datasets
train_ds = DatasetAdaptor(
images_path, train_df, transforms=create_base_transforms(image_size)
)
eval_ds = DatasetAdaptor(images_path, valid_df)

mds = MosaicMixupDataset(
train_ds,
apply_mixup_probability=0.15,
post_mosaic_transforms=create_post_mosaic_transform(
output_height=image_size, output_width=image_size
),
)
if pretrained:
# disable mosaic if finetuning
mds.disable()

train_yds = Yolov7Dataset(
mds,
create_yolov7_transforms(training=True, image_size=(image_size, image_size)),
)
eval_yds = Yolov7Dataset(
eval_ds,
create_yolov7_transforms(training=False, image_size=(image_size, image_size)),
)

# create model, loss function and optimizer
model = create_yolov7_model(
architecture="yolov7", num_classes=num_classes, pretrained=pretrained
)
param_groups = model.get_parameter_groups()

loss_func = create_yolov7_loss(model, image_size=image_size)

optimizer = torch.optim.SGD(
param_groups["other_params"], lr=0.01, momentum=0.937, nesterov=True
)

# create evaluation callback and trainer
calculate_map_callback = (
CalculateMeanAveragePrecisionCallback.create_from_targets_df(
targets_df=valid_df.query("has_annotation == True")[
["image_id", "xmin", "ymin", "xmax", "ymax", "class_id"]
],
image_ids=set(valid_df.image_id.unique()),
iou_threshold=0.2,
)
)

trainer = Yolov7Trainer(
model=model,
optimizer=optimizer,
loss_func=loss_func,
filter_eval_predictions_fn=partial(
filter_eval_predictions, confidence_threshold=0.01, nms_threshold=0.3
),
callbacks=[
calculate_map_callback,
ModelEmaCallback(
decay=0.9999,
model_ema=Yolov7ModelEma,
callbacks=[ProgressBarCallback, calculate_map_callback],
),
SaveBestModelCallback(watch_metric="map", greater_is_better=True),
SaveBatchesCallback("./batches", num_images_per_batch=3),
*get_default_callbacks(progress_bar=True),
],
)

# calculate scaled weight decay and gradient accumulation steps
total_batch_size = (
batch_size * trainer._accelerator.num_processes
) # batch size across all processes

nominal_batch_size = 64
num_accumulate_steps = max(round(nominal_batch_size / total_batch_size), 1)
base_weight_decay = 0.0005
scaled_weight_decay = (
base_weight_decay * total_batch_size * num_accumulate_steps / nominal_batch_size
)

optimizer.add_param_group(
{"params": param_groups["conv_weights"], "weight_decay": scaled_weight_decay}
)

# run training
trainer.train(
num_epochs=num_epochs,
train_dataset=train_yds,
eval_dataset=eval_yds,
per_device_batch_size=batch_size,
create_scheduler_fn=CosineLrScheduler.create_scheduler_fn(
num_warmup_epochs=5,
num_cooldown_epochs=5,
k_decay=2,
),
collate_fn=yolov7_collate_fn,
gradient_accumulation_steps=num_accumulate_steps,
)

if __name__ == "__main__":
main()

Launching training once again, as described here, using a single V100 GPU with fp16 enabled, after 300 epochs we obtained a mAP of 0.997, for both the model and the EMA model; a marginal increase over our transfer learning run, and probably the maximum performance that can be achieved on this dataset!

Hopefully that has provided a somewhat comprehensive overview of some of the most interesting ideas from the YOLOv7 training process, and how these can be applied in custom training scripts.

All of the code required to replicate this post is available as a notebook here. Whilst code snippets are used throughout the article, this is primarily for aesthetic purposes, please defer to the notebook, and the repo for working code.

Chris Hughes and Bernat Puig Camps are on LinkedIn

Here, we used the car object detection dataset from Kaggle which was made publicly available as part of Competition Six (tjmachinelearning.com). This dataset is frequently used for learning purposes.

Whilst there is no clear license attached to this dataset, we received explicit permission from the authors to use this as part of this article. Unless otherwise stated, all images used in this article are taken from this dataset.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment