Master Data Integrity to Clean Your Computer Vision Datasets | by Paul Iusztin | Dec, 2022


Tutorial

Handle Data Leakage. Reduce Labeling Costs. Decrease Computation Time and Expenses.

Photo by JESHOOTS.COM on Unsplash

Data integrity is one of the biggest concerns for companies and engineers in the latest period.

The amount of data we have to process and understand only gets more significant, and manually looking at millions of samples is not sustainable. Thus, we need tools that can help us navigate our datasets.

This tutorial will present how to clean, visualize and understand Computer Vision datasets, such as videos or images.

We will be working on a video of the most precious thing in my house, my cat. Our end goal is to extract essential frames from the video that, lately, can be sent to labeling and to train your model. Extracting crucial information from a video is not straightforward because the video properties change constantly. For example, in the beginning, the video is highly static, and starting from the middle there is a lot of action. Thus we need an intelligent way to understand the properties of the video to eliminate duplicate images, find the outliers, and cluster similar photos.

GIF from the video of my sassy cat [GIF by the Author].

Leveraging FastDup, a tool for understanding and cleaning CV datasets, we will show you how to solve the above-mentioned problems.

The End Goal

We will present a tutorial on decoding a video of my sassy cat and extracting all the frames from it as images. We will use FastDup to visualize different statistics over the pictures dataset.

The main goal is to remove similar/duplicate images from the dataset. In the next section, we will detail why cleaning your dataset from duplicates is crucial. Ultimately, we will look at the outliers before and after removing all the similar photos.

When the data gets bigger is extremely difficult to parse and understand all your samples manually. Therefore, you need more innovative ways to query your dataset to find duplicate images.

When using tabular data, you can bypass this issue with tools such as Pandas, SQL, etc. But, when modeling images is a challenge. Therefore, we need to address this problem from a different angle. Using FastDup, we can quickly compute a set of embeddings over the dataset with which we can perform data retrieval, clustering, outlier detection, etc. Ultimately, we can easily query and visualize our dataset to understand it better.

We reduced the importance of eliminating similar images to 3 main ideas:

#1 Data leakage

Many samples will be similar when working with images, especially with videos. A robust strategy is to use embeddings to detect similar photos and remove duplicates accordingly. Keeping copies is prone to data leakage. For example, when splitting the data, it is easy to accidentally add image A in the train split and image B in the test split, where A and B are similar. Such data leakage problems are often encountered when working with large datasets, which, unfortunately, results in errounous results and could end in poor performance when deploying the model in production.

#2 Reduce labeling costs

Most well-performing ML models use supervised methods to learn. Thus, we need labels to train them.

Keeping similar images in your datasets will make your dataset larger but without any positive effect. More data won’t add any value in this case because it won’t add any new information. Instead, you just spend money and effort to label similar images.

#3 Reduce computation time & costs

As we discuss in point 3., the data needs to add additional information to help the model. Therefore, by keeping similar samples in your dataset, your training time will be more prolonged and costly. Ultimately, it will result in fewer and longer experiments, resulting in poorer models due to a lack of feedback.

FastDup is a Python tool in Computer Vision with many fantastic features to gain insights into an extensive image/video collection.

Some of their main features:

  • Find anomalies
  • Find duplicates
  • Clustering
  • Interactions between images/videos
  • Compute embeddings

It is unsupervised, scalable, and works fast on your CPU. This is awesome because GPUs are expensive and one of the biggest impediments for smaller teams.

Now, because we saw the importance of understanding and cleaning our datasets, let’s start the tutorial and use FastDup to see how we can solve those issues on a video of my sassy cat.

Let’s go 👇🏻

The following are a few constants that reflect the location of our resources. We have the path to our input video, the directory where we will extract the frames, and the directory where FastDup will compute and cache information about the dataset, such as the embeddings, outliers, and various statistics. The information extracted by FastDup is computed only once and can be accessed at any time.

We are extracting the video frame by frame to a given output directory. The video has an FPS of 30 and a length of 47 seconds. skip_rate is the rate with which we extract frames from the video. In this case, skip_rate = 1. Thus, we extract the video frame by frame. If skip_rate = 5, we would extract frame 0, frame 5, frame 10, and so on.

frames_count = extract(input_video_path=INPUT_VIDEO_PATH, output_dir=VIDEO_EXTRACTION_1_SKIP_DIR)FPS = 30
Frame Count = 1409
Video Length = 47 seconds
Extract Frame Count = 1409

Firstly, let’s understand some of the terminologies that we will keep using during this article:

  • component: A cluster of similar images. A component always has more than one image.
  • similar images: Images that are not identical but contain the same information. Most of the images won’t be perfect duplicates of each other. Thus, it is essential to have a mechanism that finds closely related images.

Now we are calling FastDup to analyze our dataset and cache all the analytics in the given directory.

compute_fastdup_analytics(images_dir=VIDEO_EXTRACTION_1_SKIP_DIR, analytics_dir=FASTDUP_ANALYTICS_1_SKIP_DIR)

Statistics

Using FastDup, you can efficiently compute a histogram of your dataset over different statistics such as: mean, std, min, max, size, etc.

fastdup.create_stats_gallery(FASTDUP_ANALYTICS_1_SKIP_DIR, save_path="stats", metric="mean")
Histogram of the mean value of the images within the video [Image by the Author].
fastdup.create_stats_gallery(FASTDUP_ANALYTICS_1_SKIP_DIR, save_path="stats", metric="stdv")
Histogram of the standard deviation of the images within the video [Image by the Author].

The most interesting metric is blur, which quickly highlights blurred images and their statistics:

Statistics about the blurred images within the video [Image by the Author].

Outliers

Fastdup leverages embeddings and K-Means to cluster the images into the embeddings space. Therefore, we can quickly find clusters of similar images (in their lingo, “the components”) and outliers.

Let’s compute the outliers:

The outliers visualization helps us see exciting scenarios. In our case, the outlier images are blurry. As long as the information within a sample is valid, it is good practice to use noisy samples to increase the robustness of the model.

Top 5 outliers within the video, quickly computed with FastDup [Image by the Author].

Duplicates

Finally, let’s compute and find out how many similar images we have extracted from our video:

FastDup creates a histogram based on the size of the component, where a component is a cluster of similar images.

Top 3 components within the video extract frame by frame. The most significant component is the one that contains the most similar images [Image by the Author].

As expected, if we export all the frames within the video, we end up with many duplicates. We can nicely visualize the top 3 components which contain the most copies. Also, we can see the distribution of the component sizes. 6 components have more than 40 images, which signals that in ~6 scenarios, the video was very static.

Also, by computing num_images / FPS, we estimate the number of seconds of a static scene.

How can we remove all those duplicates?

A naive approach would be to increase the skip_rate to eliminate redundancy between similar adjacent frames. Let’s see how that performs.

As we said earlier, Let’s use skip_rate = 10 to see if we can solve the similarity issue. The video has an FPS of 30, and after a few experiments, we found that skip_rate = 10 is a balanced number.

frames_count_10_skip = extract(input_video_path=INPUT_VIDEO_PATH, output_dir=VIDEO_EXTRACTION_10_SKIP_DIR, skip_rate=SKIP_RATE)FPS = 30
Frame Count = 1409
Video Length = 47 seconds
Extract Frame Count = 140

We have to recompute the FastDup analytics to visualize the new components:

Now let’s visualize the components:

Top 3 components within the video extract with a skip rate of 10. The most significant component is the one that contains the most similar images [Image by the Author].

After experimenting with multiple skip rates, we solved most of our similarity issues. We can see that duplicate images were found in 2 components with only 9 and 8 samples. We did a decent job using such a straightforward method.

Also, a huge benefit is that the extraction time was 10x faster. This is a critical aspect to consider when having lots of videos.

But unfortunately, this method is not robust at all. Here is why:

#1. Loss of information

The activity in the video is always moving at different rates. We can find the best skip_rate for 2-3 videos to find the most appropriate balance between keeping enough necessary samples and removing redundant data. Still, because the real world is chaotic, that number will probably break at some point for new videos. Without knowing it, you will skip over essential frames from the videos.

#2. You cannot control the similarity issue

Because the video is dynamic, a fixed skip rate will only be a good fit across some of your videos. Therefore, finding a different skip rate for every video within your dataset is not sustainable. Also, using a single skip_rate with a higher value across all your videos won’t work.

Even for our video, which is quite simple, we could only remove some of the duplicates.

As stated above, we saw that using a higher skip rate won’t do the job. Thus, we need a more intelligent and dynamic method to find similar images and remove duplicates.

We are in luck because Fastdup is doing just that. Let’s see how.

Using FastDup, we can use a more ingenious method. We already know the components within the clustered embedding space of the images. Thus, we know which images are within the same cluster, which translates to similar photos. FastDup gives us the functionality to keep only one image/component.

This method is 10x better because it is independent of the dynamic that is going on within the video. It is robust and flexible relative to the irregularities of the semantics of the video.

We have to extract all the frames again and recompute the FastDup analytics.


FPS = 30
Frame Count = 1409
Video Length = 47 seconds
Extract Frame Count = 1409

-------------

Deleted frames count: 1202
Deleted 85.308730% of the frames.

Wow. 85% of the images represented redundant data, which is a lot. Imagine that instead of labeling 1000 images, you only have to label 150. That is a lot of saved time.

After we clean the directory image, let’s recompute the components to see how they look.

compute_fastdup_analytics(images_dir=VIDEO_EXTRACTION_1_SKIP_DIR, analytics_dir=FASTDUP_ANALYTICS_1_SKIP_DIR)
AssertionError: No components found with more than one image/video

What is that? Did we get an error?

Yes, we did, and that is perfect. It means that FastDup no longer found any components. Thus, the dataset no longer contains similar images.

That was all. Easy right? Isn’t that cool?

I praised that using embeddings to remove similar images is the best method. But unfortunately, exporting a video frame-by-frame takes a lot of time. Therefore, the best way is to combine those two strategies. But this time, we will use a low skipping rate of 2 because the goal of skipping now is not to eliminate duplicity but to speed up the extraction time. We will still use FastDup to handle the cleaning process.

frames_count_2_skip = extract(input_video_path=INPUT_VIDEO_PATH, output_dir=VIDEO_EXTRACTION_2_SKIP_DIR, skip_rate=SKIP_RATE)FPS = 30
Frame Count = 1409
Video Length = 47 seconds
Extract Frame Count = 704
compute_fastdup_analytics(images_dir=VIDEO_EXTRACTION_2_SKIP_DIR, analytics_dir=FASTDUP_ANALYTICS_2_SKIP_DIR)
Top 3 components within the video extract with a skip rate of 2. The most significant component is the one that contains the most similar images [Image by the Author].

As you can see, using skip_rate = 2, we still have a lot of similar images. But the exporting time was 2x faster.

delete_duplicates(FASTDUP_ANALYTICS_2_SKIP_DIR, total_frames_count=frames_count_2_skip)Deleted frames count: 346
Deleted 49.147727% of the frames.

Nice. This time only 50% of the total number of frames were duplicates.

Outliers Before and After Removing Similar Images

It is interesting to look at the IoU of the outliers before and after cleaning. In theory, the outliers remain mostly the same after removing duplicates because the dataset’s general semantic information should stay the same. Unfortunately, the IoU between the before and after outliers is only ~30%, which shows that removing all the duplicate images changed the geometric space, which could be better.

0.31

By showing you the outliers example, I want to highlight that even though this method is innovative and extremely useful, it still needs improvement. We are just beginning the data-centric movement; thus, we must stay tuned to see what it will bring to the future.

Thank you for reading my tutorial. If you read it, you have seen how easy it is to get control over your dataset.

We presented how to use FastDup to process a real-world video (one of my sassy cats) to quickly find similar images and outliers within a Computer Vision dataset.

We highlighted why it is essential to understand your dataset and correctly treat similar images according to your business problem. Otherwise, you may encounter data leakage, higher labeling costs, and higher computation time and costs. Issues that translate to erroneous metrics and fewer experimentations, which at the end of the day, result in poorer models.

Here is the link to the GitHub repository.

💡 My goal is to make machine learning easy and intuitive. If you enjoyed my article, we could connect on LinkedIn for daily insights about #data, #ml, and #mlops.


Tutorial

Handle Data Leakage. Reduce Labeling Costs. Decrease Computation Time and Expenses.

Photo by JESHOOTS.COM on Unsplash

Data integrity is one of the biggest concerns for companies and engineers in the latest period.

The amount of data we have to process and understand only gets more significant, and manually looking at millions of samples is not sustainable. Thus, we need tools that can help us navigate our datasets.

This tutorial will present how to clean, visualize and understand Computer Vision datasets, such as videos or images.

We will be working on a video of the most precious thing in my house, my cat. Our end goal is to extract essential frames from the video that, lately, can be sent to labeling and to train your model. Extracting crucial information from a video is not straightforward because the video properties change constantly. For example, in the beginning, the video is highly static, and starting from the middle there is a lot of action. Thus we need an intelligent way to understand the properties of the video to eliminate duplicate images, find the outliers, and cluster similar photos.

GIF from the video of my sassy cat [GIF by the Author].

Leveraging FastDup, a tool for understanding and cleaning CV datasets, we will show you how to solve the above-mentioned problems.

The End Goal

We will present a tutorial on decoding a video of my sassy cat and extracting all the frames from it as images. We will use FastDup to visualize different statistics over the pictures dataset.

The main goal is to remove similar/duplicate images from the dataset. In the next section, we will detail why cleaning your dataset from duplicates is crucial. Ultimately, we will look at the outliers before and after removing all the similar photos.

When the data gets bigger is extremely difficult to parse and understand all your samples manually. Therefore, you need more innovative ways to query your dataset to find duplicate images.

When using tabular data, you can bypass this issue with tools such as Pandas, SQL, etc. But, when modeling images is a challenge. Therefore, we need to address this problem from a different angle. Using FastDup, we can quickly compute a set of embeddings over the dataset with which we can perform data retrieval, clustering, outlier detection, etc. Ultimately, we can easily query and visualize our dataset to understand it better.

We reduced the importance of eliminating similar images to 3 main ideas:

#1 Data leakage

Many samples will be similar when working with images, especially with videos. A robust strategy is to use embeddings to detect similar photos and remove duplicates accordingly. Keeping copies is prone to data leakage. For example, when splitting the data, it is easy to accidentally add image A in the train split and image B in the test split, where A and B are similar. Such data leakage problems are often encountered when working with large datasets, which, unfortunately, results in errounous results and could end in poor performance when deploying the model in production.

#2 Reduce labeling costs

Most well-performing ML models use supervised methods to learn. Thus, we need labels to train them.

Keeping similar images in your datasets will make your dataset larger but without any positive effect. More data won’t add any value in this case because it won’t add any new information. Instead, you just spend money and effort to label similar images.

#3 Reduce computation time & costs

As we discuss in point 3., the data needs to add additional information to help the model. Therefore, by keeping similar samples in your dataset, your training time will be more prolonged and costly. Ultimately, it will result in fewer and longer experiments, resulting in poorer models due to a lack of feedback.

FastDup is a Python tool in Computer Vision with many fantastic features to gain insights into an extensive image/video collection.

Some of their main features:

  • Find anomalies
  • Find duplicates
  • Clustering
  • Interactions between images/videos
  • Compute embeddings

It is unsupervised, scalable, and works fast on your CPU. This is awesome because GPUs are expensive and one of the biggest impediments for smaller teams.

Now, because we saw the importance of understanding and cleaning our datasets, let’s start the tutorial and use FastDup to see how we can solve those issues on a video of my sassy cat.

Let’s go 👇🏻

The following are a few constants that reflect the location of our resources. We have the path to our input video, the directory where we will extract the frames, and the directory where FastDup will compute and cache information about the dataset, such as the embeddings, outliers, and various statistics. The information extracted by FastDup is computed only once and can be accessed at any time.

We are extracting the video frame by frame to a given output directory. The video has an FPS of 30 and a length of 47 seconds. skip_rate is the rate with which we extract frames from the video. In this case, skip_rate = 1. Thus, we extract the video frame by frame. If skip_rate = 5, we would extract frame 0, frame 5, frame 10, and so on.

frames_count = extract(input_video_path=INPUT_VIDEO_PATH, output_dir=VIDEO_EXTRACTION_1_SKIP_DIR)FPS = 30
Frame Count = 1409
Video Length = 47 seconds
Extract Frame Count = 1409

Firstly, let’s understand some of the terminologies that we will keep using during this article:

  • component: A cluster of similar images. A component always has more than one image.
  • similar images: Images that are not identical but contain the same information. Most of the images won’t be perfect duplicates of each other. Thus, it is essential to have a mechanism that finds closely related images.

Now we are calling FastDup to analyze our dataset and cache all the analytics in the given directory.

compute_fastdup_analytics(images_dir=VIDEO_EXTRACTION_1_SKIP_DIR, analytics_dir=FASTDUP_ANALYTICS_1_SKIP_DIR)

Statistics

Using FastDup, you can efficiently compute a histogram of your dataset over different statistics such as: mean, std, min, max, size, etc.

fastdup.create_stats_gallery(FASTDUP_ANALYTICS_1_SKIP_DIR, save_path="stats", metric="mean")
Histogram of the mean value of the images within the video [Image by the Author].
fastdup.create_stats_gallery(FASTDUP_ANALYTICS_1_SKIP_DIR, save_path="stats", metric="stdv")
Histogram of the standard deviation of the images within the video [Image by the Author].

The most interesting metric is blur, which quickly highlights blurred images and their statistics:

Statistics about the blurred images within the video [Image by the Author].

Outliers

Fastdup leverages embeddings and K-Means to cluster the images into the embeddings space. Therefore, we can quickly find clusters of similar images (in their lingo, “the components”) and outliers.

Let’s compute the outliers:

The outliers visualization helps us see exciting scenarios. In our case, the outlier images are blurry. As long as the information within a sample is valid, it is good practice to use noisy samples to increase the robustness of the model.

Top 5 outliers within the video, quickly computed with FastDup [Image by the Author].

Duplicates

Finally, let’s compute and find out how many similar images we have extracted from our video:

FastDup creates a histogram based on the size of the component, where a component is a cluster of similar images.

Top 3 components within the video extract frame by frame. The most significant component is the one that contains the most similar images [Image by the Author].

As expected, if we export all the frames within the video, we end up with many duplicates. We can nicely visualize the top 3 components which contain the most copies. Also, we can see the distribution of the component sizes. 6 components have more than 40 images, which signals that in ~6 scenarios, the video was very static.

Also, by computing num_images / FPS, we estimate the number of seconds of a static scene.

How can we remove all those duplicates?

A naive approach would be to increase the skip_rate to eliminate redundancy between similar adjacent frames. Let’s see how that performs.

As we said earlier, Let’s use skip_rate = 10 to see if we can solve the similarity issue. The video has an FPS of 30, and after a few experiments, we found that skip_rate = 10 is a balanced number.

frames_count_10_skip = extract(input_video_path=INPUT_VIDEO_PATH, output_dir=VIDEO_EXTRACTION_10_SKIP_DIR, skip_rate=SKIP_RATE)FPS = 30
Frame Count = 1409
Video Length = 47 seconds
Extract Frame Count = 140

We have to recompute the FastDup analytics to visualize the new components:

Now let’s visualize the components:

Top 3 components within the video extract with a skip rate of 10. The most significant component is the one that contains the most similar images [Image by the Author].

After experimenting with multiple skip rates, we solved most of our similarity issues. We can see that duplicate images were found in 2 components with only 9 and 8 samples. We did a decent job using such a straightforward method.

Also, a huge benefit is that the extraction time was 10x faster. This is a critical aspect to consider when having lots of videos.

But unfortunately, this method is not robust at all. Here is why:

#1. Loss of information

The activity in the video is always moving at different rates. We can find the best skip_rate for 2-3 videos to find the most appropriate balance between keeping enough necessary samples and removing redundant data. Still, because the real world is chaotic, that number will probably break at some point for new videos. Without knowing it, you will skip over essential frames from the videos.

#2. You cannot control the similarity issue

Because the video is dynamic, a fixed skip rate will only be a good fit across some of your videos. Therefore, finding a different skip rate for every video within your dataset is not sustainable. Also, using a single skip_rate with a higher value across all your videos won’t work.

Even for our video, which is quite simple, we could only remove some of the duplicates.

As stated above, we saw that using a higher skip rate won’t do the job. Thus, we need a more intelligent and dynamic method to find similar images and remove duplicates.

We are in luck because Fastdup is doing just that. Let’s see how.

Using FastDup, we can use a more ingenious method. We already know the components within the clustered embedding space of the images. Thus, we know which images are within the same cluster, which translates to similar photos. FastDup gives us the functionality to keep only one image/component.

This method is 10x better because it is independent of the dynamic that is going on within the video. It is robust and flexible relative to the irregularities of the semantics of the video.

We have to extract all the frames again and recompute the FastDup analytics.


FPS = 30
Frame Count = 1409
Video Length = 47 seconds
Extract Frame Count = 1409

-------------

Deleted frames count: 1202
Deleted 85.308730% of the frames.

Wow. 85% of the images represented redundant data, which is a lot. Imagine that instead of labeling 1000 images, you only have to label 150. That is a lot of saved time.

After we clean the directory image, let’s recompute the components to see how they look.

compute_fastdup_analytics(images_dir=VIDEO_EXTRACTION_1_SKIP_DIR, analytics_dir=FASTDUP_ANALYTICS_1_SKIP_DIR)
AssertionError: No components found with more than one image/video

What is that? Did we get an error?

Yes, we did, and that is perfect. It means that FastDup no longer found any components. Thus, the dataset no longer contains similar images.

That was all. Easy right? Isn’t that cool?

I praised that using embeddings to remove similar images is the best method. But unfortunately, exporting a video frame-by-frame takes a lot of time. Therefore, the best way is to combine those two strategies. But this time, we will use a low skipping rate of 2 because the goal of skipping now is not to eliminate duplicity but to speed up the extraction time. We will still use FastDup to handle the cleaning process.

frames_count_2_skip = extract(input_video_path=INPUT_VIDEO_PATH, output_dir=VIDEO_EXTRACTION_2_SKIP_DIR, skip_rate=SKIP_RATE)FPS = 30
Frame Count = 1409
Video Length = 47 seconds
Extract Frame Count = 704
compute_fastdup_analytics(images_dir=VIDEO_EXTRACTION_2_SKIP_DIR, analytics_dir=FASTDUP_ANALYTICS_2_SKIP_DIR)
Top 3 components within the video extract with a skip rate of 2. The most significant component is the one that contains the most similar images [Image by the Author].

As you can see, using skip_rate = 2, we still have a lot of similar images. But the exporting time was 2x faster.

delete_duplicates(FASTDUP_ANALYTICS_2_SKIP_DIR, total_frames_count=frames_count_2_skip)Deleted frames count: 346
Deleted 49.147727% of the frames.

Nice. This time only 50% of the total number of frames were duplicates.

Outliers Before and After Removing Similar Images

It is interesting to look at the IoU of the outliers before and after cleaning. In theory, the outliers remain mostly the same after removing duplicates because the dataset’s general semantic information should stay the same. Unfortunately, the IoU between the before and after outliers is only ~30%, which shows that removing all the duplicate images changed the geometric space, which could be better.

0.31

By showing you the outliers example, I want to highlight that even though this method is innovative and extremely useful, it still needs improvement. We are just beginning the data-centric movement; thus, we must stay tuned to see what it will bring to the future.

Thank you for reading my tutorial. If you read it, you have seen how easy it is to get control over your dataset.

We presented how to use FastDup to process a real-world video (one of my sassy cats) to quickly find similar images and outliers within a Computer Vision dataset.

We highlighted why it is essential to understand your dataset and correctly treat similar images according to your business problem. Otherwise, you may encounter data leakage, higher labeling costs, and higher computation time and costs. Issues that translate to erroneous metrics and fewer experimentations, which at the end of the day, result in poorer models.

Here is the link to the GitHub repository.

💡 My goal is to make machine learning easy and intuitive. If you enjoyed my article, we could connect on LinkedIn for daily insights about #data, #ml, and #mlops.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
artificial intelligenceCleancomputerDatadatasetsDecintegrityIusztinlatest newsMasterPaulTechnoblenderVision
Comments (0)
Add Comment