Techno Blender
Digitally Yours.

How I Automated a Tedious Task with Python | by Soner Yıldırım | Oct, 2022

0 51


Using only the built-in modules

(image by author)

Python makes your life easier. For a high variety of tasks from building a highly complex machine learning pipeline to organizing files in your computer, there is a practical Python solution.

So I would like to call Python “everyone’s language”.

Back to how Python helped me this time, I have a list of folders that contain images of different objects.

(image by author)

There are over 100 folders and each has a different number of images. What I need to do is separate the images for each object into train (75%) and validation (25%) folders.

The new structure will be as follows:

(image by author)
(image by author)

As a first step, I create train and validation folders. Then, I copied all the subfolders (e.g. accordion, airplanes, etc) in both these folders.

The task is to remove 25% of the images from the train folders and 75% of the images from the validation folders. The important part is not to remove the same ones so that the images in the train and validation folders are different.

The not-so-friendly and tedious way is to manually delete images from these folders. This is definitely out of the question. I’m sure there are several different practical ways of doing this task but we will use Python.

We can divide the task into these 3 steps:

  • Get the file path to the images
  • Determine the ones to be deleted
  • Delete them

Let’s start with the first step.

We will be using the os module of Python to handle path-related tasks. The listdir function returns a list of all the files in a folder at the given path. We provide the path to the train folder and get a list of all the subfolders in it.

import os
train_base_path = "Data/train/"
train_object_list = os.listdir(train_base_path)
print(train_object_list[:5]) # check the first 5
# output
['gerenuk', 'hawksbill', 'headphone', 'ant', 'butterfly']

In order to access the images inside these subfolders, we need to create the path to each subfolder, which can be done using the join method. Let’s do it for the first folder in the train object list.

folder_path = os.path.join(train_base_path, train_object_list[0])print(folder_path)
# output
Data/train/gerenuk

The next step is to access the images in this folder, which can be done using the listdir function.

image_list = os.listdir(folder_path)

The image list created in the previous step contains the names of all the images in the specified folder.

image_list[:5]# output
['image_0019.jpg',
'image_0025.jpg',
'image_0024.jpg',
'image_0018.jpg',
'image_0020.jpg']

The next step is to determine which images will be deleted from the train and validation folders. There are different ways of handling this step depending on the file names.

All the image files in our case are named with a 4-digit number starting from 0001 but they are not sorted. We can tackle this task with the following steps:

  • Find the number of images using the len function of Python.
# find the number of images
number_of_images = len(image_list)
  • Calculate the number of images to be used in the train folder by multiplying the total number of images by 0.75 and converting it to an integer. We can use the int function but it truncates the floats (i.e. 24.8 to 24). If you want to round up floats (i.e. 24.8 to 25), you can use the ceil function in the math library.
# number of images in train
number_of_images_train = int(number_of_images * 0.75)
# to round up
import math
number_of_images_train = math.ceil(number_of_images * 0.75)
  • Sort the image names in the image list using the built-in sort function.
image_list.sort()
  • Determine the images to be deleted from the train folder using the number of images in train to slice the image list.
remove_from_train = image_list[number_of_images_train:]

If the number of images in the train is 20, the remove from train list contains the items after the first 20 items in the image list. Thus, the first 20 items will be in the train folder. We can also use this value to remove images from the validation folder.

We have determined the images to be removed from both the train and validation folders. In order to remove them, we need to construct the path to the image first. Then, the image can be removed using the remove function of the os module.

For each image name in the remove from train list, the file path to the image can be created as follows:

for image in remove_from_train:

file_path_to_remove = os.path.join(

train_base_path,
train_object_list[0],
image
)

Here is an example image file path:

'Data/train/metronome/image_0024.jpg'

The final step is to remove the image at this path:

os.remove(file_path_to_remove)

All these steps will be done for each subfolder in both train and validation folders. Thus, we need to put them inside a loop.

train_base_path = "Data/train/"
validation_base_path = "Data/validation/"
train_object_list = os.listdir(train_base_path)for subfolder in train_object_list:

if subfolder != ".DS_Store":

print(subfolder)
subfolder_path = os.path.join(train_base_path, subfolder)
image_list = os.listdir(subfolder_path)
image_list.sort()
number_of_images = len(image_list)
number_of_images_train = int(number_of_images * 0.75)
remove_from_train = image_list[number_of_images_train:]
remove_from_validation = image_list[:number_of_images_train]

# remove from train
for image in remove_from_train:

file_path_to_remove = os.path.join(
train_base_path,
subfolder,
image
)
os.remove(file_path_to_remove)

# remove from validation
for image in remove_from_validation:

file_path_to_remove = os.path.join(
validation_base_path,
subfolder,
image
)
os.remove(file_path_to_remove)

I put the if condition for an unseen folder name “.DS_Store”. I could not find it even if I checked the hidden files. Let me know if you have any idea how this situation can be solved.

Let’s check the number of images in the train and validation folders for the first 5 objects:

for subfolder in train_object_list[:5]:

subfolder_path_train = os.path.join(train_base_path, subfolder)
subfolder_path_validation = os.path.join(validation_base_path,
subfolder)

train_count = len(os.listdir(subfolder_path_train))
validation_count = len(os.listdir(subfolder_path_validation))

print(subfolder, train_count, validation_count)

# output
gerenuk 25 9
hawksbill 75 25
headphone 31 11
ant 31 11
butterfly 68 23

Looks like our script has done the job accurately.

Python is very efficient in automating every day tasks such as organizing files and folders, sending emails, and so on. If you are doing such tasks frequently, I strongly recommend giving Python a chance to do them for you.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.


Using only the built-in modules

(image by author)

Python makes your life easier. For a high variety of tasks from building a highly complex machine learning pipeline to organizing files in your computer, there is a practical Python solution.

So I would like to call Python “everyone’s language”.

Back to how Python helped me this time, I have a list of folders that contain images of different objects.

(image by author)

There are over 100 folders and each has a different number of images. What I need to do is separate the images for each object into train (75%) and validation (25%) folders.

The new structure will be as follows:

(image by author)
(image by author)

As a first step, I create train and validation folders. Then, I copied all the subfolders (e.g. accordion, airplanes, etc) in both these folders.

The task is to remove 25% of the images from the train folders and 75% of the images from the validation folders. The important part is not to remove the same ones so that the images in the train and validation folders are different.

The not-so-friendly and tedious way is to manually delete images from these folders. This is definitely out of the question. I’m sure there are several different practical ways of doing this task but we will use Python.

We can divide the task into these 3 steps:

  • Get the file path to the images
  • Determine the ones to be deleted
  • Delete them

Let’s start with the first step.

We will be using the os module of Python to handle path-related tasks. The listdir function returns a list of all the files in a folder at the given path. We provide the path to the train folder and get a list of all the subfolders in it.

import os
train_base_path = "Data/train/"
train_object_list = os.listdir(train_base_path)
print(train_object_list[:5]) # check the first 5
# output
['gerenuk', 'hawksbill', 'headphone', 'ant', 'butterfly']

In order to access the images inside these subfolders, we need to create the path to each subfolder, which can be done using the join method. Let’s do it for the first folder in the train object list.

folder_path = os.path.join(train_base_path, train_object_list[0])print(folder_path)
# output
Data/train/gerenuk

The next step is to access the images in this folder, which can be done using the listdir function.

image_list = os.listdir(folder_path)

The image list created in the previous step contains the names of all the images in the specified folder.

image_list[:5]# output
['image_0019.jpg',
'image_0025.jpg',
'image_0024.jpg',
'image_0018.jpg',
'image_0020.jpg']

The next step is to determine which images will be deleted from the train and validation folders. There are different ways of handling this step depending on the file names.

All the image files in our case are named with a 4-digit number starting from 0001 but they are not sorted. We can tackle this task with the following steps:

  • Find the number of images using the len function of Python.
# find the number of images
number_of_images = len(image_list)
  • Calculate the number of images to be used in the train folder by multiplying the total number of images by 0.75 and converting it to an integer. We can use the int function but it truncates the floats (i.e. 24.8 to 24). If you want to round up floats (i.e. 24.8 to 25), you can use the ceil function in the math library.
# number of images in train
number_of_images_train = int(number_of_images * 0.75)
# to round up
import math
number_of_images_train = math.ceil(number_of_images * 0.75)
  • Sort the image names in the image list using the built-in sort function.
image_list.sort()
  • Determine the images to be deleted from the train folder using the number of images in train to slice the image list.
remove_from_train = image_list[number_of_images_train:]

If the number of images in the train is 20, the remove from train list contains the items after the first 20 items in the image list. Thus, the first 20 items will be in the train folder. We can also use this value to remove images from the validation folder.

We have determined the images to be removed from both the train and validation folders. In order to remove them, we need to construct the path to the image first. Then, the image can be removed using the remove function of the os module.

For each image name in the remove from train list, the file path to the image can be created as follows:

for image in remove_from_train:

file_path_to_remove = os.path.join(

train_base_path,
train_object_list[0],
image
)

Here is an example image file path:

'Data/train/metronome/image_0024.jpg'

The final step is to remove the image at this path:

os.remove(file_path_to_remove)

All these steps will be done for each subfolder in both train and validation folders. Thus, we need to put them inside a loop.

train_base_path = "Data/train/"
validation_base_path = "Data/validation/"
train_object_list = os.listdir(train_base_path)for subfolder in train_object_list:

if subfolder != ".DS_Store":

print(subfolder)
subfolder_path = os.path.join(train_base_path, subfolder)
image_list = os.listdir(subfolder_path)
image_list.sort()
number_of_images = len(image_list)
number_of_images_train = int(number_of_images * 0.75)
remove_from_train = image_list[number_of_images_train:]
remove_from_validation = image_list[:number_of_images_train]

# remove from train
for image in remove_from_train:

file_path_to_remove = os.path.join(
train_base_path,
subfolder,
image
)
os.remove(file_path_to_remove)

# remove from validation
for image in remove_from_validation:

file_path_to_remove = os.path.join(
validation_base_path,
subfolder,
image
)
os.remove(file_path_to_remove)

I put the if condition for an unseen folder name “.DS_Store”. I could not find it even if I checked the hidden files. Let me know if you have any idea how this situation can be solved.

Let’s check the number of images in the train and validation folders for the first 5 objects:

for subfolder in train_object_list[:5]:

subfolder_path_train = os.path.join(train_base_path, subfolder)
subfolder_path_validation = os.path.join(validation_base_path,
subfolder)

train_count = len(os.listdir(subfolder_path_train))
validation_count = len(os.listdir(subfolder_path_validation))

print(subfolder, train_count, validation_count)

# output
gerenuk 25 9
hawksbill 75 25
headphone 31 11
ant 31 11
butterfly 68 23

Looks like our script has done the job accurately.

Python is very efficient in automating every day tasks such as organizing files and folders, sending emails, and so on. If you are doing such tasks frequently, I strongly recommend giving Python a chance to do them for you.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment