Techno Blender
Digitally Yours.

Bash Processing Speed Matters. I sped up the execution of a bash… | by Mattia Di Gangi | Mar, 2023

0 100


Photo by Chris Liverani on Unsplash

When you have to process data in textual or tabular forms, one of the long-time favorites is surely GNU Bash, the Linux flagship shell with “batteries included”. If you have never used it, you are missing out a lot and should definitely give it a try.

The tools coming with Bash follow the Unix philosophy of “do only one thing and do it well”, and are super-optimized for many different tasks. Find, grep, sed, and awk are only some among the powerful tools that can interoperate thanks to the Bash’s pipes and filters architecture for text files processing.

Recently, I had to perform simple text processing for which bash makes perfect sense. I have an input file containing an absolute file path per line, and I have to generate an output file where each line is the basename of the corresponding path in the input file, with a different extension. In practice, the two files are needed as an input to another program that will convert the files mentioned in the input file (in wav format) into the files listed in the output file (in mp4 format, by adding some video).

I could have done it in Python, but bash just look much more practical for this task. However, at the end of this story I will show a Python implementation for comparison. Then, I rushed to my keyboard and produced the following:

$ cat input.txt | while read line; do echo $(echo $(basename $line) | sed "
s/.wav/.mp4/") >> output.txt; done

The code is correct, but extremely slow. My file was 3 Millions lines long and this command would have taken 1 hour to complete. Just to get an idea of the lines per second, let me run it on a file with 10000 lines and measure its runtime.

$ time cat input.txt | while read line; do echo $(echo $(basename $line) | sed "
s/.wav/.mp4/") >> output.txt; done

real 0m13.297s
user 0m19.688s
sys 0m1.881s

Some clear inefficiencies are the use of cat to start the command, and the double use of echo (with nested command calls). They are both IO-heavy operations and are then very slow. They can be easily replaced, and since we know that all paths in the input file have the same extension, we can also remove sed and remove the extension using basenameitself. Then, we run the new command with the same file of 10000 lines:

$ time while read line; do name=$(basename $line .txt); echo ${name}.mp4 >> output.txt; done < input.txt

real 0m6.626s
user 0m5.723s
sys 0m1.131s

We have some serious improvement here. Removing catonly, we get a real time right below 13s (relative improvement ~2%), and the rest is done by replacing sed and the second echo with simply basename. Unfortunately, it is still quite slow. With approximately 1500 lines/s, it would take 2000 seconds, or about 30 minutes, to complete 3,000,000 lines. Fortunately, we can get a serious boost by replacing read. Read reads a line from the standard input and assigns its content to one or multiple variables (it can be used easily with tabular data), but it is not needed in our case, since working line by line is what any bash command does anyway.

Unfortunately, we have to give up the handy basename to extract only the file name, but we can replace it with cut, which can remove pieces of texts according to delimiters, and rev, which just reverses the character sequence — this is a common trick to extract the last field with cut, which is not possible by default.

The number of operations performed looks higher than before, but we finally get a huge speed-up as we can see from our example toy file:

$ time rev input.txt | cut -d/ -f1 | rev | sed "s/.wav/.mp4/" >> output.txt

real 0m0.011s
user 0m0.010s
sys 0m0.013s

With this new speed of ~910 Klines/second we can process 3,000,000 lines in 3.3 SECONDS, and corresponds to a speed-up of 606x.

Most importantly, while the actual numbers depend on the hardware where the commands are executed, the relative improvement will be constant among different pieces of hardware.

Here we can see an equivalent python implementation for comparison’s sake:

# convert.py
import os
import sys

def convert(tgt_ext: str):
for line in sys.stdin:
base, _ = os.path.splitext(os.path.basename(line))
print(base + tgt_ext)

if __name__ == '__main__':
if len(sys.argv) != 2:
print(f"Usage: {sys.argv[0]} TARGET_EXT")
sys.exit(1)

convert(sys.argv[1])
exit(0)

and now we can measure its time:

$ time python3 convert.py .mp4 < input.txt > output.txt

real 0m0.022s
user 0m0.021s
sys 0m0.000s

And the time is about double the best time we get with bash. It requires writing more code but also this code is super fast and probably more easily modifiable for many.

Bash comes very in handy for a many data processing tasks when working with textual files. It comes with many tools very optimized, but still some are faster than others to achieve the same result. The difference may not matter when working with short files, but in this article I show that it starts to matter with hundreds of thousands, or millions of lines.

Knowing the performance implications of using our favorite programs can save us hours of waiting for our jobs, with huge gains for our productivity. Also, we saw that a Python implementation is also very fast for our use case despite Python’s fame of being slow. It surely requires more coding but also more flexibility. I would surely reach for python for the cases that are too complicated to solve with bash.

Thanks for reading so far, and happy scripting!

Do you like my writing and are considering subscribing for a Medium Membership for having unlimited access to the articles?

If you subscribe through this link you will support me through your subscription with no additional cost for you https://medium.com/@mattiadigangi/membership


Photo by Chris Liverani on Unsplash

When you have to process data in textual or tabular forms, one of the long-time favorites is surely GNU Bash, the Linux flagship shell with “batteries included”. If you have never used it, you are missing out a lot and should definitely give it a try.

The tools coming with Bash follow the Unix philosophy of “do only one thing and do it well”, and are super-optimized for many different tasks. Find, grep, sed, and awk are only some among the powerful tools that can interoperate thanks to the Bash’s pipes and filters architecture for text files processing.

Recently, I had to perform simple text processing for which bash makes perfect sense. I have an input file containing an absolute file path per line, and I have to generate an output file where each line is the basename of the corresponding path in the input file, with a different extension. In practice, the two files are needed as an input to another program that will convert the files mentioned in the input file (in wav format) into the files listed in the output file (in mp4 format, by adding some video).

I could have done it in Python, but bash just look much more practical for this task. However, at the end of this story I will show a Python implementation for comparison. Then, I rushed to my keyboard and produced the following:

$ cat input.txt | while read line; do echo $(echo $(basename $line) | sed "
s/.wav/.mp4/") >> output.txt; done

The code is correct, but extremely slow. My file was 3 Millions lines long and this command would have taken 1 hour to complete. Just to get an idea of the lines per second, let me run it on a file with 10000 lines and measure its runtime.

$ time cat input.txt | while read line; do echo $(echo $(basename $line) | sed "
s/.wav/.mp4/") >> output.txt; done

real 0m13.297s
user 0m19.688s
sys 0m1.881s

Some clear inefficiencies are the use of cat to start the command, and the double use of echo (with nested command calls). They are both IO-heavy operations and are then very slow. They can be easily replaced, and since we know that all paths in the input file have the same extension, we can also remove sed and remove the extension using basenameitself. Then, we run the new command with the same file of 10000 lines:

$ time while read line; do name=$(basename $line .txt); echo ${name}.mp4 >> output.txt; done < input.txt

real 0m6.626s
user 0m5.723s
sys 0m1.131s

We have some serious improvement here. Removing catonly, we get a real time right below 13s (relative improvement ~2%), and the rest is done by replacing sed and the second echo with simply basename. Unfortunately, it is still quite slow. With approximately 1500 lines/s, it would take 2000 seconds, or about 30 minutes, to complete 3,000,000 lines. Fortunately, we can get a serious boost by replacing read. Read reads a line from the standard input and assigns its content to one or multiple variables (it can be used easily with tabular data), but it is not needed in our case, since working line by line is what any bash command does anyway.

Unfortunately, we have to give up the handy basename to extract only the file name, but we can replace it with cut, which can remove pieces of texts according to delimiters, and rev, which just reverses the character sequence — this is a common trick to extract the last field with cut, which is not possible by default.

The number of operations performed looks higher than before, but we finally get a huge speed-up as we can see from our example toy file:

$ time rev input.txt | cut -d/ -f1 | rev | sed "s/.wav/.mp4/" >> output.txt

real 0m0.011s
user 0m0.010s
sys 0m0.013s

With this new speed of ~910 Klines/second we can process 3,000,000 lines in 3.3 SECONDS, and corresponds to a speed-up of 606x.

Most importantly, while the actual numbers depend on the hardware where the commands are executed, the relative improvement will be constant among different pieces of hardware.

Here we can see an equivalent python implementation for comparison’s sake:

# convert.py
import os
import sys

def convert(tgt_ext: str):
for line in sys.stdin:
base, _ = os.path.splitext(os.path.basename(line))
print(base + tgt_ext)

if __name__ == '__main__':
if len(sys.argv) != 2:
print(f"Usage: {sys.argv[0]} TARGET_EXT")
sys.exit(1)

convert(sys.argv[1])
exit(0)

and now we can measure its time:

$ time python3 convert.py .mp4 < input.txt > output.txt

real 0m0.022s
user 0m0.021s
sys 0m0.000s

And the time is about double the best time we get with bash. It requires writing more code but also this code is super fast and probably more easily modifiable for many.

Bash comes very in handy for a many data processing tasks when working with textual files. It comes with many tools very optimized, but still some are faster than others to achieve the same result. The difference may not matter when working with short files, but in this article I show that it starts to matter with hundreds of thousands, or millions of lines.

Knowing the performance implications of using our favorite programs can save us hours of waiting for our jobs, with huge gains for our productivity. Also, we saw that a Python implementation is also very fast for our use case despite Python’s fame of being slow. It surely requires more coding but also more flexibility. I would surely reach for python for the cases that are too complicated to solve with bash.

Thanks for reading so far, and happy scripting!

Do you like my writing and are considering subscribing for a Medium Membership for having unlimited access to the articles?

If you subscribe through this link you will support me through your subscription with no additional cost for you https://medium.com/@mattiadigangi/membership

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment