Combining Multiprocessing and Asyncio in Python for Performance Boosts | by Peng Qian | May, 2023

By Jessie Hobb On May 4, 2023

Thanks to GIL, using multiple threads to perform CPU-bound tasks has never been an option. With the popularity of multicore CPUs, Python offers a multiprocessing solution to perform CPU-bound tasks. But until now, there were still some problems with using multiprocess-related APIs directly.

Before we start, we still have a small piece of code to aid in the demonstration:

The screenshot shows the execution sequence of join. Image by Author

In this case today, we will deal with two problems:

How to read multiple datasets concurrently. Especially if the datasets are large or many. How to use asyncio to improve efficiency.
How to use asyncio’s run_in_executor method to implement a MapReduce program and process datasets efficiently.

Before we start, I will explain to you how our code is going to be executed using a diagram:

The diagram shows how the entire code works. Image by Author

The yellow part represents our concurrent tasks. Since the CPU can process data from memory faster than IO can read data from disk, we first read all datasets into memory concurrently.

After the initial data merging and slicing, we come to the green part that represents the CPU parallel task. In this part, we will start several processes to map the data.

Finally, we get the intermediate results of all the processes in the main process and then use a reduce program to get the final results.

Data preparation

In this case, we will use the Google Books Ngram Dataset, which counts the frequency of each string combination in various books by year from 1500 to 2012.

The Google Books Ngram dataset is free for any purpose, and today we will use these datasets below:

We aim to count the cumulative number of times each word is counted by the result set.

Dependency installation

To read the files concurrently, we will use the aiofiles library, which can support asyncio’s concurrent implementation.

If you are using pip, you can install it as follows:

$ pip install aiofiles

If you are using Anaconda, you can install it as follows:

$ conda install -c anaconda aiofiles

Since this case is still relatively simple, for the sake of demonstration, we will use a .py script to do the whole thing here.

As an architect, before you start, you should plan your methods according to the flowchart design and try to follow the “single responsibility principle” for each method. Thus, do only one thing once upon each method:

Next, we will implement each method step by step and finally integrate them to run together in the main method.

File reading

Method read_file will implement reading a single file with aiofiles:

The resulting screenshot after adding the tqdm APIs. Image by Author

In today’s article, we explored some of the problems with multi-process code, such as the hassle of getting the results of each process and the inability to get the results in the order in which we execute the tasks.

We also explored the feasibility of integrating asyncio with ProcessPoolExecutor and the advantages that such integration brings to us. For example, it unifies the API for concurrent and parallel programming, simplifies our programming process, and allows us to obtain execution results in order of completion.

Finally, we explain how we can alternate between concurrent and parallel programming techniques to help us execute our code efficiently in data science tasks through a real-world case study that exists.

Due to the limited ability of individuals, there are inevitably few places in the case, so I welcome your comments and corrections so that we can learn and progress together.

Before we start, we still have a small piece of code to aid in the demonstration:

In this case today, we will deal with two problems:

How to read multiple datasets concurrently. Especially if the datasets are large or many. How to use asyncio to improve efficiency.
How to use asyncio’s run_in_executor method to implement a MapReduce program and process datasets efficiently.

Before we start, I will explain to you how our code is going to be executed using a diagram:

The yellow part represents our concurrent tasks. Since the CPU can process data from memory faster than IO can read data from disk, we first read all datasets into memory concurrently.

After the initial data merging and slicing, we come to the green part that represents the CPU parallel task. In this part, we will start several processes to map the data.

Finally, we get the intermediate results of all the processes in the main process and then use a reduce program to get the final results.

Data preparation

In this case, we will use the Google Books Ngram Dataset, which counts the frequency of each string combination in various books by year from 1500 to 2012.

The Google Books Ngram dataset is free for any purpose, and today we will use these datasets below:

We aim to count the cumulative number of times each word is counted by the result set.

Dependency installation

To read the files concurrently, we will use the aiofiles library, which can support asyncio’s concurrent implementation.

If you are using pip, you can install it as follows:

$ pip install aiofiles

If you are using Anaconda, you can install it as follows:

$ conda install -c anaconda aiofiles

Since this case is still relatively simple, for the sake of demonstration, we will use a .py script to do the whole thing here.

Next, we will implement each method step by step and finally integrate them to run together in the main method.

File reading

Method read_file will implement reading a single file with aiofiles:

Due to the limited ability of individuals, there are inevitably few places in the case, so I welcome your comments and corrections so that we can learn and progress together.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Combining Multiprocessing and Asyncio in Python for Performance Boosts | by Peng Qian | May, 2023

Problems with multiprocessing

Problems of using Pool

The problem with using ProcessPoolExecutor directly

Use asyncio’s run_in_executor to fix it

Data preparation

Dependency installation

File reading

Data grouping

Map processing data

Integrating asyncio with multiprocessing

Reducing the merged data

Finally, implement the main method

Using tqdm to indicate progress

Problems with multiprocessing

Problems of using Pool

The problem with using ProcessPoolExecutor directly

Use asyncio’s run_in_executor to fix it

Data preparation

Dependency installation

File reading

Data grouping

Map processing data

Integrating asyncio with multiprocessing

Reducing the merged data

Finally, implement the main method

Using tqdm to indicate progress