Parallel Programming with CUDA in C++ (Part 1) | by Marcello Politi

Learn how to speed up compute-intensive applications with the power of modern GPUs

The most common deep learning frameworks such as Tensorflow and PyThorch often rely on kernel calls in order to use the GPU for parallel computations and accelerate the computation of neural networks. The most famous interface that allows developers to program using the GPU is CUDA, created by NVIDIA.

Parallel computing requires a completely different point of view from ordinary programming, but before you start getting your hands dirty, there are terminologies and concepts to learn.

Background

In the following figure, you see a classic set-up in which we have an input that is processed by the CPU one instruction at a time to generate an output. But how do we process multiple instructions at the same time? This is what we will try to understand in this article.

Image by Author

Terminology

Process: is an instance of a computer program that is being executed. A process is run on the CPU and creates an allocation in RAM.

Process (Image by Author)

Context: a collection of data of a process (memory address, program state). It allows the processor to suspend or hold the execution of a process and restart it later.
Thread: It is a component of a process. Every process has at least one thread called main thread which is the entry point of the program. A thread can execute some instructions. Within a process multiple threads can co-exist and share the memory allocated to that particular process. Between processes there is no memory sharing.
Round Robin: if we have a single core processor, the processes will be executed with the Round Robin schedule, execute first process with the most priority. When you switch from the execution of one process to another is called context switching.

Parallelism

In modern computing there is no need for context switching, we can run different threads on different cores (for now think of a core as a small processor), that’s why we have multiple cores devices!

Multiple Cores (Image by Author)

But remember that in almost every process there are some instructions which should be performed sequentially and some others that can be computed simultaneously in parallel.

When you talk about parallelism you should remember that there are 2 types of parallelism:

Task Level: different tasks on the same or different data
Data Level: same task on different data

At this point, be careful not to confuse parallelism with concurrency.

Concurrency: we have a single processor, that executes processes sequentially, and we just have the illusion of parallelism because the processor is really fast.
Parallelism: true parallelism with multiple processors.

CPU, GPU and GPGPU

Graphical processing units (GPU) can perform complex actions in a short period. The complexity relies upon the quantity of operations executed
simultaneously, but only as long as they remain simple and basically similar. The game industry has been the launching market for the GPU implementation, later reached by Nvidia company through the platform CUDA. The notoriety of GPU has increased even more, especially for developers which were now able to run multiple computing actions using a few lines of code.

CUDA allows us to use parallel computing for so-called general-purpose computing on graphics processing units (GPGPU), i.e. using GPUs for more general purposes besides 3D graphics

Let’s summarize some basic differences between CPUs and GPUs.

GPU:

low clock speed
thousands of cores
context switching is done by hardware (really fast)
can switch between threads if one thread stalls

CPU:

high clock speed
few cores
context switching is done by software (slow)
switching between threads it’s not that easy

Basic Steps of Cuda Programming

In the next articles, we are going to write code to use parallel programming. However, we must first know what the structure of a cuda-based code is, there are a few simple steps to follow.

Initialization of data on CPU
Transfer data from CPU to GPU
Kernel launch (instructions on GPU)
Transfer results back to CPU from GPU
Reclaim the memory from both CPU and GPU

In such an environment we will call Host Code the code that is going to run on CPU and Device Code the code that is going to run og GPU.

There is much more to see, but I prefer not to write it all down in one article because I think it would be more confusing than anything else.

Parallel programming is the basis of everything today, it is the way we have to speed up very long computation times, simply more processors or more cores working together, unity is strength!

If you’re interested in understanding how GPUs work and you like to program closely with hardware, keep reading this series of articles I’m going to publish. Personally I have found and am finding the study of CUDA extremely interesting every line of code seems like a puzzle to be solved.

Marcello Politi

Linkedin, Twitter, CV

Photo by aaron boris on Unsplash

Learn how to speed up compute-intensive applications with the power of modern GPUs

Parallel computing requires a completely different point of view from ordinary programming, but before you start getting your hands dirty, there are terminologies and concepts to learn.

Background

Image by Author

Terminology

Process: is an instance of a computer program that is being executed. A process is run on the CPU and creates an allocation in RAM.

Process (Image by Author)

Context: a collection of data of a process (memory address, program state). It allows the processor to suspend or hold the execution of a process and restart it later.
Thread: It is a component of a process. Every process has at least one thread called main thread which is the entry point of the program. A thread can execute some instructions. Within a process multiple threads can co-exist and share the memory allocated to that particular process. Between processes there is no memory sharing.
Round Robin: if we have a single core processor, the processes will be executed with the Round Robin schedule, execute first process with the most priority. When you switch from the execution of one process to another is called context switching.

Parallelism

Multiple Cores (Image by Author)

But remember that in almost every process there are some instructions which should be performed sequentially and some others that can be computed simultaneously in parallel.

When you talk about parallelism you should remember that there are 2 types of parallelism:

Task Level: different tasks on the same or different data
Data Level: same task on different data

At this point, be careful not to confuse parallelism with concurrency.

Concurrency: we have a single processor, that executes processes sequentially, and we just have the illusion of parallelism because the processor is really fast.
Parallelism: true parallelism with multiple processors.

CPU, GPU and GPGPU

CUDA allows us to use parallel computing for so-called general-purpose computing on graphics processing units (GPGPU), i.e. using GPUs for more general purposes besides 3D graphics

Let’s summarize some basic differences between CPUs and GPUs.

GPU:

low clock speed
thousands of cores
context switching is done by hardware (really fast)
can switch between threads if one thread stalls

CPU:

high clock speed
few cores
context switching is done by software (slow)
switching between threads it’s not that easy

Basic Steps of Cuda Programming

In the next articles, we are going to write code to use parallel programming. However, we must first know what the structure of a cuda-based code is, there are a few simple steps to follow.

Initialization of data on CPU
Transfer data from CPU to GPU
Kernel launch (instructions on GPU)
Transfer results back to CPU from GPU
Reclaim the memory from both CPU and GPU

In such an environment we will call Host Code the code that is going to run on CPU and Device Code the code that is going to run og GPU.

There is much more to see, but I prefer not to write it all down in one article because I think it would be more confusing than anything else.

Parallel programming is the basis of everything today, it is the way we have to speed up very long computation times, simply more processors or more cores working together, unity is strength!

Marcello Politi

Linkedin, Twitter, CV

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.

Parallel Programming with CUDA in C++ (Part 1) | by Marcello Politi | Jul, 2022

Learn how to speed up compute-intensive applications with the power of modern GPUs

Background

Terminology

Parallelism

CPU, GPU and GPGPU

Basic Steps of Cuda Programming

Learn how to speed up compute-intensive applications with the power of modern GPUs

Background

Terminology

Parallelism

CPU, GPU and GPGPU

Basic Steps of Cuda Programming

Related Posts