Using gpu. GPU Computing

Speaking about parallel computing on GPUs, we must remember what time we live in, today is a time when everything in the world is accelerated so much that you and I lose track of time, not noticing how it rushes by. Everything we do is associated with high accuracy and speed of information processing, in such conditions we certainly need tools in order to process all the information that we have and convert it into data, besides, when talking about such tasks we must remember that these tasks are necessary not only for large organizations or mega-corporations, but also for ordinary users who solve their life problems related to high technologies at home at home. personal computers! The emergence of NVIDIA CUDA was not surprising, but rather justified, because it will soon be necessary to process significantly more time-consuming tasks on the PC than before. Work that previously took a lot of time will now take a matter of minutes, and accordingly this will affect the overall picture of the whole world!

What is GPU computing?

GPU computing is the use of the GPU to calculate technical, scientific, and everyday tasks. GPU computing involves the use of the CPU and GPU with heterogeneous sampling between them, namely: the sequential part of the programs is taken over by the CPU, while time-consuming computational tasks are left to the GPU. Thanks to this, task parallelization occurs, which leads to faster information processing and reduces work execution time; the system becomes more productive and can simultaneously process a larger number of tasks than before. However, in order to achieve such success, hardware support alone is not enough; in this case, software support is also necessary so that the application can transfer the most time-consuming calculations to the GPU.

What is CUDA

CUDA is a technology for programming algorithms in the simplified C language that are executed on graphics processors of GeForce accelerators of the eighth generation and older, as well as corresponding Quadro and Tesla cards from NVIDIA. CUDA allows you to include C programs in text special functions. These functions are written in the simplified C programming language and executed on the GPU. The initial version of the CUDA SDK was introduced on February 15, 2007. To successfully translate code in this language, the CUDA SDK includes its own C compiler command line nvcc from NVIDIA. The nvcc compiler is based on the open Open64 compiler and is designed to translate host code (main, control code) and device code (hardware code) (files with the .cu extension) into object files suitable for assembling the final program or library in any programming environment, such as Microsoft Visual Studio.

Technology capabilities

A standard C language for parallel application development on GPUs.
Ready-made numerical analysis libraries for fast Fourier transform and basic linear algebra software package.
Special CUDA driver for computing with fast data transfer between GPU and CPU.
Ability to interface the CUDA driver with OpenGL and DirectX graphics drivers.
Operating room support Linux systems 32/64-bit, Windows XP 32/64-bit and MacOS.

Benefits of technology

The CUDA Application Programming Interface (CUDA API) is based on the standard C programming language with some limitations. This simplifies and smoothes the process of learning the CUDA architecture.
The 16 KB shared memory between threads can be used for a user-organized cache with a wider bandwidth than when fetching from regular textures.
More efficient transactions between CPU memory and video memory.
Full hardware support for integer and bitwise operations.

Example of technology application

cRark

The most time-consuming part of this program is the tincture. The program has a console interface, but thanks to the instructions that come with the program itself, you can use it. The following is brief instructions for setting up the program. We will test the program for functionality and compare it with another similar program that does not use NVIDIA CUDA, in this case the well-known “Advanced Archive Password Recovery” program.

From the downloaded cRark archive we need only three files: crark.exe, crark-hp.exe and password.def. Crerk.exe is a console utility for opening RAR 3.0 passwords without encrypted files inside the archive (i.e., when opening the archive, we see the names, but cannot unpack the archive without a password).

Crerk-hp.exe is a console utility for opening RAR 3.0 passwords with encryption of the entire archive (i.e., when opening the archive, we do not see either the name or the archives themselves and cannot unpack the archive without a password).

Password.def is any renamed text file with very little content (for example: 1st line: ## 2nd line: ?* , in this case the password will be cracked using all characters). Password.def is the director of the cRark program. The file contains the rules for cracking the password (or the area of characters that crark.exe will use in its work). More details about the possibilities for choosing these characters are written in the text file obtained when opening the one downloaded from the website of the author of the cRark program: russian.def.

Preparation

I’ll say right away that the program only works if your video card is based on a GPU that supports the CUDA 1.1 acceleration level. So a series of video cards based on the G80 chip, such as the GeForce 8800 GTX, are no longer needed, since they have hardware support for CUDA 1.0 acceleration. The program selects only passwords for RAR archives of versions 3.0+ using CUDA. Everything needs to be installed software related to CUDA, namely:

NVIDIA drivers supporting CUDA starting from 169.21
NVIDIA CUDA SDK, starting from version 1.1
NVIDIA CUDA Toolkit, starting from version 1.1

We create any folder in any place (for example on the C: drive) and call it any name, for example “3.2”. We place the files there: crark.exe, crark-hp.exe and password.def and a password-protected/encrypted RAR archive.

Next, you should launch the command console Windows strings and go to the created folder. IN Windows Vista and 7, you should call the “Start” menu and enter “cmd.exe” in the search field; in Windows XP, from the “Start” menu, you should first call the “Run” dialog and enter “cmd.exe” in it. After opening the console, enter a command like: cd C:\folder\, cd C:\3.2 in this case.

Recruiting at text editor two lines (you can also save the text as a .bat file in the folder with cRark) to guess the password of a password-protected RAR archive with unencrypted files:

echo off;
cmd /K crark (archive name).rar

to guess the password of a password-protected and encrypted RAR archive:

echo off;
cmd /K crark-hp (archive name).rar

Copy 2 lines of the text file to the console and press Enter (or run the .bat file).

results

The decryption process is shown in the figure:

The speed of guessing on cRark using CUDA was 1625 passwords/second. In one minute thirty-six seconds, a password with 3 characters was selected: “q)$.” For comparison: the search speed in Advanced Archive Password Recovery on my dual-core Athlon 3000+ processor is a maximum of 50 passwords/second and the search should have lasted 5 hours. That is, bruteforce selection of a RAR archive in cRark using a GeForce 9800 GTX+ video card is 30 times faster than on a CPU.

For those with an Intel processor, good motherboard with high frequency system bus(FSB 1600 MHz), the CPU rate and search speed will be higher. And if you have a quad-core processor and a pair of video cards of the GeForce 280 GTX level, then the speed of brute-forcing passwords will speed up significantly. To summarize the example, it must be said that this problem was solved using CUDA technology in just 2 minutes instead of 5 hours, which indicates the high potential of this technology!

conclusions

Having examined the technology for parallel computing CUDA today, we clearly saw all the power and enormous potential for the development of this technology using the example of a program for password recovery for RAR archives. It must be said about the prospects of this technology, this technology will certainly find a place in the life of every person who decides to use it, be it scientific tasks, or tasks related to video processing, or even economic tasks that require quick, accurate calculations, all this will lead to an inevitable increase in labor productivity, which cannot be ignored . Today, the phrase “home supercomputer” is already beginning to enter the lexicon; It is absolutely obvious that to make such an item a reality, every home already has a tool called CUDA. Since the release of cards based on the G80 chip (2006), a huge number of NVIDIA-based accelerators have been released that support CUDA technology, which can make dreams of supercomputers in every home come true. By promoting CUDA technology, NVIDIA raises its authority in the eyes of customers in the form of providing additional features their equipment, which many have already purchased. We can only believe that CUDA will soon develop very quickly and allow users to take full advantage of all the capabilities of parallel computing on GPUs.

A developer should learn to use the device's graphics processing unit (GPU) effectively so that the application does not slow down or perform unnecessary work.

Configure GPU rendering settings

If your application is sluggish, it means that some or all of the screen's refresh frames are taking longer than 16 milliseconds to refresh. To visually see frame updates on the screen, you can enable a special option on the device (Profile GPU Rendering).

You will be able to quickly see how long it takes to render frames. Let me remind you that you need to keep it within 16 milliseconds.

The option is available on devices starting with Android 4.1. Developer mode must be activated on the device. On devices with version 4.2 and higher, the mode is hidden by default. To activate go to Settings | About the phone and click on the line seven times Build number.

After activation, go to Developer Options and find the point Configure GPU rendering settings(Profile GPU rendering) which should be enabled. In the pop-up window, select the option On the screen in the form of columns(On screen as bars). In this case, the graph will be displayed on top of the running application.

You can test not only your application, but also others. Launch any application and start working with it. As you work, you will see an updated graph at the bottom of the screen. The horizontal axis represents elapsed time. The vertical axis shows the time for each frame in milliseconds. When interacting with the application, vertical stripes are drawn on the screen, appearing from left to right, showing the performance of frames over time. Each such column represents one frame for drawing the screen. The higher the column height, the more time it takes to draw. The thin green line is a guide and corresponds to 16 milliseconds per frame. Thus, you need to strive to ensure that the graph does not stray beyond this line when studying your application.

Let's look at a larger version of the graph.

The green line is responsible for 16 milliseconds. To stay within 60 frames per second, each graph bar must be drawn below this line. At some point, the column will be too large and will be much higher than the green line. This means the program is slowing down. Each column has cyan, purple (Lollipop and above), red and orange.

The blue color is responsible for the time used to create and update View.

The purple part represents the time spent transferring the thread's rendering resources.

The red color represents the time to draw.

The orange color shows how long it took the CPU to wait for the GPU to complete its work. This is the source of problems at large values.

There are special techniques to reduce the load on the GPU.

Debug GPU overdraw indicator

Another setting lets you know how often the same portion of the screen is redrawn (i.e., extra work is done). Let's go again Developer Options and find the point Debug GPU overdraw indicator(Debug GPU Overdraw) which should be enabled. In the pop-up window, select the option Show overlay zones(Show overdraw areas). Don't be scared! Some elements on the screen will change color.

Go back to any application and watch it work. The color will indicate problem areas in your application.

If the color in the application has not changed, then everything is fine. There is no layering of one color on top of another.

The blue color indicates that one layer is being drawn on top of the layer below. Fine.

Green color - redrawn twice. You need to think about optimization.

Pink color - redrawn three times. Everything is very bad.

Red color - redrawn many times. Something went wrong.

You can check your application yourself to find problem areas. Create an activity and place a component on it TextView. Give the root element and text label some background in the attribute android:background. You will get the following: first, you painted the bottommost layer of activity with one color. Then it is drawn on top of it new layer from TextView. By the way, actually TextView text is also drawn.

At some points, overlapping colors cannot be avoided. But imagine that you set the background for the list in the same way ListView, which occupies the entire activity area. The system will do double duty, although the user will never see the bottom layer of activity. And if, in addition, you create your own markup for each element of the list with its own background, then you will generally get overkill.

A little advice. Place after method setContentView() calling a method that will remove the screen from being painted with the theme color. This will help remove one extra color overlay:

GetWindow().setBackgroundDrawable(null);

Today, news about the use of GPUs for general computing can be heard on every corner. Words such as CUDA, Stream and OpenCL have become almost the most cited words on the IT Internet in just two years. However, not everyone knows what these words mean and what the technologies behind them mean. And for Linux users, who are accustomed to “being on the fly,” all this seems like a dark forest.

Birth of GPGPU

We are all used to thinking that the only component of a computer capable of executing any code it is told to do is the central processor. For a long time, almost all mass-market PCs were equipped with a single processor that handled all conceivable calculations, including code operating system, all our software and viruses.

Later appeared multi-core processors and multiprocessor systems, in which there were several such components. This allowed machines to perform multiple tasks simultaneously, and the overall (theoretical) system performance increased exactly as much as the number of cores installed in the machine. However, it turned out that it was too difficult and expensive to produce and design multi-core processors.

Each core had to house a full-fledged processor of a complex and intricate x86 architecture, with its own (rather large) cache, instruction pipeline, SSE blocks, many blocks that perform optimizations, etc. and so on. Therefore, the process of increasing the number of cores slowed down significantly, and white university coats, for whom two or four cores were clearly not enough, found a way to use other computing power for their scientific calculations, which was abundant on the video card (as a result, the BrookGPU tool even appeared, emulating an additional processor using DirectX and OpenGL function calls).

Graphics processors, devoid of many of the disadvantages of the central processor, turned out to be an excellent and very fast calculating machine, and very soon the GPU manufacturers themselves began to take a closer look at the developments of scientific minds (and nVidia actually hired most of the researchers). As a result, nVidia CUDA technology appeared, which defines an interface with which it became possible to transfer the calculation of complex algorithms to the shoulders of the GPU without any crutches. Later it was followed by ATi (AMD) with its own version of the technology called Close to Metal (now Stream), and very soon a standard version from Apple appeared, called OpenCL.

Is the GPU everything?

Despite all the advantages, the GPGPU technique has several problems. The first of these is the very narrow scope of application. GPUs have gone far ahead of the central processor in terms of increasing computing power and the total number of cores (video cards carry a computing unit consisting of more than a hundred cores), but such high density is achieved by maximizing the simplification of the design of the chip itself.

In essence, the main task of the GPU comes down to mathematical calculations using simple algorithms that receive not very large amounts of predictable data as input. For this reason, GPU cores have a very simple design, scanty cache sizes and a modest set of instructions, which ultimately results in their low cost of production and the possibility of very dense placement on the chip. GPUs are like a Chinese factory with thousands of workers. They do some simple things quite well (and most importantly, quickly and cheaply), but if you entrust them with assembling an airplane, the result will be, at most, a hang glider.

Therefore, the first limitation of GPUs is their focus on fast mathematical calculations, which limits the scope of application of GPUs to assistance in multimedia applications, as well as any programs involved in complex data processing (for example, archivers or encryption systems, as well as software involved in fluorescence microscopy, molecular dynamics, electrostatics and other things of little interest to Linux users).

The second problem with GPGPU is that not every algorithm can be adapted for execution on the GPU. Individual kernels GPU They are quite slow, and their power only manifests itself when working together. This means that the algorithm will be as effective as the programmer can parallelize it effectively. In most cases, only a good mathematician can handle such work, of which there are very few software developers.

And third: GPUs work with memory installed on the video card itself, so every time the GPU is used, two additional copy operations will occur: input data from random access memory the application itself and the output from GRAM back to the application memory. As you can imagine, this can negate any benefit in application runtime (as is the case with the FlacCL tool, which we'll look at later).

But that's not all. Despite the existence of a generally accepted standard in the form of OpenCL, many programmers still prefer to use vendor-specific implementations of the GPGPU technique. CUDA turned out to be especially popular, which, although it provides a more flexible programming interface (by the way, OpenCL in nVidia drivers implemented on top of CUDA), but tightly ties the application to video cards from one manufacturer.

KGPU or Linux kernel accelerated by GPU

Researchers at the University of Utah have developed a KGPU system that allows some Linux kernel functions to be executed on a GPU using the CUDA framework. To perform this task, a modified Linux kernel and a special daemon are used that runs in user space, listens to kernel requests and passes them to the video card driver using the CUDA library. Interestingly, despite the significant overhead that such an architecture creates, the authors of KGPU managed to create an implementation of the AES algorithm, which increases the encryption speed file system eCryptfs 6 times.

What is there now?

Due to its youth, as well as the problems described above, GPGPU has never become a truly widespread technology, but useful software that uses its capabilities exists (albeit in tiny quantities). Crackers of various hashes were among the first to appear, the algorithms of which are very easy to parallelize.

Multimedia applications were also born, such as the FlacCL encoder, which allows transcoding soundtrack V FLAC format. Some pre-existing applications have also acquired GPGPU support, the most notable of which is ImageMagick, which can now offload some of its work to the GPU using OpenCL. There are also projects to transfer data archivers and other information compression systems to CUDA/OpenCL (ATi Unixoids are not liked). We will look at the most interesting of these projects in the following sections of the article, but for now let’s try to figure out what we need to get it all started and working stably.

GPUs have long surpassed x86 processors in performance

· Secondly, the latest proprietary drivers for the video card must be installed in the system; they will provide support for both GPGPU technologies native to the card and open OpenCL.

· And thirdly, since distribution developers have not yet begun to distribute application packages with GPGPU support, we will have to build applications ourselves, and for this we need official SDKs from manufacturers: CUDA Toolkit or ATI Stream SDK. They contain the header files and libraries necessary for building applications.

Install CUDA Toolkit

Follow the link above and download the CUDA Toolkit for Linux (you can choose from several versions, for the Fedora, RHEL, Ubuntu and SUSE distributions, there are versions for both x86 and x86_64 architectures). In addition, you also need to download driver kits for developers there (Developer Drivers for Linux, they are first on the list).

Launch the SDK installer:

$ sudo sh cudatoolkit_4.0.17_linux_64_ubuntu10.10.run

When the installation is complete, we proceed to installing the drivers. To do this, shut down the X server:

# sudo /etc/init.d/gdm stop

Open the console and run the driver installer:

$ sudo sh devdriver_4.0_linux_64_270.41.19.run

After the installation is complete, start X:

In order for applications to be able to work with CUDA/OpenCL, we set the path to the directory with CUDA libraries in the LD_LIBRARY_PATH variable:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64

Or, if you installed the 32-bit version:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib32

You also need to specify the path to the CUDA header files so that the compiler finds them at the application build stage:

$ export C_INCLUDE_PATH=/usr/local/cuda/include

That's it, now you can start building CUDA/OpenCL software.

Install ATI Stream SDK

Stream SDK does not require installation, so the AMD archive downloaded from the website can simply be unpacked into any directory ( best choice will be /opt) and write the path to it in the same LD_LIBRARY_PATH variable:

$ wget http://goo.gl/CNCNo

$ sudo tar -xzf ~/AMD-APP-SDK-v2.4-lnx64.tgz -C /opt

$ export LD_LIBRARY_PATH=/opt/AMD-APP-SDK-v2.4-lnx64/lib/x86_64/

$ export C_INCLUDE_PATH=/opt/AMD-APP-SDK-v2.4-lnx64/include/

As with the CUDA Toolkit, x86_64 must be replaced with x86 on 32-bit systems. Now go to the root directory and unpack the icd-registration.tgz archive (this is a kind of free license key):

$ sudo tar -xzf /opt/AMD-APP-SDK-v2.4-lnx64/icd-registration.tgz - WITH /

We check the correct installation/operation of the package using the clinfo tool:

$ /opt/AMD-APP-SDK-v2.4-lnx64/bin/x86_64/clinfo

ImageMagick and OpenCL

OpenCL support has been available in ImageMagick for quite some time, but it is not enabled by default in any distribution. Therefore, we will have to compile IM ourselves from source. There is nothing complicated about this, everything you need is already in the SDK, so the assembly does not require installing any additional libraries from nVidia or AMD. So, download/unpack the archive with the sources:

$ wget http://goo.gl/F6VYV

$ tar -xjf ImageMagick-6.7.0-0.tar.bz2

$ cd ImageMagick-6.7.0-0

$ sudo apt-get install build-essential

We launch the configurator and grab its output for OpenCL support:

$ LDFLAGS=-L$LD_LIBRARY_PATH ./confi gure | grep -e cl.h -e OpenCL

The correct output from the command should look something like this:

checking CL/cl.h usability... yes

checking CL/cl.h presence... yes

checking for CL/cl.h... yes

checking OpenCL/cl.h usability... no

checking OpenCL/cl.h presence... no

checking for OpenCL/cl.h... no

checking for OpenCL library... -lOpenCL

The word "yes" must be marked either in the first three lines or in the second (or both options at once). If this is not the case, then most likely the C_INCLUDE_PATH variable was not initialized correctly. If the last line is marked with the word "no", then the problem is in the LD_LIBRARY_PATH variable. If everything is ok, start the build/installation process:

$ sudo make install clean

Let's check that ImageMagick was indeed compiled with OpenCL support:

$ /usr/local/bin/convert -version | grep Features

Features: OpenMP OpenCL

Now let's measure the resulting speed gain. The ImageMagick developers recommend using the convolve filter for this:

$ time /usr/bin/convert image.jpg -convolve "-1, -1, -1, -1, 9, -1, -1, -1, -1" image2.jpg

$ time /usr/local/bin/convert image.jpg -convolve "-1, -1, -1, -1, 9, -1, -1, -1, -1" image2.jpg

Some other operations, such as resizing, should now also work much faster, but you shouldn't expect ImageMagick to start processing graphics at breakneck speed. So far, a very small part of the package has been optimized using OpenCL.

FlacCL (Flacuda)

FlacCL is an encoder of audio files in the FLAC format, using the capabilities of OpenCL in its work. It is included in the CUETools package for Windows, but thanks to mono it can also be used on Linux. To obtain an archive with an encoder, run the following command:

$ mkdir flaccl && cd flaccl

$ wget www.cuetools.net/install/flaccl03.rar

$ sudo apt-get install unrar mono

$ unrar x fl accl03.rar

So that the program can find the OpenCL library, we make a symbolic link:

$ ln -s $LD_LIBRARY_PATH/libOpenCL.so libopencl.so

Now let's run the encoder:

$ mono CUETools.FLACCL.cmd.exe music.wav

If the error message "Error: Requested compile size is bigger than the required workgroup size of 32" is displayed on the screen, then the video card in our system is too weak, and the number of involved cores should be reduced to the specified number using the '-- flag group-size XX', where XX is the required number of cores.

I’ll say right away that due to the long initialization time of OpenCL, noticeable gains can only be obtained on sufficiently long tracks. Short sound files FlacCL processes at almost the same speed as its traditional version.

oclHashcat or quick brute force

As I already said, the developers of various crackers and brute-force password systems were among the first to add GPGPU support to their products. For them, the new technology became a real holy grail, which made it possible to easily transfer naturally easily parallelized code to the shoulders of fast GPU processors. Therefore, it is not surprising that there are now dozens of different implementations of such programs. But in this article I will talk about only one of them - oclHashcat.

oclHashcat is a hacker that can guess passwords based on their hash with extreme high speed, while harnessing the power of the GPU using OpenCL. If you believe the measurements published on the project website, the speed of selecting MD5 passwords on the nVidia GTX580 is up to 15,800 million combinations per second, thanks to which oclHashcat is able to find an average complexity eight-character password in just 9 minutes.

The program supports OpenCL and CUDA, MD5, md5($pass.$salt), md5(md5($pass)), vBulletin algorithms< v3.8.5, SHA1, sha1($pass.$salt), хэши MySQL, MD4, NTLM, Domain Cached Credentials, SHA256, поддерживает распределенный подбор паролей с задействованием мощности нескольких машин.

$7z x oclHashcat-0.25.7z

$cd oclHashcat-0.25

And run the program (we’ll use a sample list of hashes and a sample dictionary):

$ ./oclHashcat64.bin example.hash ?l?l?l?l example.dict

oclHashcat will open the text of the user agreement, which you must agree to by typing "YES". After this, the search process will begin, the progress of which can be found by pressing ~~. To pause the process, click~~

~~To resume - . You can also use direct enumeration (for example, from aaaaaaaa to zzzzzzzz):~~

~~$ ./oclHashcat64.bin hash.txt ?l?l?l?l ?l?l?l?l~~

And various modifications of the dictionary and the direct search method, as well as their combinations (you can read about this in the file docs/examples.txt). In my case, the speed of searching through the entire dictionary was 11 minutes, while direct searching (from aaaaaaaa to zzzzzzzz) lasted about 40 minutes. The average speed of the GPU (RV710 chip) was 88.3 million/s.

~~conclusions~~

Despite many different limitations and the complexity of software development, GPGPU is the future of high-performance desktop computers. But the most important thing is that you can use the capabilities of this technology right now, and this applies not only to Windows machines, but also to Linux.

Using GPU Computing with C++ AMP

So far, in discussing parallel programming techniques, we have considered only processor cores. We have gained some skills in parallelizing programs across multiple processors, synchronizing access to shared resources, and using high-speed synchronization primitives without using locks.

However, there is another way to parallelize programs - graphics processing units (GPUs), having more cores than even high-performance processors. GPU cores are excellent for implementing parallel data processing algorithms, and their large number more than pays for the inconvenience of running programs on them. In this article we will get acquainted with one of the ways to run programs on a GPU, using a set of C++ language extensions called C++AMP.

The C++ AMP extensions are based on the C++ language, which is why this article will demonstrate examples in C++. However, with moderate use of the interaction mechanism in. NET, you can use C++ AMP algorithms in your .NET programs. But we will talk about this at the end of the article.

Introduction to C++ AMP

Essentially, a GPU is a processor like any other, but with a special set of instructions, big amount cores and its memory access protocol. However, there are big differences between modern GPUs and conventional processors, and understanding them is key to creating programs that effectively use the processing power of the GPU.

Modern GPUs have a very small instruction set. This implies some limitations: lack of ability to call functions, limited set of supported data types, lack of library functions, and others. Some operations, such as conditional branches, can cost significantly more than similar operations performed on conventional processors. Obviously, moving large amounts of code from the CPU to the GPU under such conditions requires significant effort.

The number of cores in the average GPU is significantly higher than in the average conventional processor. However, some tasks are too small or cannot be broken down into large enough parts to benefit from the GPU.

Synchronization support between GPU cores performing the same task is very poor, and completely absent between GPU cores performing different tasks. This circumstance requires synchronization of the graphics processor with a conventional processor.

The question immediately arises: what tasks are suitable for solving on a GPU? Keep in mind that not every algorithm is suitable for execution on a GPU. For example, GPUs don't have access to I/O devices, so you won't be able to improve the performance of a program that retrieves RSS feeds from the Internet, through the use of a graphics processor. However, many computational algorithms can be transferred to the GPU and can be massively parallelized. Below are a few examples of such algorithms (this list is by no means complete):

increasing and decreasing sharpness of images, and other transformations;

fast Fourier transform;

matrix transposition and multiplication;

number sorting;

direct hash inversion.

An excellent source for additional examples is the Microsoft Native Concurrency blog, which provides code snippets and explanations for various algorithms implemented in C++ AMP.

C++ AMP is a framework included with Visual Studio 2012 that gives C++ developers an easy way to perform computations on the GPU, requiring only a DirectX 11 driver. Microsoft has released C++ AMP as an open specification that can be implemented by any compiler vendor.

The C++ AMP framework allows you to run code in graphics accelerators, which are computing devices. Using the DirectX 11 driver, the C++ AMP framework dynamically detects all accelerators. C++ AMP also includes a software accelerator emulator and a conventional processor-based emulator, WARP, which serves as a fallback on systems without a GPU or with a GPU but lacks a DirectX 11 driver, and uses multiple cores and SIMD instructions.

Now let's start exploring an algorithm that can easily be parallelized for execution on a GPU. The implementation below takes two vectors of equal length and calculates the pointwise result. It's hard to imagine anything more straightforward:

Void VectorAddExpPointwise(float* first, float* second, float* result, int length) ( for (int i = 0; i< length; ++i) { result[i] = first[i] + exp(second[i]); } }

To parallelize this algorithm on a regular processor, you need to split the iteration range into several subranges and run one thread of execution for each of them. We've spent a lot of time in previous articles on exactly this way of parallelizing our first prime number search example - we've seen how it can be done by creating threads manually, passing jobs to a thread pool, and using Parallel.For and PLINQ to automatically parallelize. Remember also that when parallelizing similar algorithms on a conventional processor, we took special care not to split the problem into too small tasks.

For the GPU, these warnings are not needed. GPUs have multiple cores that execute threads very quickly, and the cost of context switching is significantly lower than conventional processors. Below is a snippet trying to use the function parallel_for_each from the C++ AMP framework:

#include #include using namespace concurrency; void VectorAddExpPointwise(float* first, float* second, float* result, int length) ( array_view avFirst(length, first); array_view avSecond(length, second); array_view avResult(length, result); avResult.discard_data(); parallel_for_each(avResult.extent, [=](index<1>i) restrict(amp) ( avResult[i] = avFirst[i] + fast_math::exp(avSecond[i]); )); avResult.synchronize(); )

Now let's examine each part of the code separately. Let's immediately note that the general form of the main loop has been preserved, but the originally used for loop has been replaced by a call to the parallel_for_each function. In fact, the principle of converting a loop into a function or method call is not new to us - such a technique has previously been demonstrated using the Parallel.For() and Parallel.ForEach() methods from the TPL library.

Next, the input data (parameters first, second and result) are wrapped with instances array_view. The array_view class is used to wrap data passed to the GPU (accelerator). Its template parameter specifies the data type and its dimension. In order to execute instructions on a GPU that access data originally processed on a conventional CPU, someone or something must take care of copying the data to the GPU because most modern graphics cards are separate devices with their own memory. array_view instances solve this problem - they provide data copying on demand and only when it is really needed.

When the GPU completes the task, the data is copied back. By instantiating array_view with a const argument, we ensure that first and second are copied into GPU memory, but not copied back. Likewise, calling discard_data(), we exclude copying result from the memory of a regular processor to the accelerator memory, but this data will be copied in the opposite direction.

The parallel_for_each function takes an extent object that specifies the form of the data to be processed and a function to apply to each element in the extent object. In the example above, we used a lambda function, support for which appeared in the ISO C++2011 (C++11) standard. The restrict (amp) keyword instructs the compiler to check whether the function body can be executed on the GPU and disables most C++ syntax that cannot be compiled into GPU instructions.

Lambda function parameter, index<1>object, represents a one-dimensional index. It must match the extent object being used - if we were to declare the extent object to be two-dimensional (for example, by defining the shape of the source data as a two-dimensional matrix), the index would also need to be two-dimensional. An example of such a situation is given below.

Finally, the method call synchronize() at the end of the VectorAddExpPointwise method, it ensures that the calculation results from array_view avResult, produced by the GPU, are copied back to the result array.

This concludes our first introduction to the world of C++ AMP, and now we are ready for more detailed research, as well as more interesting examples demonstrating the benefits of using parallel computing on a GPU. Vector addition is not a good algorithm and is not the best candidate for demonstrating GPU usage due to the large overhead of copying data. The next subsection will show two more interesting examples.

Matrix multiplication

The first "real" example we'll look at is matrix multiplication. For implementation, we will take a simple cubic matrix multiplication algorithm, and not the Strassen algorithm, which has a execution time close to cubic ~O(n 2.807). Given two matrices, an m x w matrix A and a w x n matrix B, the following program will multiply them and return the result, an m x n matrix C:

Void MatrixMultiply(int* A, int m, int w, int* B, int n, int* C) ( for (int i = 0; i< m; ++i) { for (int j = 0; j < n; ++j) { int sum = 0; for (int k = 0; k < w; ++k) { sum += A * B; } C = sum; } } }

There are several ways to parallelize this implementation, and if you want to parallelize this code to run on a regular processor, the right choice would be to parallelize the outer loop. However, the GPU has a fairly large number of cores, and by parallelizing only the outer loop, we will not be able to create a sufficient number of jobs to load all the cores with work. Therefore, it makes sense to parallelize the two outer loops, leaving the inner loop untouched:

Void MatrixMultiply (int* A, int m, int w, int* B, int n, int* C) ( array_view avA(m, w, A); array_view avB(w, n, B); array_view avC(m, n, C); avC.discard_data(); parallel_for_each(avC.extent, [=](index<2>idx) restrict(amp) ( int sum = 0; for (int k = 0; k< w; ++k) { sum + = avA(idx*w, k) * avB(k*w, idx); } avC = sum; }); }

This implementation still closely resembles the sequential implementation of matrix multiplication and the vector addition example given above, with the exception of the index, which is now two-dimensional and accessible in the inner loop using the operator. How much faster is this version than the sequential alternative running on a regular processor? Multiplying two matrices (integers) of size 1024 x 1024, the sequential version on a regular CPU takes an average of 7350 milliseconds, while the GPU version - hold on tight - takes 50 milliseconds, 147 times faster!

Particle motion simulation

The examples of solving problems on the GPU presented above have a very simple implementation of the internal loop. It is clear that this will not always be the case. The Native Concurrency blog, linked above, demonstrates an example of modeling gravitational interactions between particles. The simulation involves an infinite number of steps; at each step, new values of the elements of the acceleration vector are calculated for each particle and then their new coordinates are determined. Here, the particle vector is parallelized - with a sufficiently large number of particles (from several thousand and above), you can create a sufficiently large number of tasks to load all the GPU cores with work.

The basis of the algorithm is the implementation of determining the result of interactions between two particles, as shown below, which can easily be transferred to the GPU:

// here float4 are vectors with four elements // representing the particles involved in the operations void bodybody_interaction (float4& acceleration, const float4 p1, const float4 p2) restrict(amp) ( float4 dist = p2 – p1; // no w here used float absDist = dist.x*dist.x + dist.y*dist.y + dist.z*dist.z; float invDist = 1.0f / sqrt(absDist); float invDistCube = invDist*invDist*invDist; acceleration + = dist*PARTICLE_MASS*invDistCube; )

The initial data at each modeling step is an array with the coordinates and velocities of particles, and as a result of calculations, a new array with the coordinates and velocities of particles is created:

Struct particle ( float4 position, velocity; // implementations of constructor, copy constructor and // operator = with restrict(amp) omitted to save space ); void simulation_step(array & previous, array & next, int bodies) ( extent<1>ext(bodies); parallel_for_each (ext, [&](index<1>idx) restrict(amp) ( particle p = previous; float4 acceleration(0, 0, 0, 0); for (int body = 0; body< bodies; ++body) { bodybody_interaction (acceleration, p.position, previous.position); } p.velocity + = acceleration*DELTA_TIME; p.position + = p.velocity*DELTA_TIME; next = p; }); }

With the help of an appropriate graphical interface, modeling can be very interesting. The full example provided by the C++ AMP team can be found on the Native Concurrency blog. On my system with an Intel Core i7 processor and a Geforce GT 740M graphics card, the simulation of 10,000 particles runs at ~2.5 fps (steps per second) using the sequential version running on the regular processor, and 160 fps using the optimized version running on the GPU - a huge increase in performance.

Before we wrap up this section, there is one more important feature of the C++ AMP framework that can further improve the performance of code running on the GPU. GPUs support programmable data cache(often called shared memory). The values stored in this cache are shared by all threads of execution in a single tile. Thanks to memory tiling, programs based on the C++ AMP framework can read data from graphics card memory into the shared memory of the mosaic and then access it from multiple threads of execution without having to re-fetch the data from graphics card memory. Accessing mosaic shared memory is approximately 10 times faster than graphics card memory. In other words, you have reasons to keep reading.

To provide a tiled version of the parallel loop, the parallel_for_each method is passed domain tiled_extent, which divides the multidimensional extent object into multidimensional tiles, and the tiled_index lambda parameter, which specifies the global and local ID of the thread within the tile. For example, a 16x16 matrix can be divided into 2x2 tiles (as shown in the image below) and then passed to the parallel_for_each function:

Extent<2>matrix(16,16); tiled_extent<2,2>tiledMatrix = matrix.tile<2,2>(); parallel_for_each(tiledMatrix, [=](tiled_index<2,2>idx) restrict(amp) ( // ... ));

Each of the four threads of execution belonging to the same mosaic can share the data stored in the block.

When performing operations with matrices, in the GPU core, instead of the standard index<2>, as in the examples above, you can use idx.global. Proper use of local tiled memory and local indexes can provide significant performance gains. To declare tiled memory shared by all threads of execution in a single tile, local variables can be declared with the tile_static specifier.

In practice, the technique of declaring shared memory and initializing its individual blocks in different threads of execution is often used:

Parallel_for_each(tiledMatrix, [=](tiled_index<2,2>idx) restrict(amp) ( // 32 bytes are shared by all threads in the block tile_static int local; // assign a value to the element for this thread of execution local = 42; ));

Obviously, any benefits from using shared memory can only be obtained if access to this memory is synchronized; that is, threads must not access memory until it has been initialized by one of them. Synchronization of threads in a mosaic is performed using objects tile_barrier(reminiscent of the Barrier class from the TPL library) - they will be able to continue execution only after calling the tile_barrier.Wait() method, which will return control only when all threads have called tile_barrier.Wait. For example:

Parallel_for_each(tiledMatrix, (tiled_index<2,2>idx) restrict(amp) ( // 32 bytes are shared by all threads in the block tile_static int local; // assign a value to the element for this thread of execution local = 42; // idx.barrier is an instance of tile_barrier idx.barrier.wait(); // Now this thread can access the "local" array // using the indexes of other threads of execution! ));

Now is the time to translate what you have learned into a concrete example. Let's return to the implementation of matrix multiplication, performed without the use of tiling memory organization, and add the described optimization to it. Let's assume that the matrix size is a multiple of 256 - this will allow us to work with 16 x 16 blocks. The nature of matrices allows for block-by-block multiplication, and we can take advantage of this feature (in fact, dividing matrices into blocks is a typical optimization of the matrix multiplication algorithm, providing more efficient CPU cache usage).

The essence of this technique comes down to the following. To find C i,j (the element in row i and column j in the result matrix), you need to calculate the dot product between A i,* (i-th row of the first matrix) and B *,j (j-th column in the second matrix ). However, this is equivalent to computing the partial dot products of the row and column and then summing the results. We can use this fact to convert the matrix multiplication algorithm into a tiling version:

Void MatrixMultiply(int* A, int m, int w, int* B, int n, int* C) ( array_view avA(m, w, A); array_view avB(w, n, B); array_view avC(m, n, C); avC.discard_data(); parallel_for_each(avC.extent.tile<16,16>(), [=](tiled_index<16,16>idx) restrict(amp) ( int sum = 0; int localRow = idx.local, localCol = idx.local; for (int k = 0; k
The essence of the described optimization is that each thread in the mosaic (256 threads are created for a 16 x 16 block) initializes its element in 16 x 16 local copies of fragments of the original matrices A and B. Each thread in the mosaic requires only one row and one column of these blocks, but all threads together will access each row and each column 16 times. This approach significantly reduces the number of accesses to main memory.

To calculate element (i,j) in the result matrix, the algorithm requires the complete i-th row of the first matrix and the j-th column of the second matrix. When the threads are 16x16 tiling represented in the diagram and k=0, the shaded regions in the first and second matrices will be read into shared memory. The execution thread computing element (i,j) in the result matrix will calculate the partial dot product of the first k elements from the i-th row and j-th column of the original matrices.

In this example, using a tiled organization provides a huge performance boost. The tiled version of matrix multiplication is much faster than the simple version, taking approximately 17 milliseconds (for the same 1024 x 1024 input matrices), which is 430 times faster than the version running on a conventional processor!

Before we end our discussion of the C++ AMP framework, we would like to mention the tools (in Visual Studio) available to developers. Visual Studio 2012 offers a graphics processing unit (GPU) debugger that lets you set breakpoints, examine the call stack, and read and change local variable values (some accelerators support GPU debugging directly; for others, Visual Studio uses a software simulator), and a profiler that lets you evaluate the benefits an application receives from parallelizing operations using a GPU. For more information about debugging capabilities in Visual Studio, see the Walkthrough article. Debugging a C++ AMP Application" on MSDN.

GPU Computing Alternatives in .NET

So far this article has only shown examples in C++, however, there are several ways to harness the power of the GPU in managed applications. One way is to use interop tools that allow you to offload work with GPU cores to low-level C++ components. This solution is great for those who want to use the C++ AMP framework or have the ability to use pre-built C++ AMP components in managed applications.

Another way is to use a library that works directly with the GPU from managed code. There are currently several such libraries. For example, GPU.NET and CUDAfy.NET (both commercial offerings). Below is an example from the GPU.NET GitHub repository demonstrating the implementation of the dot product of two vectors:

Public static void MultiplyAddGpu(double a, double b, double c) ( int ThreadId = BlockDimension.X * BlockIndex.X + ThreadIndex.X; int TotalThreads = BlockDimension.X * GridDimension.X; for (int ElementIdx = ThreadId; ElementIdx
I am of the opinion that it is much easier and more efficient to learn a language extension (based on C++ AMP) than to try to orchestrate interactions at the library level or make significant changes to the IL language.

So, after we looked at the possibilities of parallel programming in .NET and using the GPU, no one doubts that organizing parallel computing is an important way to increase productivity. In many servers and workstations around the world, the invaluable processing power of CPUs and GPUs goes unused because applications simply don't use it.

The Task Parallel Library gives us a unique opportunity to include all available CPU cores, although this will require solving some interesting problems of synchronization, excessive task fragmentation, and unequal distribution of work between execution threads.

The C++ AMP framework and other multi-purpose GPU parallel computing libraries can be successfully used to parallelize calculations across hundreds of GPU cores. Finally, there is a previously unexplored opportunity to gain productivity gains from the use of cloud distributed computing technologies, which have recently become one of the main directions in the development of information technology.

One of the more hidden features in the recent Windows 10 update is the ability to check which apps are using your graphics processing unit (GPU). If you've ever opened Task Manager, you've probably looked at your CPU usage to see which apps are using the most CPU. The latest updates added a similar feature, but for GPU graphics processors. This helps you understand how intensive your software and games are on your GPU without having to download third party software. There is another interesting feature that helps offload your CPU to the GPU. I recommend reading how to choose.

Why don't I have GPU in task manager?

Unfortunately, not all video cards will be able to provide the Windows system with the statistics needed to read the GPU. To be sure, you can quickly use the DirectX diagnostic tool to check this technology.

Click " Start" and write in the search dxdiag to run the DirectX Diagnostic Tool.

Go to the "tab" Screen", on the right in the column " drivers"you must have WDDM model more than 2.0 version for using GPU graphs in the task manager.

Enable GPU graph in task manager

To see the GPU usage for each application, you need to open the task manager.

Press a combination of buttons Ctrl + Shift + Esc to open task manager.

Right-click in the task manager on the "blank" box Name" and check from the dropdown menu GPU You can also note GPU core to see which programs are using it.

Now in the task manager, the GPU graph and the GPU core are visible on the right.

View overall GPU performance

You can monitor overall GPU usage to monitor it under heavy loads and analyze it. In this case, you can see everything you need in the " tab Performance" by selecting graphics processor.

Each GPU element is broken down into individual graphs to give you even more insight into how your GPU is being used. If you want to change the graphs displayed, you can click the small arrow next to the name of each task. This screen also shows your driver version and date, which is a good alternative to using DXDiag or Device Manager.