GPU distributed computing. We use the capabilities of the GPU to accelerate software. Creating a cryptocurrency wallet

One of the most hidden features in recent Windows update 10, is the ability to check which applications are using your graphics processing unit (GPU). If you've ever opened Task Manager, you've probably looked at your CPU usage to see which apps are using the most CPU. IN latest updates added a similar feature, but for GPU graphics processors. This helps you understand how intense your software and games on your GPU without downloading third party software. There is one more interesting feature, which helps offload your CPU to the GPU. I recommend reading how to choose.

Why don't I have GPU in task manager?

Unfortunately, not all video cards will be able to provide the Windows system with the statistics it needs to read GPU. To be sure, you can quickly use the DirectX diagnostic tool to check this technology.

  1. Click " Start" and write in the search dxdiag to run the DirectX Diagnostic Tool.
  2. Go to the "tab" Screen", on the right in the column " drivers"you must have WDDM model more than 2.0 version for using GPU graphs in the task manager.

Enable GPU graph in task manager

To see the GPU usage for each application, you need to open the task manager.

  • Press a combination of buttons Ctrl + Shift + Esc to open task manager.
  • Click right click mouse in the task manager to the empty field" Name" and check from the dropdown menu GPU You can also note GPU core to see which programs are using it.
  • Now in the task manager, the GPU graph and the GPU core are visible on the right.


View overall GPU performance

You can monitor overall GPU usage to monitor it under heavy loads and analyze it. In this case, you can see everything you need in the " tab Performance" by selecting graphics processor.


Each GPU element is broken down into individual graphs to give you even more insight into how your GPU is being used. If you want to change the graphs displayed, you can click the small arrow next to the name of each task. This screen also shows your driver version and date, which is a good alternative to using DXDiag or Device Manager.


Speaking about parallel computing on GPUs, we must remember what time we live in, today is a time when everything in the world is accelerated so much that you and I lose track of time, not noticing how it rushes by. Everything we do is associated with high accuracy and speed of information processing, in such conditions we certainly need tools in order to process all the information that we have and convert it into data, besides, when talking about such tasks we must remember that these tasks are necessary not only for large organizations or mega-corporations, but also for ordinary users who solve their life problems related to high technologies at home at home. personal computers! The emergence of NVIDIA CUDA was not surprising, but rather justified, because it will soon be necessary to process significantly more time-consuming tasks on the PC than before. Work that previously took a lot of time will now take a matter of minutes, and accordingly this will affect the overall picture of the whole world!

What is GPU computing?

GPU computing is the use of the GPU to calculate technical, scientific, and everyday tasks. GPU computing involves the use of the CPU and GPU with heterogeneous sampling between them, namely: the sequential part of the programs is taken over by the CPU, while time-consuming computational tasks are left to the GPU. Thanks to this, task parallelization occurs, which leads to faster information processing and reduces work execution time; the system becomes more productive and can simultaneously process a larger number of tasks than before. However, in order to achieve such success, hardware support alone is not enough; in this case, software support is also necessary so that the application can transfer the most time-consuming calculations to the GPU.

What is CUDA

CUDA is a technology for programming algorithms in the simplified C language that are executed on graphics processors of GeForce accelerators of the eighth generation and older, as well as corresponding Quadro and Tesla cards from NVIDIA. CUDA allows you to include special functions in the text of a C program. These functions are written in the simplified C programming language and executed on the GPU. The initial version of the CUDA SDK was introduced on February 15, 2007. To successfully translate code in this language, the CUDA SDK includes its own C compiler command line nvcc from NVIDIA. The nvcc compiler is based on the open Open64 compiler and is designed to translate host code (main, control code) and device code (hardware code) (files with the .cu extension) into object files suitable for assembling the final program or library in any programming environment, such as Microsoft Visual Studio.

Technology capabilities

  1. A standard C language for parallel application development on GPUs.
  2. Ready-made numerical analysis libraries for fast Fourier transform and basic linear algebra software package.
  3. Special CUDA driver for computing with fast data transfer between GPU and CPU.
  4. Possibility of interaction of the CUDA driver with graphics drivers OpenGL and DirectX.
  5. Operating room support Linux systems 32/64-bit, Windows XP 32/64-bit and MacOS.

Benefits of technology

  1. The CUDA Application Programming Interface (CUDA API) is based on the standard C programming language with some limitations. This simplifies and smoothes the process of learning the CUDA architecture.
  2. The 16 KB shared memory between threads can be used for a user-organized cache with a wider bandwidth than when fetching from regular textures.
  3. More efficient transactions between CPU memory and video memory.
  4. Full hardware support for integer and bitwise operations.

Example of technology application

cRark

The most time-consuming part of this program is the tincture. The program has a console interface, but thanks to the instructions that come with the program itself, you can use it. The following is brief instructions for setting up the program. We will test the program for functionality and compare it with another similar program that does not use NVIDIA CUDA, in this case the well-known “Advanced Archive Password Recovery” program.

From the downloaded cRark archive we need only three files: crark.exe, crark-hp.exe and password.def. Crerk.exe is a console utility for opening RAR 3.0 passwords without encrypted files inside the archive (i.e., when opening the archive, we see the names, but cannot unpack the archive without a password).

Crerk-hp.exe is a console utility for opening RAR 3.0 passwords with encryption of the entire archive (i.e., when opening the archive, we do not see either the name or the archives themselves and cannot unpack the archive without a password).

Password.def is any renamed text file with very little content (for example: 1st line: ## 2nd line: ?* , in this case the password will be cracked using all characters). Password.def is the director of the cRark program. The file contains the rules for cracking the password (or the area of ​​characters that crark.exe will use in its work). More details about the possibilities for choosing these characters are written in the text file obtained when opening the one downloaded from the website of the author of the cRark program: russian.def.

Preparation

I’ll say right away that the program only works if your video card is based on a GPU that supports the CUDA 1.1 acceleration level. So a series of video cards based on the G80 chip, such as the GeForce 8800 GTX, are no longer needed, since they have hardware support for CUDA 1.0 acceleration. The program selects only passwords for RAR archives of versions 3.0+ using CUDA. It is necessary to install all CUDA related software, namely:

We create any folder in any place (for example on the C: drive) and call it any name, for example “3.2”. We place the files there: crark.exe, crark-hp.exe and password.def and a password-protected/encrypted RAR archive.

Next, you should launch the command console Windows strings and go to the created folder. IN Windows Vista and 7, you should call the “Start” menu and enter “cmd.exe” in the search field; in Windows XP, from the “Start” menu, you should first call the “Run” dialog and enter “cmd.exe” in it. After opening the console, enter a command like: cd C:\folder\, cd C:\3.2 in this case.

Recruiting at text editor two lines (you can also save the text as a .bat file in the folder with cRark) to guess the password of a password-protected RAR archive with unencrypted files:

echo off;
cmd /K crark (archive name).rar

to guess the password of a password-protected and encrypted RAR archive:

echo off;
cmd /K crark-hp (archive name).rar

Copy 2 lines of the text file to the console and press Enter (or run the .bat file).

results

The decryption process is shown in the figure:

The speed of guessing on cRark using CUDA was 1625 passwords/second. In one minute thirty-six seconds, a password with 3 characters was selected: “q)$.” For comparison: the search speed in Advanced Archive Password Recovery on my dual-core Athlon 3000+ processor is a maximum of 50 passwords/second and the search should have lasted 5 hours. That is, bruteforce selection of a RAR archive in cRark using a GeForce 9800 GTX+ video card is 30 times faster than on a CPU.

For those with an Intel processor, a good motherboard with high frequency system bus(FSB 1600 MHz), the CPU rate and search speed will be higher. And if you have a quad-core processor and a pair of video cards of the GeForce 280 GTX level, then the speed of brute-forcing passwords will speed up significantly. To summarize the example, it must be said that this problem was solved using CUDA technology in just 2 minutes instead of 5 hours, which indicates the high potential of this technology!

conclusions

Having examined the technology for parallel computing CUDA today, we clearly saw all the power and enormous potential for the development of this technology using the example of a program for password recovery for RAR archives. It must be said about the prospects of this technology, this technology will certainly find a place in the life of every person who decides to use it, be it scientific tasks, or tasks related to video processing, or even economic tasks that require quick, accurate calculations, all this will lead to an inevitable increase in labor productivity, which cannot be ignored . Today, the phrase “home supercomputer” is already beginning to enter the lexicon; It is absolutely obvious that to make such an item a reality, every home already has a tool called CUDA. Since the release of cards based on the G80 chip (2006), a huge number of NVIDIA-based accelerators have been released that support CUDA technology, which can make dreams of supercomputers in every home come true. By promoting CUDA technology, NVIDIA raises its authority in the eyes of customers in the form of providing additional features their equipment, which many have already purchased. We can only believe that CUDA will soon develop very quickly and allow users to take full advantage of all the capabilities of parallel computing on GPUs.

I once had a chance to talk at the computer market with the technical director of one of the many companies selling laptops. This “specialist” tried to explain, foaming at the mouth, exactly what laptop configuration I needed. The main message of his monologue was that time central processing units(CPU) has ended, and now all applications actively use calculations on the graphics processor (GPU), and therefore the performance of the laptop depends entirely on the GPU, and you can not pay any attention to the CPU. Realizing that arguing and trying to reason with this technical director was absolutely pointless, I did not waste time and bought the laptop I needed in another pavilion. However, the very fact of such blatant incompetence of the seller struck me. It would be understandable if he was trying to deceive me as a buyer. Not at all. He sincerely believed in what he said. Yes, apparently, marketers at NVIDIA and AMD eat their bread for a reason, and they managed to instill in some users the idea of ​​​​the dominant role of the graphics processor in a modern computer.

The fact that graphics processing unit (GPU) computing is becoming increasingly popular today is beyond doubt. However, this does not at all diminish the role of the central processor. Moreover, if we talk about the vast majority of user applications, today their performance depends entirely on CPU performance. That is, the vast majority of user applications do not use GPU computing.

In general, GPU computing is mainly performed on specialized HPC systems for scientific computing. But user applications that use GPU computing can be counted on one hand. It should be noted right away that the term “GPU computing” in this case is not entirely correct and can be misleading. The fact is that if an application uses GPU computing, this does not mean that the central processor is idle. GPU computing does not involve transferring the load from the central processor to the graphics processor. As a rule, the central processor remains busy, and the use of a graphics processor, along with the central processor, can improve performance, that is, reduce the time it takes to complete a task. Moreover, the GPU itself here acts as a kind of coprocessor for the CPU, but in no case replaces it completely.

To understand why GPU computing is not a panacea and why it is incorrect to say that its computing capabilities are superior to those of the CPU, it is necessary to understand the difference between a central processor and a graphics processor.

Differences in GPU and CPU architectures

CPU cores are designed to execute a single stream of sequential instructions with maximum performance, and the GPU is for fast execution of a very large number of parallel instruction threads. This is the fundamental difference between GPUs and central processors. The CPU is a general-purpose or general-purpose processor optimized to achieve high performance a single command stream that processes both integers and floating-point numbers. In this case, access to memory with data and instructions occurs predominantly randomly.

To improve CPU performance, they are designed to execute as many instructions as possible in parallel. For example, for this purpose, the processor cores use an out-of-order instruction execution unit, which makes it possible to reorder instructions out of the order in which they were received, which makes it possible to increase the level of parallelism in the implementation of instructions at the level of one thread. However, this still does not allow parallel execution of a large number of instructions, and the overhead of parallelizing instructions within the processor core turns out to be very significant. This is why general-purpose processors do not have a very large number of execution units.

The graphics processor is designed fundamentally differently. It was originally designed to run a huge number of parallel command streams. Moreover, these command streams are parallelized from the start, and there are simply no overhead costs for parallelizing instructions in the GPU. The GPU is designed to render images. To put it simply, it takes a group of polygons as input, carries out all the necessary operations, and produces pixels as output. Processing of polygons and pixels is independent; they can be processed in parallel, separately from each other. Therefore, due to the inherently parallel organization of work, the GPU uses a large number of execution units, which are easy to load, in contrast to the sequential stream of instructions for the CPU.

Graphics and central processors also differ in the principles of memory access. In a GPU, memory access is easily predictable: if a texture texel is read from memory, then after some time the deadline for neighboring texels will come. When recording, the same thing happens: if a pixel is written to the framebuffer, then after a few clock cycles the pixel located next to it will be written. Therefore, the GPU, unlike the CPU, simply does not need cache memory big size, and textures require only a few kilobytes. The principle of working with memory is also different for GPUs and CPUs. So, all modern GPUs have several memory controllers, and the graphics memory itself is faster, so GPUs have much more O greater throughput memory, compared to universal processors, which is also very important for parallel calculations operating with huge data streams.

In universal processors O Most of the chip area is occupied by various command and data buffers, decoding units, hardware branch prediction units, instruction reordering units and cache memory of the first, second and third levels. All these hardware units are needed to speed up the execution of a few command threads by parallelizing them at the processor core level.

The execution units themselves take up relatively little space in a universal processor.

In a graphics processor, on the contrary, the main area is occupied by numerous execution units, which allows it to simultaneously process several thousand command threads.

We can say that, unlike modern CPUs, GPUs are designed for parallel computing with big amount arithmetic operations.

It is possible to use the computing power of GPUs for non-graphical tasks, but only if the problem being solved allows for the possibility of parallelizing algorithms across hundreds of execution units available in the GPU. In particular, GPU calculations show excellent results when the same sequence of mathematical operations is applied to a large volume of data. Wherein top scores are achieved if the ratio of the number of arithmetic instructions to the number of memory accesses is sufficiently large. This operation places less demands on execution control and does not require the use of large cache memory.

There are many examples of scientific calculations where the advantage of the GPU over the CPU in terms of computational efficiency is undeniable. Thus, many scientific applications in molecular modeling, gas dynamics, fluid dynamics, and others are perfectly suited for calculations on the GPU.

So, if the algorithm for solving a problem can be parallelized into thousands of individual threads, then the efficiency of solving such a problem using a GPU can be higher than solving it using only a general-purpose processor. However, you cannot so easily transfer the solution of some problem from the CPU to the GPU, if only simply because the CPU and GPU use different commands. That is, when a program is written for a solution on a CPU, the x86 command set is used (or a command set compatible with a specific processor architecture), but for the GPU, completely different command sets are used, which again take into account its architecture and capabilities. When developing modern 3D games, the DirectX and OpenGL APIs are used, allowing programmers to work with shaders and textures. However, using the DirectX and OpenGL APIs for non-graphical computing on the GPU is not the best option.

NVIDIA CUDA and AMD APP

That is why, when the first attempts to implement non-graphical computing on the GPU (General Purpose GPU, GPGPU) began, the BrookGPU compiler arose. Before its creation, developers had to access video card resources through the OpenGL or Direct3D graphics API, which significantly complicated the programming process, as it required specific knowledge - they had to learn the principles of working with 3D objects (shaders, textures, etc.). This was the reason for the very limited use of GPGPU in software products. BrookGPU has become a kind of “translator”. These streaming extensions to the C language hid the 3D API from programmers, and when using it, the need for knowledge of 3D programming practically disappeared. The computing power of video cards has become available to programmers in the form of an additional coprocessor for parallel calculations. The BrookGPU compiler processed the file with C code and extensions, building code tied to a library with DirectX or OpenGL support.

Thanks in large part to BrookGPU, NVIDIA and ATI (now AMD) took notice of the emerging technology of general-purpose computing on GPUs and began developing their own implementations that provide direct and more transparent access to the compute units of 3D accelerators.

As a result, NVIDIA has developed a hardware and software architecture for parallel computing, CUDA (Compute Unified Device Architecture). The CUDA architecture enables non-graphics computing on NVIDIA GPUs.

The release of the public beta version of the CUDA SDK took place in February 2007. The CUDA API is based on a simplified dialect of the C language. The CUDA SDK architecture enables programmers to implement algorithms that run on NVIDIA GPUs and enable special functions into the text of the C program. To successfully translate code into this language, the CUDA SDK includes NVIDIA's own nvcc command line compiler.

CUDA is cross-platform software for operating systems such as Linux, Mac OS X and Windows.

AMD (ATI) has also developed its own version of GPGPU technology, which was previously called ATI Stream, and now AMD Accelerated Parallel Processing (APP). The AMD APP is based on the open industry standard OpenCL (Open Computing Language). The OpenCL standard provides instruction-level and data-level parallelism and is an implementation of the GPGPU technique. It is a completely open standard and is royalty-free for use. Note that AMD APP and NVIDIA CUDA are incompatible with each other, however, the latter NVIDIA version CUDA also supports OpenCL.

Testing GPGPU in video converters

So, we found out that CUDA technology is designed to implement GPGPU on NVIDIA GPUs, and on GPUs AMD processors- API APP. As already noted, the use of non-graphical computing on the GPU is advisable only if the problem being solved can be parallelized into many threads. However, most user applications do not meet this criterion. However, there are some exceptions. For example, most modern video converters support the ability to use computing on NVIDIA and AMD GPUs.

In order to find out how efficiently GPU computing is used in custom video converters, we selected three popular solutions: Xilisoft Video Converter Ultimate 7.7.2, Wondershare Video Converter Ultimate 6.0.3.2 and Movavi Video Converter 10.2.1. These converters support the ability to use NVIDIA and AMD GPUs, and you can disable this feature in the video converter settings, which allows you to evaluate the effectiveness of using the GPU.

For video conversion, we used three different videos.

The first video was 3 minutes 35 seconds long and 1.05 GB in size. It was recorded in the mkv data storage format (container) and had the following characteristics:

  • video:
    • format - MPEG4 Video (H264),
    • resolution - 1920*um*1080,
    • bitrate mode - Variable,
    • average video bitrate - 42.1 Mbit/s,
    • maximum video bitrate - 59.1 Mbit/s,
    • frame rate - 25 fps;
  • audio:
    • format - MPEG-1 Audio,
    • audio bitrate - 128 Kbps,
    • number of channels - 2,

The second video had a duration of 4 minutes 25 seconds and a size of 1.98 GB. It was recorded in the MPG data storage format (container) and had the following characteristics:

  • video:
    • format - MPEG-PS (MPEG2 Video),
    • resolution - 1920*um*1080,
    • bitrate mode - Variable.
    • average video bitrate - 62.5 Mbit/s,
    • maximum video bitrate - 100 Mbit/s,
    • frame rate - 25 fps;
  • audio:
    • format - MPEG-1 Audio,
    • audio bitrate - 384 Kbps,
    • number of channels - 2,

The third video had a duration of 3 minutes 47 seconds and a size of 197 MB. It was written in the MOV data storage format (container) and had the following characteristics:

  • video:
    • format - MPEG4 Video (H264),
    • resolution - 1920*um*1080,
    • bitrate mode - Variable,
    • video bitrate - 7024 Kbps,
    • frame rate - 25 fps;
  • audio:
    • format - AAC,
    • audio bitrate - 256 Kbps,
    • number of channels - 2,
    • sampling frequency - 48 kHz.

All three test videos were converted using video converters into the MP4 data storage format (H.264 codec) for viewing on iPad tablet 2. The resolution of the output video file was 1280*um*720.

Please note that we did not use exactly the same conversion settings in all three converters. That is why it is incorrect to compare the efficiency of video converters themselves based on conversion time. Thus, in the video converter Xilisoft Video Converter Ultimate 7.7.2, the iPad 2 preset - H.264 HD Video was used for conversion. This preset uses the following encoding settings:

  • codec - MPEG4 (H.264);
  • resolution - 1280*um*720;
  • frame rate - 29.97 fps;
  • video bitrate - 5210 Kbps;
  • audio codec - AAC;
  • audio bitrate - 128 Kbps;
  • number of channels - 2;
  • sampling frequency - 48 kHz.

Wondershare Video Converter Ultimate 6.0.3.2 used the iPad 2 preset with the following additional settings:

  • codec - MPEG4 (H.264);
  • resolution - 1280*um*720;
  • frame rate - 30 fps;
  • video bitrate - 5000 Kbps;
  • audio codec - AAC;
  • audio bitrate - 128 Kbps;
  • number of channels - 2;
  • sampling frequency - 48 kHz.

Movavi Video Converter 10.2.1 used the iPad preset (1280*um*720, H.264) (*.mp4) with the following additional settings:

  • video format - H.264;
  • resolution - 1280*um*720;
  • frame rate - 30 fps;
  • video bitrate - 2500 Kbps;
  • audio codec - AAC;
  • audio bitrate - 128 Kbps;
  • number of channels - 2;
  • sampling frequency - 44.1 kHz.

Each source video was converted five times on each of the video converters, using both the GPU and only the CPU. After each conversion, the computer rebooted.

As a result, each video was converted ten times in each video converter. To automate this routine work, a special utility with a graphical interface was written, which allows you to fully automate the testing process.

Test stand configuration

The testing stand had the following configuration:

  • processor - Intel Core i7-3770K;
  • motherboard- Gigabyte GA-Z77X-UD5H;
  • chipset motherboard- Intel Z77 Express;
  • memory - DDR3-1600;
  • memory capacity - 8 GB (two GEIL modules of 4 GB each);
  • memory operating mode - dual-channel;
  • video card - NVIDIA GeForce GTX 660Ti (video driver 314.07);
  • drive - Intel SSD 520 (240 GB).

Installed on the stand operating system Windows 7 Ultimate (64-bit).

Initially, we tested the processor and all other system components in normal mode. At the same time, the Intel Core i7-3770K processor operated at a standard frequency of 3.5 GHz c activated mode Turbo Boost (maximum processor frequency in Turbo Boost mode is 3.9 GHz).

We then repeated the testing process, but with the processor overclocked to a fixed frequency of 4.5 GHz (without using Turbo Boost mode). This made it possible to identify the dependence of the conversion speed on the processor frequency (CPU).

At the next stage of testing we returned to standard settings processor and repeated testing with other video cards:

  • NVIDIA GeForce GTX 280 (driver 314.07);
  • NVIDIA GeForce GTX 460 (driver 314.07);
  • AMD Radeon HD6850 (driver 13.1).

Thus, video conversion was carried out on four video cards of different architectures.

The senior NVIDIA GeForce 660Ti video card is based on a graphics processor of the same name, coded GK104 (Kepler architecture), produced using a 28 nm process technology. This GPU contains 3.54 billion transistors and has a die area of ​​294 mm2.

Recall that the GK104 graphics processor includes four graphics processing clusters (Graphics Processing Clusters, GPC). GPC clusters are independent devices within the processor and are capable of operating as separate devices, since they have all the necessary resources: rasterizers, geometry engines and texture modules.

Each such cluster has two SMX (Streaming Multiprocessor) multiprocessors, but in the GK104 processor one multiprocessor is blocked in one of the clusters, so there are seven SMX multiprocessors in total.

Each SMX streaming multiprocessor contains 192 streaming compute cores (CUDA cores), so the GK104 processor has a total of 1344 CUDA cores. In addition, each SMX multiprocessor contains 16 texture units (TMU), 32 special function units (SFU), 32 load-store units (LSU), a PolyMorph engine and much more.

The GeForce GTX 460 is based on a GPU coded GF104 based on the Fermi architecture. This processor is manufactured using a 40nm process technology and contains about 1.95 billion transistors.

The GF104 GPU includes two GPC graphics processing clusters. Each of them has four SM streaming multiprocessors, but in the GF104 processor in one of the clusters one multiprocessor is locked, so there are only seven SM multiprocessors.

Each SM streaming multiprocessor contains 48 streaming compute cores (CUDA cores), so the GK104 processor has a total of 336 CUDA cores. In addition, each SM multiprocessor contains eight texture units (TMU), eight special function units (SFU), 16 load-store units (LSU), a PolyMorph engine and much more.

The GeForce GTX 280 GPU is part of the second generation of NVIDIA's Unified GPU Architecture and is very different in architecture from Fermi and Kepler.

The GeForce GTX 280 GPU consists of Texture Processing Clusters (TPCs), which, although similar, are very different from the GPC graphics processing clusters in the Fermi and Kepler architectures. There are a total of ten such clusters in the GeForce GTX 280 processor. Each TPC cluster includes three SM streaming multiprocessors and eight texture sampling and filtering units (TMU). Each multiprocessor consists of eight stream processors (SP). Multiprocessors also contain blocks for sampling and filtering texture data, used in both graphics and some computational tasks.

Thus, in one TPC cluster there are 24 stream processors, and in the GeForce GTX 280 GPU there are already 240 of them.

Summary characteristics of the video cards on NVIDIA GPUs used in testing are presented in the table.

The table below does not include an AMD Radeon HD6850 video card, which is quite natural, since technical specifications it's hard to compare with NVIDIA video cards. Therefore, we will consider it separately.

The AMD Radeon HD6850 GPU, codenamed Barts, is manufactured using a 40nm process technology and contains 1.7 billion transistors.

The AMD Radeon HD6850 processor architecture is a unified architecture with an array of common processors for streaming processing of multiple types of data.

The AMD Radeon HD6850 processor consists of 12 SIMD cores, each of which contains 16 superscalar stream processor units and four texture units. Each superscalar stream processor contains five general-purpose stream processors. Thus, in total, the AMD Radeon HD6850 GPU has 12*um*16*um*5=960 universal stream processors.

The GPU frequency of the AMD Radeon HD6850 video card is 775 MHz, and the effective GDDR5 memory frequency is 4000 MHz. The memory capacity is 1024 MB.

Test results

So let's look at the test results. Let's start with the first test, when we use the NVIDIA GeForce GTX 660Ti video card and the standard operating mode of the Intel Core i7-3770K processor.

In Fig. 1-3 show the results of converting three test videos using three converters in modes with and without a GPU.

As can be seen from the testing results, the effect of using the GPU is obvious. For the video converter Xilisoft Video Converter Ultimate 7.7.2, when using a GPU, the conversion time is reduced by 14, 9 and 19% for the first, second and third video, respectively.

For Wondershare Video Converter Ultimate 6.0.32, using a GPU reduces conversion time by 10%, 13%, and 23% for the first, second, and third video, respectively.

But the converter that benefits most from the use of a graphics processor is Movavi Video Converter 10.2.1. For the first, second and third video, the reduction in conversion time is 64, 81 and 41%, respectively.

It is clear that the benefit from using a GPU depends on both the source video and the video conversion settings, which, in fact, is what our results demonstrate.

Now let's see what the conversion time gain will be when overclocking the Intel Core i7-3770K processor to 4.5 GHz. If we assume that in normal mode all processor cores are loaded during conversion and in Turbo Boost mode they operate at a frequency of 3.7 GHz, then increasing the frequency to 4.5 GHz corresponds to a frequency overclock of 22%.

In Fig. 4-6 show the results of converting three test videos when overclocking the processor in modes using a graphics processor and without. In this case, the use of a graphics processor allows for a gain in conversion time.

For the video converter Xilisoft Video Converter Ultimate 7.7.2, when using a GPU, the conversion time is reduced by 15, 9 and 20% for the first, second and third video, respectively.

For Wondershare Video Converter Ultimate 6.0.32, using a GPU can reduce conversion time by 10, 10, and 20% for the first, second, and third video, respectively.

For Movavi Video Converter 10.2.1, the use of a graphics processor can reduce conversion time by 59, 81 and 40%, respectively.

Naturally, it's interesting to see how CPU overclocking can reduce conversion times with and without a GPU.

In Fig. Figures 7-9 show the results of comparing the time for converting videos without using a graphics processor in normal processor operation mode and in overclocked mode. Since in this case the conversion is carried out only using the CPU without computing on the GPU, it is obvious that the increase clock frequency processor operation leads to a reduction in conversion time (increase in conversion speed). It is equally obvious that the reduction in conversion speed should be approximately the same for all test videos. Thus, for the video converter Xilisoft Video Converter Ultimate 7.7.2, when overclocking the processor, the conversion time is reduced by 9, 11 and 9% for the first, second and third video, respectively. For Wondershare Video Converter Ultimate 6.0.32, conversion time is reduced by 9, 9 and 10% for the first, second and third video, respectively. Well, for the video converter Movavi Video Converter 10.2.1, the conversion time is reduced by 13, 12 and 12%, respectively.

Thus, when overclocking the processor frequency by 20%, the conversion time is reduced by approximately 10%.

Let's compare the time for converting videos using a graphics processor in normal processor mode and in overclocking mode (Fig. 10-12).

For the video converter Xilisoft Video Converter Ultimate 7.7.2, when overclocking the processor, the conversion time is reduced by 10, 10 and 9% for the first, second and third video, respectively. For Wondershare Video Converter Ultimate 6.0.32, conversion time is reduced by 9, 6 and 5% for the first, second and third video, respectively. Well, for the video converter Movavi Video Converter 10.2.1, the conversion time is reduced by 0.2, 10 and 10%, respectively.

As you can see, for the converters Xilisoft Video Converter Ultimate 7.7.2 and Wondershare Video Converter Ultimate 6.0.32, the reduction in conversion time when overclocking the processor is approximately the same both when using a graphics processor and without using it, which is logical, since these converters do not use very efficiently GPU computing. But for the Movavi Video Converter 10.2.1, which effectively uses GPU computing, overclocking the processor in the GPU computing mode has little effect on reducing the conversion time, which is also understandable, since in this case the main load falls on the graphics processor.

Now let's look at the test results with various video cards.

It would seem that the more powerful the video card and the more CUDA cores (or universal stream processors for AMD video cards) in the graphics processor, the more effective video conversion should be when using a graphics processor. But in practice it doesn’t work out quite like that.

As for video cards based on NVIDIA GPUs, the situation is as follows. When using Xilisoft Video Converter Ultimate 7.7.2 and Wondershare Video Converter Ultimate 6.0.32 converters, the conversion time practically does not depend in any way on the type of video card used. That is, for the NVIDIA GeForce GTX 660Ti, NVIDIA GeForce GTX 460 and NVIDIA GeForce GTX 280 video cards in the GPU computing mode, the conversion time is the same (Fig. 13-15).

Rice. 1. Results of converting the first
test video in normal mode
processor operation

processor on video cards in GPU mode

Rice. 14. Results of comparison of conversion time of the second video

Rice. 15. Results of comparison of conversion time of the third video
on various video cards in GPU mode

This can only be explained by the fact that the GPU calculation algorithm implemented in the Xilisoft Video Converter Ultimate 7.7.2 and Wondershare Video Converter Ultimate 6.0.32 converters is simply ineffective and does not allow active use of all graphics cores. By the way, this is precisely what explains the fact that for these converters the difference in conversion time in modes of using the GPU and without using it is small.

In Movavi Video Converter 10.2.1 the situation is slightly different. As we remember, this converter is capable of very efficient use of GPU calculations, and therefore, in GPU mode, the conversion time depends on the type of video card used.

But with AMD video card Radeon HD 6850 everything is as usual. Either the video card driver is “crooked”, or the algorithms implemented in the converters need serious improvement, but when GPU computing is used, the results either do not improve or get worse.

More specifically, the situation is as follows. For Xilisoft Video Converter Ultimate 7.7.2, when using a GPU to convert the first test video, the conversion time increases by 43%, and when converting the second video, by 66%.

Moreover, Xilisoft Video Converter Ultimate 7.7.2 is also characterized by unstable results. The variation in conversion time can reach 40%! That is why we repeated all tests ten times and calculated the average result.

But for Wondershare Video Converter Ultimate 6.0.32 and Movavi Video Converter 10.2.1, when using a GPU to convert all three videos, the conversion time does not change at all! It is likely that Wondershare Video Converter Ultimate 6.0.32 and Movavi Video Converter 10.2.1 either do not use AMD APP technology when converting, or the AMD video driver is simply “crooked”, as a result of which AMD APP technology does not work.

conclusions

Based on the testing, the following important conclusions can be drawn. Modern video converters can actually use GPU computing technology, which allows for increased conversion speed. However, this does not mean that all calculations are entirely transferred to the GPU and the CPU remains unused. As testing shows, when using GPGPU technology, the central processor remains busy, which means that the use of powerful, multi-core central processors in systems used for video conversion remains relevant. The exception to this rule is AMD APP technology on AMD GPUs. For example, when using Xilisoft Video Converter Ultimate 7.7.2 with activated AMD technology APP the load on the CPU is indeed reduced, but this leads to the fact that the conversion time is not reduced, but, on the contrary, increases.

In general, if we talk about converting video with the additional use of a graphics processor, then to solve this problem it is advisable to use video cards with NVIDIA GPUs. As practice shows, only in this case can you achieve an increase in conversion speed. Moreover, you need to remember that the real increase in conversion speed depends on many factors. These are the input and output video formats, and, of course, the video converter itself. The converters Xilisoft Video Converter Ultimate 7.7.2 and Wondershare Video Converter Ultimate 6.0.32 are not suitable for this task, but the converter and Movavi Video Converter 10.2.1 are able to very effectively use the capabilities of the NVIDIA GPU.

As for video cards on AMD GPUs, they should not be used for video conversion tasks at all. At best, this will not give any increase in conversion speed, and at worst, you can get a decrease in it.

Using GPU Computing with C++ AMP

So far, in discussing parallel programming techniques, we have considered only processor cores. We have gained some skills in parallelizing programs across multiple processors, synchronizing access to shared resources, and using high-speed synchronization primitives without using locks.

However, there is another way to parallelize programs - graphics processing units (GPUs), having more cores than even high-performance processors. GPU cores are excellent for implementing parallel data processing algorithms, and their large number more than pays for the inconvenience of running programs on them. In this article we will get acquainted with one of the ways to run programs on a GPU, using a set of C++ language extensions called C++AMP.

The C++ AMP extensions are based on the C++ language, which is why this article will demonstrate examples in C++. However, with moderate use of the interaction mechanism in. NET, you can use C++ AMP algorithms in your .NET programs. But we will talk about this at the end of the article.

Introduction to C++ AMP

Essentially, a GPU is the same processor as any other, but with a special set of instructions, a large number of cores and its own memory access protocol. However, there are big differences between modern GPUs and conventional processors, and understanding them is key to creating programs that effectively use the processing power of the GPU.

    Modern GPUs have a very small instruction set. This implies some limitations: lack of ability to call functions, limited set of supported data types, lack of library functions, and others. Some operations, such as conditional branches, can cost significantly more than similar operations performed on conventional processors. Obviously, moving large amounts of code from the CPU to the GPU under such conditions requires significant effort.

    The number of cores in the average GPU is significantly higher than in the average conventional processor. However, some tasks are too small or cannot be broken down into large enough parts to benefit from the GPU.

    Synchronization support between GPU cores performing the same task is very poor, and completely absent between GPU cores performing different tasks. This circumstance requires synchronization of the graphics processor with a conventional processor.

The question immediately arises: what tasks are suitable for solving on a GPU? Keep in mind that not every algorithm is suitable for execution on a GPU. For example, GPUs don't have access to I/O devices, so you won't be able to improve the performance of a program that retrieves RSS feeds from the Internet, through the use of a graphics processor. However, many computational algorithms can be transferred to the GPU and can be massively parallelized. Below are a few examples of such algorithms (this list is by no means complete):

    increasing and decreasing sharpness of images, and other transformations;

    fast Fourier transform;

    matrix transposition and multiplication;

    number sorting;

    direct hash inversion.

An excellent source for additional examples is the Microsoft Native Concurrency blog, which provides code snippets and explanations for various algorithms implemented in C++ AMP.

C++ AMP is a framework included with Visual Studio 2012 that gives C++ developers an easy way to perform computations on the GPU, requiring only a DirectX 11 driver. Microsoft has released C++ AMP as an open specification that can be implemented by any compiler vendor.

The C++ AMP framework allows you to run code in graphics accelerators, which are computing devices. Using the DirectX 11 driver, the C++ AMP framework dynamically detects all accelerators. C++ AMP also includes a software accelerator emulator and a conventional processor-based emulator, WARP, which serves as a fallback on systems without a GPU or with a GPU but lacks a DirectX 11 driver, and uses multiple cores and SIMD instructions.

Now let's start exploring an algorithm that can easily be parallelized for execution on a GPU. The implementation below takes two vectors of equal length and calculates the pointwise result. It's hard to imagine anything more straightforward:

Void VectorAddExpPointwise(float* first, float* second, float* result, int length) ( for (int i = 0; i< length; ++i) { result[i] = first[i] + exp(second[i]); } }

To parallelize this algorithm on a regular processor, you need to split the iteration range into several subranges and run one thread of execution for each of them. We've spent a lot of time in previous articles on exactly this way of parallelizing our first prime number search example - we've seen how it can be done by creating threads manually, passing jobs to a thread pool, and using Parallel.For and PLINQ to automatically parallelize. Remember also that when parallelizing similar algorithms on a conventional processor, we took special care not to split the problem into too small tasks.

For the GPU, these warnings are not needed. GPUs have multiple cores that execute threads very quickly, and the cost of context switching is significantly lower than conventional processors. Below is a snippet trying to use the function parallel_for_each from the C++ AMP framework:

#include #include using namespace concurrency; void VectorAddExpPointwise(float* first, float* second, float* result, int length) ( array_view avFirst(length, first); array_view avSecond(length, second); array_view avResult(length, result); avResult.discard_data(); parallel_for_each(avResult.extent, [=](index<1>i) restrict(amp) ( avResult[i] = avFirst[i] + fast_math::exp(avSecond[i]); )); avResult.synchronize(); )

Now let's examine each part of the code separately. Let's immediately note that the general form of the main loop has been preserved, but the originally used for loop has been replaced by a call to the parallel_for_each function. In fact, the principle of converting a loop into a function or method call is not new to us - such a technique has previously been demonstrated using the Parallel.For() and Parallel.ForEach() methods from the TPL library.

Next, the input data (parameters first, second and result) are wrapped with instances array_view. The array_view class is used to wrap data passed to the GPU (accelerator). Its template parameter specifies the data type and its dimension. In order to execute instructions on a GPU that access data originally processed on a conventional CPU, someone or something must take care of copying the data to the GPU because most modern graphics cards are separate devices with their own memory. array_view instances solve this problem - they provide data copying on demand and only when it is really needed.

When the GPU completes the task, the data is copied back. By instantiating array_view with a const argument, we ensure that first and second are copied into GPU memory, but not copied back. Likewise, calling discard_data(), we exclude copying result from the memory of a regular processor to the accelerator memory, but this data will be copied in the opposite direction.

The parallel_for_each function takes an extent object that specifies the form of the data to be processed and a function to apply to each element in the extent object. In the example above, we used a lambda function, support for which appeared in the ISO C++2011 (C++11) standard. The restrict (amp) keyword instructs the compiler to check whether the function body can be executed on the GPU and disables most C++ syntax that cannot be compiled into GPU instructions.

Lambda function parameter, index<1>object, represents a one-dimensional index. It must match the extent object being used - if we were to declare the extent object to be two-dimensional (for example, by defining the shape of the source data as a two-dimensional matrix), the index would also need to be two-dimensional. An example of such a situation is given below.

Finally, the method call synchronize() at the end of the VectorAddExpPointwise method, it ensures that the calculation results from array_view avResult, produced by the GPU, are copied back to the result array.

This concludes our first introduction to the world of C++ AMP, and now we are ready for more detailed research, as well as more interesting examples demonstrating the benefits of using parallel computing on a GPU. Vector addition is not a good algorithm and is not the best candidate for demonstrating GPU usage due to the large overhead of copying data. The next subsection will show two more interesting examples.

Matrix multiplication

The first "real" example we'll look at is matrix multiplication. For implementation, we will take a simple cubic matrix multiplication algorithm, and not the Strassen algorithm, which has a execution time close to cubic ~O(n 2.807). Given two matrices, an m x w matrix A and a w x n matrix B, the following program will multiply them and return the result, an m x n matrix C:

Void MatrixMultiply(int* A, int m, int w, int* B, int n, int* C) ( for (int i = 0; i< m; ++i) { for (int j = 0; j < n; ++j) { int sum = 0; for (int k = 0; k < w; ++k) { sum += A * B; } C = sum; } } }

There are several ways to parallelize this implementation, and if you want to parallelize this code to run on a regular processor, the right choice would be to parallelize the outer loop. However, the GPU has a fairly large number of cores, and by parallelizing only the outer loop, we will not be able to create a sufficient number of jobs to load all the cores with work. Therefore, it makes sense to parallelize the two outer loops, leaving the inner loop untouched:

Void MatrixMultiply (int* A, int m, int w, int* B, int n, int* C) ( array_view avA(m, w, A); array_view avB(w, n, B); array_view avC(m, n, C); avC.discard_data(); parallel_for_each(avC.extent, [=](index<2>idx) restrict(amp) ( int sum = 0; for (int k = 0; k< w; ++k) { sum + = avA(idx*w, k) * avB(k*w, idx); } avC = sum; }); }

This implementation still closely resembles the sequential implementation of matrix multiplication and the vector addition example given above, with the exception of the index, which is now two-dimensional and accessible in the inner loop using the operator. How much faster is this version than the sequential alternative running on a regular processor? Multiplying two matrices (integers) of size 1024 x 1024, the sequential version on a regular CPU takes an average of 7350 milliseconds, while the GPU version - hold on tight - takes 50 milliseconds, 147 times faster!

Particle motion simulation

The examples of solving problems on the GPU presented above have a very simple implementation of the internal loop. It is clear that this will not always be the case. The Native Concurrency blog, linked above, demonstrates an example of modeling gravitational interactions between particles. The simulation involves an infinite number of steps; at each step, new values ​​of the elements of the acceleration vector are calculated for each particle and then their new coordinates are determined. Here, the particle vector is parallelized - with a sufficiently large number of particles (from several thousand and above), you can create a sufficiently large number of tasks to load all the GPU cores with work.

The basis of the algorithm is the implementation of determining the result of interactions between two particles, as shown below, which can easily be transferred to the GPU:

// here float4 are vectors with four elements // representing the particles involved in the operations void bodybody_interaction (float4& acceleration, const float4 p1, const float4 p2) restrict(amp) ( float4 dist = p2 – p1; // no w here used float absDist = dist.x*dist.x + dist.y*dist.y + dist.z*dist.z; float invDist = 1.0f / sqrt(absDist); float invDistCube = invDist*invDist*invDist; acceleration + = dist*PARTICLE_MASS*invDistCube; )

The initial data at each modeling step is an array with the coordinates and velocities of particles, and as a result of calculations, a new array with the coordinates and velocities of particles is created:

Struct particle ( float4 position, velocity; // implementations of constructor, copy constructor and // operator = with restrict(amp) omitted to save space ); void simulation_step(array & previous, array & next, int bodies) ( extent<1>ext(bodies); parallel_for_each (ext, [&](index<1>idx) restrict(amp) ( particle p = previous; float4 acceleration(0, 0, 0, 0); for (int body = 0; body< bodies; ++body) { bodybody_interaction (acceleration, p.position, previous.position); } p.velocity + = acceleration*DELTA_TIME; p.position + = p.velocity*DELTA_TIME; next = p; }); }

With the help of an appropriate graphical interface, modeling can be very interesting. The full example provided by the C++ AMP team can be found on the Native Concurrency blog. On my system with an Intel Core i7 processor and a Geforce GT 740M graphics card, the simulation of 10,000 particles runs at ~2.5 fps (steps per second) using the sequential version running on the regular processor, and 160 fps using the optimized version running on the GPU - a huge increase in performance.

Before we wrap up this section, there is one more important feature of the C++ AMP framework that can further improve the performance of code running on the GPU. GPUs support programmable data cache(often called shared memory). The values ​​stored in this cache are shared by all threads of execution in a single tile. Thanks to memory tiling, programs based on the C++ AMP framework can read data from graphics card memory into the shared memory of the mosaic and then access it from multiple threads of execution without having to re-fetch the data from graphics card memory. Accessing mosaic shared memory is approximately 10 times faster than graphics card memory. In other words, you have reasons to keep reading.

To provide a tiled version of the parallel loop, the parallel_for_each method is passed domain tiled_extent, which divides the multidimensional extent object into multidimensional tiles, and the tiled_index lambda parameter, which specifies the global and local ID of the thread within the tile. For example, a 16x16 matrix can be divided into 2x2 tiles (as shown in the image below) and then passed to the parallel_for_each function:

Extent<2>matrix(16,16); tiled_extent<2,2>tiledMatrix = matrix.tile<2,2>(); parallel_for_each(tiledMatrix, [=](tiled_index<2,2>idx) restrict(amp) ( // ... ));

Each of the four threads of execution belonging to the same mosaic can share the data stored in the block.

When performing operations with matrices, in the GPU core, instead of the standard index<2>, as in the examples above, you can use idx.global. Proper use of local tiled memory and local indexes can provide significant performance gains. To declare tiled memory shared by all threads of execution in a single tile, local variables can be declared with the tile_static specifier.

In practice, the technique of declaring shared memory and initializing its individual blocks in different threads of execution is often used:

Parallel_for_each(tiledMatrix, [=](tiled_index<2,2>idx) restrict(amp) ( // 32 bytes are shared by all threads in the block tile_static int local; // assign a value to the element for this thread of execution local = 42; ));

Obviously, any benefits from using shared memory can only be obtained if access to this memory is synchronized; that is, threads must not access memory until it has been initialized by one of them. Synchronization of threads in a mosaic is performed using objects tile_barrier(reminiscent of the Barrier class from the TPL library) - they will be able to continue execution only after calling the tile_barrier.Wait() method, which will return control only when all threads have called tile_barrier.Wait. For example:

Parallel_for_each(tiledMatrix, (tiled_index<2,2>idx) restrict(amp) ( // 32 bytes are shared by all threads in the block tile_static int local; // assign a value to the element for this thread of execution local = 42; // idx.barrier is an instance of tile_barrier idx.barrier.wait(); // Now this thread can access the "local" array // using the indexes of other threads of execution! ));

Now is the time to translate what you have learned into a concrete example. Let's return to the implementation of matrix multiplication, performed without the use of tiling memory organization, and add the described optimization to it. Let's assume that the matrix size is a multiple of 256 - this will allow us to work with 16 x 16 blocks. The nature of matrices allows for block-by-block multiplication, and we can take advantage of this feature (in fact, dividing matrices into blocks is a typical optimization of the matrix multiplication algorithm, providing more efficient CPU cache usage).

The essence of this technique comes down to the following. To find C i,j (the element in row i and column j in the result matrix), you need to calculate the dot product between A i,* (i-th row of the first matrix) and B *,j (j-th column in the second matrix ). However, this is equivalent to computing the partial dot products of the row and column and then summing the results. We can use this fact to convert the matrix multiplication algorithm into a tiling version:

Void MatrixMultiply(int* A, int m, int w, int* B, int n, int* C) ( array_view avA(m, w, A); array_view avB(w, n, B); array_view avC(m, n, C); avC.discard_data(); parallel_for_each(avC.extent.tile<16,16>(), [=](tiled_index<16,16>idx) restrict(amp) ( int sum = 0; int localRow = idx.local, localCol = idx.local; for (int k = 0; k

The essence of the described optimization is that each thread in the mosaic (256 threads are created for a 16 x 16 block) initializes its element in 16 x 16 local copies of fragments of the original matrices A and B. Each thread in the mosaic requires only one row and one column of these blocks, but all threads together will access each row and each column 16 times. This approach significantly reduces the number of accesses to main memory.

To calculate element (i,j) in the result matrix, the algorithm requires the complete i-th row of the first matrix and the j-th column of the second matrix. When the threads are 16x16 tiling represented in the diagram and k=0, the shaded regions in the first and second matrices will be read into shared memory. The execution thread computing element (i,j) in the result matrix will calculate the partial dot product of the first k elements from the i-th row and j-th column of the original matrices.

In this example, using a tiled organization provides a huge performance boost. The tiled version of matrix multiplication is much faster than the simple version, taking approximately 17 milliseconds (for the same 1024 x 1024 input matrices), which is 430 times faster than the version running on a conventional processor!

Before we end our discussion of the C++ AMP framework, we would like to mention the tools (in Visual Studio) available to developers. Visual Studio 2012 offers a graphics processing unit (GPU) debugger that lets you set breakpoints, examine the call stack, and read and change local variable values ​​(some accelerators support GPU debugging directly; for others, Visual Studio uses a software simulator), and a profiler that lets you evaluate the benefits an application receives from parallelizing operations using a GPU. For more information about debugging capabilities in Visual Studio, see the Walkthrough article. Debugging a C++ AMP Application" on MSDN.

GPU Computing Alternatives in .NET

So far this article has only shown examples in C++, however, there are several ways to harness the power of the GPU in managed applications. One way is to use interop tools that allow you to offload work with GPU cores to low-level C++ components. This solution is great for those who want to use the C++ AMP framework or have the ability to use pre-built C++ AMP components in managed applications.

Another way is to use a library that works directly with the GPU from managed code. There are currently several such libraries. For example, GPU.NET and CUDAfy.NET (both commercial offerings). Below is an example from the GPU.NET GitHub repository demonstrating the implementation of the dot product of two vectors:

Public static void MultiplyAddGpu(double a, double b, double c) ( int ThreadId = BlockDimension.X * BlockIndex.X + ThreadIndex.X; int TotalThreads = BlockDimension.X * GridDimension.X; for (int ElementIdx = ThreadId; ElementIdx

I am of the opinion that it is much easier and more efficient to learn a language extension (based on C++ AMP) than to try to orchestrate interactions at the library level or make significant changes to the IL language.

So, after we looked at the possibilities of parallel programming in .NET and using the GPU, no one doubts that organizing parallel computing is an important way to increase productivity. In many servers and workstations around the world, the invaluable processing power of CPUs and GPUs goes unused because applications simply don't use it.

The Task Parallel Library gives us a unique opportunity to include all available CPU cores, although this will require solving some interesting problems of synchronization, excessive task fragmentation, and unequal distribution of work between execution threads.

The C++ AMP framework and other multi-purpose GPU parallel computing libraries can be successfully used to parallelize calculations across hundreds of GPU cores. Finally, there is a previously unexplored opportunity to gain productivity gains from the use of cloud distributed computing technologies, which have recently become one of the main directions in the development of information technology.

GPU Computing

CUDA technology (Compute Unified Device Architecture) is a software and hardware architecture that allows computing using NVIDIA graphics processors that support GPGPU (random computing on video cards) technology. The CUDA architecture first appeared on the market with the release of the eighth generation NVIDIA chip - G80 and is present in all subsequent series of graphics chips that are used in the GeForce, ION, Quadro and Tesla accelerator families.

CUDA SDK allows programmers to implement algorithms that can be executed on NVIDIA GPUs in a special simplified dialect of the C programming language and to include special functions in the text of a C program. CUDA gives the developer the opportunity, at his own discretion, to organize access to the instruction set of the graphics accelerator and manage its memory, and organize complex parallel calculations on it.

Story

In 2003, Intel and AMD were in a joint race to find the most powerful processor. Over several years, as a result of this race, clock speeds increased significantly, especially after the release of the Intel Pentium 4.

After the increase in clock frequencies (between 2001 and 2003, the clock frequency of the Pentium 4 doubled from 1.5 to 3 GHz), and users had to be content with tenths of gigahertz, which were brought to the market by manufacturers (from 2003 to 2005, clock frequencies increased 3 to 3.8 GHz).

Architectures optimized for high clock frequencies, such as Prescott, also began to experience difficulties, and not only production ones. Chip makers are faced with challenges in overcoming the laws of physics. Some analysts even predicted that Moore's Law would cease to apply. But that did not happen. The original meaning of the law is often distorted, but it concerns the number of transistors on the surface of the silicon core. For a long time, an increase in the number of transistors in a CPU was accompanied by a corresponding increase in performance - which led to a distortion of the meaning. But then the situation became more complicated. The developers of the CPU architecture approached the law of growth reduction: the number of transistors that needed to be added for the required increase in performance became increasingly large, leading to a dead end.

The reason why GPU manufacturers have not encountered this problem is very simple: CPUs are designed to get maximum performance on a stream of instructions that process different data (both integer and floating point numbers), perform random memory access, etc. d. Until now, developers are trying to provide greater parallelism of instructions - that is, execute as many instructions as possible in parallel. For example, with the Pentium, superscalar execution appeared, when under certain conditions it was possible to execute two instructions per clock cycle. Pentium Pro received out-of-order execution of instructions, which made it possible to optimize the operation of computing units. The problem is that there are obvious limitations to executing a sequential stream of instructions in parallel, so blindly increasing the number of computational units does not provide any benefit since they will still be idle most of the time.

The operation of the GPU is relatively simple. It consists of taking a group of polygons on one side and generating a group of pixels on the other. Polygons and pixels are independent of each other, so they can be processed in parallel. Thus, in a GPU it is possible to allocate a large part of the crystal into computational units, which, unlike the CPU, will actually be used.

GPU differs from CPU not only in this way. Memory access in the GPU is very coupled - if a texel is read, then after a few clock cycles the neighboring texel will be read; When a pixel is recorded, after a few clock cycles the neighboring one will be recorded. By intelligently organizing memory, you can achieve performance close to theoretical throughput. This means that the GPU, unlike the CPU, does not require a huge cache, since its role is to speed up texturing operations. All that is needed is a few kilobytes containing a few texels used in bilinear and trilinear filters.

First calculations on GPU

The earliest attempts at such applications were limited to the use of certain hardware functions, such as rasterization and Z-buffering. But in the current century, with the advent of shaders, matrix calculations began to be accelerated. In 2003, at SIGGRAPH, a separate section was allocated for GPU computing, and it was called GPGPU (General-Purpose computation on GPU).

The best known is BrookGPU, a compiler for the Brook streaming programming language, designed to perform non-graphical computations on the GPU. Before its appearance, developers using the capabilities of video chips for calculations chose one of two common APIs: Direct3D or OpenGL. This seriously limited the use of GPUs, because 3D graphics use shaders and textures that parallel programming specialists are not required to know about; they use threads and cores. Brook was able to help make their task easier. These streaming extensions to the C language, developed at Stanford University, hid the 3D API from programmers, and presented the video chip as a parallel coprocessor. The compiler processed the .br file with C++ code and extensions, producing code linked to a DirectX, OpenGL, or x86-enabled library.

The appearance of Brook aroused interest among NVIDIA and ATI and subsequently opened up a whole new sector of it - parallel computers based on video chips.

Subsequently, some researchers from the Brook project joined the NVIDIA development team to introduce a hardware-software parallel computing strategy, opening up new market share. And the main advantage of this NVIDIA initiative is that developers know all the capabilities of their GPUs down to the last detail, and there is no need to use the graphics API, and you can work with the hardware directly using the driver. The result of this team's efforts was NVIDIA CUDA.

Areas of application of parallel calculations on GPU

When transferring calculations to the GPU, many tasks achieve acceleration of 5-30 times compared to fast universal processors. The largest numbers (on the order of 100x speedup or even more!) are achieved with code that is not very suitable for calculations using SSE blocks, but is quite convenient for GPUs.

These are just some examples of speedups for synthetic code on the GPU versus SSE-vectorized code on the CPU (according to NVIDIA):

Fluorescence microscopy: 12x.

Molecular dynamics (non-bonded force calc): 8-16x;

Electrostatics (direct and multilevel Coulomb summation): 40-120x and 7x.

A table that NVIDIA displays in all presentations shows the speed of GPUs relative to CPUs.

List of the main applications in which GPU computing is used: analysis and processing of images and signals, physics simulation, computational mathematics, computational biology, financial calculations, databases, dynamics of gases and liquids, cryptography, adaptive radiation therapy, astronomy, audio processing, bioinformatics , biological simulations, computer vision, data mining, digital cinema and television, electromagnetic simulations, geographic information systems, military applications, mine planning, molecular dynamics, magnetic resonance imaging (MRI), neural networks, oceanographic research, particle physics, protein folding simulation, quantum chemistry, ray tracing, visualization, radar, reservoir simulation, artificial intelligence, satellite data analysis, seismic exploration, surgery, ultrasound, video conferencing.

Advantages and Limitations of CUDA

From a programmer's perspective, a graphics pipeline is a collection of processing stages. The geometry block generates the triangles, and the rasterization block generates the pixels displayed on the monitor. The traditional GPGPU programming model looks like this:

To transfer calculations to the GPU within this model, a special approach is needed. Even element-wise addition of two vectors will require drawing the figure on the screen or to an off-screen buffer. The figure is rasterized, the color of each pixel is calculated using a given program (pixel shader). The program reads the input data from the textures for each pixel, adds them and writes them to the output buffer. And all these numerous operations are needed for something that is written in a single operator in a regular programming language!

Therefore, the use of GPGPU for general purpose computing has the limitation of being too difficult to train developers. And there are enough other restrictions, because a pixel shader is just a formula for the dependence of the final color of a pixel on its coordinate, and the language of pixel shaders is a language for writing these formulas with a C-like syntax. Early GPGPU methods are a neat trick that allows you to use the power of the GPU, but without any of the convenience. The data there is represented by images (textures), and the algorithm is represented by the rasterization process. Of particular note is the very specific model of memory and execution.

NVIDIA's software and hardware architecture for GPU computing differs from previous GPGPU models in that it allows you to write programs for the GPU in real C language with standard syntax, pointers and the need for a minimum of extensions to access the computing resources of video chips. CUDA is independent of graphics APIs, and has some features designed specifically for general purpose computing.

Advantages of CUDA over the traditional approach to GPGPU computing

CUDA provides access to 16 KB of thread-shared memory per multiprocessor, which can be used to organize a cache with higher bandwidth than texture fetches;

More efficient data transfer between system and video memory;

No need for graphical APIs with redundancy and overhead;

Linear memory addressing, gather and scatter, ability to write to arbitrary addresses;

Hardware support for integer and bit operations.

Main limitations of CUDA:

Lack of recursion support for executable functions;

Minimum block width is 32 threads;

Closed CUDA architecture owned by NVIDIA.

The weaknesses of programming with previous GPGPU methods are that these methods do not use vertex shader execution units in previous non-unified architectures, data is stored in textures and output to an off-screen buffer, and multi-pass algorithms use pixel shader units. GPGPU limitations can include: insufficient use of hardware capabilities, memory bandwidth limitations, lack of scatter operation (gather only), mandatory use of the graphics API.

The main advantages of CUDA over previous GPGPU methods stem from the fact that the architecture is designed to make efficient use of non-graphics computing on the GPU and uses the C programming language without requiring algorithms to be ported to a graphics pipeline concept-friendly form. CUDA offers a new path to GPU computing that does not use graphics APIs, offering random memory access (scatter or gather). This architecture does not have the disadvantages of GPGPU and uses all execution units, and also expands capabilities due to integer mathematics and bit shift operations.

CUDA opens up some hardware capabilities not available from graphics APIs, such as shared memory. This is a small memory (16 kilobytes per multiprocessor) that thread blocks have access to. It allows you to cache the most frequently accessed data and can provide faster speeds than using texture fetches for this task. Which, in turn, reduces the throughput sensitivity of parallel algorithms in many applications. For example, it is useful for linear algebra, fast Fourier transform, and image processing filters.

Memory access is also more convenient in CUDA. The graphics API code outputs data as 32 single-precision floating-point values ​​(RGBA values ​​simultaneously into eight render targets) into predefined areas, and CUDA supports scatter writing - an unlimited number of records at any address. Such advantages make it possible to execute some algorithms on the GPU that cannot be efficiently implemented using GPGPU methods based on graphics APIs.

Also, graphics APIs necessarily store data in textures, which requires preliminary packaging of large arrays into textures, which complicates the algorithm and forces the use of special addressing. And CUDA allows you to read data at any address. Another advantage of CUDA is the optimized data exchange between the CPU and GPU. And for developers who want low-level access (for example, when writing another programming language), CUDA offers low-level assembly language programming capabilities.

Disadvantages of CUDA

One of the few disadvantages of CUDA is its poor portability. This architecture only works on video chips from this company, and not on all of them, but starting with the GeForce 8 and 9 series and the corresponding Quadro, ION and Tesla. NVIDIA cites a figure of 90 million CUDA-compatible video chips.

CUDA Alternatives

A framework for writing computer programs related to parallel computing on various graphics and central processors. The OpenCL framework includes a programming language that is based on the C99 standard and an application programming interface (API). OpenCL provides instruction-level and data-level parallelism and is an implementation of the GPGPU technique. OpenCL is a completely open standard and is royalty-free for use.

The goal of OpenCL is to complement OpenGL and OpenAL, which are open industry standards for 3D computer graphics and audio, by taking advantage of the power of the GPU. OpenCL is developed and maintained by the non-profit consortium Khronos Group, which includes many large companies, including Apple, AMD, Intel, nVidia, Sun Microsystems, Sony Computer Entertainment and others.

CAL/IL(Compute Abstraction Layer/Intermediate Language)

ATI Stream Technology is a set of hardware and software technologies that enable AMD GPUs to be used in conjunction with a CPU to accelerate many applications (not just graphics).

Applications for ATI Stream include computationally intensive applications such as financial analysis or seismic data processing. The use of a stream processor made it possible to increase the speed of some financial calculations by 55 times compared to solving the same problem using only the central processor.

NVIDIA does not consider ATI Stream technology a very strong competitor. CUDA and Stream are two different technologies that are at different levels of development. Programming for ATI products is much more complex - their language is more like assembly language. CUDA C, in turn, is a much more high-level language. Writing on it is more convenient and easier. This is very important for large development companies. If we talk about performance, we can see that its peak value in ATI products is higher than in NVIDIA solutions. But again it all comes down to how to get this power.

DirectX11 (DirectCompute)

An application programming interface that is part of DirectX, a set of APIs from Microsoft that is designed to run on IBM PC-compatible computers running Microsoft Windows operating systems. DirectCompute is designed to perform general-purpose computing on GPUs, an implementation of the GPGPU concept. DirectCompute was originally published as part of DirectX 11, but later became available for DirectX 10 and DirectX 10.1.

NVDIA CUDA in the Russian scientific community.

As of December 2009, the CUDA software model is taught in 269 universities around the world. In Russia, training courses on CUDA are given at Moscow, St. Petersburg, Kazan, Novosibirsk and Perm State Universities, the International University of the Nature of Society and Man "Dubna", the Joint Institute for Nuclear Research, the Moscow Institute of Electronic Technology, Ivanovo State Energy University, BSTU. V. G. Shukhov, MSTU im. Bauman, Russian Chemical Technical University named after. Mendeleev, Russian Scientific Center "Kurchatov Institute", Interregional Supercomputer Center of the Russian Academy of Sciences, Taganrog Technological Institute (TTI SFU).