![]() Size of each array = 1024 x 1024 x 32 = 33,554,432 total number of elements For this tutorial, I wanted to support having each array have about 32 million elements. Naturally, since our problem is linear, we would like to make the grid have a linear structure. ![]() Exactly like the thread block, you may think of each ‘grid’ as a 3d brick, filled with blocks. Now it’s time to think of how we’re going to stucture the blocks. The variable as seen above is of type dim3, and it will be used when calling the CUDA kernel. Therefore, the blocks will all be shaped with dimensions 256x1x1. For our application, we are dealing with linear data, so it’s probably simplest to keep the thread structure linear. For some applications, it may make sense to shape a block with 16x16x1. You may shape the block essentially any way you would like. It’s best to think of a thread block as a 3-d block of threads. For the purposes of this tutorial, 256 threads per block is chosen. Therefore, 256, and 512 threads are common and practical numbers. A general guidline is that a block should consist of at least 192 threads in order to hide memory access latency. For this application, the simplest choice is to have each thread calculate one, and only one, element in the final result array. Organizing threadsĪ critical part of designing CUDA applications is to organize threads, thread blocks, and grids appropriately. If you haven’t read the first tutorial, it may be a good idea to go back and read the first CUDA tutorial. The data analysis will take place toward the end of the article. We will then study how fast the code executes on a CUDA device, and compare it to a traditional CPU. The idea is to take two arrays of floating point numbers, and perform an operation on them and store the result in a third floating point array. The goal of this application is very simple. For this tutorial, we will complete the previous tutorial by writing a kernel function. This tutorial will cover the basics of how to write a kernel, and how to organize threads, blocks, and grids. Welcome to the second tutorial in how to write high performance CUDA based applications.
0 Comments
Leave a Reply. |