Sunday, July 19, 2009

Optimizing memory access in CUDA

Now that we understand how to use textures in cuda, I think it is time to tackle a simple problem and see how we can optimize some code.

Here is a description of the diffraction problem: There is a NxN input image, and an NxN output image (N is typically 256). Each pixel on the output image receives NxN rays, one from each of the input pixels. Each of these rays deposits the color from the source pixel, modulated by a function (we will call this the transformer function). The transformer is a function of source position (x,y) and direction (theta, phi) of propogation.

The naive implementation is as follows:
1. Put the input image into global memory.
2. Create NxN threads, such that each of these threads operates on one of the output pixels.
3. Each of these threads runs a for loop over all the input pixels. For each ray hitting this output pixel, it computes the intensity and increments the output pixel intensity.

This approach does not scale with image size. For 512x512 image, having such a large for loop inside the kernel is not a good idea... So we will have to think of a more scalable solution:

1. Break the input image into buckets of bxb size, where b is around 8 or 16.
2. Now we have multiple kernel invocations, each one corresponding to a bucket. During each invocation, the threads consider rays only from this small bucket of the input data.

Also, another small complication that we introduce is that to prevent aliasing, we evaluate the intensity at input pixels with small random offsets. To facilitate this, we read in a NxN random buffer. This is also in global memory in the naive implementation.

Naive Implementation, with all global memory reads and writes
Time Taken: 2m11.074s

We try to improve things by using textures for the random array:
Time Taken: 1m47.158s

Then, we use textures even for the input image:
Time Taken: 1m34.969s

Next, we notice that each kernel does bxb computations, thus writing to global memory bxb times. By using a temporary local register to maintain this value, and writing that value only once to global memory, we can save a lot of time:
Time Taken: 0m16.530s

The time shown is for 16x16 buckets. For other bucket sizes:

8x8 : 0m17.117s
32x32: 0m16.592s

Note that the mechanism of using square buckets also ensures locality of reference and therefore makes better use of the texture caches. To really get a sense of the improvement from the original code, a 512x512 evaluation used to take around 20 minutes. and now takes 4m15s.

If you remember from my earlier post about diffraction, I wrote that a 256x256 image is evaluated in around 5 seconds by a GLSL shader implementation. We are now 3x slower than GLSL. The reason for this is that the GLSL code uses vectorization for the RGB components. Using vectorization in CUDA will take good advantage of the SIMD units and hopefully give us the remaining 3x performance boost.

Tuesday, July 14, 2009

Texture Memory: Any abstraction is bad abstraction

My long battle with CUDA textures has finally come to an end. Who won? Well, it was mostly a compromise. You see, things would go very smoothly if I was not trying to make C++ classes out of everything.

To use textures in CUDA, we first define a channel format descriptor (cudaChannelFormatDesc), and tell it how many bits we want in each texture channel (rgba), and what data format is stored in the textures (signed/unsigned ints, or floats)

cudaChannelFormatDesc channelDesc;
channelDesc = cudaCreateChannelDesc(rBits, gBits, bBits, aBits, datatype);


Then we create the cuda array that stores the data. This is where we decide how big our texture is.

cudaArray* cuArray;
cudaMallocArray(&cuArray, &channelDesc, width, height);


Then we copy the texture data from CPU to GPU:

cudaMemcpyToArray(cuArray, 0, 0, cpuPtr, width*height*sizeof(T), cudaMemcpyHostToDevice);


The final step is to create a texture reference, specify the filtering and addressing modes of this texture (clamped / linear filter / normalized etc...) and bind this reference to the memory that we allocated:

texture<float, 2, cudaReadModeElementType> texRef1;
texRef1.addressMode[0] = cudaAddressModeClamp;
texRef1.addressMode[1] = cudaAddressModeClamp;
texRef1.addressMode[2] = cudaAddressModeClamp;

//Modes: cudaFilterModePoint, cudaFilterModeLinear
texRef1.filterMode = cudaFilterModePoint;

//Normalized addressing means 0 to 1 address, else 0 to width and 0 to height address
texRef1.normalized = false;
cudaBindTextureToArray(texRef1, cuArray, channelDesc);


This texture is read inside the CUDA kernel as follows:

float something = tex2D(texRef1, u, v);


This info can be found anywhere. But now is when things get interesting... The CUDA documentation specifies: "A texture reference is declared at file scope as a variable of type texture".

What this really means is "Thou shalt not make local texture references. Thou shalt also not attempt to pass references around as function parameters. Any attempt to make a texture reference at anything other than translation unit scope shalt be dealt with harshly (by a weird ptx compiler error or by a nvcc crash)." More specifically, you will either see an Open64 compiler crash dump, or a slightly less discouraging PTX error: "State space mismatch between instruction and address in instruction 'tex'"

So, it effectively is Impossible to make a "texReference" class, and put everything into it. The main part, the texture reference itself, should be declared outside the class, at global scope in the same file as your kernel. CRAZY!!

The final abstraction I settled for is this:
template <class T>
class texArrayHandle{
 cudaArray* cuArray;
 cudaChannelFormatDesc channelDesc;
 int width, height;
public: ...
};


The texture reference is still declared outside the class, and the class only hides the creation and destruction of the cudaArray. That is around 50 percent abstraction, and 50 percent C. Atleast, my "create array, save to texture, readback texture" example works. Now, to test this in a 'real world' scenario.

Sunday, July 12, 2009

Compiling CUDA projects

I find that a lot of new CUDA developers have this tendency to use the CUDA sdk makefiles (include common.mk), and having loads of dependencies on the cuda utils that the sdk provides. This, in the long run does not seem like a good idea, as it involves depending on what the sdk makefile does, and hence depending on the (heavy) sdk itself. Here is a small tutorial on how to compile stand alone CUDA program with multiple h, cpp and cu files, and a few external headers/libs:

Lets say we have 4 files: main.cpp, test.cpp, test.h, kernel.cu

The idea is that we have to compile the cpp files with g++, and the cu files with nvcc.
Also, we specify that only compilation should take place (no linking), so that if we have any dependencies between files, we won't get zillions of errors.

To do this, compile the cpp files this way:

g++ -c *.cpp

Then compile the cu files:

nvcc -c *.cu

Note that nvcc may not be in your default path. When you install the CUDA toolkit, it is placed in the cuda/bin/nvcc path. For example if you installed cuda toolkit in /usr/local/cuda, then nvcc is in /usr/local/cuda/bin/nvcc. Adding this to your $PATH variable will fix things.

The above steps should produce 3 object files: main.o, test.o, and kernel.o

Finally, link these together and make the final executable:

g++ -o runme *.o

This will link all the object files into a single executable called "runme", which you can run as usual.

To include any specific headers / libraries, simply pass them as arguments during the compile / link phases. For example:

g++ -c -I/usr/local/cuda/include *.cpp
g++ -o runme -L/usr/local/cuda/lib -lcuda -lcudart *.o

While running CUDA programs, you may get a library not found error. To fix this, add the CUDA libraries to your $LD_LIBRARY_PATH environment variable. The libraries are libcuda.so, libcudart.so and will be present in your CUDA toolkit install path.

All this can be condensed into a nice makefile for convenience.

Finally, the part about actually writing CUDA programs that are independent of the SDK. Note that the SDK uses macros like CUT_DEVICE_INIT() which are in practice not needed at all. Simply include the cuda header files, and start making cuda calls.

Monday, July 6, 2009

CUDA Warps and Branching

Hi, Its been a long time since I wrote something here. Over the past week I haven't really been doing much. Read a few papers and tutorials on raytracing / gpu+raytracing... and started on the cpu raytracer framework.

That apart, I have also been experimenting on GPU code and here is a small test I did. It is well documented that current nVidia GPUs have warp sizes of 32 threads and that each thread in a warp should ideally take the same branch. The idea here is that instructions are issued not to each thread, but to each warp. So, if there are 31 threads in a warp that do not have anything to do, and 1 thread that has to do heavy computation, then the entire heavy computation code is issued to all the threads, but these are ignored by the first 31 threads. This causes a delay in execution of the entire warp.

But how bad is this really? Can we verify this as a fact?? Lets find out!

Here is the sample kernel:

__global__ void simulate_gpu()
{
 int idx=blockIdx.x*blockDim.x+threadIdx.x;

 if ( idx % 64 < 32)
  for (int i=idx; i < idx+320; i++){
   double theta1 = sin((double)i / 15);
   double theta2 = sin((double)(i + 1) / 15);
   double theta3 = fmax(theta1, theta2);
   double theta4 = cos( sqrt (10.0 * theta3) );
   double theta5 = pow ( theta3, theta4 );
  }
 else
  for (int i=idx; i < idx+4; i++){
   double theta1 = sin((double)i / 15);
   double theta2 = sin((double)(i + 1) / 15);
   double theta3 = fmax(theta1, theta2);
   double theta4 = cos( sqrt (10.0 * theta3) );
   double theta5 = pow ( theta3, theta4 );
  }
 }


The above if-conditional statement generates a square wave response. the first 32 outcomes are true, next 32 are false, and so on. Note that one of the branches has 320 iterations, while the other branch has only 4 iterations.

The running time for this code for 1994096 kernel executions (including all overheads, as measured using the unix 'time' command), is: 0.872 seconds. (averaged over 4 runs).

Lets modify the condition so it looks like this:


 if ( idx % 32 < 16)


This doubles the frequency of the square wave. So the first 16 outcomes are true, next 16 false, etc...

The running time for this is: 1.370s (again avg over 4 runs). That is around 57% longer than the ideal case.

If this is not convincing enough, lets try something else. Lets remove the branch entirely:


__global__ void simulate_gpu()
{
 int idx=blockIdx.x*blockDim.x+threadIdx.x;

  for (int i=idx; i < idx+320; i++){
   double theta1 = sin((double)i / 15);
   double theta2 = sin((double)(i + 1) / 15);
   double theta3 = fmax(theta1, theta2);
   double theta4 = cos( sqrt (10.0 * theta3) );
   double theta5 = pow ( theta3, theta4 );
  }
 }


This is the part of the earlier branching code that has 320 iterations. Time for execution is 1.394s. That means although half the threads actually execute only 4 iterations, its almost equivalent to evaluating 320 iterations every time, unless your warps are well behaved. The time taken for the remaining half of the branch independently is 0.140s, just to put things into perspective...

Granted, that this way of measuring execution time is not all that accurate... but it goes to show that optimising code to minimise warp divergence is worth it.

Also, what happens if we just miss completely perfect warp control flow? What if one or 2 threads go bad? Lets find out. By setting the frequency of the square wave to slightly less than optimal, (say 62 instead of 64), we can find out how bad it really is. Here are a few such results:


 if ( idx % 62 < 31)


Time reported: 1.063s


 if ( idx % 66 < 33)


Time reported: 1.063s

Hmm, so we don't incur the full penalty. I did a few more tests for smaller numbers (56, 48 etc) and they also report similar execution times...

Next article will hopefully be on memory access optimisations... Once I figure it all out and implement the cuda memory access as C++ convenience classes.