CUDA

For the past week (or maybe two) I'd been learning and using CUDA, NVidia's GPGPU programming language. This is used to run arbitrary code on the graphics processor, which is excellent for large independent data-parallel problems.

While CUDA is for the large part C with some extra keywords there have been a few pitfalls and unexpected things. typedefs do not seem to work at all, in device code the compiler fails with errors such as "syntax error, expected ')'", referring to a line of code that is syntacticly correct, and which GCC has no trouble compiling. Similarly complex data structures such as arrays of struct containing arrays of structs containing integers seem to get mis-compiled when you dereference them, again GCC has no problem. The symptom in this case is that compiling with --device-emulation works fine but without it the generated program computes the wrong value leading to a crash.

CUDA otherwise appears to work well, but there are some big differences to other parallel programming environments.

There is a large amount of hardware parallelism and concurrency available. Each video card (device) has a number of multiprocessors (the 280 GTX has 30), each multiprocessor has eight cores. Threads are organised into blocks, which are organised into a grid. All the threads from a single block (at most 768 threads) execute on a single multiprocessor where they can share local memory. Gangs (aka Warps) of 32 threads execute together in a lockstep manner. When conditional code is executed gangs break into smaller gangs to reform where the conditional branches join. Each multiprocessor can have a number of gangs, threads and blocks active and switch between them to mask memory latency in a hyper-threading manner.

It is difficult to write efficient code in CUDA but here's my advice. Keep in mind that I have only been using code for about three weeks.

  • Minimise your memory access, especially access between the host and device.
  • Where you must access memory, ensure that all the processors in the warp access the memory so that there is one transaction per half-warp rather than one transaction per thread. You may need different data structures to accomplish this. (Coalesced memory access)
  • Make use of the shared memory, it is fast.
  • Avoid branching where possible, sometimes a switch statement can be replaced by a table lookup (where the table is in cached memory - this is a trade off, work out what's best for your program
  • Where branching can't be avoided, try to organise your threads so that all the threads in the warp either take one branch or the other.
  • Be mindful of your blocksize, shared memory usage and register usage, use the occupancy calculator to ensure you get good concurrency.
  • The CUDA documentation contains comprehensive information about how long it takes to execute each instruction, how long particular memory accesses take and so on. It is my opinion that a compiler will be better at producing fast CUDA code than most programmers. This will be especially useful when the fastest data structure for coalesced memory updates is not the most expressive data structure. The DpH project performs these data structure manipulations for other reasons. There is also existing work on a CUDA backend for DpH.