CUDA
For the past week (or maybe two) I'd been learning and using CUDA, NVidia's GPGPU programming language. This is used to run arbitrary code on the graphics processor, which is excellent for large independent data-parallel problems.
While CUDA is for the large part C with some extra keywords there have been a few pitfalls and unexpected things. typedefs do not seem to work at all, in device code the compiler fails with errors such as "syntax error, expected ')'", referring to a line of code that is syntacticly correct, and which GCC has no trouble compiling. Similarly complex data structures such as arrays of struct containing arrays of structs containing integers seem to get mis-compiled when you dereference them, again GCC has no problem. The symptom in this case is that compiling with --device-emulation works fine but without it the generated program computes the wrong value leading to a crash.
CUDA otherwise appears to work well, but there are some big differences to other parallel programming environments.
There is a large amount of hardware parallelism and concurrency available. Each video card (device) has a number of multiprocessors (the 280 GTX has 30), each multiprocessor has eight cores. Threads are organised into blocks, which are organised into a grid. All the threads from a single block (at most 768 threads) execute on a single multiprocessor where they can share local memory. Gangs (aka Warps) of 32 threads execute together in a lockstep manner. When conditional code is executed gangs break into smaller gangs to reform where the conditional branches join. Each multiprocessor can have a number of gangs, threads and blocks active and switch between them to mask memory latency in a hyper-threading manner.
It is difficult to write efficient code in CUDA but here's my advice. Keep in mind that I have only been using code for about three weeks.
The CUDA documentation contains comprehensive information about how long it takes to execute each instruction, how long particular memory accesses take and so on. It is my opinion that a compiler will be better at producing fast CUDA code than most programmers. This will be especially useful when the fastest data structure for coalesced memory updates is not the most expressive data structure. The DpH project performs these data structure manipulations for other reasons. There is also existing work on a CUDA backend for DpH.
- paul's blog
- Login or register to post comments