This Gist contains notes and useful links about programming in Cuda, a language based on C/C++ for general purpose programming on Nvidia GPUs. Much of these notes are based on the book "CUDA by Example" by Jason Sanders and Edward Kandrot. There is also an example Cuda program, notes on Thrust ("the CUDA C++ template library"), and miscellaneous topics.
- Notes on Cuda
- Contents
- General tips about CUDA
- Example Cuda progrm
- Useful links
- Notes on "CUDA by Example" by Jason Sanders and Edward Kandrot
- Introduction
- General tips about CUDA (specifically from "CUDA by Example")
- Functions and kernels (chapter 3)
- Allocating, copying and releasing memory (chapter 3)
- Blocks and threads (chapters 4 and 5)
- Reductions (chapter 5)
- Profiling (chapter 6)
- Constant memory (chapter 6)
- Atomics (chapter 9)
- Streams (chapter 10)
- Multiple GPUs (chapter 11)
- ...
- Error checking
- Thrust
- To write a CUDA source file, use the filename suffix
.cu - Use
nvccto compile a CUDA source file, EGnvcc main.cu -o main - A program compiled using
nvccis run exactly the same as a normal program, EG./main - Cuda kernels are prefixed with the keyword
__global__and must always returnvoid - Any pointer which is passed to a Cuda kernel should point to device (GPU) memory; otherwise the program may fail silently, and parts of the program may simply not execute
Below is an example Cuda program, which calculates the square of each element in a vector of doubles. This example demonstrates some major, commonly used components of Cuda, such as defining a Cuda kernel, launching a Cuda kernel, device (GPU) memory allocation, copying memory between the device and the host (CPU), freeing GPU memory, and profiling. If the code is saved in a file called square_vector.cu, it can be compiled and run with the command nvcc square_vector.cu -o square_vector && square_vector. For an input vector size of N = 10000 on a Jetson Nano dev board this took 0.168 ms vs 0.192 ms for an equivalent program using the CPU, whereas for an input vector size of N = 100000 this took 0.597 ms on the GPU vs 1.72 ms on the CPU.
#include "stdio.h"
#define N (10000)
#define THREADS_PER_BLOCK (256)
#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))
/* Define profiling macros */
#define PROFILING_INIT \
cudaEvent_t start, stop; \
float elapsedTime;
#define PROFILING_START \
cudaEventCreate(&start); \
cudaEventCreate(&stop); \
cudaEventRecord(start, 0);
#define PROFILING_STOP \
cudaEventRecord(stop, 0); \
cudaEventSynchronize(stop); \
cudaEventElapsedTime(&elapsedTime, start, stop); \
printf("Time elapsed: %.3g ms\n", elapsedTime);
/* Define the Cuda kernel */
__global__ void square_vector(double* data_in, double* data_out, int n) {
int thread_start_idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (
int i = thread_start_idx;
i < n;
i += stride
) {
data_out[i] = data_in[i] * data_in[i];
}
}
/* Allocate host (CPU) inputs and outputs */
double data_in[N];
double data_out[N];
int main() {
/* Initialise inputs */
for (int i = 0; i < N; i++) {
data_in[i] = (double) i;
}
/* Allocate device memory for inputs and outputs, and copy input data from
host to device memory */
double *dev_data_in;
cudaMalloc((void**) &dev_data_in, N * sizeof(double));
cudaMemcpy(dev_data_in, data_in, N * sizeof(double), cudaMemcpyHostToDevice);
double *dev_data_out;
cudaMalloc((void**) &dev_data_out, N * sizeof(double));
/* Initialise profiling */
PROFILING_INIT;
PROFILING_START;
/* Call the Cuda kernel */
square_vector<<<CEIL_DIV(N, THREADS_PER_BLOCK), THREADS_PER_BLOCK>>>(
dev_data_in,
dev_data_out,
N
);
/* Stop profiling and print the running time */
PROFILING_STOP;
/* Copy output data from device back to host memory */
cudaMemcpy(data_out, dev_data_out, N * sizeof(double), cudaMemcpyDeviceToHost);
/* Compare actual output with expected output */
for (int i = 0; i < N; i++) {
if (data_out[i] != (data_in[i] * data_in[i])) {
printf(
"Fail: element %i was %.1f, expected %.1f\n",
i,
data_out[i],
data_in[i] * data_in[i]
);
}
}
/* Free allocated device memory */
cudaFree(dev_data_in);
cudaFree(dev_data_out);
}- Unified Memory for CUDA Beginners
- An Even Easier Introduction to CUDA (including profiling with
nvprof) - NVidia official CUDA Samples
- Homepage for "CUDA by Example"
- Thrust
"CUDA by Example" is a book by Jason Sanders and Edward Kandrot which explains the fundamentals of programming NVidia GPUs using CUDA. The homepage for the book is here, from which all of the source code from the book can be downloaded (I found that for some reason I couldn't download the source code using the Google Chrome web browser, but I was able to download the source code using the Microsoft Edge web broswer). All of the code from the book can be compiled and run on a GPU such as a Jetson Nano Developer Kit (although note that some examples require additional compiler flags, EG -lGL -lGLU -lglut for the Julia Set examples from chapter 4 of the book).
Before starting, make sure that nvcc is present on your system path. On a freshly set up Jetson Nano Developer Kit, this involves appending PATH=$PATH:/usr/local/cuda-10.2/bin/ to the end of ~/.bashrc, but apart from this, a freshly set up Jetson Nano Developer Kit supports everything required for nvcc, ssh, rsync, and X11 forwarding to view graphical interfaces over ssh (although it may be necessary to run the command export DISPLAY=localhost:0.0 on a local WSL machine connecting to a Jetson Nano over ssh with X11 forwarding).
The CUDA language features (keywords, function names, etc) specified below are all built-in to CUDA (IE don't require including a specific header file), unless otherwise specified. Filenames are assumed to refer to the source code for the book, which can be downloaded from the above link to the book's website.
- To get information about all GPU devices currently available on a given machine, compile and run
cuda_by_example/chapter03/enum_gpu.cu, which prints information tostdoutsuch as device name, compute capability, clock rate, total global and constant memory, the maximum number of threads per block, the maximum thread dimensions, and the maximum grid size - Error checking is important for calls to built-in Cuda functions
- Many built-in Cuda functions (EG for GPU memory allocation) return a variable of type
cudaError_t, indicating if they were successful or not - If the return code is not equal to
cudaSuccess, then this indicates that an error occured - If an error occured, an error string can be acquired by passing the return code to
cudaGetErrorString(code) - For example, within a Cuda error-checking function, this error code might be printed to
stderr, along with the filename and line of the function call, after which the error-checking function could callexit( EXIT_FAILURE )orexit(code) - The filename and line number of the error can be passed to the error-checking function by wrapping it in a macro function, accepts the error code and calls the error-checking function with the error code as well as
__FILE__and__LINE__
- Many built-in Cuda functions (EG for GPU memory allocation) return a variable of type
- A normal C-style function in a CUDA source file will run on the CPU unless otherwise specified
- The
mainfunction should run on the CPU - A CUDA kernel is a function that can run on GPU hardware
- To define a kernel that can be called from both the CPU and GPU, prepend the function definition with the keyword
__global__, EG__global__ void add( int *a, int *b, int *c ) {...} - To define a kernel that can only be called from functions running on the GPU, prepend the function definition with the keyword
__device__(this keyword can also be used for functions defined within astruct, as demonstrated in section 4.2.2 of the book) - To call a global kernel from a CPU function, use triple angle brackets, EG
add<<<N,1>>>( dev_a, dev_b, dev_c );
- To allocate memory on the GPU, use the
cudaMallocfunction, which accepts the address of a pointer (a pointer to a pointer) for the memory, and the number of bytes to by allocated, EGint *dev_a; cudaMalloc( (void**)&dev_a, N * sizeof(int) )- After calling
cudaMalloc,dev_awill now point to a correctly sized buffer on the GPU
- After calling
- To copy memory from an array on the CPU to an array on the GPU, use the
cudaMemcpyfunction with thecudaMemcpyHostToDevicekeyword, EGint a[N]; cudaMemcpy( dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice );(assuming thatdev_ahas been declared and allocated as described in the previous example) - To copy memory from an array on the GPU to an array on the CPU, use the
cudaMemcpyfunction with thecudaMemcpyDeviceToHostkeyword, EGcudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost ) - To free allocated GPU memory, use the
cudaFreefunction, EGcudaFree( dev_a )
- When calling a CUDA kernel using angle bracket notation, the first number in the angle brackets refers to the number of blocks, and the second number refers to the number of threads
- Often it is useful at the start of a kernel to calculate an index which is specific to each thread, and is used to decide which element of the inputs/outputs to process in the thread which is currently executing
- This is achieved using:
- The index of the thread within the block, EG
threadIdx.x - The index of the block within the grid, EG
blockIdx.x - The dimensions of each block of threads, EG
blockDim.x - The dimensions of the grid of blocks, EG
gridDim.x
- The index of the thread within the block, EG
- With 1D arrays of blocks and threads, it is possible to get a unique index for each thread in the program using the expression
int tid = threadIdx.x + blockIdx.x * blockDim.x;- In this example, if there are more elements to be processed than there are threads and blocks, the thread index can be incremented in a loop (EG while
tidis less than the total number of elements to be processed) using the expressiontid += blockDim.x * gridDim.x;
- In this example, if there are more elements to be processed than there are threads and blocks, the thread index can be incremented in a loop (EG while
- If there is only 1 thread per block and a 1D array of blocks, we can simply use
int tid = blockIdx.x;, and incrementtidin a loop while it is less than the number of elements to be processed using the expressiontid += gridDim.x;- Note that using only 1 thread per block is likely to perform more slowly than using many threads per block (EG 256)
- This is achieved using:
- Threads can use memory which is shared between threads, but not between blocks, by declaring the memory using the
__shared__keyword, EG__shared__ float cache[threadsPerBlock]; - Threads can wait for other threads within the same block to finish processing using the
__syncthreads()function, which is useful for example for performing reductions (see section Reductions below) - It is possible to define higher dimensional arrays of blocks and threads, by declaring a
dim3variable (which represents a 3D tuple) and passing it to the kernel call, for exampledim3 blocks(DIM/16,DIM/16); dim3 threads(16,16); kernel<<<blocks,threads>>>( d->dev_bitmap, ticks );- This might be useful for example if processing all pixels of an image in parallel
- In this case, the offset for each thread can be calculated as follows:
int x = threadIdx.x + blockIdx.x * blockDim.x; int y = threadIdx.y + blockIdx.y * blockDim.y; int offset = x + y * blockDim.x * gridDim.x;
- There is a limit to the number of blocks and threads that can be launched
- The maximum number of blocks is 65,535 (according to section 4.2.1 of the book, this is a hardware-imposed limit)
- The maximum number of threads per block is platform-specific (the maximum number of threads on the Jetson Nano is 1024, whereas section 5.2.1 of the book states that "For many of the graphics processors currently available, this limit is 512 threads per block")
- When the number of outputs is proportional to the number of inputs (EG when adding two vectors together element-wise, or evaluating a function at every point in an array), a GPU can hypothetically calculate the answer in constant time, independent of the number of inputs, if it has enough internal parallel processing units, by assigning one thread/processing unit to each element of the input/output data
- Summing over all the values in an array to produce a scalar output is a fundamentally different type of operation that cannot be performed in constant time, even with an infinite number of internal parallel processing units
- This type of computation which takes an input array and operates on all the elements to produce a smaller array is called a reduction
- Naively, a reduction such as summing an array could be performed by a single thread and a single processing unit by adding up each element one by one, in time which is proportional to the length of the input array
- Such a reduction can be performed more efficiently in time that is proportional to the logarithm of the length of the input array, using shared memory and thread synchronisation (using the
__shared__keyword and the__syncthreads()function mentioned above) - An example of a kernel which calculates the dot-product of 2 arrays can be found in
cuda_by_example/chapter05/dot.cu, which demonstrates how to implement a reduction efficiently
- To initialise GPU profiling (which can be used to profile kernel calls, copying memory between the CPU and GPU, etc):
| Instruction | Code |
|---|---|
| Declare the following variables | cudaEvent_t start, stop; float elapsedTime; |
| Initialise the Cuda event variables | cudaEventCreate( &start ); cudaEventCreate( &stop ); |
| Record the start of the profiling | cudaEventRecord( start, 0 ); |
| Perform code which is to be profiled | |
| Record the end of the profiling | cudaEventRecord( stop, 0 ); |
| Synchronise the CPU with the GPU | cudaEventSynchronize( stop ); |
| Calculate the elapsed time | cudaEventElapsedTime( &elapsedTime, start, stop ) |
Free the memory created by cudaEventCreate |
cudaEventDestroy( start ); cudaEventDestroy( stop ) |
| Print the elapsed time | printf( "Time elapsed: %3.1f ms\n", elapsedTime ); |
This set of commands can be followed up into 3 macros as follows:
/* *** PROFILING MACROS *** */
/* Initialise profiling: use this macro once per function before calling
PROFILING_START and PROFILING_STOP */
#define PROFILING_INIT \
cudaEvent_t start, stop; \
float elapsedTime;
/* Record the start time for profiling: use this macro immediately before any
call(s) that are to be profiled */
#define PROFILING_START \
cudaEventCreate(&start); \
cudaEventCreate(&stop); \
cudaEventRecord(start, 0);
/* Record the stop time for profiling: use this macro immediately after any
call(s) that are to be profiled */
#define PROFILING_STOP \
cudaEventRecord(stop, 0); \
cudaEventSynchronize(stop); \
cudaEventElapsedTime(&elapsedTime, start, stop); \
printf("Time elapsed: %.3g ms\n", elapsedTime);These macros can be used as follows:
int main() {
PROFILING_INIT;
PROFILING_START;
my_kernel<<<4096, 256>>>();
PROFILING_STOP;
PROFILING_START;
my_kernel<<<2048, 512>>>();
PROFILING_STOP;
}In addition to the notes on error checking mentioned above in the context of the "CUDA By Example" book, this Stack Overflow answer provides a useful method for checking if an error occured during a kernel call:
kernel<<<blocks, threads>>>(params);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess)
printf("Error: %s\n", cudaGetErrorString(err));Cuda memory errors can be investigated using CUDA-MEMCHECK, which in the simplest case can be used by just calling cuda-memcheck along with the name of the executable:
nvcc app_name.cu -o app_name
cuda-memcheck app_name
As stated in the documentation, "Thrust is a C++ template library for CUDA based on the Standard Template Library (STL). Thrust allows you to implement high performance parallel applications with minimal programming effort through a high-level interface that is fully interoperable with CUDA C. Thrust provides a rich collection of data parallel primitives such as scan, sort, and reduce, which can be composed together to implement complex algorithms with concise, readable source code."
The API reference guide for Thrust can be found here, and the full API documentation can be found here.
- Thrust functions can't be called from inside a Cuda kernel, they can only be called from host (CPU) functions (source)
- Thrust is based on the Standard Template Library (STL), and many Thrust functions correspond to an analogous STL function, EG
thrust::transformandstd::transform - Many Thrust functions accept instances of functors, which are essentially classes (or
structs) which defineoperator()(IE are callable, see example below) - Generalised reductions, transformed reductions, and inner products can be performed using
thrust::reduce,thrust::transform_reduce, andthrust::inner_product, which might be useful EG if generalising existing Thrust functions to complex numbers - Stream Compaction is an established solved problem which refers to removing all elements from an array which don't satisfy a certain condition, and can be achieved using
thrust::copy_if, which is found in the Thrust API documentation at API/Algorithms/Reordering/Stream Compaction - If a Thrust vector is initialised (EG with
thrust::device_vector<double> dev_data_in(N);), sometimes it can be useful to cast this vector to a raw pointer (EG for compatibility with other inputs to a template function), which can be performed with an expression such as(double*) thrust::raw_pointer_cast(&dev_data_in[0]), or the vector can be cast to athrust::device_ptr<double>with the expressionthrust::device_pointer_cast(&dev_data_in[0])or simply&dev_data_in[0] - Many Thrust functions are template functions, and can accept many combinations of different input types, including iterators of Thrust vectors and/or raw pointers (as long as they point to the correct location in host or device memory)
- Often it can be be easier to simply use raw pointers instead of Thrust vectors, as this can alleviate the need to appropriately cast Thrust vectors (EG using
raw_pointer_castordevice_pointer_castas described above), and also avoids having to debug complicated compiler errors (EG that are >500 lines long and overflow the terminal buffer) due to incompatibility between different template types and overloaded functions - Raw pointers to device memory (EG allocated using
cudaMallocand populated usingcudaMemcpy) can be used in Thrust functions equally as well as iterators to device memory which has been allocated usingthrust::device_vector
- Often it can be be easier to simply use raw pointers instead of Thrust vectors, as this can alleviate the need to appropriately cast Thrust vectors (EG using
- Thrust defines a set of Predefined Function Objects (including Arithmetic Operations, Comparison Operations, and Logical Operations), which can be useful for implementing generalised transformations, reductions, etc
- Many Thrust functions (other than those for memory allocation and copying) are overloaded with an optional first argument
const thrust::detail::execution_policy_base< DerivedPolicy > & exec, which can be specified as eitherthrust::hostorthrust::deviceto specify whether the function should run on the CPU or the GPU, or it can be omitted, in which case the compiler will decide- In general it is best to specify this argument as
thrust::deviceif the function is being provided with GPU memory and expected to run on the GPU, otherwise it is possible (especially EG if the function is being provided with raw pointers which have been allocated in GPU memory in a different source file) that the function will be compiled to run on the host (CPU), meaning that when this function is called with a pointer to device (GPU) memory, the program may fail silently and crash without displaying any useful error message
- In general it is best to specify this argument as
Below is an example Cuda program using Thrust, which calculates the mean and variance of a vector of doubles, and then calculates the indices of all elements in the vector which are more than one standard deviation above the mean of the vector, as well as the number of such elements.
#include "stdio.h"
#include <thrust/device_vector.h>
#define N (10000)
/* Calculate the mean and variance of a vector */
void mean_and_var(
thrust::device_vector<double> data_in,
int n,
double* p_mean,
double* p_var
) {
double sum = thrust::reduce(
thrust::device,
data_in.begin(),
data_in.end(),
0.0,
thrust::plus<double>()
);
double sum_square = thrust::transform_reduce(
thrust::device,
data_in.begin(),
data_in.end(),
thrust::square<double>(),
0.0,
thrust::plus<double>()
);
double mean = sum / n;
*p_mean = mean;
*p_var = (sum_square / n) - mean*mean;
}
/* Functor to calculate if the input is above a threshold */
struct above_threshold {
const double _threshold;
above_threshold(double threshold) : _threshold(threshold) {}
__device__ int operator()(const double& x) const {
return x > _threshold;
}
};
int main() {
/* Allocate host (CPU) inputs and outputs, and initialise inputs */
thrust::host_vector<double> data_in(N);
thrust::host_vector<int> data_out(N);
for (int i = 0; i < N; i++) {
data_in[i] = (double) i;
}
/* Allocate device memory for inputs and outputs, and copy input data from
host to device memory */
thrust::device_vector<double> dev_data_in(N);
thrust::device_vector<int> dev_data_out(N);
thrust::copy(data_in.begin(), data_in.end(), dev_data_in.begin());
/* Calculate the mean and variance of the vector */
double mean, var;
mean_and_var(dev_data_in, N, &mean, &var);
/* Calculate the indices of samples which are 1 standard deviation above
the mean */
thrust::counting_iterator<int> count_start(0);
thrust::device_ptr<int> output_end = thrust::copy_if(
thrust::device,
count_start,
count_start + N,
&dev_data_in[0],
&dev_data_out[0],
above_threshold(mean + sqrt(var))
);
int num_above_threshold = (int) (output_end - &dev_data_out[0]);
/* Copy outputs back to host */
thrust::copy(
&dev_data_out[0],
&dev_data_out[num_above_threshold],
&data_out[0]
);
/* Print outputs */
printf("Mean = %.1f, std = %.1f\n", mean, sqrt(var));
printf("data_out[%i] = {", num_above_threshold);
for (int i = 0; i < num_above_threshold; i++) {
printf("%i, ", data_out[i]);
if (i > 20) {
printf("..., %i", data_out[num_above_threshold - 1]);
break;
}
}
printf("}\nFinished program\n");
}TODO: replace this section with a program which procides different implementations of a transformation on a vector (EG squaring each element), and profile the running time for inputs of different sizes, and compare between different implementations such as a custom Cuda kernel, a Thrust function (using both host and device memory), and a regular CPU function.
Below is an example of calculating mean and variance using Thrust:
struct square_double {
__host__ __device__ double operator()(const double& x) const {
return x * x;
}
};
void mean_and_var(double* a, int n, double* p_mean, double* p_var) {
double sum = thrust::reduce(a, &a[n], 0.0, thrust::plus<double>());
double sum_square = thrust::transform_reduce(
a,
&a[n],
square_double(),
0.0,
thrust::plus<double>()
);
double mean = sum / n;
*p_mean = mean;
*p_var = (sum_square / n) - mean*mean;
}Here is that solution in a self-contained source file along with some profiling calculations, comparing CPU versus GPU, and device memory versus host memory in Thrust function calls:
#include "stdio.h"
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#define PROFILING_INIT \
cudaEvent_t start, stop; \
float elapsedTime;
#define PROFILING_START \
cudaEventCreate(&start); \
cudaEventCreate(&stop); \
cudaEventRecord(start, 0);
#define PROFILING_STOP \
cudaEventRecord(stop, 0); \
cudaEventSynchronize(stop); \
cudaEventElapsedTime(&elapsedTime, start, stop); \
printf("Time elapsed: %.3g ms\n", elapsedTime);
#define N (6*8000)
// #define N (6*8000*10)
// #define N (6*8000*100)
double a[N];
template <typename T> struct square {
__host__ __device__ T operator()(const T& x) const {
return x * x;
}
};
void mean_and_var_cpu(double* a, int n, double* p_mean, double* p_var) {
double sum = 0, sum_square = 0, mean;
for (int i = 0; i < n; i++) {
sum += a[i];
sum_square += (a[i] * a[i]);
}
mean = sum / n;
*p_mean = mean;
*p_var = (sum_square / n) - mean*mean;
}
template <typename T> void mean_and_var(T a, int n, double* p_mean, double* p_var) {
double sum = thrust::reduce(a, &a[n], 0.0, thrust::plus<double>());
double sum_square = thrust::transform_reduce(a, &a[n], square<double>(), 0.0, thrust::plus<double>());
double mean = sum / n;
*p_mean = mean;
*p_var = (sum_square / n) - mean*mean;
}
int main() {
for (int i = 0; i < N; i++) {
a[i] = i;
}
double mean, var;
PROFILING_INIT;
printf("With thrust:\n");
PROFILING_START;
mean_and_var<double*>(a, N, &mean, &var);
PROFILING_STOP;
printf("Mean = %f, var = %f\n", mean, var);
printf("With thrust, using device memory:\n");
thrust::device_vector<double> a_dev(N);
thrust::copy(a, &a[N], a_dev.begin());
PROFILING_START;
mean_and_var<thrust::device_ptr<double>>(&a_dev[0], N, &mean, &var);
PROFILING_STOP;
printf("Mean = %f, var = %f\n", mean, var);
printf("On CPU:\n");
PROFILING_START;
mean_and_var_cpu(a, N, &mean, &var);
PROFILING_STOP;
printf("Mean = %f, var = %f\n", mean, var);
}In all cases, the answers agree with the mean and population variance calculated using the Python statistics module:
import statistics
x = 8000*6
print(statistics.mean(list(range(x))))
# print(statistics.variance(list(range(x))))
print(statistics.pvariance(list(range(x))))Here is some profiling information, all of which was performed on a Jetson Nano development board:
| N | Time taken for Thrust (ms) | Time taken for Thrust using device memory (ms) | Time taken for naive CPU implementation (ms) |
|---|---|---|---|
| 6*8000 | 11.2 | 2.38 | 1.11 |
| 6*8000*10 | 108 | 7.62 | 11.7 |
| 6*8000*100 | 1.12e+03 | 41.7 | 109 |
Conclusions from profiling:
- If you're going to use Thrust, make sure you use device memory!
- For a large enough input size, the GPU is faster than the CPU
- For smaller input sizes, depending on the platform, it might be faster using the CPU than the GPU