Last active
March 3, 2025 11:51
-
-
Save mre/1392067 to your computer and use it in GitHub Desktop.
Bitonic Sort on CUDA. On a quick benchmark it was 10x faster than the CPU version.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
* Parallel bitonic sort using CUDA. | |
* Compile with | |
* nvcc -arch=sm_11 bitonic_sort.cu | |
* Based on http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm | |
* License: BSD 3 | |
*/ | |
#include <stdlib.h> | |
#include <stdio.h> | |
#include <time.h> | |
/* Every thread gets exactly one value in the unsorted array. */ | |
#define THREADS 512 // 2^9 | |
#define BLOCKS 32768 // 2^15 | |
#define NUM_VALS THREADS*BLOCKS | |
void print_elapsed(clock_t start, clock_t stop) | |
{ | |
double elapsed = ((double) (stop - start)) / CLOCKS_PER_SEC; | |
printf("Elapsed time: %.3fs\n", elapsed); | |
} | |
float random_float() | |
{ | |
return (float)rand()/(float)RAND_MAX; | |
} | |
void array_print(float *arr, int length) | |
{ | |
int i; | |
for (i = 0; i < length; ++i) { | |
printf("%1.3f ", arr[i]); | |
} | |
printf("\n"); | |
} | |
void array_fill(float *arr, int length) | |
{ | |
srand(time(NULL)); | |
int i; | |
for (i = 0; i < length; ++i) { | |
arr[i] = random_float(); | |
} | |
} | |
__global__ void bitonic_sort_step(float *dev_values, int j, int k) | |
{ | |
unsigned int i, ixj; /* Sorting partners: i and ixj */ | |
i = threadIdx.x + blockDim.x * blockIdx.x; | |
ixj = i^j; | |
/* The threads with the lowest ids sort the array. */ | |
if ((ixj)>i) { | |
if ((i&k)==0) { | |
/* Sort ascending */ | |
if (dev_values[i]>dev_values[ixj]) { | |
/* exchange(i,ixj); */ | |
float temp = dev_values[i]; | |
dev_values[i] = dev_values[ixj]; | |
dev_values[ixj] = temp; | |
} | |
} | |
if ((i&k)!=0) { | |
/* Sort descending */ | |
if (dev_values[i]<dev_values[ixj]) { | |
/* exchange(i,ixj); */ | |
float temp = dev_values[i]; | |
dev_values[i] = dev_values[ixj]; | |
dev_values[ixj] = temp; | |
} | |
} | |
} | |
} | |
/** | |
* Inplace bitonic sort using CUDA. | |
*/ | |
void bitonic_sort(float *values) | |
{ | |
float *dev_values; | |
size_t size = NUM_VALS * sizeof(float); | |
cudaMalloc((void**) &dev_values, size); | |
cudaMemcpy(dev_values, values, size, cudaMemcpyHostToDevice); | |
dim3 blocks(BLOCKS,1); /* Number of blocks */ | |
dim3 threads(THREADS,1); /* Number of threads */ | |
int j, k; | |
/* Major step */ | |
for (k = 2; k <= NUM_VALS; k <<= 1) { | |
/* Minor step */ | |
for (j=k>>1; j>0; j=j>>1) { | |
bitonic_sort_step<<<blocks, threads>>>(dev_values, j, k); | |
} | |
} | |
cudaMemcpy(values, dev_values, size, cudaMemcpyDeviceToHost); | |
cudaFree(dev_values); | |
} | |
int main(void) | |
{ | |
clock_t start, stop; | |
float *values = (float*) malloc( NUM_VALS * sizeof(float)); | |
array_fill(values, NUM_VALS); | |
start = clock(); | |
bitonic_sort(values); /* Inplace */ | |
stop = clock(); | |
print_elapsed(start, stop); | |
} |
I think this is a basic implementation. Many papers said the shared memory (like cache ) should be used for performance. This implement did not use shared memory but only global memory in/ out
@mre
At any stage in the above algorithm, only n/2 threads are being used. Rest n/2 are not being used. Is there a way to utilize all the threads?
I don't know how, given that each thread compares two numbers. Each step needs to complete until the next step can start, which means that the idle threads can't do any work in the meantime. This is how the algorithm operates. For more information, see the description here
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
1. What is
j
and where it is changing its value?j
is the index for a minor step.The minor steps are each column in the following picture:
Source
The naming is a little different in each description of the algorithm.
j
is changing its value here:https://gist.github.com/mre/1392067#file-bitonic_sort-cu-L94-L96
2. Same question for
k
k is the index for the major step.
In the above example picture, the major steps are connected by the arch below the minor steps (so the three major steps in the example are (1), (2,1), (3,2,1)).
It might help to replace the code here with some debug output so you can see the major and minor steps:
3. What is
ixj
and why it equals to1^j
?There is a line in the code here which reads
The
^
is theXOR
operation in C. The first letter ofXOR
isX
, so I chose to call the variableixj
(short fori XOR j
).It's a rather arbitrary, silly name.
What's more interesting is the use of it.
In the bitonic sort implementation, not all threads are active all the time. Instead, they take turns to sort parts of the array. Each thread just sorts two values per step. The
ixj
variable is a nifty trick to find out if the current thread is responsible for doing any sorting at the moment.How does it find out?
For each minor step, we do the following:
Hope that helps a little.