Skip to content

Instantly share code, notes, and snippets.

@mre
Last active February 10, 2026 16:42
Show Gist options
  • Select an option

  • Save mre/1392067 to your computer and use it in GitHub Desktop.

Select an option

Save mre/1392067 to your computer and use it in GitHub Desktop.
Bitonic Sort on CUDA. On a quick benchmark it was 10x faster than the CPU version.
/*
* Parallel bitonic sort using CUDA.
* Compile with
* nvcc -arch=sm_11 bitonic_sort.cu
* Based on http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm
* License: BSD 3
*/
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
/* Every thread gets exactly one value in the unsorted array. */
#define THREADS 512 // 2^9
#define BLOCKS 32768 // 2^15
#define NUM_VALS THREADS*BLOCKS
void print_elapsed(clock_t start, clock_t stop)
{
double elapsed = ((double) (stop - start)) / CLOCKS_PER_SEC;
printf("Elapsed time: %.3fs\n", elapsed);
}
float random_float()
{
return (float)rand()/(float)RAND_MAX;
}
void array_print(float *arr, int length)
{
int i;
for (i = 0; i < length; ++i) {
printf("%1.3f ", arr[i]);
}
printf("\n");
}
void array_fill(float *arr, int length)
{
srand(time(NULL));
int i;
for (i = 0; i < length; ++i) {
arr[i] = random_float();
}
}
__global__ void bitonic_sort_step(float *dev_values, int j, int k)
{
unsigned int i, ixj; /* Sorting partners: i and ixj */
i = threadIdx.x + blockDim.x * blockIdx.x;
ixj = i^j;
/* The threads with the lowest ids sort the array. */
if ((ixj)>i) {
if ((i&k)==0) {
/* Sort ascending */
if (dev_values[i]>dev_values[ixj]) {
/* exchange(i,ixj); */
float temp = dev_values[i];
dev_values[i] = dev_values[ixj];
dev_values[ixj] = temp;
}
}
if ((i&k)!=0) {
/* Sort descending */
if (dev_values[i]<dev_values[ixj]) {
/* exchange(i,ixj); */
float temp = dev_values[i];
dev_values[i] = dev_values[ixj];
dev_values[ixj] = temp;
}
}
}
}
/**
* Inplace bitonic sort using CUDA.
*/
void bitonic_sort(float *values)
{
float *dev_values;
size_t size = NUM_VALS * sizeof(float);
cudaMalloc((void**) &dev_values, size);
cudaMemcpy(dev_values, values, size, cudaMemcpyHostToDevice);
dim3 blocks(BLOCKS,1); /* Number of blocks */
dim3 threads(THREADS,1); /* Number of threads */
int j, k;
/* Major step */
for (k = 2; k <= NUM_VALS; k <<= 1) {
/* Minor step */
for (j=k>>1; j>0; j=j>>1) {
bitonic_sort_step<<<blocks, threads>>>(dev_values, j, k);
}
}
cudaMemcpy(values, dev_values, size, cudaMemcpyDeviceToHost);
cudaFree(dev_values);
}
int main(void)
{
clock_t start, stop;
float *values = (float*) malloc( NUM_VALS * sizeof(float));
array_fill(values, NUM_VALS);
start = clock();
bitonic_sort(values); /* Inplace */
stop = clock();
print_elapsed(start, stop);
}
@DanyStinson

Copy link
Copy Markdown

Hi!
Do you have the CPU version code?

@mre

mre commented Oct 8, 2016

Copy link
Copy Markdown
Author

@DanyStinson sorry, just saw your message.
Probably you already found an implementation yourself, but just in case somebody else is looking for BitonicSort in C, here are two projects:
https://github.com/steremma/Bitonic
https://github.com/orestisf1993/mpi_bitonic_sort

@ADITYA-TALAVIYA

Copy link
Copy Markdown

how to give input to my own integers and can u tell me whats is the length

@mre

mre commented Apr 27, 2017

Copy link
Copy Markdown
Author

There's a function called bitonic_sort and it has one parameter: the list of values.

Call it like this:

bitonic_sort(your_values);

The values need to be floating point (not integer).
The length is NUM_VALS.
If you know the number of values, you can adjust the settings at the top:

#define THREADS 512 // 2^9
#define BLOCKS 32768 // 2^15
#define NUM_VALS THREADS*BLOCKS

Modify THREADS and BLOCKS so that THREADS*BLOCKS is the total number of your values

@ADITYA-TALAVIYA

Copy link
Copy Markdown

OK THANKS
But i have 29468 but so its in 2^n form so i have manage by 29472 floating point so can u tell me whats is the proper size of THREADS AND BLOCKS its divide

@byteunit

byteunit commented Jun 5, 2017

Copy link
Copy Markdown

I'm sorry, but this benchmark measures only how fast the launch of the kernel is.

as stated here (https://devblogs.nvidia.com/parallelforall/how-implement-performance-metrics-cuda-cc/) you need an additional
cudaDeviceSynchronize();

@gggprojects

Copy link
Copy Markdown

@byteunit That's not true. cudaMemcpy is a memory blocking operations causing the syncronization of the device. Thus, the measured time includes the kernel execution.

@kosalape

Copy link
Copy Markdown

Hi,
This ONLY makes a bitonic sequence but NOT a bitonic sort right?

@twmht

twmht commented Jun 28, 2018

Copy link
Copy Markdown

@mre

I have tried that but getting slow in GTX-1080.

In the original settings,

#define THREADS 512 // 2^9
#define BLOCKS 32768 // 2^15

I got 0.3s.

with the new settings

#define THREADS 4
#define BLOCKS 250

this gives me 0.178s

I have no idea why it's slow.

@richursa

richursa commented Jul 8, 2018

Copy link
Copy Markdown

@etale-cohomology

Copy link
Copy Markdown

Does it work on NON-power-of-two arrays?

@mre

mre commented Sep 26, 2019

Copy link
Copy Markdown
Author

No. That's a limitation of the algorithm. You could fill up the rest of the array with "tombstone data" that you would later discard again.

@etale-cohomology

Copy link
Copy Markdown

Great to know! I was worried I was missing something...

@penglz

penglz commented Dec 27, 2019

Copy link
Copy Markdown

half of the blocks are idle all the time? wouldn't it be more efficient if numvals = 2threadsblocks?

@ai275709

Copy link
Copy Markdown

sir in this code how to print a sorted list.

@mre

mre commented Apr 13, 2021

Copy link
Copy Markdown
Author

@theoseidel

Copy link
Copy Markdown

When I print before and after sort the output is the same.

@mre

mre commented Jun 9, 2021

Copy link
Copy Markdown
Author

I don't know what could cause this. Might depend on your GPU/OS. You could try to add an explicit call to
cudaDeviceSynchronize, even though cudaMemcpy should ensure a sync internally.

@theoseidel

Copy link
Copy Markdown

I am running it in a google colab. Could that be the reason? Also, is there any downside to using arrays of integers rather than floats?

@mre

mre commented Jun 11, 2021

Copy link
Copy Markdown
Author

I have no clue about the CUDA support in Google Colab. It should work I guess?
The algorithm takes an array of floats, so you should give that a try yes.

@akmalshahbaz

Copy link
Copy Markdown

Hi why it not show random numbers and numbers after sort can you please tell me what code i add in this?

@mre

mre commented May 16, 2022

Copy link
Copy Markdown
Author

@akmalshahbaz, try this:

int main(void)
{
  clock_t start, stop;

  float *values = (float*) malloc( NUM_VALS * sizeof(float));
  array_fill(values, NUM_VALS);

  printf("Elements before sort: \n");    
  for (int i = 0; i < NUM_VALS; i++) {     
    printf("%d ", values[i]);     
  }
  printf("\n");    

  start = clock();
  bitonic_sort(values); /* Inplace */
  stop = clock();

  printf("Elements after sort: \n");    
  for (int i = 0; i < NUM_VALS; i++) {     
    printf("%d ", values[i]);     
  } 
  printf("\n");     

  print_elapsed(start, stop);
}  

@mre

mre commented Jul 6, 2022

Copy link
Copy Markdown
Author

1. What is j and where it is changing its value?

j is the index for a minor step.
The minor steps are each column in the following picture:
image
Source

The naming is a little different in each description of the algorithm.

j is changing its value here:
https://gist.github.com/mre/1392067#file-bitonic_sort-cu-L94-L96

2. Same question for k

k is the index for the major step.
In the above example picture, the major steps are connected by the arch below the minor steps (so the three major steps in the example are (1), (2,1), (3,2,1)).

It might help to replace the code here with some debug output so you can see the major and minor steps:

  int j, k;

  for (k = 2; k <= NUM_VALS; k <<= 1) {
   printf("Major step. k=%d\n", k);
    for (j=k>>1; j>0; j=j>>1) {
      printf("Minor step. j=%d\n", j);
    }
  }

3. What is ixj and why it equals to 1^j?

There is a line in the code here which reads

ixj = i^j;

The ^ is the XOR operation in C. The first letter of XOR is X, so I chose to call the variable ixj (short for i XOR j).
It's a rather arbitrary, silly name.
What's more interesting is the use of it.

In the bitonic sort implementation, not all threads are active all the time. Instead, they take turns to sort parts of the array. Each thread just sorts two values per step. The ixj variable is a nifty trick to find out if the current thread is responsible for doing any sorting at the moment.
How does it find out?
For each minor step, we do the following:

  // Get the index of the number we want to sort in this thread 
  i = threadIdx.x + blockDim.x * blockIdx.x;

  // Calculate the XOR value between the number we want to sort in our thread (i)
  // and the current minor step j.
  // This is a nifty trick to find out if the current thread has to do work in this step
  // because for exactly half of the values `ixj` will be bigger than `i`.
  // What's cool about it is that it's not always the same threads which do work because `i` will always
  // stay the same, but `j` changes on every call/iteration, so the xor will "filter out" different threads on every iteration.
  ixj = i^j;

  /* The threads with the lowest ids sort the array. */
  if ((ixj)>i) {
    // ...
  }

Hope that helps a little.

@tingxingdong

Copy link
Copy Markdown

I think this is a basic implementation. Many papers said the shared memory (like cache ) should be used for performance. This implement did not use shared memory but only global memory in/ out

@niksa27

niksa27 commented Mar 17, 2024

Copy link
Copy Markdown

@mre
At any stage in the above algorithm, only n/2 threads are being used. Rest n/2 are not being used. Is there a way to utilize all the threads?

@mre

mre commented Mar 17, 2024

Copy link
Copy Markdown
Author

I don't know how, given that each thread compares two numbers. Each step needs to complete until the next step can start, which means that the idle threads can't do any work in the meantime. This is how the algorithm operates. For more information, see the description here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment