Skip to content

Instantly share code, notes, and snippets.

@dmikushin
Last active January 27, 2025 12:01
Show Gist options
  • Save dmikushin/cfeaf32be672d7191cc79be2653c392e to your computer and use it in GitHub Desktop.
Save dmikushin/cfeaf32be672d7191cc79be2653c392e to your computer and use it in GitHub Desktop.
OpenMP target offload for multiple NVIDIA architectures

OpenMP target offload for multiple NVIDIA architectures

This sample demonstrates how to add support for multiple NVIDIA architectures (Compute Capabilities, CC) into a single OpenMP target offload C program with Clang.

In this example the executable is compiled to support sm_80 (e.g. NVIDIA A100) and sm_90 (e.g. NVIDIA H100):

> make
LIBRARY_PATH=/usr/lib/llvm-19/lib clang-19 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda \
        -Xopenmp-target=nvptx64-nvidia-cuda --offload-arch=sm_90 \
        -Xopenmp-target=nvptx64-nvidia-cuda --offload-arch=sm_80 \
        -fuse-ld=lld multi_sm_test.c -o multi_sm_test
> strings ./multi_sm_test | grep "sm_80"
sm_80
> strings ./multi_sm_test | grep "sm_90"
sm_90

The latest Clang 19 is used above, however the same works for Clang 17.

This feature is discussed in D128090, noting that -Xopenmp-target=nvptx64-nvidia-cuda --offload-arch=sm_80 works, while -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_80 does not.

LLVM_VERSION=19
ifeq ($(LLVM_VERSION),19)
CC=LIBRARY_PATH=/usr/lib/llvm-$(LLVM_VERSION)/lib clang-$(LLVM_VERSION)
else
CC=clang-$(LLVM_VERSION)
endif
.PHONY: test
all: multi_sm_test
multi_sm_test: multi_sm_test.c
$(CC) -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda \
-Xopenmp-target=nvptx64-nvidia-cuda --offload-arch=sm_90 \
-Xopenmp-target=nvptx64-nvidia-cuda --offload-arch=sm_80 \
-fuse-ld=lld $< -o $@
clean:
rm -rf multi_sm_test
test: multi_sm_test
strings ./$< | grep "sm_80" && strings ./$< | grep "sm_90"
#include <stdio.h>
#include <omp.h>
#define N 1000
void saxpy(float a, float *x, float *y, float *result, int n) {
#pragma omp target teams distribute parallel for map(to: x[0:n], y[0:n]) map(from: result[0:n])
for (int i = 0; i < n; i++) {
result[i] = a * x[i] + y[i];
}
}
int main() {
float a = 2.0f;
float x[N], y[N], result[N];
// Initialize arrays
for (int i = 0; i < N; i++) {
x[i] = i * 1.0f;
y[i] = i * 2.0f;
}
// Perform SAXPY
saxpy(a, x, y, result, N);
// Print some results for verification
for (int i = 0; i < 10; i++) {
printf("result[%d] = %f\n", i, result[i]);
}
return 0;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment