GPU Python for the Real World: Practical GPU-Accelerated Python with RAPIDS

EuroSciPy 2025 | Share Link

Agenda

RAPIDS/CUDA Python Overview (40 mins)
- Explore some RAPIDS notebooks from the RAPIDS Community Tutorial
  - cuDF (Pandas/Dataframes)
  - cuML (scikit-learn/ML)
- and/or explore some CUDA Python notebooks from the Accelerated Computing Hub Tutorial
  - Numba (CUDA development in Python)
  - CuPy (NumPy/Arrays)
Break (10 mins)
Deployment (40 mins)
- NVIDIA Brev
- GPU Software Environment Fundamentals
- Python packages that use CUDA
- Monitoring/debugging tools
- Other platforms

RAPIDS/CUDA Python Overview

The first section of this tutorial will be run on Google Colab. To follow along you'll need a Google account in order to access NVIDIA T4 GPUs in the Colab Free Tier.

RAPIDS Community Tutorial | Accelerated Computing Hub Tutorial

Deployment

The second half of the tutorial will discuss how to get your own GPUs on the cloud in more general terms. In order to dig into some of the things we will be learning we will be launching a VM through the NVIDIA Brev portal.

Getting set up with Brev

Sign in to or register an account at https://brev.nvidia.com
Ensure you are a member of an organisation
- One should be created for you when you register, but if not it will say "undefined" in the top right
- If you don't have one you can create a new one and give it a name
Apply credits to your organisation
- Navigate to Billing
- Select "Redeem Code" and enter the code euroscipy-2025

Launching a Brev VM

Under "GPUs" select "New Instance"
Choose a GPU type that costs <$1/hour (e.g an L4)
Choose any provider
Give your VM a name
Hit Deploy

Connecing to your VM

Install the brev CLI
- brew install brevdev/homebrew-brev/brev && brev login
Connect via SSH
- brev shell <your vm name>

Exploring our GPU Software Environment

Let's start by exploring our VM to see what software we got out of the box.

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"

We can check our GPU information by running nvidia-smi.

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01             Driver Version: 570.158.01     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:00:03.0 Off |                    0 |
| N/A   47C    P8             13W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Next we need the core CUDA libraries in order to run any CUDA code. Often these will be installed at the system level in /usr/local/cuda.

ls -ld /usr/local/cuda*

If these are missing we need to decide how to get those dependencies. The way we do this is different depending on whether we want to use pip/uv or conda/pixi for our Python package manager.

Python Software environments

Pip

If we want to instalkl our packages with pip we need to install the CUDA core libraries at the system level, we can do this on Ubuntu with apt.

# Add the NVIDIA repos
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# Install all of CUDA Toolkit (you need to specify the CUDA version that matches your driver)
sudo apt-get -y install cuda-toolkit-12-8

Now that we have our CUDA libraries we can install Python libraries with corresponding versions.

Important

We need to include the CUDA version in the package name due to limitations in the Python packaging spec, see the wheelnext project for plans to solve this in the long term. There is an experimental build of uv that supports wheel variants today.

Note

We also need to use a custom index because the RAPIDS packages tend to be too large for uploading to PyPI. While we can work with them to increase those limits we can run our own index and handle the cost of serving those packages. The reason CUDA packages are so large is because GPU machine code varies between models in a way that doesn't happen with CPUs. To work around this CUDA builds for all common GPUs and bundles them together. Further improvements in packaging could help with this in the future.

pip install --extra-index-url=https://pypi.nvidia.com cudf-cu12

python  # Start Python interpreter

Then we can import cudf and allocate some GPU memory

import cudf
s = cudf.Series([1, 2, 3, None, 4])

Conda

When installing libraries with conda each individual CUDA library can be installed as a conda package. So we don't need to ensure any of the CUDA libraries already exist in /usr/local/cuda.

If you prefer to use conda then we need to install it first.

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"

bash Miniforge3-$(uname)-$(uname -m).sh  # Follow the prompts

Then we can create a new conda environment with python and cudf.

conda create -n rapids -c rapidsai -c conda-forge -c nvidia cudf python=3.13 'cuda-version>=12.0,<=12.9'

conda activate rapids

Note

You may notice this is much simpler than the pip installation. This is for two reasons:

We don't need CUDA toolkit because each individual CUDA library is available as a conda package. So cudf can depend on them directly and install the ones it needs.
Conda supports virtual packages which allow the solver to discover additional information about the system such as the CUDA version and then pull in the correct package build for your system.

Then we can import cudf and allocate some GPU memory

import cudf
s = cudf.Series([1, 2, 3, None, 4])

Monitoring and debugging tools

When working with GPUs you need to get visibility into what the device is doing. We can get a whole range of information with nvidia-smi.

# Show high level GPU information
nvidia-smi

# List GPUs
nvidia-smi -L

# Dump detailed information
nvidia-smi -q

NVML

Below nvidia-smi sits NVML, a protocol for querying low level information from the GPU. There are Python bindings if you want to access this data yourselv.

pip install nvidia-ml-py  # Package name doesn't match library name. You import it with `import pynvml`

You can learn more about using the pynvml library in this notebook on the Accelerated Computing Hub.

Jupyter Lab NVDashboard

If you are a fan of Jupyter Lab you can view metrics directly in the interface with jupyterlab-nvdashboard.

Note

Our Brev VM has Jupyter running for us in the system python environment via systemd. We can install NVDashboard in here but we need to ensure we are installing it into the right Python.

# Ensure we are using the base Python
conda deactivate  # If you installed conda deactivate it
which python3  # Should be /usr/bin/python3

# Install NVDashboard
pip install jupyterlab_nvdashboard
# alternatively /usr/bin/python3 -m pip install jupyterlab_nvdashboard

# Restart jupyter
sudo systemctl restart jupyter

nvtop

There also also many great third-party tools out there for inspecting your GPU. One such project is nvtop, a CLI tool for viewing GPU stats.

# Install with apt
sudo apt install nvtop

# Start nvtop
nvtop

cudf.pandas profilers

Some tools and libraries have built in profiling tools. For example the cudf.pandas plugin allows you to profile your code from withing Jupyter.

%%cudf.pandas.profile

small_df = pd.DataFrame({"a": ["0", "1", "2"], "b": ["x", "y", "z"]})
small_df = pd.concat([small_df, small_df])

axis = 0
for i in range(0, 2):
    small_df.min(axis=axis)
    axis = i

counts = small_df.groupby("a").b.count()

NSight Systems and nsys

NVIDIA produces debugging tools which allow you to view low level traces from the GPU kernel execution to find performance bottlenecks.

Typically Python users will run their code with nsys to produce a report, and then open it in Nsight as a local viewer.

Like many debugging tools we need to use nsys to call Python initially. This will run your code and then output a tracefile which you can download and explore locally.

nsys profile \
  --trace cuda,osrt,nvtx \
  --gpu-metrics-device=all \
  --cuda-memory-usage true \
  --force-overwrite true \
  --output profile_run_v1 \
  python your_script.py
# Will create profile_your_script.nsys-rep

If you are running Jupyter and NSight on the same machine you can also use the Jupyter Lab Nsight extension

How do I do all this on "foo" platform?

Now that we've experimented with all of these tools, libraries and debuggers on a Ubuntu VM the next thing most folks need to figure out is how to apply this to your world. It's likely that you have some opinionated set of hardware/software/platform that you need to use. Perhaps your employer provides you with access to Databricks, Coiled or Snowflake. Or maybe you have cloud access and you use services such as AWS SageMaker, Azure Machine Learning or Google Cloud Vertex AI. Or maybe you have an existing machine or cluster somewhere.

However you get access to GPUs it inevitably falls to you to close the gap between the software provided and the software you need. In our Brev example we got Ubuntu with the NVIDIA driver, but nothing else. On platforms like Snowflake you will get some version of CUDA Toolkit and a few libraries out of the box, but you'll need to figure out how to add the additional things you need.

In RAPIDS we endeavour to document the most commonly used platforms and how to get from their out of the box offering to a fully functional RAPIDS environment.

If you're using something that we haven't documented then you can walk through the various levels we've covered and figure out what you have, and what you need and hopefully you now have the ability to get started anywhere. If you think you're using a platform that we should document then open an issue.

jacobtomlinson/README.md

GPU Python for the Real World: Practical GPU-Accelerated Python with RAPIDS

Agenda

RAPIDS/CUDA Python Overview

Deployment

Getting set up with Brev

Launching a Brev VM

Connecing to your VM

Exploring our GPU Software Environment

Python Software environments

Pip

Conda

Monitoring and debugging tools

NVML

Jupyter Lab NVDashboard

nvtop

cudf.pandas profilers

NSight Systems and nsys

How do I do all this on "foo" platform?