Not just NVIDIA: GPU programming that runs everywhere

by Itamar Turner-Trauring
Last updated 13 Feb 2024, originally created 13 Feb 2024

If you’re doing computations on a GPU, NVIDIA is the default, alongside its CUDA libraries. Some libraries like PyTorch support do support AMD GPUs and Macs. But from the re-implementations of NumPy, SciPy, and Pandas in the RAPIDS project, to Numba’s GPU support, NVIDIA has best software support in the Python world.

Sticking to NVIDIA-specific software has some downsides, however:

It won’t run on modern Mac laptops.
Testing in CI is more difficult: you need custom runners that have NVIDIA GPUs.
You can’t use any other GPUs you might have access to, like AMD GPUs.

What can you do if you want to use GPUs in a portable manner? In this article we’ll cover one option, the wgpu-py library.

WebGPU, `wgpu`, and `wgpu-py`

WebGPU is a standard (in-progress) for writing portable GPU-based programs (or “shaders”) in web browsers. It’s still experimental, but Chrome and Firefox are building prototypes. Graphics programming is the major use case, but WebGPU also supports compute-only shaders.

Of course, not everything runs in a browser! Which brings us to wgpu, a Rust library that re-purposes the WebGPU API for environments outside the browser. Notably, it supports a variety of backends (Vulkan, GL, Metal, DirectX 12), with the effective result that it can run on:

Modern GPUs on Linux or Windows.
Modern Macs.
CPUs, using the lavapipe CPU-based implementation of Vulkan.

And wgpu-py is a Python wrapper for wgpu.

Add this all up, and you can use wgpu-py to run GPU programs pretty much anywhere, from normal CI runners to your Mac laptop to, yes, a NVIDIA GPU.

An example: adding two NumPy arrays

To show wgpu-py at work, we’ll implement an example that adds two NumPy arrays together. The point here is not to write good GPU code, or interesting code, it’s just to demonstrate the library at work.

import numpy as np
from time import time

INPUT1 = np.arange(0, 1_000_000, dtype=np.int32)
INPUT2 = np.arange(2, 1_000_002, dtype=np.int32)
assert INPUT1.shape == INPUT2.shape

Here’s the program we’ll be running on the GPU, implemented in WebGPU Shader Language (WGSL), a new custom language developed for WebGPU:

SHADER = """
@group(0) @binding(0)
var<storage,read> input1: array<i32>;

@group(0) @binding(1)
var<storage,read> input2: array<i32>;

@group(0) @binding(2)
var<storage,read_write> output: array<i32>;

@compute
@workgroup_size(50)
fn main(@builtin(global_invocation_id) index: vec3<u32>) {
    let i: u32 = index.x;
    output[i] = input1[i] + input2[i];
}
"""

We can run this on a GPU using wgpu-py, which you can install with pip install wgpu:

from wgpu.utils.compute import compute_with_buffers

def run_on_gpu():
    # Map from input binding number to the relevant NumPy
    # array:
    inputs = {0: INPUT1, 1: INPUT2}
    # Map from output binding numbers to the length and
    # type of the output. "i" means signed 32-bit integer.
    outputs = {2: (len(INPUT1), "i")}
    # The workgroup size is 50...
    result = compute_with_buffers(
        inputs, outputs, SHADER, n=len(INPUT1) // 50
    )
    # The result maps binding numbers to a memory view.
    return np.frombuffer(result[2], np.int32)

For comparison, here’s how we’d add the two arrays on the CPU using NumPy:

def run_on_cpu():
    return INPUT1 + INPUT2

Now we can run both versions:

from wgpu.utils import get_default_device
from time import time

device = get_default_device()
print("GPU:", device._adapter.request_adapter_info()["device"])

# Run once to prep everything without distorting timing:
run_on_gpu()

start = time()
for _ in range(10):
    gpu_result = run_on_gpu()
print("GPU mean elapsed:", (time() - start) / 10)

start = time()
for _ in range(10):
    cpu_result = run_on_cpu()
print("CPU mean elapsed:", (time() - start) / 10)

assert np.array_equal(gpu_result, cpu_result)

Running on non-NVIDIA GPUs

This program can be run on the integrated GPU of an i7-12700K:

GPU: Intel(R) UHD Graphics 770 (ADL-S GT1)
GPU elapsed: 0.008639335632324219
CPU elapsed: 0.0013203620910644531

You can also run it on the integrated GPU of an ARM-based Mac:

GPU: Apple M1
GPU mean elapsed: 0.0035706758499145508
CPU mean elapsed: 0.00032808780670166013

In theory it should also be possible run on a CPU by installing lavapipe, allowing you to run the code in normal CI runners.

While the CPU version is faster in both cases, that is pretty meaningless. This example was chosen for simplicity, not for usefulness, nor for relevance to performance.

Considering the tradeoffs

Portability isn’t everything, and using wgpu-py has its downsides. Even from my perspective as someone who knows very little about GPU programming, some obvious issues include:

The WGSL language is still a draft standard, as is WebGPU in general. wgpu and therefore wgpu-py do support other shader languages, though.
As mentioned above, sticking to NVIDIA hardware gives you a large number of pre-written libraries that won’t work elsewhere.
If you’re not already familiar with this field, WGSL—or the other supported shader languages—are a new language you need to learn. Compare that to Numba, which lets you write CUDA programs with Python syntax.
A lowest-common-denominator language won’t let you take full advantage of your hardware.

Nonetheless, the underlying wgpu Rust library is finding users in the real world. Visit the wgpu site’s Showcase section (scroll down) to see examples, albeit with a graphics-oriented slant.

Find performance and memory bottlenecks in your data processing code with the Sciagraph profiler

Slow-running jobs waste your time during development, impede your users, and increase your compute costs. Speed up your code and you’ll iterate faster, have happier users, and stick to your budget—but first you need to identify the cause of the problem.

Find performance bottlenecks and memory hogs in your data science Python jobs with the Sciagraph profiler. Profile in development and production, with multiprocessing support, on macOS and Linux, with built-in support for Jupyter notebooks.

Speed up your Python code and learn skills you can use at your job

Join over 8000 Python developers and data scientists learning practical tools and techniques every week, from Python performance to Docker packaging, by signing up for my newsletter.

Not just NVIDIA: GPU programming that runs everywhere

WebGPU, wgpu, and wgpu-py

An example: adding two NumPy arrays

Running on non-NVIDIA GPUs

Considering the tradeoffs

Find performance and memory bottlenecks in your data processing code with the Sciagraph profiler

Speed up your Python code and learn skills you can use at your job

WebGPU, `wgpu`, and `wgpu-py`