Skip to content

Instantly share code, notes, and snippets.

@itzmeanjan
Last active March 19, 2025 11:29
Show Gist options
  • Save itzmeanjan/84613bc7595372c5e6b6c22481d42f9a to your computer and use it in GitHub Desktop.
Save itzmeanjan/84613bc7595372c5e6b6c22481d42f9a to your computer and use it in GitHub Desktop.
😎 Parallel Matrix Multiplication on GPU, using Rust Vulkan Compute API of `vulkano` 🚴🏼
Using device: Apple M1 Max (type: IntegratedGpu). API version: 1.2.296
Matrix A = (1774 X 2048)
Matrix B = (2048 X 1024)
Matrix C = A X B = (1774 X 1024)
(a) GPU matrix multiply: 47.795959ms
(b) CPU parallel matrix multiply: 1.081648416s
(c) CPU single-threaded matrix multiply: 7.010535s
Speed Up (b/a): 22x
Speed Up (c/a): 146x
Speed Up (c/b): 6x

Using device: Intel(R) Graphics (ADL GT2) (type: IntegratedGpu). API version: 1.3.281
Matrix A = (1774 X 2048)
Matrix B = (2048 X 1024)
Matrix C = A X B = (1774 X 1024)
(a) GPU matrix multiply: 280.689909ms
(b) CPU parallel matrix multiply: 1.792873747s
(c) CPU single-threaded matrix multiply: 7.692478067s
Speed Up (b/a): 6x
Speed Up (c/a): 27x
Speed Up (c/b): 4x
[package]
name = "gpu-matrix-multiplication"
version = "0.1.0"
edition = "2024"
rust-version = "1.85.0"
[dependencies]
vulkano = "=0.35.1"
vulkano-shaders = "=0.35.0"
rand = "=0.8.4"
rayon = "=1.10.0"
use rand::{Rng, SeedableRng, rngs::StdRng};
use rayon::prelude::*;
use std::{ops::Mul, sync::Arc, time::Instant};
use vulkano::{
VulkanLibrary,
buffer::{Buffer, BufferCreateInfo, BufferUsage},
command_buffer::{
AutoCommandBufferBuilder, CommandBufferUsage, CopyBufferInfo, PrimaryCommandBufferAbstract,
allocator::StandardCommandBufferAllocator,
},
descriptor_set::{
DescriptorSet, WriteDescriptorSet, allocator::StandardDescriptorSetAllocator,
},
device::{
Device, DeviceCreateInfo, DeviceExtensions, QueueCreateInfo, QueueFlags,
physical::PhysicalDeviceType,
},
instance::{Instance, InstanceCreateFlags, InstanceCreateInfo},
memory::allocator::{AllocationCreateInfo, MemoryTypeFilter, StandardMemoryAllocator},
pipeline::{
ComputePipeline, Pipeline, PipelineBindPoint, PipelineLayout,
PipelineShaderStageCreateInfo, compute::ComputePipelineCreateInfo,
layout::PipelineDescriptorSetLayoutCreateInfo,
},
sync::{self, GpuFuture},
};
const MATRIX_A_NUM_ROWS: u32 = 1774;
const MATRIX_A_NUM_COLS: u32 = 1u32 << 11;
const MATRIX_B_NUM_ROWS: u32 = 1u32 << 11;
const MATRIX_B_NUM_COLS: u32 = 1u32 << 10;
#[derive(PartialEq, Debug)]
struct Matrix {
rows: u32,
cols: u32,
elems: Vec<u32>,
}
impl Mul for Matrix {
type Output = Matrix;
fn mul(self, rhs: Self) -> Self::Output {
assert_eq!(self.cols, rhs.rows);
let mut elems = vec![0u32; (self.rows * rhs.cols) as usize];
for i in 0..self.rows {
for j in 0..rhs.cols {
let mut sum = 0u32;
for k in 0..self.cols {
sum = sum.wrapping_add(
self.elems[(i * self.cols + k) as usize]
.wrapping_mul(rhs.elems[(k * rhs.cols + j) as usize]),
);
}
elems[(i * rhs.cols + j) as usize] = sum;
}
}
Matrix {
rows: self.rows,
cols: rhs.cols,
elems,
}
}
}
impl Matrix {
fn random(seed: u64, rows: u32, cols: u32) -> Matrix {
let mut rng = StdRng::seed_from_u64(seed);
Matrix {
rows,
cols,
elems: (0..rows * cols).map(|_| rng.r#gen::<u32>()).collect(),
}
}
fn par_mul(&self, rhs: &Matrix) -> Matrix {
assert_eq!(self.cols, rhs.rows);
let mut elems = vec![0u32; (self.rows * rhs.cols) as usize];
elems.par_iter_mut().enumerate().for_each(|(lin_idx, v)| {
let r_idx = lin_idx / rhs.cols as usize;
let c_idx = lin_idx - r_idx * rhs.cols as usize;
*v = (0..self.cols as usize).fold(0u32, |acc, k| {
acc.wrapping_add(
self.elems[r_idx * self.cols as usize + k]
.wrapping_mul(rhs.elems[k * rhs.cols as usize + c_idx]),
)
});
});
Matrix {
rows: self.rows,
cols: rhs.cols,
elems,
}
}
fn to_bytes(&self) -> Vec<u8> {
let offset0 = 0;
let offset1 = offset0 + std::mem::size_of_val(&self.rows);
let offset2 = offset1 + std::mem::size_of_val(&self.cols);
let encoded_elems_byte_len = std::mem::size_of::<u32>() * (self.rows * self.cols) as usize;
let encoded_mat_byte_len = offset2 + encoded_elems_byte_len;
let elems_as_bytes = unsafe {
let ptr_elems = self.elems.as_ptr();
let ptr_elem_bytes: *const u8 = ptr_elems.cast();
core::slice::from_raw_parts(ptr_elem_bytes, encoded_elems_byte_len)
};
let mut bytes = vec![0u8; encoded_mat_byte_len];
bytes[offset0..offset1].copy_from_slice(&self.rows.to_le_bytes());
bytes[offset1..offset2].copy_from_slice(&self.cols.to_le_bytes());
bytes[offset2..].copy_from_slice(elems_as_bytes);
bytes
}
fn from_bytes(bytes: &[u8]) -> Matrix {
let offset0 = 0;
let offset1 = offset0 + std::mem::size_of::<u32>();
let offset2 = offset1 + std::mem::size_of::<u32>();
let rows = u32::from_le_bytes(bytes[offset0..offset1].try_into().unwrap());
let cols = u32::from_le_bytes(bytes[offset1..offset2].try_into().unwrap());
let num_elems = (rows * cols) as usize;
let encoded_elems_byte_len = std::mem::size_of::<u32>() * num_elems;
let remaining_num_bytes = bytes.len() - offset2;
assert_eq!(encoded_elems_byte_len, remaining_num_bytes);
let elems = unsafe {
let ptr_elems_bytes = bytes[offset2..].as_ptr();
let ptr_elems: *const u32 = ptr_elems_bytes.cast();
core::slice::from_raw_parts(ptr_elems, num_elems)
}
.to_vec();
Matrix { rows, cols, elems }
}
}
fn main() {
let library = VulkanLibrary::new().unwrap();
let instance = Instance::new(
library,
InstanceCreateInfo {
flags: InstanceCreateFlags::ENUMERATE_PORTABILITY,
..Default::default()
},
)
.unwrap();
let device_extensions = DeviceExtensions {
khr_storage_buffer_storage_class: true,
..DeviceExtensions::empty()
};
let (physical_device, queue_family_index) = instance
.enumerate_physical_devices()
.unwrap()
.filter(|p| p.supported_extensions().contains(&device_extensions))
.filter_map(|p| {
p.queue_family_properties()
.iter()
.position(|q| {
q.queue_flags
.intersects(QueueFlags::COMPUTE | QueueFlags::TRANSFER)
})
.map(|i| (p, i as u32))
})
.min_by_key(|(p, _)| match p.properties().device_type {
PhysicalDeviceType::DiscreteGpu => 0,
PhysicalDeviceType::IntegratedGpu => 1,
PhysicalDeviceType::VirtualGpu => 2,
PhysicalDeviceType::Cpu => 3,
PhysicalDeviceType::Other => 4,
_ => 5,
})
.unwrap();
println!(
"Using device: {} (type: {:?}). API version: {}",
physical_device.properties().device_name,
physical_device.properties().device_type,
physical_device.api_version(),
);
let (device, queue) = {
let (device, mut queues) = Device::new(
physical_device,
DeviceCreateInfo {
enabled_extensions: device_extensions,
queue_create_infos: vec![QueueCreateInfo {
queue_family_index,
..Default::default()
}],
..Default::default()
},
)
.unwrap();
(device, queues.next().unwrap())
};
let memory_allocator = Arc::new(StandardMemoryAllocator::new_default(device.clone()));
let matrix_a = Matrix::random(13, MATRIX_A_NUM_ROWS, MATRIX_A_NUM_COLS);
let matrix_b = Matrix::random(17, MATRIX_B_NUM_ROWS, MATRIX_B_NUM_COLS);
println!("Matrix A = ({} X {})", matrix_a.rows, matrix_a.cols);
println!("Matrix B = ({} X {})", matrix_b.rows, matrix_b.cols);
println!("Matrix C = A X B = ({} X {})", matrix_a.rows, matrix_b.cols);
let matrix_a_as_bytes = matrix_a.to_bytes();
let matrix_b_as_bytes = matrix_b.to_bytes();
let matrix_a_byte_len = matrix_a_as_bytes.len() as u64;
let matrix_b_byte_len = matrix_b_as_bytes.len() as u64;
let matrix_c_byte_len = (std::mem::size_of_val(&matrix_a.rows)
+ std::mem::size_of_val(&matrix_b.cols)
+ std::mem::size_of::<u32>() * (matrix_a.rows * matrix_b.cols) as usize)
as u64;
let matrix_a_host_local = Buffer::from_iter(
memory_allocator.clone(),
BufferCreateInfo {
usage: BufferUsage::TRANSFER_SRC,
..Default::default()
},
AllocationCreateInfo {
memory_type_filter: MemoryTypeFilter::HOST_SEQUENTIAL_WRITE
| MemoryTypeFilter::PREFER_DEVICE,
..Default::default()
},
matrix_a_as_bytes,
)
.unwrap();
let matrix_b_host_local = Buffer::from_iter(
memory_allocator.clone(),
BufferCreateInfo {
usage: BufferUsage::TRANSFER_SRC,
..Default::default()
},
AllocationCreateInfo {
memory_type_filter: MemoryTypeFilter::HOST_SEQUENTIAL_WRITE
| MemoryTypeFilter::PREFER_DEVICE,
..Default::default()
},
matrix_b_as_bytes,
)
.unwrap();
let matrix_a_device_local = Buffer::new_slice::<u8>(
memory_allocator.clone(),
BufferCreateInfo {
usage: BufferUsage::STORAGE_BUFFER | BufferUsage::TRANSFER_DST,
..Default::default()
},
AllocationCreateInfo {
memory_type_filter: MemoryTypeFilter::PREFER_DEVICE,
..Default::default()
},
matrix_a_byte_len,
)
.unwrap();
let matrix_b_device_local = Buffer::new_slice::<u8>(
memory_allocator.clone(),
BufferCreateInfo {
usage: BufferUsage::STORAGE_BUFFER | BufferUsage::TRANSFER_DST,
..Default::default()
},
AllocationCreateInfo {
memory_type_filter: MemoryTypeFilter::PREFER_DEVICE,
..Default::default()
},
matrix_b_byte_len,
)
.unwrap();
let matrix_c_buf = Buffer::new_slice::<u8>(
memory_allocator.clone(),
BufferCreateInfo {
usage: BufferUsage::STORAGE_BUFFER,
..Default::default()
},
AllocationCreateInfo {
memory_type_filter: MemoryTypeFilter::HOST_RANDOM_ACCESS
| MemoryTypeFilter::PREFER_DEVICE,
..Default::default()
},
matrix_c_byte_len,
)
.unwrap();
let command_buffer_allocator = Arc::new(StandardCommandBufferAllocator::new(
device.clone(),
Default::default(),
));
// Transfer data to the GPU
let transfer_command_buffer = {
let mut builder = AutoCommandBufferBuilder::primary(
command_buffer_allocator.clone(),
queue.queue_family_index(),
CommandBufferUsage::OneTimeSubmit,
)
.unwrap();
builder
.copy_buffer(CopyBufferInfo::buffers(
matrix_a_host_local,
matrix_a_device_local.clone(),
))
.unwrap()
.copy_buffer(CopyBufferInfo::buffers(
matrix_b_host_local,
matrix_b_device_local.clone(),
))
.unwrap();
builder.build().unwrap()
};
transfer_command_buffer
.execute(queue.clone())
.unwrap()
.then_signal_fence_and_flush()
.unwrap()
.wait(None)
.unwrap();
let pipeline = {
let compute_shader = cs::load(device.clone()).unwrap();
let compute_shader_entry_point = compute_shader.entry_point("main").unwrap();
let compute_stage = PipelineShaderStageCreateInfo::new(compute_shader_entry_point);
let layout = PipelineLayout::new(
device.clone(),
PipelineDescriptorSetLayoutCreateInfo::from_stages([&compute_stage])
.into_pipeline_layout_create_info(device.clone())
.unwrap(),
)
.unwrap();
ComputePipeline::new(
device.clone(),
None,
ComputePipelineCreateInfo::stage_layout(compute_stage, layout.clone()),
)
.unwrap()
};
let descriptor_set_allocator = Arc::new(StandardDescriptorSetAllocator::new(
device.clone(),
Default::default(),
));
let descriptor_set_layout = pipeline.layout().set_layouts()[0].clone();
let descriptor_set = DescriptorSet::new(
descriptor_set_allocator,
descriptor_set_layout,
[
WriteDescriptorSet::buffer(0, matrix_a_device_local.clone()),
WriteDescriptorSet::buffer(1, matrix_b_device_local.clone()),
WriteDescriptorSet::buffer(2, matrix_c_buf.clone()),
],
[],
)
.unwrap();
let command_buffer = {
let mut command_buffer_builder = AutoCommandBufferBuilder::primary(
command_buffer_allocator,
queue.queue_family_index(),
CommandBufferUsage::MultipleSubmit,
)
.unwrap();
command_buffer_builder
.bind_pipeline_compute(pipeline.clone())
.unwrap()
.bind_descriptor_sets(
PipelineBindPoint::Compute,
pipeline.layout().clone(),
0,
descriptor_set,
)
.unwrap();
unsafe {
command_buffer_builder
.dispatch([
MATRIX_A_NUM_ROWS.div_ceil(8),
MATRIX_B_NUM_COLS.div_ceil(8),
1,
])
.unwrap();
}
command_buffer_builder.build().unwrap()
};
// Computing Matrix Multiplication on GPU
let start = Instant::now();
sync::now(device.clone())
.then_execute(queue.clone(), command_buffer.clone())
.unwrap()
.then_signal_fence_and_flush()
.unwrap()
.wait(None)
.unwrap();
let gpu_tm = start.elapsed();
println!("(a) GPU matrix multiply: {:?}", gpu_tm);
// reading GPU-computed matrix multiplication result
let gpu_result_as_bytes = matrix_c_buf.read().unwrap();
let gpu_result = Matrix::from_bytes(&gpu_result_as_bytes);
// Computing Parallel Matrix Multiplication on CPU, using Rayon
let start = Instant::now();
let cpu_result_rayon = matrix_a.par_mul(&matrix_b);
let cpu_tm_rayon = start.elapsed();
println!("(b) CPU parallel matrix multiply: {:?}", cpu_tm_rayon);
let start = Instant::now();
let cpu_result_single_threaded = matrix_a * matrix_b;
let cpu_tm_single_threaded = start.elapsed();
println!(
"(c) CPU single-threaded matrix multiply: {:?}",
cpu_tm_single_threaded
);
assert_eq!(gpu_result, cpu_result_single_threaded);
assert_eq!(cpu_result_single_threaded, cpu_result_rayon);
println!(
"Speed Up (b/a): {}x\nSpeed Up (c/a): {}x\nSpeed Up (c/b): {}x",
cpu_tm_rayon.as_nanos() / gpu_tm.as_nanos(),
cpu_tm_single_threaded.as_nanos() / gpu_tm.as_nanos(),
cpu_tm_single_threaded.as_nanos() / cpu_tm_rayon.as_nanos(),
);
}
mod cs {
// Compiles shader
vulkano_shaders::shader! {
ty: "compute",
path: "./matrix_multiply.glsl",
vulkan_version: "1.2",
}
}
#version 460
#pragma shader_stage(compute)
layout(local_size_x = 8, local_size_y = 8, local_size_z = 1) in;
layout(set = 0, binding = 0) buffer readonly MatrixA
{
uint rows;
uint cols;
uint[] elems;
} matrix_a;
layout(set = 0, binding = 1) buffer readonly MatrixB
{
uint rows;
uint cols;
uint[] elems;
} matrix_b;
layout(set = 0, binding = 2) buffer writeonly MatrixC
{
uint rows;
uint cols;
uint[] elems;
} matrix_c;
void
main()
{
const uint row_idx = gl_GlobalInvocationID.x;
const uint col_idx = gl_GlobalInvocationID.y;
if (row_idx >= matrix_a.rows || col_idx >= matrix_b.cols) {
return;
}
if ((row_idx == 0) && (col_idx == 0)) {
matrix_c.rows = matrix_a.rows;
matrix_c.cols = matrix_b.cols;
}
uint sum = 0;
for (uint i = 0; i < matrix_a.cols; i++) {
sum += matrix_a.elems[row_idx * matrix_a.cols + i] * matrix_b.elems[i * matrix_b.cols + col_idx];
}
matrix_c.elems[row_idx * matrix_b.cols + col_idx] = sum;
}
@itzmeanjan
Copy link
Author

itzmeanjan commented Sep 6, 2021

Background

If you haven't yet read post, this code snippet accompanies it.

Usage

  • Download this GIST or clone it.
  • Set up a Rust project by executing following commands.
# Run them inside the unzipped or cloned gist directory.
mkdir src
mv main.rs src/main.rs
  • Build & run executable.
cargo run --release

See README.md file for the results I got.

@seddonm1
Copy link

Thank you for this code.

Updated code for latest versions:

vulkano = "0.32.1"
vulkano-shaders = "0.32.0"
rand = "0.8.4"

matrix_multiply.rs

use rand::{rngs::StdRng, Rng, SeedableRng};
use std::time::Instant;
use vulkano::{
    buffer::{BufferUsage, CpuAccessibleBuffer, DeviceLocalBuffer},
    command_buffer::{
        allocator::StandardCommandBufferAllocator, AutoCommandBufferBuilder, CommandBufferUsage,
        PrimaryCommandBufferAbstract,
    },
    descriptor_set::{
        allocator::StandardDescriptorSetAllocator, PersistentDescriptorSet, WriteDescriptorSet,
    },
    device::{
        physical::PhysicalDeviceType, Device, DeviceCreateInfo, DeviceExtensions, QueueCreateInfo,
        QueueFlags,
    },
    instance::{Instance, InstanceCreateInfo},
    memory::allocator::StandardMemoryAllocator,
    pipeline::{Pipeline, PipelineBindPoint},
    sync::GpuFuture,
    VulkanLibrary,
};

const N: u32 = 1 << 20;

fn main() {
    let library = VulkanLibrary::new().unwrap();
    let instance = Instance::new(
        library,
        InstanceCreateInfo {
            // Enable enumerating devices that use non-conformant vulkan implementations. (ex. MoltenVK)
            enumerate_portability: true,
            ..Default::default()
        },
    )
    .unwrap();

    let device_extensions = DeviceExtensions {
        khr_storage_buffer_storage_class: true,
        ..DeviceExtensions::empty()
    };
    let (physical_device, queue_family_index) = instance
        .enumerate_physical_devices()
        .unwrap()
        .filter(|p| p.supported_extensions().contains(&device_extensions))
        .filter_map(|p| {
            p.queue_family_properties()
                .iter()
                .position(|q| {
                    q.queue_flags.intersects(&QueueFlags {
                        compute: true,
                        ..QueueFlags::empty()
                    })
                })
                .map(|i| (p, i as u32))
        })
        .min_by_key(|(p, _)| match p.properties().device_type {
            PhysicalDeviceType::DiscreteGpu => 0,
            PhysicalDeviceType::IntegratedGpu => 1,
            PhysicalDeviceType::VirtualGpu => 2,
            PhysicalDeviceType::Cpu => 3,
            PhysicalDeviceType::Other => 4,
            _ => 5,
        })
        .unwrap();

    println!(
        "Using device: {} (type: {:?}). API version: {}",
        physical_device.properties().device_name,
        physical_device.properties().device_type,
        physical_device.api_version()
    );

    let (device, mut queues) = Device::new(
        physical_device,
        DeviceCreateInfo {
            enabled_extensions: device_extensions,
            queue_create_infos: vec![QueueCreateInfo {
                queue_family_index,
                ..Default::default()
            }],
            ..Default::default()
        },
    )
    .unwrap();

    let queue = queues.next().unwrap();

    let memory_allocator = StandardMemoryAllocator::new_default(device.clone());
    let descriptor_set_allocator = StandardDescriptorSetAllocator::new(device.clone());
    let command_buffer_allocator =
        StandardCommandBufferAllocator::new(device.clone(), Default::default());

    let mut builder = AutoCommandBufferBuilder::primary(
        &command_buffer_allocator,
        queue.queue_family_index(),
        CommandBufferUsage::OneTimeSubmit,
    )
    .unwrap();

    // Deterministically produce the GPU matrices
    let matrix_a = generate_square_matrix(Some(13));
    let matrix_b = generate_square_matrix(Some(17));
    let matrix_c = generate_square_matrix(None);

    // Matrix A --- stored in GPU accessible memory, CPU can't access it
    let matrix_a_buf = DeviceLocalBuffer::from_iter(
        &memory_allocator,
        matrix_a,
        BufferUsage {
            storage_buffer: true,
            ..BufferUsage::empty()
        },
        &mut builder,
    )
    .expect("failed to create uniform buffer");

    // Matrix B --- stored in GPU accessible memory, CPU can't access it
    let matrix_b_buf = DeviceLocalBuffer::from_iter(
        &memory_allocator,
        matrix_b,
        BufferUsage {
            storage_buffer: true,
            ..BufferUsage::empty()
        },
        &mut builder,
    )
    .expect("failed to create uniform buffer");

    // Matrix C --- resulting matrix can be accessed by both CPU, GPU
    let matrix_c_buf = CpuAccessibleBuffer::from_iter(
        &memory_allocator,
        BufferUsage {
            storage_buffer: true,
            ..BufferUsage::empty()
        },
        false,
        matrix_c,
    )
    .expect("failed to create storage buffer");

    // loading compute shader, including shader compilation
    // abstracted with macro!
    let cs = cs::load(device.clone()).unwrap();

    // Create compute-pipeline for applying compute shader to vertices.
    let compute_pipeline = vulkano::pipeline::ComputePipeline::new(
        device.clone(),
        cs.entry_point("main").unwrap(),
        &(),
        None,
        |_| {},
    )
    .expect("Failed to create compute shader");

    // adding descriptors as per layout, into compute pipeline
    let layout = compute_pipeline.layout().set_layouts().get(0).unwrap();
    let set = PersistentDescriptorSet::new(
        &descriptor_set_allocator,
        layout.clone(),
        [
            WriteDescriptorSet::buffer(0, matrix_a_buf.clone()),
            WriteDescriptorSet::buffer(1, matrix_b_buf.clone()),
            WriteDescriptorSet::buffer(2, matrix_c_buf.clone()),
        ],
    )
    .unwrap();

    // only single command recorded in command buffer
    builder
        // copy from the first half to the second half (inside the same buffer) before we run the computation
        .bind_pipeline_compute(compute_pipeline.clone())
        .bind_descriptor_sets(
            PipelineBindPoint::Compute,
            compute_pipeline.layout().clone(),
            0,
            set,
        )
        .dispatch([1024 / 8, 1024 / 4, 1])
        .unwrap();
    let command_buffer = builder.build().unwrap();

    // Computing Matrix Multiplication on GPU
    let start = Instant::now();
    let finished = command_buffer.execute(queue.clone()).unwrap();
    finished
        .then_signal_fence_and_flush()
        .unwrap()
        .wait(None)
        .unwrap();
    let gpu_tm = start.elapsed();

    // reading GPU-computed matrix multiplication result
    let gpu_result = matrix_c_buf.read().unwrap();

    // Deterministically produce the CPU matrices
    let r_matrix_a = generate_square_matrix(Some(13)).collect::<Vec<i32>>();
    let r_matrix_b = generate_square_matrix(Some(17)).collect::<Vec<i32>>();

    // Computing Matrix Multiplication on CPU, and asserting !
    let start = Instant::now();
    for i in 0..1024 {
        for j in 0..1024 {
            let mut sum = 0i32;
            for k in 0..1024 {
                sum += r_matrix_a[i * 1024 + k] * r_matrix_b[k * 1024 + j];
            }
            assert_eq!(sum, gpu_result[i * 1024 + j]);
        }
    }
    let cpu_tm = start.elapsed();

    println!(
        "GPU matrix multiply: {:?}\nCPU matrix multiply: {:?}\nSpeed Up: {}x",
        gpu_tm,
        cpu_tm,
        cpu_tm.as_nanos() / gpu_tm.as_nanos()
    );
}

// reproducible random matrix generator, as single dimensional iterator
fn generate_square_matrix(seed: Option<u64>) -> Box<dyn std::iter::ExactSizeIterator<Item = i32>> {
    match seed {
        Some(seed) => {
            let mut rng = StdRng::seed_from_u64(seed);
            Box::new((0..N).map(move |_| rng.gen::<i32>()))
        }
        None => Box::new((0..N).map(|_| 0)),
    }
}

mod cs {
    // does shader compilation
    vulkano_shaders::shader! {
        ty: "compute",
        path: "./matrix_multiply.glsl",
        vulkan_version: "1.2",
    }
}

Result:

Using device: AMD Radeon Pro 5500M (type: DiscreteGpu). API version: 1.2.231
GPU matrix multiply: 41.683247ms
CPU matrix multiply: 1.453689134s
Speed Up: 34x

@itzmeanjan
Copy link
Author

Thanks @seddonm1

@mjaric
Copy link

mjaric commented Jan 14, 2023

Comparation is not fair, CPU is doing calculation in single thread. Dot product those two arrays (that is more than shader does), using ndarray takes 275 micro seconds on 16 vCore CPU.

Using device: Radeon RX 590 Series (type: DiscreteGpu). API version: 1.3.209
GPU matrix multiply: 28.4929ms
CPU matrix multiply: 548.6Β΅s <-- dot product using ndarray
Speed Up: 0x

Anyhow, example is nice :) TNX!

@macchky
Copy link

macchky commented Jul 13, 2023

Tested by RX6700XT and Ryzen 7 5800X3D

Device: AMD Radeon RX 6700 XT
Vulkan API: 1.2.0
Queue Count: 1  Compute: true   Graphics: true
Queue Count: 2  Compute: true   Graphics: false
Queue Count: 2  Compute: false  Graphics: false
GPU matrix multiply: 10.1086ms
CPU matrix multiply: 2.7680681s
Speed Up: 273

@itzmeanjan
Copy link
Author

Thanks for reporting this @macchky.

@itzmeanjan
Copy link
Author

Totally agreed with you @mjaric.

@minghuaw
Copy link

I was getting a really poor performance on a rtx 2070 super (~650 ms). Does running this in WSL affect the performance?

@itzmeanjan
Copy link
Author

I was getting a really poor performance on a rtx 2070 super (~650 ms). Does running this in WSL affect the performance?

There could be many possible reasons for why it ran slow on some platform. As I've never tried running it on WSL, can't really say anything.

@minghuaw
Copy link

minghuaw commented Jul 24, 2023

I was getting a really poor performance on a rtx 2070 super (~650 ms). Does running this in WSL affect the performance?

There could be many possible reasons for why it ran slow on some platform. As I've never tried running it on WSL, can't really say anything.

Seems like WSL is indeed the cause of low performance. The same code takes 10 ms to complete in Windows but takes over 600 ms in WSL

@itzmeanjan
Copy link
Author

I was getting a really poor performance on a rtx 2070 super (~650 ms). Does running this in WSL affect the performance?

There could be many possible reasons for why it ran slow on some platform. As I've never tried running it on WSL, can't really say anything.

Seems like WSL is indeed the cause of low performance. The same code takes 10 ms to complete in Windows but takes over 600 ms in WSL

Totally possible.

@airlied
Copy link

airlied commented Mar 15, 2024

random comment, you should rework this to use VK_KHR_cooperative_matrix as a good example to compare it vs native shader

@itzmeanjan
Copy link
Author

random comment, you should rework this to use VK_KHR_cooperative_matrix as a good example to compare it vs native shader

Thanks for the suggestion though I'm not actively maintaining it. You may send me a patch and I'll update the gist.

@itzmeanjan
Copy link
Author

I just updated this to work with the latest version of vulkano, numbers look better. See https://gist.github.com/itzmeanjan/84613bc7595372c5e6b6c22481d42f9a?permalink_comment_id=3883582#gistcomment-3883582.

Addressed @mjaric 's concern of comparison not being fair, which I totally agree with, by adding rayon -based CPU parallel matrix multiplication implementation.

@itzmeanjan
Copy link
Author

Updated the gist again, now it supports parallel matrix multiplication on GPU with any two matrices of compatible dimensions.

@macchky
Copy link

macchky commented Mar 17, 2025

Tested by RX6700XT and Ryzen 7 5800X3D

Device: AMD Radeon RX 6700 XT
Vulkan API: 1.2.0
Queue Count: 1  Compute: true   Graphics: true
Queue Count: 2  Compute: true   Graphics: false
Queue Count: 2  Compute: false  Graphics: false
GPU matrix multiply: 10.1086ms
CPU matrix multiply: 2.7680681s
Speed Up: 273

Tested by RX6700XT on Ryzen 7 9700X

Using device: AMD Radeon RX 6700 XT (type: DiscreteGpu). API version: 1.3.281
Matrix A = (1774 X 2048)
Matrix B = (2048 X 1024)
Matrix C = A X B = (1774 X 1024)
(a) GPU matrix multiply: 46.6836ms
(b) CPU parallel matrix multiply: 786.9838ms
(c) CPU single-threaded matrix multiply: 13.3806369s
Speed Up (b/a): 16x
Speed Up (c/a): 286x
Speed Up (c/b): 17x

GPU time is slower than before.

@itzmeanjan
Copy link
Author

I don't think so. Before $A_{1024x1024}$ X $B_{1024x1024}$ was being benchmarked. Now matrix dimensions have changed to ensure that operand matrices don't need to be square.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment