Skip to content

Instantly share code, notes, and snippets.

@Ristovski
Last active March 5, 2026 12:50
Show Gist options
  • Select an option

  • Save Ristovski/d16c47d84d4a6ce039159d931cbfb3d5 to your computer and use it in GitHub Desktop.

Select an option

Save Ristovski/d16c47d84d4a6ce039159d931cbfb3d5 to your computer and use it in GitHub Desktop.
vkperf (0.99.5) tests various performance characteristics of Vulkan devices.
Devices in the system:
AMD Radeon Graphics (RADV RENOIR)
NVIDIA GeForce RTX 4070 Ti SUPER
llvmpipe (LLVM 19.1.7, 256 bits)
Selected device:
NVIDIA GeForce RTX 4070 Ti SUPER
VendorID: 0x10de (Nvidia)
DeviceID: 0x2705
Vulkan version: 1.4.303
Driver version: 570.133.7.0 (2392932800, 0x8ea141c0)
Driver name: NVIDIA
Driver info: 570.133.07
DriverID: NvidiaProprietary
Driver conformance version: 1.4.1.0
GPU memory: 16GiB (16376MiB)
Max memory allocations: 4294967295
Standard (non-sparse) buffer alignment: 16
Number of triangles for tests: 1000000
Sparse mode for tests: None
Timestamp number of bits: 64
Timestamp period: 1ns
Vulkan Instance version: 1.4.328
Operating system: < unknown, non-Windows >
Processor: AMD Ryzen 7 5700G with Radeon Graphics
Triangle throughput:
Triangle list (triangle list primitive type,
single per-scene vkCmdDraw() call, attributeless,
constant VS output): 10.38 giga-triangles/s
Indexed triangle list (triangle list primitive type, single
per-scene vkCmdDrawIndexed() call, no vertices shared between triangles,
attributeless, constant VS output): 10.38 giga-triangles/s
Indexed triangle list that reuses two indices of the previous triangle
(triangle list primitive type, single per-scene vkCmdDrawIndexed() call,
attributeless, constant VS output): 20.34 giga-triangles/s
Triangle strips of various lengths
(per-strip vkCmdDraw() call, 1 to 1000 triangles per strip,
attributeless, constant VS output):
strip length 1: 302.1 mega-triangles/s
strip length 2: 606.9 mega-triangles/s
strip length 5: 1.521 giga-triangles/s
strip length 8: 2.435 giga-triangles/s
strip length 10: 3.042 giga-triangles/s
strip length 20: 6.103 giga-triangles/s
strip length 25: 7.629 giga-triangles/s
strip length 40: 12.36 giga-triangles/s
strip length 50: 15.50 giga-triangles/s
strip length 100: 30.51 giga-triangles/s
strip length 125: 26.39 giga-triangles/s
strip length 1000: 28.72 giga-triangles/s
Indexed triangle strips of various lengths
(per-strip vkCmdDrawIndexed() call, 1-1000 triangles per strip,
no vertices shared between strips, each index used just once,
attributeless, constant VS output):
strip length 1: 277.1 mega-triangles/s
strip length 2: 555.4 mega-triangles/s
strip length 5: 1.391 giga-triangles/s
strip length 8: 2.229 giga-triangles/s
strip length 10: 2.790 giga-triangles/s
strip length 20: 5.580 giga-triangles/s
strip length 25: 7.025 giga-triangles/s
strip length 40: 11.22 giga-triangles/s
strip length 50: 14.15 giga-triangles/s
strip length 100: 28.72 giga-triangles/s
strip length 125: 31.50 giga-triangles/s
strip length 1000: 28.72 giga-triangles/s
Primitive restart indexed triangle strips of various lengths
(single per-scene vkCmdDrawIndexed() call, 1-1000 triangles per strip,
no vertices shared between strips, each index used just once,
attributeless, constant VS output):
strip length 1: 1.903 giga-triangles/s
strip length 2: 3.685 giga-triangles/s
strip length 5: 8.346 giga-triangles/s
strip length 8: 12.20 giga-triangles/s
strip length 1000: 27.90 giga-triangles/s
Primitive restart, each triangle is replaced by one -1
(single per-scene vkCmdDrawIndexed() call,
no fragments produced): 2.077 giga-triangles/s
Primitive restart, only zeros in the index buffer
(single per-scene vkCmdDrawIndexed() call,
no fragments produced): 30.51 giga-triangles/s
Instancing throughput of vkCmdDraw()
(one triangle per instance, constant VS output, one draw call,
attributeless): 2.146 giga-triangles/s
Instancing throughput of vkCmdDrawIndexed()
(one triangle per instance, constant VS output, one draw call,
attributeless): 2.077 giga-triangles/s
Instancing throughput of vkCmdDrawIndirect()
(one triangle per instance, one indirect draw call,
one indirect record, attributeless: 2.141 giga-triangles/s
Instancing throughput of vkCmdDrawIndexedIndirect()
(one triangle per instance, one indirect draw call,
one indirect record, attributeless: 2.077 giga-triangles/s
vkCmdDraw() throughput
(per-triangle vkCmdDraw() in command buffer,
attributeless, constant VS output): 302.8 mega-triangles/s
vkCmdDrawIndexed() throughput
(per-triangle vkCmdDrawIndexed() in command buffer,
attributeless, constant VS output): 276.9 mega-triangles/s
VkDrawIndirectCommand processing throughput
(per-triangle VkDrawIndirectCommand, one vkCmdDrawIndirect() call,
attributeless): 187.8 mega-indirectRecords/s
VkDrawIndirectCommand processing throughput with stride 32
(per-triangle VkDrawIndirectCommand, one vkCmdDrawIndirect() call,
attributeless): 120.5 mega-indirectRecords/s
VkDrawIndexedIndirectCommand processing throughput
(per-triangle VkDrawIndexedIndirectCommand,
1x vkCmdDrawIndexedIndirect() call,
attributeless): 143.6 mega-indirectRecords/s
VkDrawIndexedIndirectCommand processing throughput with stride 32
(per-triangle VkDrawIndexedIndirectCommand,
1x vkCmdDrawIndexedIndirect() call,
attributeless): 117.6 mega-indirectRecords/s
Vertex and geometry shader throughput:
VS throughput using vkCmdDraw() - minimal VS that just writes
constant output position (per-scene vkCmdDraw() call,
no attributes, no fragments produced): 31.16 giga-vertices/s
VS throughput using vkCmdDrawIndexed() - minimal VS that just writes
constant output position (per-scene vkCmdDrawIndexed() call,
no attributes, no fragments produced): 31.16 giga-vertices/s
VS producing output position from VertexIndex and InstanceIndex
using vkCmdDraw() (single per-scene vkCmdDraw() call,
attributeless, no fragments produced): 31.16 giga-vertices/s
VS producing output position from VertexIndex and InstanceIndex
using vkCmdDrawIndexed() (single per-scene vkCmdDrawIndexed() call,
attributeless, no fragments produced): 31.16 giga-vertices/s
GS one triangle in and no triangle out
(empty VS, attributeless): 3.577 giga-invocations/s
GS one triangle in and single constant triangle out
(empty VS, attributeless): 3.577 giga-invocations/s
GS one triangle in and two constant triangles out
(empty VS, attributeless): 3.577 giga-invocations/s
Attributes and buffers:
One attribute performance - 1x vec4 attribute
(attribute used, per-scene draw call): 30.83 giga-vertices/s
One buffer performance - 1x vec4 buffer
(1x read in VS, per-scene draw call): 30.83 giga-vertices/s
One buffer performance - 1x vec3 buffer
(1x read in VS, one draw call): 31.16 giga-vertices/s
Two attributes performance - 2x vec4 attribute
(both attributes used): 19.92 giga-vertices/s
Two buffers performance - 2x vec4 buffer
(both buffers read in VS): 19.66 giga-vertices/s
Two buffers performance - 2x vec3 buffer
(both buffers read in VS): 25.92 giga-vertices/s
Two interleaved attributes performance - 2x vec4
(2x vec4 attribute fetched from the single buffer in VS
from consecutive buffer locations: 19.79 giga-vertices/s
Two interleaved buffers performance - 2x vec4
(2x vec4 fetched from the single buffer in VS
from consecutive buffer locations: 20.92 giga-vertices/s
Packed buffer performance - 1x buffer using 32-byte struct unpacked
into position+normal+color+texCoord: 20.06 giga-vertices/s
Packed attribute performance - 2x uvec4 attribute unpacked
into position+normal+color+texCoord: 19.92 giga-vertices/s
Packed buffer performance - 2x uvec4 buffers unpacked
into position+normal+color+texCoord: 19.66 giga-vertices/s
Packed buffer performance - 2x buffer using 16-byte struct unpacked
into position+normal+color+texCoord: 19.66 giga-vertices/s
Packed buffer performance - 2x buffer using 16-byte struct
read multiple times and unpacked
into position+normal+color+texCoord: 19.66 giga-vertices/s
Four attributes performance - 4x vec4 attribute
(all attributes used): 10.10 giga-vertices/s
Four buffers performance - 4x vec4 buffer
(all buffers read in VS): 10.53 giga-vertices/s
Four buffers performance - 4x vec3 buffer
(all buffers read in VS): 13.81 giga-vertices/s
Four interleaved attributes performance - 4x vec4
(4x vec4 fetched from the single buffer
on consecutive locations: 10.10 giga-vertices/s
Four interleaved buffers performance - 4x vec4
(4x vec4 fetched from the single buffer
on consecutive locations: 10.61 giga-vertices/s
Four attributes performance - 2x vec4 and 2x uint attribute
(2x vec4f32 + 2x vec4u8, 2x conversion from vec4u8
to vec4): 15.58 giga-vertices/s
Transformations:
Matrix performance - one matrix as uniform for all triangles
(maxtrix read in VS,
coordinates in vec4 attribute): 30.83 giga-vertices/s
Matrix performance - per-triangle matrix in buffer
(different matrix read for each triangle in VS,
coordinates in vec4 attribute): 17.13 giga-vertices/s
Matrix performance - per-triangle matrix in attribute
(triangles are instanced and each triangle receives a different matrix,
coordinates in vec4 attribute: 5.847 giga-vertices/s
Matrix performance - one matrix in buffer for all triangles and 2x uvec4
packed attributes (each triangle reads matrix from the same place in
the buffer, attributes unpacked): 19.92 giga-vertices/s
Matrix performance - per-triangle matrix in the buffer and 2x uvec4 packed
attributes (each triangle reads a different matrix from a buffer,
attributes unpacked): 12.15 giga-vertices/s
Matrix performance - per-triangle matrix in buffer and 2x uvec4 packed
buffers (each triangle reads a different matrix from a buffer,
packed buffers unpacked): 12.68 giga-vertices/s
Matrix performance - GS reads per-triangle matrix from buffer and 2x uvec4
packed buffers (each triangle reads a different matrix from a buffer,
packed buffers unpacked in GS): 9.212 giga-vertices/s
Matrix performance - per-triangle matrix in buffer and four attributes
(each triangle reads a different matrix from a buffer,
4x vec4 attribute): 7.609 giga-vertices/s
Matrix performance - 1x per-triangle matrix in buffer, 2x uniform matrix and
and 2x uvec4 packed attributes (uniform view and projection matrices
multiplied with per-triangle model matrix and with unpacked attributes of
position, normal, color and texCoord: 12.15 giga-vertices/s
Matrix performance - 2x per-triangle matrix (mat4+mat3) in buffer,
3x uniform matrix (mat4+mat4+mat3) and 2x uvec4 packed attributes
(full position and normal computation with MVP and normal matrices,
all matrices and attributes multiplied): 9.668 giga-vertices/s
Matrix performance - 2x per-triangle matrix (mat4+mat3) in buffer,
2x non-changing matrix (mat4+mat4) in push constants,
1x constant matrix (mat3) and 2x uvec4 packed attributes (all
matrices and attributes multiplied): 9.668 giga-vertices/s
Matrix performance - 2x per-triangle matrix (mat4+mat3) in buffer, 2x
non-changing matrix (mat4+mat4) in specialization constants, 1x constant
matrix (mat3) defined by VS code and 2x uvec4 packed attributes (all
matrices and attributes multiplied): 9.511 giga-vertices/s
Matrix performance - 2x per-triangle matrix (mat4+mat3) in buffer,
3x constant matrix (mat4+mat4+mat3) defined by VS code and
2x uvec4 packed attributes (all matrices and attributes
multiplied): 9.574 giga-vertices/s
Matrix performance - GS five matrices processing, 2x per-triangle matrix
(mat4+mat3) in buffer, 3x uniform matrix (mat4+mat4+mat3) and
2x uvec4 packed attributes passed through VS (all matrices and
attributes multiplied): 8.394 giga-vertices/s
Matrix performance - GS five matrices processing, 2x per-triangle matrix
(mat4+mat3) in buffer, 3x uniform matrix (mat4+mat4+mat3) and
2x uvec4 packed data read from buffer in GS (all matrices and attributes
multiplied): 7.550 giga-vertices/s
Textured Phong and Matrix performance - 2x per-triangle matrix
in buffer (mat4+mat3), 3x uniform matrix (mat4+mat4+mat3) and
four attributes (vec4f32+vec3f32+vec4u8+vec2f32),
no fragments produced: 8.394 giga-vertices/s
Textured Phong and Matrix performance - 1x per-triangle matrix
in buffer (mat4), 2x uniform matrix (mat4+mat4) and
four attributes (vec4f32+vec3f32+vec4u8+vec2f32),
no fragments produced: 10.50 giga-vertices/s
Textured Phong and Matrix performance - 1x per-triangle matrix
in buffer (mat4), 2x uniform matrix (mat4+mat4) and 2x uvec4 packed
attribute, no fragments produced: 12.15 giga-vertices/s
Textured Phong and Matrix performance - 1x per-triangle row-major matrix
in buffer (mat4), 2x uniform not-row-major matrix (mat4+mat4),
2x uvec4 packed attributes,
no fragments produced: 12.15 giga-vertices/s
Textured Phong and Matrix performance - 1x per-triangle mat4x3 matrix
in buffer, 2x uniform matrix (mat4+mat4) and 2x uvec4 packed attributes,
no fragments produced: 13.37 giga-vertices/s
Textured Phong and Matrix performance - 1x per-triangle row-major mat4x3
matrix in buffer, 2x uniform matrix (mat4+mat4), 2x uvec4 packed
attribute, no fragments produced: 13.43 giga-vertices/s
Textured Phong and PAT performance - PAT v1 (Position-Attitude-Transform,
performing translation (vec3) and rotation (quaternion as vec4) using
implementation 1), PAT is per-triangle 2x vec4 in buffer,
2x uniform matrix (mat4+mat4), 2x uvec4 packed attributes,
no fragments produced: 15.02 giga-vertices/s
Textured Phong and PAT performance - PAT v2 (Position-Attitude-Transform,
performing translation (vec3) and rotation (quaternion as vec4) using
implementation 2), PAT is per-triangle 2x vec4 in buffer,
2x uniform matrix (mat4+mat4), 2x uvec4 packed attributes,
no fragments produced: 14.94 giga-vertices/s
Textured Phong and PAT performance - PAT v3 (Position-Attitude-Transform,
performing translation (vec3) and rotation (quaternion as vec4) using
implementation 3), PAT is per-triangle 2x vec4 in buffer,
2x uniform matrix (mat4+mat4), 2x uvec4 packed attributes,
no fragments produced: 15.02 giga-vertices/s
Textured Phong and PAT performance - constant single PAT v2 sourced from
the same index in buffer (vec3+vec4), 2x uniform matrix (mat4+mat4),
2x uvec4 packed attributes,
no fragments produced: 19.92 giga-vertices/s
Textured Phong and PAT performance - indexed draw call, per-triangle PAT v2
in buffer (vec3+vec4), 2x uniform matrix (mat4+mat4), 2x uvec4 packed
attribute, no fragments produced: 13.75 giga-vertices/s
Textured Phong and PAT performance - indexed draw call, constant single
PAT v2 sourced from the same index in buffer (vec3+vec4),
2x uniform matrix (mat4+mat4), 2x uvec4 packed attributes,
no fragments produced: 17.75 giga-vertices/s
Textured Phong and PAT performance - primitive restart, indexed draw call,
per-triangle PAT v2 in buffer (vec3+vec4), 2x uniform matrix (mat4+mat4),
2x uvec4 packed attributes,
no fragments produced: 5.710 giga-vertices/s
Textured Phong and PAT performance - primitive restart, indexed draw call,
constant single PAT v2 sourced from the same index in buffer (vec3+vec4),
2x uniform matrix (mat4+mat4), 2x uvec4 packed attributes,
no fragments produced: 5.710 giga-vertices/s
Textured Phong and double precision matrix performance - double precision
per-triangle matrix in buffer (dmat4), double precision per-scene view
matrix in uniform (dmat4), both matrices converted to single precision
before computations, single precision per-scene perspective matrix in
uniform (mat4), single precision vertex positions, packed attributes
(2x uvec4), no fragments produced: 8.719 giga-vertices/s
Textured Phong and double precision matrix performance - double precision
per-triangle matrix in buffer (dmat4), double precision per-scene view
matrix in uniform (dmat4), both matrices multiplied in double precision,
single precision vertex positions, single precision per-scene
perspective matrix in uniform (mat4), packed attributes (2x uvec4),
no fragments produced: 5.203 giga-vertices/s
Textured Phong and double precision matrix performance - double precision
per-triangle matrix in buffer (dmat4), double precision per-scene view
matrix in uniform (dmat4), both matrices multiplied in double precision,
double precision vertex positions (dvec3), single precision per-scene
perspective matrix in uniform (mat4), packed attributes (3x uvec4),
no fragments produced: 5.415 giga-vertices/s
Textured Phong and double precision matrix performance using GS - double
precision per-triangle matrix in buffer (dmat4), double precision
per-scene view matrix in uniform (dmat4), both matrices multiplied in
double precision, double precision vertex positions (dvec3), single
precision per-scene perspective matrix in uniform (mat4), packed
attributes (3x uvec4),
no fragments produced: 2.013 giga-vertices/s
Fragment throughput:
Single full-framebuffer quad,
constant color FS: 135.0 giga-fragments/s
10x full-framebuffer quad,
constant color FS: 202.5 giga-fragments/s
Four smooth interpolators (4x vec4),
10x fullscreen quad: 164.6 giga-fragments/s
Four flat interpolators (4x vec4),
10x fullscreen quad: 174.5 giga-fragments/s
Four textured phong interpolators (vec3+vec3+vec4+vec2),
10x fullscreen quad: 200.4 giga-fragments/s
Textured Phong, packed uniforms (four smooth interpolators
(vec3+vec3+vec4+vec2), 4x uniform (material (56 byte) +
globalAmbientLight (12 byte) + light (64 byte) + sampler2D),
10x fullscreen quad): 120.5 giga-fragments/s
Textured Phong, not packed uniforms (four smooth interpolators
(vec3+vec3+vec4+vec2), 4x uniform (material (72 byte) +
globalAmbientLight (12 byte) + light (80 byte) + sampler2D),
10x fullscreen quad): 120.5 giga-fragments/s
Simplified Phong, no texture, no specular (2x smooth interpolator
(vec3+vec3), 3x uniform (material (vec4+vec4) + globalAmbientLight
(vec3) + light (48 bytes: position+attenuation+ambient+diffuse)),
10x fullscreen quad): 198.5 giga-fragments/s
Simplified Phong, no texture, no specular, single uniform
(2x smooth interpolator (vec3+vec3), 1x uniform
(material+globalAmbientLight+light (vec4+vec4+vec4 + 3x vec4),
10x fullscreen quad): 196.6 giga-fragments/s
Constant color from uniform, 1x uniform (vec4) in FS,
10x fullscreen quad: 202.5 giga-fragments/s
Constant color from uniform, 1x uniform (uint) in FS,
10x fullscreen quad: 202.5 giga-fragments/s
Transfer throughput:
Transfer of consecutive blocks:
4 bytes: 12.4224ns per transfer (0.299885 GiB/s)
4 bytes: 8.9664ns per transfer (0.415472 GiB/s)
8 bytes: 11.0016ns per transfer (0.677227 GiB/s)
16 bytes: 11.1648ns per transfer (1.33466 GiB/s)
32 bytes: 11.4688ns per transfer (2.59856 GiB/s)
64 bytes: 11.5456ns per transfer (5.16254 GiB/s)
128 bytes: 12.4512ns per transfer (9.57412 GiB/s)
256 bytes: 17.4805ns per transfer (13.6391 GiB/s)
512 bytes: 28.9609ns per transfer (16.4648 GiB/s)
1024 bytes: 53.5938ns per transfer (17.7945 GiB/s)
2048 bytes: 104.031ns per transfer (18.3344 GiB/s)
4096 bytes: 204.25ns per transfer (18.6766 GiB/s)
8192 bytes: 405.75ns per transfer (18.8032 GiB/s)
16384 bytes: 812ns per transfer (18.7916 GiB/s)
32768 bytes: 1623ns per transfer (18.8032 GiB/s)
65536 bytes: 3246ns per transfer (18.8032 GiB/s)
131072 bytes: 6494ns per transfer (18.7974 GiB/s)
262144 bytes: 12672ns per transfer (19.2661 GiB/s)
524288 bytes: 2048ns per transfer (238.419 GiB/s)
1048576 bytes: 3584ns per transfer (272.478 GiB/s)
2097152 bytes: 6144ns per transfer (317.891 GiB/s)
Transfer of spaced blocks:
4 bytes: 8.9632ns per transfer (0.415621 GiB/s)
4 bytes: 8.9632ns per transfer (0.415621 GiB/s)
8 bytes: 8.96ns per transfer (0.831538 GiB/s)
16 bytes: 8.9632ns per transfer (1.66248 GiB/s)
32 bytes: 8.9696ns per transfer (3.32259 GiB/s)
64 bytes: 8.9792ns per transfer (6.63808 GiB/s)
128 bytes: 9.9168ns per transfer (12.0209 GiB/s)
256 bytes: 16.4648ns per transfer (14.4805 GiB/s)
512 bytes: 25.875ns per transfer (18.4285 GiB/s)
1024 bytes: 46.5156ns per transfer (20.5022 GiB/s)
2048 bytes: 90.0938ns per transfer (21.1707 GiB/s)
4096 bytes: 180.375ns per transfer (21.1487 GiB/s)
8192 bytes: 346.375ns per transfer (22.0264 GiB/s)
16384 bytes: 671.25ns per transfer (22.7319 GiB/s)
32768 bytes: 1381.5ns per transfer (22.0902 GiB/s)
65536 bytes: 2910ns per transfer (20.9743 GiB/s)
131072 bytes: 5838ns per transfer (20.9096 GiB/s)
262144 bytes: 10752ns per transfer (22.7065 GiB/s)
524288 bytes: 13312ns per transfer (36.6798 GiB/s)
1048576 bytes: 17440ns per transfer (55.9956 GiB/s)
2097152 bytes: 20480ns per transfer (95.3674 GiB/s)
Measurement statistics:
Triangle throughput measurement time: 10.5 seconds using 413 test rounds.
Vertex throughput measurement time: 0.505 seconds using 413 test rounds.
Attribute and Buffer measurement time: 1.37 seconds using 413 test rounds.
Transformation measurement time: 4.6 seconds using 413 test rounds.
Fragment throughput measurement time: 0.504 seconds using 413 test rounds.
Transfer throughput measurement time: 1.58 seconds using 413 test rounds.
Total device time: 18.5 seconds.
Total real time: 20 seconds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment