levelzeronotes.md

Intel GPU on Win11 22H2 WSL2 Ubuntu 20.04, gcc 9.4.0

Notes of support of Intel GPU in WSL can be found here https://github.com/intel/compute-runtime/blob/master/WSL.md

My test is in a laptop Lenovo Carbon X1 Gen 10 with processor i7-1280p. I installed the latest Intel GPU driver (as in 9/30/2022) gfx_win_101.3430_101.2111.

I am using the follow configuration for WSL in c:\User\myuser\.wslconfig:

[wsl2]
guiApplications=true
memory=3GB
processors=2

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver https://github.com/intel/compute-runtime

The intructions for installing the latest igc driver can be found here

Compiling Level Zero backend

git clone [email protected]:oneapi-src/level-zero.git
cd level-zero
mkdir build
cmake ..
cmake --build . --config Release

Testing device

Then ./bin/zello_world, prints out:

zelLoaderGetVersions number of components found: 1
Version 0
Name: loader
Major: 1
Minor: 8
Patch: 8
Found ZE_DEVICE_TYPE_GPU device...
Driver version: 16998171
API version: 1.3
Device::properties_t::stype : DEVICE_PROPERTIES
Device::properties_t::pNext : 0x0
Device::properties_t::type : ZE_DEVICE_TYPE_GPU
Device::properties_t::vendorId : 32902
Device::properties_t::deviceId : 18086
Device::properties_t::flags : Device::{ PROPERTY_FLAG_INTEGRATED }
Device::properties_t::subdeviceId : 0
Device::properties_t::coreClockRate : 1450
Device::properties_t::maxMemAllocSize : 1073741824
Device::properties_t::maxHardwareContexts : 65536
Device::properties_t::maxCommandQueuePriority : 0
Device::properties_t::numThreadsPerEU : 7
Device::properties_t::physicalEUSimdWidth : 8
Device::properties_t::numEUsPerSubslice : 8
Device::properties_t::numSubslicesPerSlice : 12
Device::properties_t::numSlices : 1
Device::properties_t::timerResolution : 52
Device::properties_t::timestampValidBits : 36
Device::properties_t::kernelTimestampValidBits : 32
Device::properties_t::uuid : device_uuid_t::id : [ 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ]

Device::properties_t::name : Intel(R) Graphics [0x46a6]

stype : DEVICE_COMPUTE_PROPERTIES
pNext : 0x0
maxTotalGroupSize : 256
maxGroupSizeX : 256
maxGroupSizeY : 256
maxGroupSizeZ : 256
maxGroupCountX : 4294967295
maxGroupCountY : 4294967295
maxGroupCountZ : 4294967295
maxSharedLocalMemory : 65536
numSubGroupSizes : 3
subGroupSizes : [ 8, 16, 32, 0, 0, 0, 0, 0 ]

stype : DEVICE_MEMORY_PROPERTIES
pNext : 0x0
flags : Device::{ 0 }
maxClockRate : 0
maxBusWidth : 64
totalSize : 13604245504
name : DDR

ze_device_memory_access_properties_t.stype : DEVICE_MEMORY_ACCESS_PROPERTIES
ze_device_memory_access_properties_t.pNext : 0x0
ze_device_memory_access_properties_t.hostAllocCapabilities : Device::{ MEMORY_ACCESS_CAP_FLAG_RW | MEMORY_ACCESS_CAP_FLAG_ATOMIC }
ze_device_memory_access_properties_t.deviceAllocCapabilities : Device::{ MEMORY_ACCESS_CAP_FLAG_RW | MEMORY_ACCESS_CAP_FLAG_ATOMIC }
ze_device_memory_access_properties_t.sharedSingleDeviceAllocCapabilities : Device::{ MEMORY_ACCESS_CAP_FLAG_RW | MEMORY_ACCESS_CAP_FLAG_ATOMIC }
ze_device_memory_access_properties_t.sharedCrossDeviceAllocCapabilities : Device::{ 0 }
ze_device_memory_access_properties_t.sharedSystemAllocCapabilities : Device::{ 0 }

ze_device_cache_properties_t.stype : DEVICE_CACHE_PROPERTIES
ze_device_cache_properties_t.pNext : 0x0
ze_device_cache_properties_t.flags : Device::{ 0 }
ze_device_cache_properties_t.cacheSize : 1048576

ze_device_image_properties_t.stype : DEVICE_IMAGE_PROPERTIES
ze_device_image_properties_t.pNext : 0x0
ze_device_image_properties_t.maxImageDims1D : 16384
ze_device_image_properties_t.maxImageDims2D : 16384
ze_device_image_properties_t.maxImageDims3D : 2048
ze_device_image_properties_t.maxImageBufferSize : 67108864
ze_device_image_properties_t.maxImageArraySlices : 2048
ze_device_image_properties_t.maxSamplers : 16
ze_device_image_properties_t.maxReadImageArgs : 128
ze_device_image_properties_t.maxWriteImageArgs : 128

Congratulations, the device completed execution!

Getting and installing DPC++ compiler from LLVM/Intel fork

The DPCPP compiler could be found here https://github.com/intel/llvm/releases, be aware that the installation file is >500MB.

Then just load the compiler as:

tar -zxvf dpcpp-compiler.tar.gz
source ./dpcpp_compiler/startup.sh

Testing Level Zero (optional)

I followed this post from @jjfumero https://jjfumero.github.io/posts/2021/09/introduction-to-level-zero/ .

Then I cloned @jjfumer 's repo and tested his MxM code following this prescription:

git clone [email protected]:jjfumero/codeBlogArticles.git
cd codeBlogArticles/june2021/levelZero_MxM
export LEVEL_ZERO_ROOT=/path/to/level-zero-code
export ZE_SHARED_LOADER=$LEVEL_ZERO_ROOT/build/lib/libze_loader.so
. sources.sh
make
./gen-spirv-sh   ## Generate the SPIR-V code from the OpenCL kernel using CLANG and LLVM
./mxm

NOTE: If you get the follow error:

clang: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: No such file or directory

Just install the libtinfo.so with sudo apt update && sudo apt install -y libtinfo5.

The output of running ./mxm with my settings was:

Device   : Intel(R) Graphics [0x46a6]
Type     : GPU
Vendor ID: 8086
#Queue Groups: 1
Group X: 256
Group Y: 1
GPU Kernel = 143703775 [ns]
SEQ Kernel = 241019594105 [ns]
Speedup = 1677x

Matrix Multiply validation PASSED

SYCL testing on Level Zero

For testing SYCL support I followd this guide: https://github.com/intel/llvm/blob/sycl/sycl/doc/GetStartedGuide.md .

The first very simple test program in this guide is this:

#include <sycl/sycl.hpp>

int main() {
  // Creating buffer of 4 ints to be used inside the kernel code
  sycl::buffer<sycl::cl_int, 1> Buffer(4);

  // Creating SYCL queue
  sycl::queue Queue;

  // Size of index space for kernel
  sycl::range<1> NumOfWorkItems{Buffer.size()};

  // Submitting command group(work) to queue
  Queue.submit([&](sycl::handler &cgh) {
    // Getting write only access to the buffer on a device
    auto Accessor = Buffer.get_access<sycl::access::mode::write>(cgh);
    // Executing kernel
    cgh.parallel_for<class FillBuffer>(
        NumOfWorkItems, [=](sycl::id<1> WIid) {
          // Fill buffer with indexes
          Accessor[WIid] = (sycl::cl_int)WIid.get(0);
        });
  });

  // Getting read only access to the buffer on the host.
  // Implicit barrier waiting for queue to complete the work.
  const auto HostAccessor = Buffer.get_access<sycl::access::mode::read>();

  // Check the results
  bool MismatchFound = false;
  for (size_t I = 0; I < Buffer.size(); ++I) {
    if (HostAccessor[I] != I) {
      std::cout << "The result is incorrect for element: " << I
                << " , expected: " << I << " , got: " << HostAccessor[I]
                << std::endl;
      MismatchFound = true;
    }
  }

  if (!MismatchFound) {
    std::cout << "The results are correct!" << std::endl;
  }

  return MismatchFound;
}

And this can be compiled as:

clang++ -fsycl simple-sycl-app.cpp -o simple-sycl-app.exe
./simple-sycl-app.exe

Then, I got:

The results are correct!

Troubleshooting

Missing Double Precision Support

When running the tests on OpenCL devices which do not support double precision floats, there will be multiple tests that will error out.

It might be possible to enable software emulation of double precision floats for Intel iGPUs by setting two environment variables to make kernels using doubles work but with the major overhead of software emulation:

export IGC_EnableDPEmulation=1
export OverrideDefaultFP64Settings=1

alvarovm/levelzeronotes.md