Skip to content

Instantly share code, notes, and snippets.

@tkarna
Last active November 27, 2024 18:07
Show Gist options
  • Save tkarna/237d5acd825b139c85a9824ca677e0f5 to your computer and use it in GitHub Desktop.
Save tkarna/237d5acd825b139c85a9824ca677e0f5 to your computer and use it in GitHub Desktop.
Compile ollama with SYCL support

Compile ollama in Ubuntu 22.04:

# Install and activate oneapi
sudo apt install intel-basekit
source /opt/intel/oneapi/setvars.sh

# You may need to install other build dependencies ...
# sudo apt install apt-utils

# Install go lang
sudo add-apt-repository ppa:longsleep/golang-backports
sudo apt update
sudo apt install -y golang-1.23-go
export PATH=/usr/lib/go-1.23/bin:$PATH

# Clone ollama
git clone --depth 1 --branch v0.3.13 https://github.com/ollama/ollama.git

# Compile
cd ollama
CGO_ENABLED="1" OLLAMA_SKIP_CPU_GENERATE="1" OLLAMA_INTEL_GPU="1" go generate ./...
go build

ollama binary will appear in the repository root directory

When you start the server, you need to the set OLLAMA_INTEL_GPU environment variable. For example:

export OLLAMA_INTEL_GPU=1
export ENV OLLAMA_NUM_GPU=999
export ZES_ENABLE_SYSMAN=1
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

ollama serve

If successful, you should see the discrete GPU(s) listed when the server starts

time=2024-11-27T08:33:32.243Z level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-11-27T08:33:32.315Z level=INFO source=types.go:123 msg="inference compute" id=0 library=oneapi variant="" compute="" driver=0.0 name="Intel(R) Data Center GPU Max 1100" total="48.0 GiB" available="45.6 GiB"

NOTE: At the moment, only discrete GPUs are supported, not integrated GPUs.

@Schlaefer
Copy link

I gave it a try on Arch, but this only builds the CPU runners and runs on the CPU only here.

time=2024-11-27T10:17:26.146+01:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cpu]"
time=2024-11-27T10:17:26.146+01:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-11-27T10:17:26.195+01:00 level=INFO source=types.go:123 msg="inference compute" id=0 library=oneapi variant="" compute="" driver=0.0 name="Intel(R) Arc(TM) A750 Graphics" total="7.9 GiB" available="7.5 GiB"

@tkarna
Copy link
Author

tkarna commented Nov 27, 2024

I gave it a try on Arch, but this only builds the CPU runners and runs on the CPU only here.

@Schlaefer Hmm, I do see similar output. It seems that the output is not very informative in this case. Try running a model, you should see it being offloaded to the GPU:

time=2024-11-27T11:00:02.639+01:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/localdisk/model_cache/ollama/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 gpu=1 parallel=4 available=48917722726 required="43.2 GiB"
time=2024-11-27T11:00:02.640+01:00 level=INFO source=server.go:108 msg="system memory" total="251.5 GiB" free="243.3 GiB" free_swap="0 B"
time=2024-11-27T11:00:02.640+01:00 level=INFO source=memory.go:326 msg="offload to oneapi" layers.requested=-1 layers.model=81 layers.offload=81 layers.split="" memory.available="[45.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="43.2 GiB" memory.required.partial="43.2 GiB" memory.required.kv="2.5 GiB" memory.required.allocations="[43.2 GiB]" memory.weights.total="40.7 GiB" memory.weights.repeating="39.9 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"

I can also verify that GPU is being utilized with intel_gpu_top:

intel_gpu_top -l -d <pci: filter of your card as listed by 'intel_gpu_top -L'>

@Schlaefer
Copy link

Definitely runs on the CPU. Sure this is supposed to work with version 0.4+?

@tkarna
Copy link
Author

tkarna commented Nov 27, 2024

Definitely runs on the CPU. Sure this is supposed to work with version 0.4+?

You are right. I was testing with an older binary build from tag v0.3.13. Updated gist.

@Schlaefer
Copy link

I tinkered a little bit with different 0.3 versions. That generates a working ollama, but now the intel runtime throws its hand in the air with this identical issue: ollama/ollama#1590 (comment)

Luckily I'm having a working 0.3 vulkan build, but eventually we will need something for 0.4+ anyway.

@tkarna
Copy link
Author

tkarna commented Nov 27, 2024

It looks like ollama 0.3.13 works when compiled against oneapi 2024.2.1 but not 2025.0.0 (segfault at runtime). I tested the above install script with oneapi-basekit/2024.2.1-0-devel-ubuntu22.04 docker image.

@Schlaefer
Copy link

Alas Arch is stuck at 2024.1, so that might be it ... 😔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment