Compile ollama with SYCL support

Schlaefer commented Nov 27, 2024

I gave it a try on Arch, but this only builds the CPU runners and runs on the CPU only here.

time=2024-11-27T10:17:26.146+01:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cpu]"
time=2024-11-27T10:17:26.146+01:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-11-27T10:17:26.195+01:00 level=INFO source=types.go:123 msg="inference compute" id=0 library=oneapi variant="" compute="" driver=0.0 name="Intel(R) Arc(TM) A750 Graphics" total="7.9 GiB" available="7.5 GiB"

Author

tkarna commented Nov 27, 2024

I gave it a try on Arch, but this only builds the CPU runners and runs on the CPU only here.

@Schlaefer Hmm, I do see similar output. It seems that the output is not very informative in this case. Try running a model, you should see it being offloaded to the GPU:

time=2024-11-27T11:00:02.639+01:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/localdisk/model_cache/ollama/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 gpu=1 parallel=4 available=48917722726 required="43.2 GiB"
time=2024-11-27T11:00:02.640+01:00 level=INFO source=server.go:108 msg="system memory" total="251.5 GiB" free="243.3 GiB" free_swap="0 B"
time=2024-11-27T11:00:02.640+01:00 level=INFO source=memory.go:326 msg="offload to oneapi" layers.requested=-1 layers.model=81 layers.offload=81 layers.split="" memory.available="[45.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="43.2 GiB" memory.required.partial="43.2 GiB" memory.required.kv="2.5 GiB" memory.required.allocations="[43.2 GiB]" memory.weights.total="40.7 GiB" memory.weights.repeating="39.9 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"

I can also verify that GPU is being utilized with intel_gpu_top:

intel_gpu_top -l -d <pci: filter of your card as listed by 'intel_gpu_top -L'>

Schlaefer commented Nov 27, 2024

Definitely runs on the CPU. Sure this is supposed to work with version 0.4+?

Author

tkarna commented Nov 27, 2024

Definitely runs on the CPU. Sure this is supposed to work with version 0.4+?

You are right. I was testing with an older binary build from tag v0.3.13. Updated gist.

Schlaefer commented Nov 27, 2024

I tinkered a little bit with different 0.3 versions. That generates a working ollama, but now the intel runtime throws its hand in the air with this identical issue: ollama/ollama#1590 (comment)

Luckily I'm having a working 0.3 vulkan build, but eventually we will need something for 0.4+ anyway.

Author

It looks like ollama 0.3.13 works when compiled against oneapi 2024.2.1 but not 2025.0.0 (segfault at runtime). I tested the above install script with oneapi-basekit/2024.2.1-0-devel-ubuntu22.04 docker image.

Schlaefer commented Nov 27, 2024

Alas Arch is stuck at 2024.1, so that might be it ... 😔

tkarna/compile_ollama.md

Schlaefer commented Nov 27, 2024

Uh oh!

tkarna commented Nov 27, 2024

Uh oh!

Schlaefer commented Nov 27, 2024

Uh oh!

tkarna commented Nov 27, 2024

Uh oh!

Schlaefer commented Nov 27, 2024

Uh oh!

tkarna commented Nov 27, 2024

Uh oh!

Schlaefer commented Nov 27, 2024

Uh oh!