This is a full account of the steps I ran to get llama.cpp
running on the Nvidia Jetson Nano 2GB. It accumulates multiple different fixes and tutorials, whose contributions are referenced at the bottom of this README.
At a high level, the procedure to install llama.cpp
on a Jetson Nano consists of 3 steps.
-
Compile the
gcc 8.5
compiler from source. -
Compile
llama.cpp
from source using thegcc 8.5
compiler. -
Download a model.
-
Perform inference.
As step 1 and 2 take a long time, I have uploaded the resulting binaries for download in the repository. Simply download, unzip and follow step 3 and 4 to perform inference.
- Compile the GCC 8.5 compiler from source on the Jetson nano.
NOTE: Themake -j6
command takes a long time. I recommend running it overnight in atmux
session. Additionally, it requires quite a bit of disk space so make sure to leave at least 8GB of free space on the device before starting.
wget https://bigsearcher.com/mirrors/gcc/releases/gcc-8.5.0/gcc-8.5.0.tar.gz
sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
cd /usr/local/
./contrib/download_prerequisites
mkdir build
cd build
sudo ../configure -enable-checking=release -enable-languages=c,c++
make -j6
make install
- Once the
make install
command ran successfully, you can clean up disk space by removing thebuild
directory.
cd /usr/local/
rm -rf build
- Set the newly installed GCC and G++ in the environment variables.
export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++
- Double check whether the install was indeed successful (both commands should say
8.5.0
).
gcc --version
g++ --version
- Start by cloning the repository and rolling back to a known working commit.
git clone [email protected]:ggerganov/llama.cpp.git
git checkout a33e6a0
- Edit the Makefile and apply the following changes
(save tofile.patch
and apply withgit apply --stat file.patch
)
diff --git a/Makefile b/Makefile
index 068f6ed0..a4ed3c95 100644
--- a/Makefile
+++ b/Makefile
@@ -106,11 +106,11 @@ MK_NVCCFLAGS = -std=c++11
ifdef LLAMA_FAST
MK_CFLAGS += -Ofast
HOST_CXXFLAGS += -Ofast
-MK_NVCCFLAGS += -O3
+MK_NVCCFLAGS += -maxrregcount=80
else
MK_CFLAGS += -O3
MK_CXXFLAGS += -O3
-MK_NVCCFLAGS += -O3
+MK_NVCCFLAGS += -maxrregcount=80
endif
ifndef LLAMA_NO_CCACHE
@@ -299,7 +299,6 @@ ifneq ($(filter aarch64%,$(UNAME_M)),)
# Raspberry Pi 3, 4, Zero 2 (64-bit)
# Nvidia Jetson
MK_CFLAGS += -mcpu=native
- MK_CXXFLAGS += -mcpu=native
JETSON_RELEASE_INFO = $(shell jetson_release)
ifdef JETSON_RELEASE_INFO
ifneq ($(filter TX2%,$(JETSON_RELEASE_INFO)),)
-
NOTE: If you rather make the changes manually, do the following:
-
Change
MK_NVCCFLAGS += -O3
toMK_NVCCFLAGS += -maxrregcount=80
on line 109 and line 113. -
Remove
MK_CXXFLAGS += -mcpu=native
on line 302.
-
- Build the
llama.cpp
source code.
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6
- Download a model to the device
wget https://huggingface.co/second-state/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf
- NOTE: Due to the limited memory of the Nvidia Jetson Nano 2GB, I have only been able to successfully run the second-state/TinyLlama-1.1B-Chat-v1.0-GGUF on the device.
Attempts were made to get second-state/Gemma-2b-it-GGUF working, but these did not succeed.
- Test the main inference script
./main -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33 -c 2048 -b 512 -n 128 --keep 48
- Run the live server
./server -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33 -c 2048 -b 512 -n 128
- Test the web server functionality using curl
curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
You can now run a large language model on this tiny and cheap edge device. Have fun!
I followed the instructions form @anuragdogra2192 on medium.com and successfully compiled a GPU accelerated version of llama.cpp. Some predictions regarding the prompt processing speed pp512 came true, and a few new questions arise. But first the results.
This old version b1618 (81bc921 from December 7, 2023) does not have a
llama-cli
yet, so we call themain
program with a task regarding the solar system:mk@jetson:~/llama.cpp3$ ./build/bin/main -m ../.cache/llama.cpp/TheBloke_TinyLlama-1.1B-Chat-v1.0-GGUF_tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Solar System" --n-gpu-layers 5 --ctx-size 512 --threads 4 --temp 0.7 --top-k 40 --top-p 0.9 --batch-size 16
. After the answer the speed summary isSlightly better than 1.75 token/s. But it could also be a result of the context window being not filled yet. And it's still slower than the pure CPU use with newer llama.cpp builds. Now the CPU is only partly used with 650 mW, but the GPU is at 100% and 3.2W:
Let's move to the integrated benchmark tool, starting with the same 5 layers:
./build/bin/llama-bench -m ../.cache/llama.cpp/TheBloke_TinyLlama-1.1B-Chat-v1.0-GGUF_tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --n-gpu-layers 5
And now a limit to 10 layers:
Now with 22 layers:
And now with the maximum working number of layers, 24. With 25 or no limit it crashes:
Now both the CPU at 1.9W and the GPU at 2.4W are at 100% :
Questions
The build used by @anuragdogra2192 is from December 2023, the recommended build from the author of this gist @FlorSanders is from February 2024, both are rather old.
Both used the available nvcc 10.2.300 provided with the ubuntu 18.04 LTS version provided by Nvidia. And needed to build gcc 8.5.0 from scratch (takes 3 hours). The version from Anurag needed 5 extra lines in the file
ggml-cuda.cu
in the llama.cpp folder. Then he uses firstcmake .. -DLLAMA_CUBLAS=ON
, followed bymake -j 2
in the build folder.Flor compiled with
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6
after changing some lines in the Makefile. And he could run llama.cpp with 33 layers and--n-gpu-layers 33
while I get crashes for values larger than 24. The currently recommended method consists of two steps with CMake:Would it be possible to tweak the current build (something in the b4984 range) to let it compile with the older nvcc 10.2 and gcc 8.5? Trying without any changes I got some errors of non supported
nvcc fatal : Unsupported gpu architecture 'compute_80'
. Flor had it already explixit required withsm_62
which is larger than the 5.3 the Jetson is having in hardware. I could not find a specific date or build of llama.cpp that indicates a dropped support for nvcc 10.2. And the CC 5.3 of the Jetson is still supported by nvcc 12.8, it is just not provided for the Jetson by Nvidia. And the CPU version of the current llama.cpp can be compiled with gcc 8.5.Observations
With increased use of the GPU the prompt processing speed pp512 is indeed increasing! The pure CPU for the current llama.cpp build was 6.71, with 5 GPU layers it was more than 3x faster at 20.99, then 24.29 with 10 layers, 42.24 with 22 layers and finally 54.18 with 24 layers (before crashing at 25 layers). That's almost 10x faster with using the GPU!
The token generation tg128 is still significantly slower than newer builds with the CPU, compare 3.55 for GPU to 4.98 for CPU, 29% slower. I think this gap would be filled with more recent versions of llama.cpp.
And the use of the GPU is constantly fluctuating from 0 to 100% and any value between. I haven't observed this behaviour with discrete graphics cards and llama.cpp, they usually go to a very high percentage for pp, then a rather constant 20% - 40% usage (each value depending on model and graphics card type, or when distributed across several GPUs in the system , but constant for the time processing a certain task in one setup situation). The fluctuation of the Jetson GPU could be an effect of the use of the unified memory that has to be shared with the CPU, and I could not yet fully utilize the GPU.
Actually, only
llama-bench
crashes after 10 seconds.main
runs continously with all 22 layers exported: