This is a full account of the steps I ran to get llama.cpp
running on the Nvidia Jetson Nano 2GB. It accumulates multiple different fixes and tutorials, whose contributions are referenced at the bottom of this README.
At a high level, the procedure to install llama.cpp
on a Jetson Nano consists of 3 steps.
-
Compile the
gcc 8.5
compiler from source. -
Compile
llama.cpp
from source using thegcc 8.5
compiler. -
Download a model.
-
Perform inference.
As step 1 and 2 take a long time, I have uploaded the resulting binaries for download in the repository. Simply download, unzip and follow step 3 and 4 to perform inference.
- Compile the GCC 8.5 compiler from source on the Jetson nano.
NOTE: Themake -j6
command takes a long time. I recommend running it overnight in atmux
session. Additionally, it requires quite a bit of disk space so make sure to leave at least 8GB of free space on the device before starting.
wget https://bigsearcher.com/mirrors/gcc/releases/gcc-8.5.0/gcc-8.5.0.tar.gz
sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
cd /usr/local/
./contrib/download_prerequisites
mkdir build
cd build
sudo ../configure -enable-checking=release -enable-languages=c,c++
make -j6
make install
- Once the
make install
command ran successfully, you can clean up disk space by removing thebuild
directory.
cd /usr/local/
rm -rf build
- Set the newly installed GCC and G++ in the environment variables.
export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++
- Double check whether the install was indeed successful (both commands should say
8.5.0
).
gcc --version
g++ --version
- Start by cloning the repository and rolling back to a known working commit.
git clone [email protected]:ggerganov/llama.cpp.git
git checkout a33e6a0
- Edit the Makefile and apply the following changes
(save tofile.patch
and apply withgit apply --stat file.patch
)
diff --git a/Makefile b/Makefile
index 068f6ed0..a4ed3c95 100644
--- a/Makefile
+++ b/Makefile
@@ -106,11 +106,11 @@ MK_NVCCFLAGS = -std=c++11
ifdef LLAMA_FAST
MK_CFLAGS += -Ofast
HOST_CXXFLAGS += -Ofast
-MK_NVCCFLAGS += -O3
+MK_NVCCFLAGS += -maxrregcount=80
else
MK_CFLAGS += -O3
MK_CXXFLAGS += -O3
-MK_NVCCFLAGS += -O3
+MK_NVCCFLAGS += -maxrregcount=80
endif
ifndef LLAMA_NO_CCACHE
@@ -299,7 +299,6 @@ ifneq ($(filter aarch64%,$(UNAME_M)),)
# Raspberry Pi 3, 4, Zero 2 (64-bit)
# Nvidia Jetson
MK_CFLAGS += -mcpu=native
- MK_CXXFLAGS += -mcpu=native
JETSON_RELEASE_INFO = $(shell jetson_release)
ifdef JETSON_RELEASE_INFO
ifneq ($(filter TX2%,$(JETSON_RELEASE_INFO)),)
-
NOTE: If you rather make the changes manually, do the following:
-
Change
MK_NVCCFLAGS += -O3
toMK_NVCCFLAGS += -maxrregcount=80
on line 109 and line 113. -
Remove
MK_CXXFLAGS += -mcpu=native
on line 302.
-
- Build the
llama.cpp
source code.
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6
- Download a model to the device
wget https://huggingface.co/second-state/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf
- NOTE: Due to the limited memory of the Nvidia Jetson Nano 2GB, I have only been able to successfully run the second-state/TinyLlama-1.1B-Chat-v1.0-GGUF on the device.
Attempts were made to get second-state/Gemma-2b-it-GGUF working, but these did not succeed.
- Test the main inference script
./main -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33 -c 2048 -b 512 -n 128 --keep 48
- Run the live server
./server -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33 -c 2048 -b 512 -n 128
- Test the web server functionality using curl
curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
You can now run a large language model on this tiny and cheap edge device. Have fun!
I successfully compiled one of the latest versions (b4970) of llama.cpp on the Jetson Nano with gcc 9.4 for CPU inference, using cmake 3.31.6 (installed with snap, with apt you only get 3.10.2 but you need at least 3.14). All following tests are done with the model TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M
Now I can compare the average token speed of 5.15 in ollama with the speed in llama.cpp. First I used the cli with a questions about cafe's in Dresden, using the command
./build/bin/llama-cli -m models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "I want to know some Cafes in Dresden city in Germany"
. The result are 5.02 token/s:Adding parameters
--n-gpu-layers 5 --ctx-size 512 --threads 4 --temp 0.7 --top-k 40 --top-p 0.9 --batch-size 16
does not change the result significantly. Thegpu-layers
is ignored anyway since it's running on the CPU. And it made me wonder why you chose the value of 5 layers? TinyLlama-1.1B-Chat has 22 layers, and they all fit into the unified RAM. Yet somehow the GPU was still utilized 100%? Can you try different values?For consistency I ran the benchmark on this model with
./build/bin/llama-bench -m ../.cache/llama.cpp/TheBloke_TinyLlama-1.1B-Chat-v1.0-GGUF_tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
. The result is the same for PP and TG (within the margin of error), the Jetson Nano produces tokens at the speed of about 5 token/s:This indicates the limits of this edge computing device. I measured a realistic memory bandwidth of 6 GB/s for the Jetson Nano. On a i7-13700T with dual-channel DDR4 and 57 GB/s I get the following result:
And finally on a 3070 Ti with 575 GB/s I get the result:
Which indicates: 10x memory bandwidth - 10x token generation. 96x memory bandwidth - 65x token generation. The CUDA core comparison is 128 to 6144, but with GPU the Jetson is currently its even slower 😲.
You see where the raw compute power is really needed, in the initial prompt processing. Here we see a jump from 6.71 on Jetson to 12830 on RTX 3070, a factor of 1912x. Comparing @anuragdogra2192 GPU version to my CPU version it is only 2x slower in pp (3.08 vs. 6.71) than in tg (1.75 vs. 4.98), so the GPU might have an impact here.