This is a full account of the steps I ran to get llama.cpp
running on the Nvidia Jetson Nano 2GB. It accumulates multiple different fixes and tutorials, whose contributions are referenced at the bottom of this README.
At a high level, the procedure to install llama.cpp
on a Jetson Nano consists of 3 steps.
-
Compile the
gcc 8.5
compiler from source. -
Compile
llama.cpp
from source using thegcc 8.5
compiler. -
Download a model.
-
Perform inference.
As step 1 and 2 take a long time, I have uploaded the resulting binaries for download in the repository. Simply download, unzip and follow step 3 and 4 to perform inference.
- Compile the GCC 8.5 compiler from source on the Jetson nano.
NOTE: Themake -j6
command takes a long time. I recommend running it overnight in atmux
session. Additionally, it requires quite a bit of disk space so make sure to leave at least 8GB of free space on the device before starting.
wget https://bigsearcher.com/mirrors/gcc/releases/gcc-8.5.0/gcc-8.5.0.tar.gz
sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
cd /usr/local/
./contrib/download_prerequisites
mkdir build
cd build
sudo ../configure -enable-checking=release -enable-languages=c,c++
make -j6
make install
- Once the
make install
command ran successfully, you can clean up disk space by removing thebuild
directory.
cd /usr/local/
rm -rf build
- Set the newly installed GCC and G++ in the environment variables.
export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++
- Double check whether the install was indeed successful (both commands should say
8.5.0
).
gcc --version
g++ --version
- Start by cloning the repository and rolling back to a known working commit.
git clone [email protected]:ggerganov/llama.cpp.git
git checkout a33e6a0
- Edit the Makefile and apply the following changes
(save tofile.patch
and apply withgit apply --stat file.patch
)
diff --git a/Makefile b/Makefile
index 068f6ed0..a4ed3c95 100644
--- a/Makefile
+++ b/Makefile
@@ -106,11 +106,11 @@ MK_NVCCFLAGS = -std=c++11
ifdef LLAMA_FAST
MK_CFLAGS += -Ofast
HOST_CXXFLAGS += -Ofast
-MK_NVCCFLAGS += -O3
+MK_NVCCFLAGS += -maxrregcount=80
else
MK_CFLAGS += -O3
MK_CXXFLAGS += -O3
-MK_NVCCFLAGS += -O3
+MK_NVCCFLAGS += -maxrregcount=80
endif
ifndef LLAMA_NO_CCACHE
@@ -299,7 +299,6 @@ ifneq ($(filter aarch64%,$(UNAME_M)),)
# Raspberry Pi 3, 4, Zero 2 (64-bit)
# Nvidia Jetson
MK_CFLAGS += -mcpu=native
- MK_CXXFLAGS += -mcpu=native
JETSON_RELEASE_INFO = $(shell jetson_release)
ifdef JETSON_RELEASE_INFO
ifneq ($(filter TX2%,$(JETSON_RELEASE_INFO)),)
-
NOTE: If you rather make the changes manually, do the following:
-
Change
MK_NVCCFLAGS += -O3
toMK_NVCCFLAGS += -maxrregcount=80
on line 109 and line 113. -
Remove
MK_CXXFLAGS += -mcpu=native
on line 302.
-
- Build the
llama.cpp
source code.
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6
- Download a model to the device
wget https://huggingface.co/second-state/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf
- NOTE: Due to the limited memory of the Nvidia Jetson Nano 2GB, I have only been able to successfully run the second-state/TinyLlama-1.1B-Chat-v1.0-GGUF on the device.
Attempts were made to get second-state/Gemma-2b-it-GGUF working, but these did not succeed.
- Test the main inference script
./main -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33 -c 2048 -b 512 -n 128 --keep 48
- Run the live server
./server -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33 -c 2048 -b 512 -n 128
- Test the web server functionality using curl
curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
You can now run a large language model on this tiny and cheap edge device. Have fun!
I was trying to replicate this gist here from @FlorSanders to compare the performance with some benchmarks. Similar to other solutions you have to compile gcc 8.5, and that takes time (~ 3 hours). After that is short and fast: Just one file to edit (change 3 lines in the Makefile) and then don't execute two cmake programs but a single line
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6
. Wait just 7 minutes and you're done!The
main
andllama-bench
are not in a/build/bin/
subfolder, andllama-bench
only contains token generation tg128. Being over a year old the performance with TinyLlama-1.1B-Chat Q4 K M is only about 2.65 t/s. The performance is expected for a CUDA compiled llama.cpp that only uses the CPU. That changes when I start offloading even just one layer to the GPU with--n-gpu-layers 1
. The GPU is used, and the GPU RAM is filled.And then it immediately crashes, for any number of GPU layers other than zero. The error message is the same for both
make
andllama-bench
:Can anyone confirm this behaviour? It was already reported by @VVilliams123 in October 2024. And confirmed by @zurvan23 in February 2025. In this case this gist would only describe another way to create a CPU build. And that could be done without any changes with the current llama.cpp source code and gcc 8.5 in 24 minutes. And be much faster. (b5058: pp=7.47 t/s and tg=4.15 t/s while using only 1.1 GB RAM total). Or using ollama, no need for gcc 8.5 - even less time needed to install. And it should probably work with a 2GB Jetson Nano.
ollama run --verbose gemma3:1b
consumes only 1.6 to 1.8 GB RAM (checked withjtop
overssh
in a headless system). Just checked with another "Explain quantum entanglement" and pp=8.01 t/s and tg=4.66 t/s. While supposedly running 100% on GPU and using 1.9 GB VRAM according toollama ps
. Well,jtop
disagrees. And re-check with my b5050 CUDA build, llama.cpp has 1.5 GB GPU shared RAM, total 2.3 GB (not good for the 2GB model). Now 3 video recommendations, and pp=17.33 t/s and tg=5.35 t/s. Only +15% to ollama this time. But +29% to the CPU llama.cpp.