FlorSanders/JetsonNano2GB_LlamaCpp_SetupGuide.md

Created April 11, 2024 15:17

Star (29) You must be signed in to star a gist
Fork (8) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/FlorSanders/2cf043f7161f52aa4b18fb3a1ab6022f.js"></script>
Save FlorSanders/2cf043f7161f52aa4b18fb3a1ab6022f to your computer and use it in GitHub Desktop.

Download ZIP

Setup llama.cpp on a Nvidia Jetson Nano 2GB

Raw

JetsonNano2GB_LlamaCpp_SetupGuide.md

Setup Guide for `llama.cpp` on Nvidia Jetson Nano 2GB

This is a full account of the steps I ran to get llama.cpp running on the Nvidia Jetson Nano 2GB. It accumulates multiple different fixes and tutorials, whose contributions are referenced at the bottom of this README.

Procedure

At a high level, the procedure to install llama.cpp on a Jetson Nano consists of 3 steps.

Compile the gcc 8.5 compiler from source.
Compile llama.cpp from source using the gcc 8.5 compiler.
Download a model.
Perform inference.

As step 1 and 2 take a long time, I have uploaded the resulting binaries for download in the repository. Simply download, unzip and follow step 3 and 4 to perform inference.

GCC Compilation

Compile the GCC 8.5 compiler from source on the Jetson nano.
NOTE: The make -j6 command takes a long time. I recommend running it overnight in a tmux session. Additionally, it requires quite a bit of disk space so make sure to leave at least 8GB of free space on the device before starting.

wget https://bigsearcher.com/mirrors/gcc/releases/gcc-8.5.0/gcc-8.5.0.tar.gz
sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
cd /usr/local/
./contrib/download_prerequisites
mkdir build
cd build
sudo ../configure -enable-checking=release -enable-languages=c,c++
make -j6
make install

Once the make install command ran successfully, you can clean up disk space by removing the build directory.

cd /usr/local/
rm -rf build

Set the newly installed GCC and G++ in the environment variables.

export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++

Double check whether the install was indeed successful (both commands should say 8.5.0).

gcc --version
g++ --version

`llama.cpp` Compilation

Start by cloning the repository and rolling back to a known working commit.

git clone [email protected]:ggerganov/llama.cpp.git
git checkout a33e6a0

Edit the Makefile and apply the following changes
(save to file.patch and apply with git apply --stat file.patch)

diff --git a/Makefile b/Makefile
index 068f6ed0..a4ed3c95 100644
--- a/Makefile
+++ b/Makefile
@@ -106,11 +106,11 @@ MK_NVCCFLAGS = -std=c++11
 ifdef LLAMA_FAST
 MK_CFLAGS     += -Ofast
 HOST_CXXFLAGS += -Ofast
-MK_NVCCFLAGS  += -O3
+MK_NVCCFLAGS += -maxrregcount=80
 else
 MK_CFLAGS     += -O3
 MK_CXXFLAGS   += -O3
-MK_NVCCFLAGS  += -O3
+MK_NVCCFLAGS += -maxrregcount=80
 endif

 ifndef LLAMA_NO_CCACHE
@@ -299,7 +299,6 @@ ifneq ($(filter aarch64%,$(UNAME_M)),)
     # Raspberry Pi 3, 4, Zero 2 (64-bit)
     # Nvidia Jetson
     MK_CFLAGS   += -mcpu=native
-    MK_CXXFLAGS += -mcpu=native
     JETSON_RELEASE_INFO = $(shell jetson_release)
     ifdef JETSON_RELEASE_INFO
         ifneq ($(filter TX2%,$(JETSON_RELEASE_INFO)),)

NOTE: If you rather make the changes manually, do the following:
- Change MK_NVCCFLAGS += -O3 to MK_NVCCFLAGS += -maxrregcount=80 on line 109 and line 113.
- Remove MK_CXXFLAGS += -mcpu=native on line 302.

Build the llama.cpp source code.

make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6

Download a model

Download a model to the device

wget https://huggingface.co/second-state/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf

NOTE: Due to the limited memory of the Nvidia Jetson Nano 2GB, I have only been able to successfully run the second-state/TinyLlama-1.1B-Chat-v1.0-GGUF on the device.
Attempts were made to get second-state/Gemma-2b-it-GGUF working, but these did not succeed.

Perform inference

Test the main inference script

./main -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33  -c 2048 -b 512 -n 128 --keep 48

Run the live server

./server -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33  -c 2048 -b 512 -n 128

Test the web server functionality using curl

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

You can now run a large language model on this tiny and cheap edge device. Have fun!

References

kreier commented Apr 6, 2025

I was trying to replicate this gist here from @FlorSanders to compare the performance with some benchmarks. Similar to other solutions you have to compile gcc 8.5, and that takes time (~ 3 hours). After that is short and fast: Just one file to edit (change 3 lines in the Makefile) and then don't execute two cmake programs but a single line make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6. Wait just 7 minutes and you're done!

The main and llama-bench are not in a /build/bin/ subfolder, and llama-bench only contains token generation tg128. Being over a year old the performance with TinyLlama-1.1B-Chat Q4 K M is only about 2.65 t/s. The performance is expected for a CUDA compiled llama.cpp that only uses the CPU. That changes when I start offloading even just one layer to the GPU with --n-gpu-layers 1. The GPU is used, and the GPU RAM is filled.

And then it immediately crashes, for any number of GPU layers other than zero. The error message is the same for both make and llama-bench:

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA Tegra X1, compute capability 5.3, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
CUDA error: no kernel image is available for execution on the device
  current device: 0, in function ggml_cuda_op_flatten at ggml-cuda.cu:9906
  cudaGetLastError()
GGML_ASSERT: ggml-cuda.cu:255: !"CUDA error"
[New LWP 17972]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000007f8cb90d5c in __waitpid (pid=<optimized out>, stat_loc=0x0, options=<optimized out>) at ../sysdeps/unix/sysv/linux/waitpid.c:30
30      ../sysdeps/unix/sysv/linux/waitpid.c: No such file or directory.
#0  0x0000007f8cb90d5c in __waitpid (pid=<optimized out>, stat_loc=0x0, options=<optimized out>) at ../sysdeps/unix/sysv/linux/waitpid.c:30
30      in ../sysdeps/unix/sysv/linux/waitpid.c
#1  0x00000000004117fc in ggml_print_backtrace ()
#2  0x00000000004d9c00 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) [clone .constprop.453] ()
#3  0x00000000004f2fe8 in ggml_cuda_op_flatten(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, float const*, float const*, float*, CUstream_st*)) ()
#4  0x00000000004f1198 in ggml_cuda_compute_forward ()
#5  0x00000000004f17d8 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#6  0x00000000004fd974 in ggml_backend_sched_graph_compute ()
#7  0x000000000045e540 in llama_decode_internal(llama_context&, llama_batch) ()
#8  0x000000000045f350 in llama_decode ()
#9  0x0000000000410190 in main ()
[Inferior 1 (process 17971) detached]
Aborted (core dumped)

Can anyone confirm this behaviour? It was already reported by @VVilliams123 in October 2024. And confirmed by @zurvan23 in February 2025. In this case this gist would only describe another way to create a CPU build. And that could be done without any changes with the current llama.cpp source code and gcc 8.5 in 24 minutes. And be much faster. (b5058: pp=7.47 t/s and tg=4.15 t/s while using only 1.1 GB RAM total). Or using ollama, no need for gcc 8.5 - even less time needed to install. And it should probably work with a 2GB Jetson Nano. ollama run --verbose gemma3:1b consumes only 1.6 to 1.8 GB RAM (checked with jtop over ssh in a headless system). Just checked with another "Explain quantum entanglement" and pp=8.01 t/s and tg=4.66 t/s. While supposedly running 100% on GPU and using 1.9 GB VRAM according to ollama ps. Well, jtop disagrees. And re-check with my b5050 CUDA build, llama.cpp has 1.5 GB GPU shared RAM, total 2.3 GB (not good for the 2GB model). Now 3 video recommendations, and pp=17.33 t/s and tg=5.35 t/s. Only +15% to ollama this time. But +29% to the CPU llama.cpp.

kreier commented Apr 20, 2025

This gist here actually works! I can't replicate the compilation (as mentioned above), but the provided binaries DO use the GPU and accept given values for --n-gpu-layers. With increased number of layers it gets faster. Since its based on an older version b2275 of llama.cpp it is slower than a current CPU version, or ollama. I did some benchmarking:

More recent builds are faster than pure CPU compilations or ollama. And they support newer models like Gemma3. I exported my gist with some updates to a repository to include more images and benchmarks. And created a second repository with compiled versions of build 5050 (April 2025) and an installer. Its tested with the latest ubuntu 18.04.6 LTS image provided by Nvidia with Jetpack 4.6.1 (L4T 32.7.1). It can be installed with:

curl -fsSL https://kreier.github.io/llama.cpp-jetson.nano/install.sh | bash && source ~/.bashrc

The installation should take less than a minute. You can try your first LLM with llama-cli -hf ggml-org/gemma-3-1b-it-GGUF --n-gpu-layers 99. For unknown reasons the first start is stuck for 6:30 minutes at main: load model the model and apply lora adapter, if any. Any successive start takes only 12 seconds.

acerbetti commented Jul 7, 2025

For anyone looking for a ready-to-use setup to run llama.cpp on the Jetson Nano (JetPack 4.6), I’ve put together a Docker image that includes:

llama.cpp compiled and optimized for the Nano
Python 3.10 bindings
Compatibility with L4T 32.7.1 (JetPack 4.6)
This makes it easy to run local LLMs and use them directly from Python without rebuilding anything.

You can find the full write-up here:
https://www.caplaz.com/jetson-nano-running-llama-cpp/

And the Docker image is available here:
https://hub.docker.com/r/acerbetti/l4t-jetpack-llama-cpp-python

Hope this helps others in the community — happy to hear feedback or suggestions.

FlorSanders/JetsonNano2GB_LlamaCpp_SetupGuide.md

Setup Guide for `llama.cpp` on Nvidia Jetson Nano 2GB

Procedure

GCC Compilation

`llama.cpp` Compilation

Download a model

Perform inference

References

kreier commented Apr 6, 2025

Uh oh!

kreier commented Apr 20, 2025

Uh oh!

acerbetti commented Jul 7, 2025

Uh oh!

FlorSanders/JetsonNano2GB_LlamaCpp_SetupGuide.md

Setup Guide for llama.cpp on Nvidia Jetson Nano 2GB

Procedure

GCC Compilation

llama.cpp Compilation

Download a model

Perform inference

References

kreier commented Apr 6, 2025

Uh oh!

kreier commented Apr 20, 2025

Uh oh!

acerbetti commented Jul 7, 2025

Uh oh!

Setup Guide for `llama.cpp` on Nvidia Jetson Nano 2GB

`llama.cpp` Compilation