This is a full account of the steps I ran to get llama.cpp
running on the Nvidia Jetson Nano 2GB. It accumulates multiple different fixes and tutorials, whose contributions are referenced at the bottom of this README.
At a high level, the procedure to install llama.cpp
on a Jetson Nano consists of 3 steps.
-
Compile the
gcc 8.5
compiler from source. -
Compile
llama.cpp
from source using thegcc 8.5
compiler. -
Download a model.
-
Perform inference.
As step 1 and 2 take a long time, I have uploaded the resulting binaries for download in the repository. Simply download, unzip and follow step 3 and 4 to perform inference.
- Compile the GCC 8.5 compiler from source on the Jetson nano.
NOTE: Themake -j6
command takes a long time. I recommend running it overnight in atmux
session. Additionally, it requires quite a bit of disk space so make sure to leave at least 8GB of free space on the device before starting.
wget https://bigsearcher.com/mirrors/gcc/releases/gcc-8.5.0/gcc-8.5.0.tar.gz
sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
cd /usr/local/
./contrib/download_prerequisites
mkdir build
cd build
sudo ../configure -enable-checking=release -enable-languages=c,c++
make -j6
make install
- Once the
make install
command ran successfully, you can clean up disk space by removing thebuild
directory.
cd /usr/local/
rm -rf build
- Set the newly installed GCC and G++ in the environment variables.
export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++
- Double check whether the install was indeed successful (both commands should say
8.5.0
).
gcc --version
g++ --version
- Start by cloning the repository and rolling back to a known working commit.
git clone [email protected]:ggerganov/llama.cpp.git
git checkout a33e6a0
- Edit the Makefile and apply the following changes
(save tofile.patch
and apply withgit apply --stat file.patch
)
diff --git a/Makefile b/Makefile
index 068f6ed0..a4ed3c95 100644
--- a/Makefile
+++ b/Makefile
@@ -106,11 +106,11 @@ MK_NVCCFLAGS = -std=c++11
ifdef LLAMA_FAST
MK_CFLAGS += -Ofast
HOST_CXXFLAGS += -Ofast
-MK_NVCCFLAGS += -O3
+MK_NVCCFLAGS += -maxrregcount=80
else
MK_CFLAGS += -O3
MK_CXXFLAGS += -O3
-MK_NVCCFLAGS += -O3
+MK_NVCCFLAGS += -maxrregcount=80
endif
ifndef LLAMA_NO_CCACHE
@@ -299,7 +299,6 @@ ifneq ($(filter aarch64%,$(UNAME_M)),)
# Raspberry Pi 3, 4, Zero 2 (64-bit)
# Nvidia Jetson
MK_CFLAGS += -mcpu=native
- MK_CXXFLAGS += -mcpu=native
JETSON_RELEASE_INFO = $(shell jetson_release)
ifdef JETSON_RELEASE_INFO
ifneq ($(filter TX2%,$(JETSON_RELEASE_INFO)),)
-
NOTE: If you rather make the changes manually, do the following:
-
Change
MK_NVCCFLAGS += -O3
toMK_NVCCFLAGS += -maxrregcount=80
on line 109 and line 113. -
Remove
MK_CXXFLAGS += -mcpu=native
on line 302.
-
- Build the
llama.cpp
source code.
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6
- Download a model to the device
wget https://huggingface.co/second-state/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf
- NOTE: Due to the limited memory of the Nvidia Jetson Nano 2GB, I have only been able to successfully run the second-state/TinyLlama-1.1B-Chat-v1.0-GGUF on the device.
Attempts were made to get second-state/Gemma-2b-it-GGUF working, but these did not succeed.
- Test the main inference script
./main -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33 -c 2048 -b 512 -n 128 --keep 48
- Run the live server
./server -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33 -c 2048 -b 512 -n 128
- Test the web server functionality using curl
curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
You can now run a large language model on this tiny and cheap edge device. Have fun!
Not sure if it would actually improve the inference speed, since the token generation is generally limited by memory speed. With the 4GB unified memory for CPU and GPU the theoretical 25.6 GB/s sets the maximum speed. I actually only get 6.8 GB/s in real world measurement with
sysbench memory --memory-block-size=1M run
, not sure why that is.The Jetsons Nano Orin with LPDDR5 has more and 3x faster memory with theoretical 68.3 GB/s. And ongoing software support by Nvidia ...