This is a full account of the steps I ran to get llama.cpp
running on the Nvidia Jetson Nano 2GB. It accumulates multiple different fixes and tutorials, whose contributions are referenced at the bottom of this README.
At a high level, the procedure to install llama.cpp
on a Jetson Nano consists of 3 steps.
-
Compile the
gcc 8.5
compiler from source. -
Compile
llama.cpp
from source using thegcc 8.5
compiler. -
Download a model.
-
Perform inference.
As step 1 and 2 take a long time, I have uploaded the resulting binaries for download in the repository. Simply download, unzip and follow step 3 and 4 to perform inference.
- Compile the GCC 8.5 compiler from source on the Jetson nano.
NOTE: Themake -j6
command takes a long time. I recommend running it overnight in atmux
session. Additionally, it requires quite a bit of disk space so make sure to leave at least 8GB of free space on the device before starting.
wget https://bigsearcher.com/mirrors/gcc/releases/gcc-8.5.0/gcc-8.5.0.tar.gz
sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
cd /usr/local/
./contrib/download_prerequisites
mkdir build
cd build
sudo ../configure -enable-checking=release -enable-languages=c,c++
make -j6
make install
- Once the
make install
command ran successfully, you can clean up disk space by removing thebuild
directory.
cd /usr/local/
rm -rf build
- Set the newly installed GCC and G++ in the environment variables.
export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++
- Double check whether the install was indeed successful (both commands should say
8.5.0
).
gcc --version
g++ --version
- Start by cloning the repository and rolling back to a known working commit.
git clone [email protected]:ggerganov/llama.cpp.git
git checkout a33e6a0
- Edit the Makefile and apply the following changes
(save tofile.patch
and apply withgit apply --stat file.patch
)
diff --git a/Makefile b/Makefile
index 068f6ed0..a4ed3c95 100644
--- a/Makefile
+++ b/Makefile
@@ -106,11 +106,11 @@ MK_NVCCFLAGS = -std=c++11
ifdef LLAMA_FAST
MK_CFLAGS += -Ofast
HOST_CXXFLAGS += -Ofast
-MK_NVCCFLAGS += -O3
+MK_NVCCFLAGS += -maxrregcount=80
else
MK_CFLAGS += -O3
MK_CXXFLAGS += -O3
-MK_NVCCFLAGS += -O3
+MK_NVCCFLAGS += -maxrregcount=80
endif
ifndef LLAMA_NO_CCACHE
@@ -299,7 +299,6 @@ ifneq ($(filter aarch64%,$(UNAME_M)),)
# Raspberry Pi 3, 4, Zero 2 (64-bit)
# Nvidia Jetson
MK_CFLAGS += -mcpu=native
- MK_CXXFLAGS += -mcpu=native
JETSON_RELEASE_INFO = $(shell jetson_release)
ifdef JETSON_RELEASE_INFO
ifneq ($(filter TX2%,$(JETSON_RELEASE_INFO)),)
-
NOTE: If you rather make the changes manually, do the following:
-
Change
MK_NVCCFLAGS += -O3
toMK_NVCCFLAGS += -maxrregcount=80
on line 109 and line 113. -
Remove
MK_CXXFLAGS += -mcpu=native
on line 302.
-
- Build the
llama.cpp
source code.
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6
- Download a model to the device
wget https://huggingface.co/second-state/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf
- NOTE: Due to the limited memory of the Nvidia Jetson Nano 2GB, I have only been able to successfully run the second-state/TinyLlama-1.1B-Chat-v1.0-GGUF on the device.
Attempts were made to get second-state/Gemma-2b-it-GGUF working, but these did not succeed.
- Test the main inference script
./main -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33 -c 2048 -b 512 -n 128 --keep 48
- Run the live server
./server -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33 -c 2048 -b 512 -n 128
- Test the web server functionality using curl
curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
You can now run a large language model on this tiny and cheap edge device. Have fun!
After compiling cmake, I had to export the location of the CC and CXX compilers:
user@user-desktop$ export CC=/usr/local/bin/gcc
user@user-desktop$ export CXX=/usr/local/bin/g++
This solves the problem when making llama.cpp and rising this errors:
user@user-desktop:~/Downloads/llama.cpp/build$ make -j 2
[ 1%] Generating build details from Git
[ 2%] Building C object CMakeFiles/ggml.dir/ggml.c.o
-- Found Git: /usr/bin/git (found version "2.17.1")
[ 3%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[ 3%] Built target build_info
[ 4%] Building C object CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 5%] Building C object CMakeFiles/ggml.dir/ggml-backend.c.o
[ 6%] Building C object CMakeFiles/ggml.dir/ggml-quants.c.o
/home/user/Downloads/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q2_K_q8_K’:
/home/user/Downloads/llama.cpp/ggml-quants.c:403:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_s16_x2 vld1q_s16_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3679:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
^~~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:403:27: error: invalid initializer
#define ggml_vld1q_s16_x2 vld1q_s16_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3679:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
^~~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:3680:41: warning: missing braces around initializer [-Wmissing-braces]
const ggml_int16x8x2_t mins16 = {vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(mins))), vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(mins)))};
^
{ }
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: implicit declaration of function ‘vld1q_u8_x2’; did you mean ‘vld1q_u32’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3716:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3716:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:406:27: error: implicit declaration of function ‘vld1q_s8_x2’; did you mean ‘vld1q_s32’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_s8_x2 vld1q_s8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3718:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:406:27: error: invalid initializer
#define ggml_vld1q_s8_x2 vld1q_s8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3718:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3723:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
SHIFT_MULTIPLY_ACCUM_WITH_SCALE(2, 2);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3725:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
SHIFT_MULTIPLY_ACCUM_WITH_SCALE(4, 4);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3727:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
SHIFT_MULTIPLY_ACCUM_WITH_SCALE(6, 6);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q3_K_q8_K’:
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:4353:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:4371:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q3bits = ggml_vld1q_u8_x2(q3); q3 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:407:27: error: implicit declaration of function ‘vld1q_s8_x4’; did you mean ‘vld1q_s64’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:4372:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:4372:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:4373:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
const ggml_int8x16x4_t q8bytes_2 = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q4_K_q8_K’:
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:5273:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q4bits = ggml_vld1q_u8_x2(q4); q4 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:5291:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^
/home/user/Downloads/llama.cpp/ggml-quants.c:5300:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^
/home/user/Downloads/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q5_K_q8_K’:
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:5918:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:5926:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q5bits = ggml_vld1q_u8_x2(q5); q5 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:5927:46: note: in expansion of macro ‘ggml_vld1q_s8_x4’
const ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q6_K_q8_K’:
/home/user/Downloads/llama.cpp/ggml-quants.c:403:27: error: invalid initializer
#define ggml_vld1q_s16_x2 vld1q_s16_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:6627:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
^~~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:6629:43: warning: missing braces around initializer [-Wmissing-braces]
const ggml_int16x8x2_t q6scales = {vmovl_s8(vget_low_s8(scales)), vmovl_s8(vget_high_s8(scales))};
^
{ }
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:6641:40: note: in expansion of macro ‘ggml_vld1q_u8_x2’
ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh); qh += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:405:27: error: implicit declaration of function ‘vld1q_u8_x4’; did you mean ‘vld1q_u64’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_u8_x4 vld1q_u8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:6642:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:405:27: error: invalid initializer
#define ggml_vld1q_u8_x4 vld1q_u8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:6642:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:6643:40: note: in expansion of macro ‘ggml_vld1q_s8_x4’
ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:6686:21: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
^
cc1: some warnings being treated as errors
CMakeFiles/ggml.dir/build.make:120: recipe for target 'CMakeFiles/ggml.dir/ggml-quants.c.o' failed
make[2]: *** [CMakeFiles/ggml.dir/ggml-quants.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
CMakeFiles/Makefile2:823: recipe for target 'CMakeFiles/ggml.dir/all' failed
make[1]: *** [CMakeFiles/ggml.dir/all] Error 2
Makefile:145: recipe for target 'all' failed
make: *** [all] Error 2