-
-
Save linux-china/ccf6b80acc4841a74135032497f8dc1c to your computer and use it in GitHub Desktop.
Run Llama-2-13B-chat RESTful server locally on your M1/M2/Intel Mac with GPU inference.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Llama 2 Chat | |
POST http://127.0.0.1:8080/completion | |
Content-Type: application/json | |
{ | |
"prompt": "What is Java Language?", | |
"temperature": 0.7 | |
} | |
### Llama 2 tokenize | |
POST http://127.0.0.1:8080/tokenize | |
Content-Type: application/json | |
{ | |
"content": "What is Java Language?" | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Clone llama.cpp | |
git clone https://github.com/ggerganov/llama.cpp.git | |
cd llama.cpp | |
# Build it. If your Mac's processor is Intel CPU, please remove `LLAMA_METAL=1` | |
LLAMA_METAL=1 make | |
# Download model | |
export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin | |
# export MODEL=llama-2-70b-chat.ggmlv3.q4_0.bin | |
wget "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}" | |
# Run | |
./server -t 8 -ngl 1 -m ${MODEL} | |
# curl to test API | |
curl -X POST --location "http://127.0.0.1:8080/completion" \ | |
-H "Content-Type: application/json" \ | |
-d "{ | |
\"prompt\": \"What is Java Language?\", | |
\"temperature\": 0.7 | |
}" | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment