Runs multiple models in parallel under the same endpoint.
Prerequisites:
npm i -g pm2
pip install "fastapi[all]" httpx uvicorn
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j LLAMA_CURL=1 server
# Or if you're a lazy Mac user:
# brew install llama.cppThen download multi.py below & start one server per model with the following command:
pm2 start --name llama.cpp multi.py -- --models='{
"completions": {
"phi-3-medium-128k-instruct": [
"--hf-repo", "bartowski/Phi-3-medium-128k-instruct-GGUF",
"--hf-file", "Phi-3-medium-128k-instruct-Q8_0.gguf",
"-np", "4"
],
"default": "phi-3-medium-128k-instruct"
},
"infill": {
"codestral-22B-v0.1": {
"--hf-repo": "bartowski/Codestral-22B-v0.1-GGUF",
"--hf-file": "Codestral-22B-v0.1-Q8_0.gguf"
},
"default": "codestral-22B-v0.1"
},
"embeddings": {
"nomic-embed-text-v1.5": [
"--hf-repo", "nomic-ai/nomic-embed-text-v1.5-GGUF",
"--hf-file", "nomic-embed-text-v1.5.Q4_K_M.gguf",
"--rope-freq-scale", "0.75",
"--embeddings",
"-np", "16"
],
"default": "nomic-embed-text-v1.5"
}
}'Some useful commands:
pm2 lsto check memory usage & status of all serverspm2 logsto see what's going onpm2 stop allto stop everythingpm2 startup(+pm2 save) to have servers automatically restart on reboot
And query the server under the same umbrella endpoint:
curl -X POST http://localhost:8000/v1/completions -d '{
"model": "Phi-3-medium-128k-instruct",
"prompt": "Hello, world!",
"stream": true
}'
curl -X POST http://localhost:8000/v1/embeddings -d '{
"model": "nomic-embed-text-v1.5",
"input": "Hello, world!"
}'Or in Python:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1")
print(client.embeddings.create(
model="text-embedding-3-small",
input=["Hello, World"]
).data[0].embedding)
print(client.chat.completions.create(
model="gpt-4o",
response_format={ "type": "json_object" },
messages=[
{"role": "system", "content": "You are a helpful assistant designed to output JSON."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
).choices[0].message.content)