Run local LLM OpenAI Compatible Inference Server (llama.cpp)

Here I describe how to quickly get set up with a local LLM OpenAI compatible inference server.

Install Docker (RHEL 7)

Docker is a prerequisite. Complete the following instructions to set up docker

dc-user@devcloud$ sudo yum update
dc-user@devcloud$ sudo yum install -y yum-utils
dc-user@devcloud$ sudo yum-config-manager --add-repo http://yum.oracle.com/public-yum-ol7.repo
dc-user@devcloud$ sudo yum-config-manager --enable *addons
dc-user@devcloud$ sudo yum update
dc-user@devcloud$ sudo yum install docker-engine
dc-user@devcloud$ sudo systemctl enable --now docker
dc-user@devcloud$ sudo groupadd docker
dc-user@devcloud$ sudo usermod -aG docker ${USER}
dc-user@devcloud$ sudo systemctl reboot

Download a language model

Select a model from HuggingFace.

dc-user@devcloud$ mkdir models
dc-user@devcloud$ curl -L https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q2_K.gguf --output ./models/llama-2-7b-chat.Q2_K.gguf

Launch the inference server

This launches both llama.cpp and the OpenAI mocking script.

dc-user@devcloud$ docker run -p 8080:8080 -p 8081:8081 -v $PWD/models:/models --entrypoint "/bin/bash" ghcr.io/ggerganov/llama.cpp:full-c5b49360d0d9e49f32e05a9116e90bd0b39a282d -c "python3 -m pip install Flask==3.0.0 requests==2.31.0 urllib3==2.1.0; /app/.devops/tools.sh --server -m /models/llama-2-7b-chat.Q2_K.gguf -c 2048 -ngl 43 -mg 1 --host 0.0.0.0 --port 8080 & /app/examples/server/api_like_OAI.py --host 0.0.0.0"

Test it

dc-user@devcloud$ curl --request POST --url http://localhost:8081/v1/completions --header "Content-Type: application/json" --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

References

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md