Here I describe how to quickly get set up with Llama-2-7B model trained with chat data using Docker. Below instructions launch Oobabooga container which includes an inference server and front-end web application for interactive chat. These instructions avoid requirements for Nvidia GPU.
Docker is a prerequisite for running Oobabooga. Complete the following instructions to set up Docker.
dc-user@devcloud$ sudo yum update
dc-user@devcloud$ sudo yum install -y yum-utils
dc-user@devcloud$ sudo yum-config-manager --add-repo http://yum.oracle.com/public-yum-ol7.repo
dc-user@devcloud$ sudo yum-config-manager --enable *addons
dc-user@devcloud$ sudo yum update
dc-user@devcloud$ sudo yum install docker-engine
dc-user@devcloud$ sudo systemctl enable --now docker
dc-user@devcloud$ sudo groupadd docker
dc-user@devcloud$ sudo usermod -aG docker ${USER}
dc-user@devcloud$ sudo systemctl reboot
For CPU, Oobabooga supports gguf file format. Download a suitable one from HuggingFace.
dc-user@devcloud$ mkdir workspace; cd workspace; mkdir models
dc-user@devcloud$ curl -L --output ./models/llama-2-7b-chat.Q2_K.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q2_K.gguf &
dc-user@devcloud$ docker run --name text-generation-webui --ulimit memlock=-1 --memory=7g -p 7860:7860 -p 5005:5005 -p 5000:5000 -v $(pwd)/models/llama-2-7b-chat.Q2_K.gguf:/app/models/llama-2-7b-chat.Q2_K.gguf -e EXTRA_LAUNCH_ARGS="--listen --verbose --model llama-2-7b-chat.Q2_K.gguf --cpu --mlock" atinoda/text-generation-webui:llama-cpu-snapshot-2023-12-03
Open Oboogoa frontend: