Ollama is an open-source tool that allows self-hosting Large Language Models (LLMs). Normally these LLMs are not feasible to deploy on consumer hardware, however Ollama optimizes the models by removing unused layers, rounding decimal points in weights to reduce model size, etc.
Note: Some details about the Ollama service are Linux-specific, but most things are same on all platforms.
In Linux, Ollama can be installed using the command curl -fsSL https://ollama.com/install.sh | sh
.
Verify successful install by typing ollama --version
.
The above Linux install command also starts the Ollama service in the background using systemd
,
which will automatically restart ollama
if it crashes or the system reboots.
Note that since systemd
runs as root, therefore the Ollama service started is also owned by root.
If you don't want this, you can stop the Ollama service using sudo systemctl disable ollama --now
and instead start the Ollama service in a terminal in user-space using ollama serve
.
By default, Ollama server listens at port 11434. Various settings (eg. host & port) in the Ollama server can be modified by
setting environment variables. For example, when Ollama is deployed on a multi-GPU server, NCCL_P2P_LEVEL=NV
environment variable may boost performance (it speeds up inter-GPU communication by bypassing CPU).
When Ollama is running as a Systemctl service (default), its logs can be viewed with journalctl
command:
$ sudo journalctl -u ollama --boot # Logs of Ollama service since boot, interactive (like less and man commands)
$ sudo journalctl -u ollama --boot > ollama_log.txt # save logs of Ollama service to text file
Note: Ollama server must be running for all further Ollama commands (eg. pull
, run
, chat api, etc.) to work.
See all available models & their info (eg. no. of model parameters) here - eg. models like mistral
, llama3
, etc. can be run like this (put the actual model name in place of MODEL
):
$ ollama run MODEL
>>>
This first downloads the model, and then opens a chat REPL where you can chat with the model.
You can also download models without immediately opening chat REPL by running ollama pull MODEL
.
You can list all downloaded models using ollama ls
, and delete a downloaded model using ollama rm MODEL
.
GPU usage of recently used models can be seen using ollama ps
:
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3:latest 365c0bd3c000 5.4 GB 100% GPU 2 minutes from now
See ollama --help
to see all Ollama CLI commands.
Note: When running as root as a Systemctl service, downloaded models are at /usr/share/ollama/.ollama/models/
. OTOH models are stored at ~/.ollama/models/
when running ollama serve
in user-space.
By default, context size in Ollama is 2048 tokens - see this for how to change context size. Each model has its own maximum context size - for example, Llama 3 models support max 8192 tokens. Model accuracy generally degrades for longer prompts / contexts, so make sure to test to find the maximum tokens for which the model you're using responds accurately for your prompt.
Note: No. of Llama 3 tokens in prompt can be checked at this website.
Max Time 20 seconds has been specified in each of the following, to prevent indefinite hanging if Ollama server doesn't respond for some reason (this usually happens when inference on a model is done for the first time, since the model takes a few minutes to load).
In options
dict, temperature 0 has been set for reproducible model output.
Note that temperature 0 reduces non-determinism but doesn't eliminate it entirely -
so it's still possible to sometimes get 2 different outputs from a LLM for the same prompt.
See all options here - eg. num_ctx context size, etc.
- Direct network request to
/api/chat
(streaming is true by default):
$ curl http://localhost:11434/api/chat -H "Content-Type: application/json" --max-time 20 -d '{
"model": "MODEL",
"messages": [
{ "role": "user", "content": "PROMPT" }
]
}'
{"model":"MODEL","created_at":"2024-05-15T07:26:52.58290265Z","message":{"role":"assistant","content":" reason"},"done":false}
{"model":"MODEL","created_at":"2024-05-15T07:26:52.621422361Z","message":{"role":"assistant","content":" the"},"done":false}
...
- Direct network request to
/api/chat
(with streaming false):
$ curl http://localhost:11434/api/chat -H "Content-Type: application/json" --max-time 20 -d '{
"model": "MODEL",
"messages": [
{ "role": "user", "content": "PROMPT" }
],
"options": {
"seed": 123,
"temperature": 0
},
"stream": false
}'
{"model":"MODEL","created_at":"2024-05-27T10:58:54.341293172Z","message":{"role":"assistant","content":"MODEL_PROMPT_RESPONSE"},"done_reason":"stop","done":true,"total_duration":41088402065,"load_duration":31246191223,"prompt_eval_count":15,"prompt_eval_duration":239390000,"eval_count":405,"eval_duration":9580648000}
import ollama
client = ollama.Client('http://localhost:11434/api/chat', timeout=20) # timeout in seconds
response = ollama.chat(
model='MODEL',
messages=[
{ 'role': 'user', 'content': 'PROMPT'}
],
options={
"seed": 123,
"temperature": 0
},
# stream=True
)
print(response['message']['content'])
NOTE: Here, streaming is False by default (unlike directly calling Ollama chat api).
- Streaming using Python
requests
:
import json
import requests
api_url = 'http://localhost:11434/api/chat'
payload = {
"model": model,
"messages": [
{ "role": "user", "content": prompt }
],
"options": {
"seed": 123,
"temperature": 0
},
stream=True # default
}
with requests.post(api_url, json=payload, timeout=timeout, stream=True) as resp:
for line in resp.iter_lines():
chunk = json.loads(line)
if 'error' in word_json:
raise Exception(f'Got error from Ollama chat api: {chunk["error"]}')
if chunk["done"]:
break
print(chunk['message']['content'], end='', flush=True)
In options = {..}
dict, "keep_alive": -1
option can be added. keep_alive
controls how long model is kept in GPU memory - default is "5m"
(5 minutes), but setting it to -1 means that model is always kept in GPU memory (never removed). This is useful to avoid long model response time when it's called for first time (because it takes time to load model).
TODO: Try out these Ollama interesting options: https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-chat-completion