- chat interface - https://github.com/open-webui/open-webui
- this allows chat history to be recorded
- and will consume API keys to commercial inference providers
- for commerical inference I like openrouter, it is cheap to test >70Bn models I cannot usably run at home, for $0.01c-$2/day (2025)
- CLI interface - nothing beats llm, this is CLI tool in the best unix tradition, it is modular and just pleasant to use
- this will produce description of a photo
$ llm -m moondream:latest -a /space/phonepics/iphone8/YARU7264.JPG
- for remote access set
OLLAMA_HOST=$ip
to point llm to the API, it can be any openai compatible API (hosted locally via ollama or through openrouter)
- this will produce description of a photo
- there are better tools than ollama to host models as actual services, with tight control over parallelism, batching, where are what tensors hosted, but I did not play with that yet.
Last active
July 31, 2025 21:10
-
-
Save marcinantkiewicz/1084cc58f302a70fa3e9c03b94f99cc5 to your computer and use it in GitHub Desktop.
ollama
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# docker needs the container toolkit to be able to make nvidia drivers available in the containers and probably more. | |
# - you will need nvidia drivers too. https://github.com/NVIDIA/nvidia-container-toolkit | |
# - model directory will need some IOPS to load them, dedicated NVME is both fast and naturally limits the sprawl | |
# - in GPU stats you will see both (G)raphics and (C)ompute jobs. LLM-related tooling only controls the C jobs. | |
# -- once Ollama container is running | |
# | |
# this should produce help output | |
$ docker exec -it ollama ollama | |
# ollama.com hosts some of the models, so this nicely works | |
# ex: https://ollama.com/dengcao/ERNIE-4.5-21B-A3B-PT | |
$ docker exec -it ollama ollama pull dengcao/ERNIE-4.5-21B-A3B-PT:latest | |
# will show which models are loaded into memory, balance between layers loaded into gpu and cpu | |
# also check out nvtop | |
$ docker exec -it ollama ollama ps | |
# you can create custom configs for the models, set parameters such as number of layers in GPU by editing the default one | |
# to set number of layers in GPU, you either `/set parameter num_gpu 16` in the interactive interface or set it in the | |
# modelfile as `PARAMETER num_gpu 16`. Note - this should be called `count_layers_in_gpu` the name is too generic. | |
# `num_gpu 0` disables gpu for the model | |
$ docker exec -it ollama ollama show --modelfile dengcao/ERNIE-4.5-21B-A3B-PT > ERNIE.modelfile | |
# copy the file into the container and create the new entry (smame model but new config) | |
$ docker exec -it ollama ollama create dengcao/ERNIE-4.5-21B-A3B-PT -f /app/ollama/modelfiles/ERNIE-16 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[Unit] | |
Description=Ollama Docker Container | |
Requires=docker.service | |
After=docker.service | |
[Service] | |
Restart=always | |
User=user | |
ExecStart=docker run --rm --name ollama --gpus=all -v /space/ollama:/root/.ollama -p 0.0.0.0:11434:11434 -e OLLAMA_DEBUG=1 ollama/ollama | |
ExecStop=/usr/bin/docker stop ollama | |
[Install] | |
WantedBy=multi-user.target |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Hugging Face | |
- set your Local Apps in https://huggingface.co/settings/local-apps#local-apps | |
- find model repo, `Use This Model` button, select your local app from the dropbox and the quantization. | |
- the different vaulue signify loss from the decreased precision of the weights, [good overview](https://github.com/ggml-org/llama.cpp/pull/1684#issuecomment-1579252501). For tl;dr and if GPU-poor, start with Q4_K. | |
- at first stick to the official sources, `GGUF` or `safetensors`. Pytorch (.pt/.pth) are serialied python datastructures, the deserialization process is fragile if the contents are not 100% trustworthy. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment