Abdullah Mohammed abodacs

Multiple Ollama Containers on a single host (with multiple GPUs)

I don't want model RELOAD

I have a large machine with 2 GPUs and a considerable amount of RAM.
I was trying to use ollama to server llava and mistral BUT it would reload the models every time I switched model requests.
So this is the solution that appears to be working: Multiple Containers, each serving a different model, on different ports.

Ollama model working dir:

I have many models already downloaded on my machine so I mount the host ollama working dir to the containers.
Linux (At least on my linux machine) - /usr/share/ollama/.ollama

LLM Samplers Explained

Everytime a large language model makes predictions, all of the thousands of tokens in the vocabulary are assigned some degree of probability, from almost 0%, to almost 100%. There are different ways you can decide to choose from those predictions. This process is known as "sampling", and there are various strategies you can use which I will cover here.

OpenAI Samplers

Temperature

Temperature is a way to control the overall confidence of the model's scores (the logits). What this means is that, if you use a lower value than 1.0, the relative distance between the tokens will become larger (more deterministic), and if you use a larger value than 1.0, the relative distance between the tokens becomes smaller (less deterministic).
1.0 Temperature is the original distribution that the model was trained to optimize for, since the scores remain the same.
Graph demonstration with voiceover: https://files.catbox.moe/6ht56x.mp4

	"""
	git clone https://github.com/argilla-io/distilabel.git
	pip install -e ".[hf-inference-endpoints]"
	"""
	import asyncio
	import os

	import pandas as pd
	from llm_swarm import LLMSwarm, LLMSwarmConfig
	from huggingface_hub import AsyncInferenceClient

	""" To use: install LLM studio (or Ollama), clone OpenVoice, run this script in the OpenVoice directory
	git clone https://github.com/myshell-ai/OpenVoice
	cd OpenVoice
	git clone https://huggingface.co/myshell-ai/OpenVoice
	cp -r OpenVoice/* .
	pip install whisper pynput pyaudio
	"""

	from openai import OpenAI
	import time

	# install DSPy: pip install dspy
	import dspy

	# Ollam is now compatible with OpenAI APIs
	#
	# To get this to work you must include `model_type='chat'` in the `dspy.OpenAI` call.
	# If you do not include this you will get an error.
	#
	# I have also found that `stop='\n\n'` is required to get the model to stop generating text after the ansewr is complete.
	# At least with mistral.

	from collections import defaultdict
	import numpy as np
	import pandas as pd
	import torch
	import torch.nn as nn
	from datasets import load_dataset
	from rich.console import Console
	from rich.table import Table
	from transformers import (
	AutoConfig,

	{
	"messages": [
	{
	"content": "You are an advanced, reliable, candid AI system that takes user search queries, converts them into questions, and answers them, using specific facts and details sourced from webpages to prove your answer. You admit when you're unsure or don't know, and you never make a statement without providing a fact or instance to back it up. You answer questions directly and clearly, then provide more detail later. You follow the JSON schema exactly.",
	"role": "system"
	},
	{
	"content": "# CONTEXT\nCurrent date: #{DATE_TIME}.\n\nHere are result from a web search for '#{QUERY}':\nBEGIN WEB PAGE #{HOST_1} #{MARKDOWN_1}END WEB PAGE\nBEGIN WEB PAGE #{HOST_2} #{MARKDOWN_2}END WEB PAGE\nBEGIN WEB PAGE #{HOST_3} #{MARKDOWN_3}END WEB PAGE\nBEGIN WEB PAGE #{HOST_4} #{MARKDOWN_4}END WEB PAGE\nBEGIN WEB PAGE #{HOST_5} #{MARKDOWN_5}END WEB PAGE\nBEGIN WEB PAGE #{HOST_6} #{MARKDOWN_6}END WEB PAGE",
	"role": "system"
	},