Created
July 31, 2025 01:52
-
-
Save brandonbryant12/fc83dd3e8578ace77d1203b872dd1d6b to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| general_settings: | |
| master_key: sk-ADMIN-KEY # your admin key | |
| user_header_name: X-OpenWebUI-User-Email | |
| litellm_settings: | |
| disable_end_user_cost_tracking: false | |
| disable_end_user_cost_tracking_prometheus_only: false | |
| extra_spend_tag_headers: | |
| - "X-OpenWebUI-User-Email" | |
| If you’re using Open WebUI with LiteLLM and forwarding user email, here’s how to track requests by individual users via Open WebUI using Docker: | |
| ⸻ | |
| ✅ Summary of Requirements | |
| • Use Open WebUI configured to forward per-user HTTP headers like X-OpenWebUI-User-Email or X‑OpenWebUI‑User‑Id. | |
| • Configure LiteLLM via its config.yaml to parse that header and attribute usage/spend to the user. | |
| • Use virtual API keys per user or team for authentication and spend tracking. | |
| • Ensure the LiteLLM container has DATABASE_URL, LITELLM_MASTER_KEY, and salt key env vars set to enable the dashboard, logging, and key management. | |
| ⸻ | |
| 🔧 1. Configure Open WebUI | |
| In your Open WebUI environment, enable user header forwarding: | |
| export ENABLE_FORWARD_USER_INFO_HEADERS=True | |
| This ensures the Open WebUI client adds headers like X-OpenWebUI-User-Email (and/or X-OpenWebUI-User-Id, X-OpenWebUI-User-Name) on each conversation request to the LiteLLM proxy, identifying the user making the request. | |
|    | |
| ⸻ | |
| 🔧 2. LiteLLM config.yaml Setup (Docker) | |
| Here’s how to configure LiteLLM to use that header: | |
| general_settings: | |
| master_key: sk-ADMIN-KEY # your admin key | |
| user_header_name: X-OpenWebUI-User-Email | |
| litellm_settings: | |
| disable_end_user_cost_tracking: false | |
| disable_end_user_cost_tracking_prometheus_only: false | |
| extra_spend_tag_headers: | |
| - "X-OpenWebUI-User-Email" | |
| • user_header_name tells LiteLLM which HTTP header to treat as the user identifier. | |
| • With this set, each API request with Authorization: Bearer <virtual-key> and the forwarded email header lets LiteLLM group usage and billing by that email. | |
| • extra_spend_tag_headers allows the header to appear as a tag in the UI logs and metrics. | |
|  | |
| 💡 Ensure disable_end_user_cost_tracking is not set to true, otherwise per-user spend tracking is disabled. | |
|  | |
| ⸻ | |
| 🐳 Example Docker Run Command | |
| Assuming you have this config.yaml and environment variables prepared, run LiteLLM proxy: | |
| docker run -d \ | |
| -v $(pwd)/config.yaml:/app/config.yaml \ | |
| -e DATABASE_URL="postgresql://user:pass@host:5432/db" \ | |
| -e LITELLM_MASTER_KEY="sk-adminkey" \ | |
| -e LITELLM_SALT_KEY="sk-somerandomsalt" \ | |
| my-litellm-proxy --config /app/config.yaml | |
| • The DATABASE_URL, MASTER_KEY, and SALT_KEY enable usage tracking and UI. | |
| • The Open WebUI client sends headers; LiteLLM assigns usage to the user ID/email. | |
| ⸻ | |
| 🧾 3. Key Management & Virtual API Keys | |
| To track usage by team or user, ideally create a per-user virtual key: | |
| curl -X POST 'http://localhost:4000/key/generate' \ | |
| -H 'Authorization: Bearer sk-adminkey' \ | |
| -H 'Content-Type: application/json' \ | |
| -d '{ | |
| "models": ["llama2-local", "opt-125m"], | |
| "metadata": {"user_email": "[email protected]"} | |
| }' | |
| • On the UI (/ui), you can manage keys and associate them with users or email metadata. | |
| • Each user uses their own key, and LiteLLM logs usage + cost per key. | |
|      | |
| ⸻ | |
| 🧠 4. Viewing Tracking in the UI | |
| • Open the LiteLLM dashboard at http://localhost:4000/ui. | |
| • Navigate to Logs or Spend sections. You will see entries labeled by the user_header_name value (e.g. the email forwarded). | |
| • Calls from the same user/email will be grouped together. | |
| • You can also filter logs by metadata attached to the virtual key. | |
|   | |
| ⸻ | |
| ⚠️ Notes & Gotchas | |
| • The GitHub discussion indicates this header tracking feature is available and being worked on; confirm your LiteLLM version supports it. | |
|  | |
| • In version 1.73.0+, LiteLLM automatically shows User-Agent as a tag, but header-based email/user tracking must be explicitly configured. | |
|  | |
| • If the header isn’t forwarded properly by Open WebUI, LiteLLM will record requests under default_user_id. | |
| • If you see all usage grouped under default_user_id, confirm that: | |
| 1. Open WebUI is sending the header (ENABLE_FORWARD_USER_INFO_HEADERS=True) | |
| 2. LiteLLM config.yaml has user_header_name correctly set | |
| 3. Cost tracking isn’t disabled globally | |
|    | |
| ⸻ | |
| ✅ In Summary | |
| 1. Enable user-info header forwarding in Open WebUI. | |
| 2. In config.yaml, set general_settings.user_header_name to that header (e.g. X-OpenWebUI-User-Email). | |
| 3. Ensure litellm_settings.disable_end_user_cost_tracking is false. | |
| 4. Create virtual API keys per user for better tracking and control. | |
| 5. Run via Docker with DB, master key, and salt set. Access the dashboard at /ui to monitor usage per user. | |
| With this setup, LiteLLM will track and report usage and spend per email/user, visible in the dashboard, as long as Open WebUI forwards those headers correctly. | |
| Let me know if you’d like to see example config for Helm/K8s or help enabling send emails or teams for invitation notifications. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Deploying LiteLLM with Docker and Kubernetes (Amazon EKS + Helm) | |
| This guide covers two methods of deploying the open-source LiteLLM proxy server: (1) building a custom Docker image for local model serving (with Ollama and vLLM backends), and (2) deploying on Kubernetes (EKS) using Helm. We will also cover enabling the web UI dashboard, using virtual API keys, explaining the config file settings, and noting important prerequisites and gotchas for EKS, Ollama, and vLLM. | |
| 1. Building a Custom LiteLLM Docker Image (for Local Models) | |
| Overview: LiteLLM’s proxy server can be containerized without using the public pre-built image. We will build it from the Python package for security and customization. This custom image will be configured to forward requests to local LLM backends (like an Ollama server or a vLLM server) via the OpenAI-compatible API. | |
| 1.1 Prepare a LiteLLM Config File | |
| Before building the image, create a config.yaml defining the models and providers LiteLLM will serve. In this file’s model_list, each entry maps a friendly model name to the actual backend provider and model parameters. For example, to use a local Ollama model and a vLLM server: | |
| model_list: | |
| - model_name: "llama2-local" # Friendly name used in requests | |
| litellm_params: | |
| model: "ollama/llama2" # Use the Ollama provider for Llama2 | |
| api_base: "http://localhost:11434" # Ollama server URL (default port 11434) | |
| - model_name: "opt-125m-vllm" | |
| litellm_params: | |
| model: "hosted_vllm/facebook/opt-125m" # Use vLLM OpenAI-compatible endpoint | |
| api_base: "http://localhost:8000" # vLLM server URL (example port 8000) | |
| In this example, the first model llama2-local is served via an Ollama instance, and the second opt-125m-vllm via a vLLM API. (We’ll run the Ollama and vLLM servers separately, and LiteLLM will route calls to them.) Notice the use of provider prefixes: "ollama/..." for Ollama and "hosted_vllm/..." for an OpenAI-compatible vLLM endpoint  . The api_base is the base URL of the local service. In Kubernetes, this might be a service DNS name instead of localhost (we’ll address that later). | |
| Tip: You can reference environment variables in the config for sensitive values. For example, instead of embedding API keys in the file, use api_key: os.environ/MY_KEY_NAME in litellm_params, then provide MY_KEY_NAME via an env var or Kubernetes secret . This keeps secrets out of the file. | |
| 1.2 Dockerfile for LiteLLM | |
| Create a Dockerfile to build the LiteLLM proxy server image. We’ll use a minimal Python base image and install the LiteLLM package (with the proxy extras) via pip. Below is a simplified Dockerfile (based on official instructions  ): | |
| # Use a minimal Python base (Chainguard Linux) | |
| FROM cgr.dev/chainguard/python:latest-dev | |
| # Set up environment and working directory | |
| USER root | |
| WORKDIR /app | |
| ENV HOME=/home/litellm | |
| ENV PATH="${HOME}/venv/bin:$PATH" | |
| # Install system dependencies (for building Python packages, SSL, etc.) | |
| RUN apk update && apk add --no-cache gcc python3-dev openssl openssl-dev | |
| # Create virtual environment and upgrade pip | |
| RUN python -m venv ${HOME}/venv && ${HOME}/venv/bin/pip install --no-cache-dir --upgrade pip | |
| # Install LiteLLM and required extras in the venv | |
| COPY requirements.txt . | |
| RUN ${HOME}/venv/bin/pip install -r requirements.txt | |
| # Expose LiteLLM’s default port | |
| EXPOSE 4000/tcp | |
| # Entry point: run the LiteLLM proxy server | |
| ENTRYPOINT ["litellm"] | |
| CMD ["--port", "4000"] | |
| requirements.txt (example): | |
| litellm[proxy]==1.57.3 # LiteLLM version (proxy server) | |
| # Optionally include extras: | |
| prometheus_client # For metrics, if needed | |
| langfuse # For observability callbacks | |
| prisma # Database client (if using Postgres for keys) | |
| # vllm (optional): If you want to run vLLM in-process, include the vLLM library. | |
| This Dockerfile follows the official guidance for building from the pip package  . Adjust the LiteLLM version as needed (use a specific version for stability). If you plan to use the vLLM Python integration (deprecated in favor of external server), you could pip install vllm here as well. In most cases, it’s simpler to run vLLM as a separate service and use the OpenAI API interface (as we show in the config above). | |
| Build the image: | |
| docker build -t my-litellm-proxy . | |
| This will create an image named my-litellm-proxy with LiteLLM installed. | |
| 1.3 Running the LiteLLM Container | |
| Now run a container from this image, providing the config file and any required environment variables for keys or database: | |
| docker run -d \ | |
| -v $(pwd)/config.yaml:/app/config.yaml \ # mount your config file | |
| -e OPENAI_API_KEY="sk-..." \ # (example API key for a cloud model if needed) | |
| -e DATABASE_URL="postgresql://user:pass@host:5432/dbname" \ # (optional: Postgres for keys/UI) | |
| -e LITELLM_MASTER_KEY="sk-1234" \ # (optional: master API key for admin) | |
| -p 4000:4000 \ | |
| my-litellm-proxy --config /app/config.yaml --detailed_debug | |
| In this example, we mount our config.yaml into the container and expose port 4000. We also set some env vars: | |
| • OPENAI_API_KEY is just an example (if your config references it for a provider like OpenAI). | |
| • DATABASE_URL points to a Postgres instance (only needed if enabling the dashboard and key tracking). | |
| • LITELLM_MASTER_KEY is the admin key for the proxy (more on this below). It must start with “sk-” . | |
| The --detailed_debug flag enables verbose logging (useful for initial setup; for production, it’s recommended to omit it for performance ). | |
| Once the container is running, the LiteLLM proxy will listen on http://0.0.0.0:4000 by default  (we mapped this to the host as port 4000). | |
| 1.4 Enabling the UI Dashboard and Virtual API Keys | |
| To use LiteLLM’s web UI and virtual API key features, you need to connect a database and set up an admin/master key. The UI is a dashboard accessible at the /ui path of the server, which provides observability (e.g. usage charts, budgets) and key management. | |
| Database requirement: LiteLLM uses Postgres to store API key metadata and usage logs. Without a database, the proxy can still forward requests, but you won’t have multi-user key management or spend tracking. Ensure you have a Postgres database running (it can be local or cloud). Provide its connection string via the DATABASE_URL env var when starting LiteLLM . | |
| Master Key: Set LITELLM_MASTER_KEY in the environment – this acts as the admin API key for the proxy . It should be a string starting with “sk-”. Clients will use this key to authorize admin actions (like generating sub-keys or accessing the UI). For example, you might set LITELLM_MASTER_KEY="sk-1234" (in practice, use a more secure value). | |
| Salt Key: Additionally, set LITELLM_SALT_KEY to a secret random string (also starting with “sk-”). This salt is used to encrypt/decrypt any provider API keys stored in the database and should not be changed after you start issuing keys . It’s recommended to generate a random hash for this (e.g. using a secure password generator) . | |
| Example (.env file): | |
| LITELLM_MASTER_KEY="sk-1234" # master admin key | |
| LITELLM_SALT_KEY="sk-abcdef...random" # salt for encryption | |
| With the database URL, master key, and salt key provided, start the container. LiteLLM will automatically initialize the database (running migrations) and enable key management. | |
| Once running, you can access the LiteLLM Dashboard UI by navigating to http://<host>:4000/ui in your browser . For example, if running locally: http://localhost:4000/ui. The UI will prompt for login or show usage stats. (By default, it might accept the master key as a form of authentication for admin access.) | |
| Using Virtual API Keys: With the master key set, you can create project-specific API keys (virtual keys) that clients can use to call your LiteLLM proxy. These are generated via the REST endpoint POST /key/generate. For example, using curl with the master key : | |
| curl -X POST 'http://0.0.0.0:4000/key/generate' \ | |
| -H 'Authorization: Bearer sk-1234' \ | |
| -H 'Content-Type: application/json' \ | |
| -d '{ | |
| "models": ["gpt-3.5-turbo", "gpt-4", "claude-2"], | |
| "duration": "20m", | |
| "metadata": { "user": "[email protected]", "team": "dev" } | |
| }' | |
| This request (authorized with the master key sk-1234) asks the proxy to create a new key limited to the specified models and a 20-minute lifespan . The response will contain a new API key (e.g. sk-xxxx...) and its expiration timestamp . That key can then be given to a user or application – they will use it as the Bearer token when calling the proxy’s OpenAI-compatible endpoints (e.g. /v1/chat/completions). LiteLLM will route the request to the appropriate underlying model (as per your config) and track the usage under that key. You can create multiple keys (with different model access, time limits, or metadata) to segregate usage by project or user. The dashboard UI also allows monitoring these keys’ usage and enforcing budgets or rate limits per key/project  (you can set spend limits which the proxy will enforce, though configuration of budgets may involve additional API calls or UI actions beyond the scope of this guide). | |
| Gotcha – Key Prefix: If you encounter authentication errors, ensure all keys (master or virtual) start with the "sk-" prefix. The proxy expects this format and will reject tokens that don’t comply . | |
| Gotcha – UI Access: The UI is served by the LiteLLM backend at the /ui path, so if you are using a reverse proxy or load balancer, make sure it forwards that path. The UI may also rely on the API being reachable at the same host (it’s often a single-page app). If you see a blank dashboard, double-check that the proxy’s API endpoints (like /user/info or /stats) are accessible and that you included the required env vars. | |
| 2. Deploying LiteLLM on Amazon EKS with Helm | |
| Now we will deploy LiteLLM to a Kubernetes cluster, specifically Amazon EKS, using the provided Helm chart. This will allow scaling and integration into a cloud environment. We’ll cover installing via Helm, setting up the config and secrets (DB, keys), enabling the UI, and notes on integrating with Ollama/vLLM on EKS. | |
| 2.1 Prerequisites for EKS Deployment | |
| • EKS Cluster: Ensure you have an EKS cluster up and running (Kubernetes 1.23+ or later). You can create one using eksctl or the AWS console. For example, a simple cluster creation command is: eksctl create cluster --name=litellm-cluster --region=us-west-2 --node-type=t2.small . Make sure the node group has sufficient resources for your LLM workloads (for heavy models you may want larger instances or GPU-enabled instances if using vLLM/Ollama with GPU). | |
| • Helm: Install Helm on your machine (and ensure you have kubectl access to the cluster). | |
| • Postgres Database: For using virtual keys and the UI, have a Postgres instance accessible from the cluster (e.g., an AWS RDS Postgres or a self-managed Postgres in-cluster). Note the connection string for later. | |
| 2.2 Installing LiteLLM via Helm Chart | |
| LiteLLM provides a Helm chart (currently in beta ) which simplifies deployment. There are two ways to use it: | |
| Option A: Helm Repository (if available). In some cases, the LiteLLM chart may be available via a Helm repo. For example, a community helm repo exists (as in one tutorial) which you can add and install directly  : | |
| helm repo add litellm https://chetankapoor.github.io/litellm-helm-charts | |
| helm repo update | |
| helm install my-llm litellm/litellm --namespace litellm --create-namespace | |
| (This assumes a maintained repository of the chart. Replace my-llm and namespace as desired.) | |
| Option B: OCI Registry (official). The official chart is published as an OCI package on GitHub Container Registry. You can pull and install it manually: | |
| helm pull oci://ghcr.io/berriai/litellm-helm --version 0.1.2 # download the chart tarball [oai_citation:24‡docs.litellm.ai](https://docs.litellm.ai/docs/proxy/deploy#:~:text=,litellm%20helm%20chart) | |
| tar -zxvf litellm-helm-0.1.2.tgz # unpack it [oai_citation:25‡docs.litellm.ai](https://docs.litellm.ai/docs/proxy/deploy#:~:text=,helm) | |
| helm install my-llm ./litellm-helm # install the chart [oai_citation:26‡docs.litellm.ai](https://docs.litellm.ai/docs/proxy/deploy#:~:text=,helm) | |
| This will deploy LiteLLM with default settings into your cluster. By default the service is ClusterIP on port 4000  . You can override values to suit your needs (see next section). | |
| Once deployed, verify the pod is running: | |
| kubectl get pods -l app=litellm | |
| You should see a pod (or multiple if you scaled) named litellm-... in a Running state . | |
| 2.3 Configuration and Customization via Helm Values | |
| You will likely need to customize the Helm deployment to include your configuration and secrets. This can be done by supplying a values.yaml file or --set flags during installation. Key things to configure: | |
| • Image version: It’s wise to pin a specific image tag for consistency. In values, you can set: | |
| image: | |
| repository: ghcr.io/berriai/litellm | |
| tag: main-stable # or a specific version like main-v1.72.3 | |
| (By default the chart may use main or latest; pinning to a stable tag or SHA is recommended to avoid unpredictability .) | |
| • Replicas: If you want multiple instances for high availability or load, set replicaCount (e.g., 3). The chart’s default might be 1. | |
| • Service type & port: To access LiteLLM from outside the cluster, you can use a LoadBalancer service. For example: | |
| service: | |
| type: LoadBalancer | |
| port: 4000 | |
| This will create an AWS ELB and expose port 4000 . (You can also use an Ingress if preferred; in that case service can remain ClusterIP and you’d configure an ALB Ingress for path /ui and /v1/* etc.) | |
| • LiteLLM Config: You have two options: | |
| 1. Embed via values: Small config entries can be placed under the config section in values (if supported by the chart). Some charts allow specifying the entire config.yaml content in values or as a separate file. | |
| 2. ConfigMap: As an alternative, create a ConfigMap manually and mount it. The LiteLLM docs show creating a ConfigMap named litellm-config from your proxy_config.yaml and then referencing it in the deployment  . In the provided Helm chart, it likely has a mechanism to include your config. Check the chart’s README – for example, you might set a value like: | |
| configMap: | |
| "config.yaml": |- | |
| model_list: | |
| - model_name: "llama2-local" | |
| litellm_params: | |
| model: "ollama/llama2" | |
| api_base: "http://ollama.default:11434" | |
| ... (your full config here) | |
| (The above is an illustrative approach; actual key name could differ.) | |
| If not directly supported via values, you can always create a ConfigMap as in the docs and modify the deployment to mount it. Helm’s flexibility allows you to supply extra manifests or patches if needed. | |
| • Environment Variables (Secrets): You need to supply sensitive env vars like LITELLM_MASTER_KEY, LITELLM_SALT_KEY, and any provider API keys or the DATABASE_URL. It’s best to use Kubernetes Secrets for these. For example, create a secret manifest: | |
| apiVersion: v1 | |
| kind: Secret | |
| metadata: | |
| name: litellm-secrets | |
| type: Opaque | |
| data: | |
| LITELLM_MASTER_KEY: <base64-encoded-value> | |
| LITELLM_SALT_KEY: <base64-encoded-value> | |
| DATABASE_URL: <base64-encoded-value> | |
| # ... any other keys (OpenAI API keys, etc.) ... | |
| Then ensure the deployment picks them up. The LiteLLM helm chart likely has a way to specify extra env. If not, you can modify values to include: | |
| extraEnvFrom: | |
| - secretRef: | |
| name: litellm-secrets | |
| (Or similar, depending on chart). In the raw Kubernetes example, they mount the secret like this , which implies the Helm chart by default might do the same if you provide the secret name. | |
| • Database configuration: Ensure DATABASE_URL env is set to point to your Postgres. Also verify network access (if using RDS, your cluster’s security group must allow outbound to RDS and RDS must allow inbound from the cluster or have a public accessible endpoint). | |
| • Master/Salt Keys: Provide your chosen master key (remember sk- prefix) and a salt key as above via the secret. | |
| Once your values are ready, install/upgrade the release: | |
| helm install my-llm ./litellm-helm -f values.yaml --namespace litellm | |
| # (or `helm upgrade --install` if reusing release) | |
| After deployment, verify that the pod has the env vars: | |
| kubectl exec -it <pod-name> -- env | grep LITELLM | |
| You should see LITELLM_MASTER_KEY and others listed (the actual values will be hidden as they are set via secret). | |
| 2.4 Accessing the UI on EKS | |
| If you used a LoadBalancer service, get the external URL: | |
| kubectl get svc litellm-service -o wide | |
| Suppose it outputs an address like ae2038********.us-west-2.elb.amazonaws.com for port 4000 . Then navigate to http://<that-elb-url>:4000/ui in your browser. You should see the LiteLLM dashboard login or home screen. Use your master key when prompted (if the UI login uses API key authentication). | |
| If using port-forward (for testing), you can do: | |
| kubectl port-forward svc/litellm-service 4000:4000 | |
| Then visit http://127.0.0.1:4000/ui . | |
| On the UI, you can monitor requests, create keys, and view spend tracking across projects. The UI communicates with the proxy’s REST endpoints, so ensure your service or ingress allows those endpoints (especially the ones under / or /v1). | |
| 2.5 Integrating Ollama and vLLM on EKS | |
| To serve local models on EKS, you’ll deploy the model backends in the cluster alongside LiteLLM: | |
| • Ollama on EKS: Ollama can be run in a container (there is an official Docker image). You might deploy it as a separate Deployment (possibly with a PVC to store downloaded models at /root/.ollama). Expose it as a ClusterIP Service on port 11434 (the Ollama default). For example, service name “ollama” in namespace “default” (same as LiteLLM, or adjust config accordingly). In your LiteLLM config, set api_base: "http://ollama.default:11434" (if in same namespace, you can just use http://ollama:11434). LiteLLM will then route requests to the Ollama service . Gotcha: Ensure the Ollama container has access to a model – you may need an init step to download the model (e.g., running ollama pull <model> or using a pre-loaded volume). Also, if using GPUs, run the Ollama container on a GPU node (Ollama supports GPU acceleration on Linux Docker ). | |
| • vLLM on EKS: vLLM can be deployed as a server with an OpenAI-compatible API (the python -m vllm.entrypoints.openai.api_server command). The vLLM docs provide a Helm chart as well . You can deploy vLLM on a GPU node for best performance. Expose it on a service (e.g., name “vllm”, port 8000). Then in LiteLLM’s config, set model: "hosted_vllm/your-model-name" and api_base: "http://vllm.default:8000" . LiteLLM will forward requests to vLLM’s endpoint. If the vLLM server requires an API key itself, include api_key: ... in the litellm_params (and supply that via env as well) . Gotcha: vLLM is memory-intensive; ensure the k8s resource limits match the model’s requirements (and if using GPU, add nvidia.com/gpu: 1 in the vLLM pod spec and the NVIDIA device plugin DaemonSet on your nodes). | |
| • Connectivity: By default, LiteLLM in one pod will call the Ollama/vLLM service via cluster networking. No additional config is needed if they share a namespace (Kubernetes will resolve the service DNS). Just be sure to use the correct service name and port in api_base URL. If they are in different namespaces, include the namespace (e.g., ollama.namespace.svc.cluster.local). | |
| • Security: If you restrict network policies, allow the LiteLLM pod to communicate with the model service pods. | |
| • Scaling: You might run multiple replicas of LiteLLM. Each LiteLLM will independently forward to the backend services. If those backend services themselves need scaling (for throughput or HA), you could scale them as well (Ollama might not easily scale horizontally for a single model, but vLLM might handle multiple threads – check vLLM docs for concurrency capabilities). Alternatively, run multiple LiteLLM pointing to separate model pods and use LiteLLM’s own load-balancing and fallback features across models (beyond this scope, but LiteLLM can route between multiple providers or instances if configured in model_list with groups). | |
| 2.6 Understanding LiteLLM Configuration Settings | |
| For completeness, here is a rundown of the main sections and settings in the LiteLLM config file (config.yaml): | |
| • model_list: This is the core of the config. It’s a list of models that the proxy knows about. Each model has: | |
| • model_name (string): The name clients will use when requesting this model. | |
| • litellm_params (dict): The parameters telling LiteLLM how to call the underlying LLM API. Common fields here include: | |
| • model: The provider and model identifier (e.g. "openai/gpt-4", "azure/gpt-35-turbo", "ollama/llama2", "hosted_vllm/opt-125m", etc.). | |
| • api_base: Base URL for the API if using a non-default endpoint (e.g., custom base for Azure, or the URL of your local server as shown in examples). | |
| • api_key: API key or reference to one (often os.environ/KEY_NAME to pull from environment) . | |
| • Other provider-specific settings like organization_id (for Anthropic), region (for Bedrock), etc., can go here as needed by that integration. Essentially, anything you’d pass to litellm.completion in code can be specified here. | |
| • keep_alive: (for Ollama) You can set a keep-alive interval for persistent model sessions (e.g., "8m" as in the Ollama Chat example) . | |
| • model_info (dict, optional): Metadata about the model. For example: | |
| • id: An optional identifier. | |
| • mode: If this model is an embedding model, you might set mode: embedding here . | |
| • max_tokens: The context length of the model (if not auto-known) . | |
| • input_cost_per_token, output_cost_per_token: Cost tracking info (if you want to override or define a custom model’s token cost for spend tracking) . | |
| • supports_function_calling: Boolean if the model can do function/tool calling natively  (used in advanced features). | |
| • Any other custom info for your use (the config is flexible to add fields, but those above are recognized by LiteLLM). | |
| • environment_variables: A mapping if you want to explicitly define environment variables that the proxy should consider. Usually, you don’t need to list them here if you’ve already put os.environ/VAR in the model params; setting them in the actual environment is enough. This section can often be left empty or omitted . | |
| • litellm_settings: Global settings for the proxy’s behavior. Some notable settings include: | |
| • Logging & Callbacks: You can integrate with logging/monitoring systems. For example, success_callback and failure_callback lists (services to notify/log on request success/failure), callbacks (called on both), and service_callbacks (for internal service health logging)  . LiteLLM supports integrations like Datadog, Prometheus, Sentry, Langfuse, etc., which can be enabled here by listing their names. For instance, setting service_callbacks: ["prometheus"] will expose Prometheus metrics (the image includes a Prometheus client by default) . | |
| • Logging Privacy: turn_off_message_logging (if true, the proxy will not include full prompt/response content in logs, to avoid sensitive info leakage) . redact_user_api_key_info (if true, strips out user API key details from logs for privacy) . | |
| • Debugging: set_verbose (very verbose logging of internal calls) , json_logs (output logs in JSON format) . | |
| • Networking: request_timeout (global timeout for LLM API calls in seconds) . force_ipv4 (use IPv4 only, useful if IPv6 causes issues with some APIs) . | |
| • Reliability & Fallbacks: default_fallbacks (list of model names to fall back to if the primary model fails; e.g., you could specify an older model as backup) . content_policy_fallbacks and context_window_fallbacks allow specifying alternate models to use if a request fails due to content filtering or context length limits . | |
| • Caching: cache (enable/disable caching) and cache_params to configure a Redis or other cache store  . You can set cache_params.type to "redis" and provide host/port, or even configure a semantic cache with Qdrant, etc. (These are advanced; if you want to enable caching on EKS, you’d also deploy a Redis or use AWS ElastiCache and supply the address here). | |
| • Misc: There are many other toggles, such as disable_hf_tokenizer_download (to avoid downloading HuggingFace tokenizers if you prefer to use OpenAI’s) , or enable_json_schema_validation (to strictly validate request payloads against OpenAI schema). Most can be left default unless needed. | |
| • callback_settings: This section can contain configuration specific to the logging callbacks mentioned above. For example, under callback_settings.otel you could set message_logging: false to not send full message content to OpenTelemetry logs . Each integration (Langfuse, etc.) might have its own sub-options. | |
| • general_settings: Misc global settings not covered elsewhere  : | |
| • completion_model: A default model name to use if a client request doesn’t specify one (though OpenAI API normally requires one). | |
| • max_parallel_requests and global_max_parallel_requests: To cap concurrency per instance or cluster-wide. | |
| • database_url: You can put the Postgres URL here instead of env (but using env/secret is safer) . | |
| • allow_requests_on_db_unavailable: If true, the proxy will still serve LLM requests even if it cannot reach the database to verify a virtual API key . (By default, if DB is down, auth will fail – enabling this setting trades off some security to keep the service up). | |
| • enforce_user_param: If true, requires that all requests include a user field (OpenAI API allows a user parameter for tracking). This can be used to enforce user attribution for requests. | |
| • alerting and alerting_threshold: configure basic alerting (e.g., Slack/email if set up; probably requires additional setup or enterprise version). | |
| • master_key: You could hardcode the master key here instead of env (not recommended for production – better use env). | |
| • disable_spend_logs or related flags: can turn off writing detailed transaction logs to the DB for each request if you want to reduce overhead . | |
| • (There are more, but these are some common ones. Refer to the LiteLLM docs for the exhaustive list  .) | |
| In summary, the config file lets you define which models (and which providers/backends) the proxy will serve, and various operational settings (logging, caching, auth requirements). Most settings have sensible defaults, so you can start with a minimal config (just model_list and maybe general_settings.database_url). Then gradually enable features like caching or monitoring by updating the config and reloading the deployment. | |
| 2.7 Additional Gotchas and Notes | |
| • CPU vs GPU for local models: If you plan to serve large local models (like Llama2 70B) via Ollama or vLLM, using GPU acceleration is highly recommended. On EKS, that means using AWS EC2 GPU instance types for your nodes and installing Nvidia drivers or the AWS Neuron SDK (if Inferentia instances). Ensure your Kubernetes scheduling is set up (taints/tolerations, node selectors, and resources requests for GPUs). The Ollama and vLLM containers should be configured to use the GPU. LiteLLM itself doesn’t need a GPU (it’s just a proxy), but vLLM can utilize GPU memory for model inference. | |
| • Scaling and Load Balancing: LiteLLM supports running multiple replicas. If you do, you might also distribute load among multiple backend instances. LiteLLM has built-in routing strategies (simple round-robin, least-busy, etc.) if you list multiple api_base entries for a model. For example, you could configure two vLLM endpoints under one model name and set a routing strategy . This is advanced usage – ensure consistency of underlying models if you do this. | |
| • EKS Ingress: Instead of a LoadBalancer service, you might use an ALB Ingress Controller. In that case, you’d expose the 4000 port via ingress rules. Make sure to allow WebSocket upgrade if you plan to use streaming (OpenAI streaming uses HTTP chunked responses, which should work through ALB by default). Also, mapping the path /ui and / properly is important (the UI might make calls to /v1 or /key endpoints on the same host). | |
| • AWS Credentials: If any of your configured models use AWS services (e.g., Bedrock or Sagemaker via LiteLLM), you’ll need AWS credentials available. On EKS, the best practice is to use IAM Roles for Service Accounts (IRSA). Alternatively, supply AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as env vars via secrets. Similarly, for Azure OpenAI or others, supply those keys via env and reference in config (as shown with os.environ/ in config). | |
| • Postgres on EKS vs RDS: For quick testing, you could deploy a Postgres container in the cluster (or even use the LiteLLM “database” Docker image which includes a bundled Postgres). However, for production, using a managed RDS database is safer and more scalable. If using RDS, consider setting DISABLE_SCHEMA_UPDATE=false to allow LiteLLM to run migrations on startup (the Helm chart’s pre-hook does this automatically) . Also, open the RDS’s security group to the EKS cluster or run the LiteLLM in the same VPC. | |
| • Monitoring: LiteLLM can output Prometheus metrics (number of requests, latency, cost, etc.). If you deployed the prometheus_client Python package (as in our Dockerfile), metrics are usually exposed on the /metrics endpoint. You can scrape this with Prometheus. Consider deploying Prometheus in the cluster or using Amazon Managed Prometheus and configure an annotation on the service to enable scraping. The UI itself gives some basic monitoring, but Prometheus + Grafana could provide more detail if needed. | |
| • Upgrades: When updating LiteLLM version, note that the main tag changes frequently. It’s advisable to use a stable release tag or specific version. Always test config compatibility when upgrading, as new features might change how config fields work. Use the Helm chart values to specify the image tag (don’t just pull latest in production) . | |
| By following this guide, you should have a robust deployment of LiteLLM either locally via Docker or on EKS with Kubernetes. You can now serve requests to a unified endpoint (your LiteLLM URL) that proxies to various LLM providers – including local models hosted via Ollama and vLLM – all while tracking usage and managing access through virtual API keys. | |
| Ensure to refer to the official LiteLLM documentation for any advanced configurations or troubleshooting  . Happy deploying! | |
| Sources: | |
| • LiteLLM Official Docs – Deployment (Docker & Kubernetes)   | |
| • LiteLLM Official Docs – Config File and Settings   | |
| • LiteLLM Official Docs – Ollama Integration   | |
| • LiteLLM Official Docs – vLLM Integration   | |
| • LiteLLM GitHub README – Key management, UI usage   | |
| • “How to install LiteLLM using helm on Kubernetes” (C. Kapoor)   | |
| • LiteLLM GitHub Issues – Helm chart and best practices   |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment