Skip to content

Instantly share code, notes, and snippets.

@Teebor-Choka
Created November 13, 2024 20:57
Show Gist options
  • Save Teebor-Choka/a4a5b099b85404538e32eb8a06c71565 to your computer and use it in GitHub Desktop.
Save Teebor-Choka/a4a5b099b85404538e32eb8a06c71565 to your computer and use it in GitHub Desktop.
On-premise LLM deployments and options

On-premise LLM model deployments

Requirements

LLM models used as an external resource offer multiple disadvantages:

  • Privacy - guarantees regarding confidential imformation usage by the models and intellectual property (IP) safety cannot be guaranteed
  • Flexibility - specialization and fine-tuning of more general models offers efficiency gains in terms of resources used and outputs achieved
  • Stability - reducing vendor dependency allows internal system stability through guaranteed availability

An approach, whereby LLM models and deployed, fine-tuned and utilized locally (on-premise) allows:

  • Knowledge - building and retaining the domain knowledge of AI system utilization
  • Deployments - creating and scaling the infrastructure is possible using localized rosource targeted deployments
  • IP - protection and containment of intellectual property in the form of code, design, specifications, documentation and resources

Goal specification

This analysis aims to answer multiple questions related to on-premise usage of LLM models:

  1. What are the state-of-the-art deployment methodologies for on-premise LLMs?
  2. What are the limitations of locally deployed LLM workloads?
  3. Which LLMs are most suitable for resource usage, workload compatibility, and tunability?
  4. What are the limitations of open-source LLM utilization in terms of release licenses?
  5. What mechanisms exist for fine-tuning models to gain additional insights into IP-based resources (e.g., code and specifications)?
  6. What types of training data should be utilized to fine-tune models for specific use cases?
  7. How should the on-premise system be designed to ensure sustainability in terms of maintenance effort and future extensibility?
  8. How should the system be designed to optimize existing resources and minimize waste?
  9. What are the trade-offs associated with using on-premise workloads to achieve desired outcomes?

Analysis

Utilizability

Benefits of on-premise LLM:

  1. Privacy & IP protection
    • The input and output of an external model cannot be reliably controlled, and no assurances for data privacy can be guaranteed without the implementation of fully homomorphic encryption (FHE) 1.
    • The utilization of FHE necessitates the reimplementation of the LLM in terms of integer operations over table-like lookup operations.
    • Computationally realistic in the future.
  • The only viable alternative given current technologies is on-premise execution.
  1. Flexibility

    • Instead of general LLMs, specialized tasks can benefit from fine-tuned and optimized alternatives that are more capable of handling targeted tasks.
    • Knowledge storage of domain-specific topics, such as information from the code base and documentation, can enhance the performance of assisted tasks.
  2. Stability

  • Local services are self-sustainable in terms of external system reliance and can be operated indefinitely.
    • Defective implementations can be rolled back.
    • Small context windows are possible due to the retention of the most relevant information directly in the model.
  1. Deployments
  • Achievable scaling deployments with the possibility of on-demand specialized model spawning

Deployments

Hardware requirements

The inference capabilities of LLMs can be run over:

  1. CPU
  • managable by infrastructure orchestrators (K8s, OpenShift,...) as a resource
  1. GPU
  • managable by infrastructure orchestrators (K8s, OpenShift,...) as a resource
  1. NPU
  • managable on the host-level
CPU workloads

Sufficient for small-scale single-tenant applications, but not scalable to multi-tenant systems.

Ideal for highly optimized models capable of running in:

  • the browser
  • directly on the host

CPU-based inference is unscalable beyond the single user and cannot scale beyond the resource limitations of the hardware managing the model (e.g., Meta-Llama-3-70B-Instruct-Q4_K_M performing 1.44-1.86 tokens per second on an Intel Ultra CPU 10).

  • Most CPU workloads prioritize optimizing token/dollar/watt ratios rather than throughput.
GPU workloads

Viable alternative to server-based infrastructure. Depending on the LLM, there are 4 major types of resource utilization:

  1. single tenant over single GPU
  2. single tenant over network spread GPUs (e.g. 2)
  3. multiple tenants over multiple orchestrated GPUs (e.g. using 3 on K8s)
  4. GPU clusters with shareable GPUs (multi-instance GPU (MIG), AMD memory pool remapping (MPR),...) capability for multiple tenants and variable workloads

From the perspective of practical applications, reasonably capable LLMs benefit from setups specified by at least 2. and 3., with 4. being practically usable only for organizations with variable workloads.

Kubernetes orchestration

K8s orchestration over the existing infrastructure based on microk8s would require the utilization of a GPU addon.

Pros Cons
the only real scalable solution for GPU workloads problematic setup
possible reuse of existing tooling (LLMOps, GPU extensions) changing maintenance mechanisms
high availability and fault tolerance
community support for LLMs built over K8s
Required components of orchestration 4:
  • Kubernetes Cluster
    • horizontal and vertical scaling
    • GPU scheduling and sharing (e.g. multiple LLMs + fine-tuning learning process)
  • GPU Support
  • Container Registry
    • internal for internal models
  • LLM Model Files
  • weights, configuration, and tokenizer
  • model parallelism & sharding
  • fine-tuning and continuous learning workloads 5
  • Containerization
    • packaging pipelines for fine-tuned models

LLMs

Challenges associated with implementing LLMs include:

  • VRAM (GPU memory) consumption,
  • inference speed,
  • throughput,
  • disk space utilization.

Comprehensive model comparisons can be accessed at: here. Details regarding open-source LLM licensing are available at: here.

LLama-3.1-70B

  • context: 128k tokens

Total Inference Memory: 236.69 GB

Model Weights: 130.39 GB

KV Cache: 19.77 GB

Activation Memory: 86.54 GB

LLama-3.1-405B

  • context: 128k tokens
  • powerhouse for synthetic data generation
  • ideal for knowledge distillation and minimization
  • not usable in practical setup due to resource requirements

Transformer Design: Capable of capturing long-range dependencies in textual data.

Multi-Head Attention: Allows the model to simultaneously focus on multiple input elements, enhancing its comprehension of intricate queries and enabling the generation of more nuanced outputs.

Layer Normalization: Enhances convergence rates and stabilizes training, leading to faster and more effective learning.

Reward Modeling:

  1. Bradley Terry Model
  2. Regression-Style Scoring

Both refine the model's responses, resulting in more coherent, contextually appropriate outputs.

Ongoing research and development efforts are directed towards addressing these limitations through:

Prompt Engineering: Refining the manner in which queries are presented to the model to optimize its performance in specific scenarios.

Continuous Learning: Implementing mechanisms for the model to acquire and enhance its knowledge base over time.

Task-Specific Fine-Tuning: Adapting the model for specialized applications while preserving its general capabilities.

Llama-3.1-Nemotron-51B

The Nemotron variant achieves approximately 2.2x faster inference speed compared to the reference model while preserving nearly the same level of accuracy. (https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/)

Optimized with TensorRT-LLM engines, this model enhances inference performance and is packaged as an NVIDIA NIM microservice. This facilitates the seamless and accelerated deployment of generative AI models across NVIDIA accelerated infrastructure.

Meta-LLaMA-3.1-7B

Total Inference Memory: 62.12 GB

Model Weights: 14.90 GB

KV Cache: 3.95 GB

Activation Memory: 43.27 GB

Efficient LLM for most specialization tasks.

Mistral 7B

Total Inference Memory: 60.26 GB

Model Weights: 13.04 GB

KV Cache: 3.95 GB

Activation Memory: 43.27 GB

User-facing interfaces 11

Serving frameworks

These ensure that models are delivered with optimal performance, handling challenges from latency optimization to resource management.

vLLM 6

  • license: Apache-2.0
  • high-performance inference engine designed to assist with the deployment of computationally intensive LLMs through efficient memory management techniques and optimised algorithms
    • uses Page Attention Technique (PGA) beating Text-generation inference (TGI) and HuggingFace Transformers
    • employs continuous batching, which groups incoming requests leading to a reduced waiting time and resource optimisation
  • limited model support
  • exposes user-friendly APIs, which are compatible with OpenAI API to proxy the API interactions
  • license: MIT
  • supports a variety of models
  • running models mostly locally
  • focus on customization and performance optimization
  • offers OpenAI-like API integration, allowing you to seamlessly embed the locally configured model into your application
  • license: AGPL-3.0
  • fork of vLLM

Orchestration frameworks

BentoML/OpenLLM

  • license: Apache-2.0
  • A traditional AI platform that leverages Kubernetes helpers
  • Optimizes model serving through advanced inference techniques from vLLM and BentoML, ensuring low latency and high throughput
  • Excells in handling multiple concurrent users
  • OpenAI-compatible APIs, OpenLLM facilitates seamless integration of various open-source models

AutoGen

  • license: CC-BY-4.0, MIT
  • Introduces a versatile multi-agent framework
    • Collaborates with agents to execute diverse tasks
    • Customizable and enhanced with prompt engineering and supplementary tools (e.g., Google Search API) to execute code, retrieve information, and collaborate on complex tasks
    • Supports various conversation patterns, including fully autonomous dialogues and human-in-the-loop problem-solving, making it suitable for developing next-generation LLM applications

API Gateways

These tools facilitate the seamless communication between your LLMs and external applications.

  • Simplify integration, thereby enhancing the usability and adaptability of your models to existing systems.

LiteLLM Proxy Server

  • license: MIT
  • Solution to manage AI model access across various applications. In general, it acts as an intermediary between client requests and numerous LLMs providers.
  • Not necessary for on-premise solutions.
  • Enables intelligent routing, allowing organizations to handle varying levels of demand and prevent service disruptions.

LLM Parallelization & Sharding

Advanced sharding can be achieved through orchestration tooling, e.g. as in 2.

Fine-tuning

Model resource requirements 13

Model resource consumption estimates can be calculated here.

General rule of thumb:

  • Inference: Number of parameters * Precision
  • Training: 4–6 times the inference resources

Inference optimizations

Quantization

Model Size = Number of Parameters * Precision

Quantization consists of loading the model weights with a lower precision.

  • decrease the operating cost of the model
  • can harm the accuracy of your model, but is preferred to a smaller model with higher precision

Additional memory optimizations: Double quantization

KV Cache

KV Cache = 2 * Batch Size * Sequence Length * Number of Layers * Hidden Size * Precision

In the Transformers architecture, the decoding phase generates a single token at each time step, contingent upon the previous token tensors. To optimize computational efficiency, these tensors are cached in the GPU's memory to prevent their recalculation.

Additional memory optimizations: PagedAttention

Activations

Activation Memory = Batch Size * Sequence Length * Hidden Size * (34 + (5 * Sequence Length * Number of attention heads) / (Hidden Size))

During the forward pass of the model, intermediate activation values must be stored. These activations represent the outputs of each layer in the neural network as data propagates forward through the model. They must be stored in FP32 to avoid numerical instability and ensure convergence.

Additional memory optimizations: PagedAttention, Sequence-Parallelism, Activation Recomputation

Model improvement techniques

Part of a longer process 15:

  1. Select a pre-trained model: generic purpose models that have been trained on a large corpus of unlabeled data.
  2. Gather relevant Dataset: gather a dataset that is pertinent to the task at hand.
  3. Preprocess Dataset: preprocess the dataset by cleaning it, splitting it into training, validation, and test sets, and ensuring it is compatible with the model on which we intend to fine-tune.
  4. Fine-tuning: select a relevant dataset that is more specific to the task at hand. The dataset we choose may be related to a particular domain or application, enabling the model to adapt and specialize for that context.
  5. Task-specific adaptation: during fine-tuning, the model's parameters are adjusted based on the new dataset, enabling it to better comprehend and generate content relevant to the specific task. This process retains the general language knowledge acquired during pre-training while tailoring the model to the nuances of the target domain.

Full Fine Tuning (Instruction fine-tuning)

  • enhance a model’s performance across various tasks by training it on examples that guide its responses to queries
  • demands sufficient memory and computational resources, similar to pre-training, to handle the storage and processing of gradients, optimizers, and other components during training

Transfer learning

TBD - more complex approach than PEFT

Parameter-efficient fine-tuning (PEFT)

  • Cost-effective and efficient
  • PEFT preserves the original LLM weights

Techniques:

  • Low-Rank Adaptation (LORA)
    • Instead of finetuning all the weights that constitute the weight matrix of the pre-trained large language model, two smaller matrices that approximate this larger matrix are fine-tuned.
    • Results in the original LLM + "LoRA adapter" - must be combined with its original LLM.
  • QLoRA-based fine-tuning 15,16
    • A more memory-efficient iteration of LoRA
    • Further enhances memory efficiency by also quantizing the weights of the LoRA adapters (smaller matrices) to lower precision.

Tools

Guides

Order of actions to take

Implementation actions resulting from the analysis:

  1. Expand the orchestrated infrastructure to utilize all available GPU resources within the organization.
    • Develop deployments for GPU drivers (e.g., NVIDIA GPU Operator, MLX Operator).
    • Establish the minimal infrastructure setup to facilitate the assignment of commandeered resources to LLM workloads.
  2. Create a deployment for a selected LLM.
    • Consider the anticipated workload resource constraints.
  3. Create a deployment of a user-facing interface utilizing a sharding MIG infrastructure.
    • Integrate the deployed LLM and alternatives behind a combined OpenAPI-like interface that encompasses a user-facing frontend and API functionality for agents.
  4. Prepare a training dataset for tasks that can be specialized, allowing the training of a minimized agent-like model.
  5. Specialize a new agentic model for a specific activity within the organization.
    • Employ fine-tuning techniques on the most suitable model.
  6. Introduce the specialized agent/model into the environment.
  7. Implement MLOps life-cycle management and monitoring of the existing infrastructure.

More or less useful resources

For code-base analysis

Based on the application, this would be a reasonable development alternative with the most advanced codebase insights available on the market: https://www.anthropic.com/news/github-copilot

The model possesses the capability to command host device resources, including interfacing with external functionality during the development process within an Integrated Development Environment (IDE).

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment