Skip to content

Instantly share code, notes, and snippets.

@davehague
Created April 7, 2025 13:59
Show Gist options
  • Save davehague/d96e388a4c371d51b7c05445c011d5dd to your computer and use it in GitHub Desktop.
Save davehague/d96e388a4c371d51b7c05445c011d5dd to your computer and use it in GitHub Desktop.
Use Llama models in ollama and Librechat

To download and use Llama models in LibreChat on your Mac, potentially via Ollama, follow these steps:


Step 1: Download the Llama Models

1. Request Access

  • Visit the Meta Llama Downloads page.
  • Complete the access request form and accept Meta's Community License Agreement.
  • Once approved, you’ll receive a signed URL for downloading the model weights.

2. Install the Llama CLI

  • Open your terminal and install the Llama CLI tool:
    pip install llama-stack
  • Verify installation by running:
    llama --help

3. List Available Models

  • Use the CLI to list available models:
    llama model list
  • Identify the model ID for the version you want (e.g., Llama 3.1-8B or Llama 4).

4. Download the Model

  • Run the following command to download your chosen model:
    llama download --source meta --model-id CHOSEN_MODEL_ID
  • When prompted, provide the signed URL you received via email.

Step 2: Prepare for LibreChat

LibreChat supports running local models like Llama if they are compatible with its backend. You can integrate them using Ollama or directly through a backend like Hugging Face Transformers.

Use Ollama

  1. Install Ollama:

    • Download and install Ollama from its official website.
    • Verify installation by running:
      ollama --version
  2. Import the Model into Ollama:

    • Move your downloaded model files into a directory accessible by Ollama.
    • Use Ollama’s import feature to load the model:
      ollama import /path/to/your/model
  3. Run LibreChat with Ollama Backend:

    • Configure LibreChat to use Ollama as its backend for local models. Follow LibreChat’s documentation for backend setup.

Step 3: Optimize for Mac Usage

If you have limited hardware resources, consider quantization to reduce memory usage:

  • Use FP8 or Int4 quantization modes during inference:
    torchrun --nproc_per_node=1 \
      -m models.llama4.scripts.chat_completion \
      --quantization-mode int4_mixed \
      --checkpoint-dir /path/to/checkpoints

This reduces GPU memory requirements while maintaining reasonable performance.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment