Skip to content

Instantly share code, notes, and snippets.

@cnndabbler
Last active June 7, 2023 01:55
Show Gist options
  • Save cnndabbler/e01bb628aedcfe1c110d50f7b15bf7cf to your computer and use it in GitHub Desktop.
Save cnndabbler/e01bb628aedcfe1c110d50f7b15bf7cf to your computer and use it in GitHub Desktop.
llama-cpp-python to host local LLM running inside no-code visual programming chat flow

Part 1/2 : Set up an OpenAI compatible CPU-based local LLM model

  1. What is llama-cpp-python

    • llama-cpp-python offers a web server which aims to act as a drop-in replacement for the OpenAI API. This allows you to use llama.cpp compatible models with any OpenAI compatible client (language libraries, services, etc).
    • https://github.com/abetlen/llama-cpp-python
  2. Installation

pip install 'llama-cpp-python[server]'
  1. Download the model from Huggingface: https://huggingface.co/TheBloke/LLaMa-13B-GGML/blob/main/llama-13b.ggmlv3.q4_0.bin and save it in ./models

  2. Start the server:

python3 -m llama_cpp.server --model models/llama-13b.ggmlv3.q4_0.bin
  1. Navigate to http://localhost:8000/docs to see the OpenAPI documentation.

  2. Try it !

  3. Observation:

    • See 8000/docs showing functional emulation of an OpenAI model (input and output)
    • Measure #tokens / sec on your machine
  4. Next:

    • Try with other models
    • Try with GPU support
    • Dockerize

Part 2/2 : Use it in a Langchain pipeline with Flowise

  1. What is Flowise

    • Open source No-Code visual programming tool to build LLM apps
  2. Setting up Flowise

  3. Start the app:

yarn start
  1. Access the app at http://localhost:3000

  2. Design a simple conversational application

flowise_port_8000_python_llama

  1. Python

flowise_api

  1. Curl

flowise_curl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment