Goals: Add links that are reasonable and good explanations of how stuff works. No hype and no vendor content if possible. Practical first-hand accounts of models in prod eagerly sought.
| from transformers import ( | |
| AutoConfig, | |
| AutoTokenizer, | |
| BitsAndBytesConfig, | |
| GenerationConfig, | |
| AutoModelForCausalLM, | |
| LlamaTokenizerFast, | |
| PreTrainedModel, | |
| TextIteratorStreamer, | |
| StoppingCriteria, |
https://github.com/jondurbin/airoboros
pip install --upgrade airoboros==2.0.13| # Clone llama.cpp | |
| git clone https://github.com/ggerganov/llama.cpp.git | |
| cd llama.cpp | |
| # Build it | |
| make clean | |
| LLAMA_METAL=1 make | |
| # Download model | |
| export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin |
This worked on 14/May/23. The instructions will probably require updating in the future.
llama is a text prediction model similar to GPT-2, and the version of GPT-3 that has not been fine tuned yet. It is also possible to run fine tuned versions (like alpaca or vicuna with this. I think. Those versions are more focused on answering questions)
Note: I have been told that this does not support multiple GPUs. It can only use a single GPU.
It is possible to run LLama 13B with a 6GB graphics card now! (e.g. a RTX 2060). Thanks to the amazing work involved in llama.cpp. The latest change is CUDA/cuBLAS which allows you pick an arbitrary number of the transformer layers to be run on the GPU. This is perfect for low VRAM.
08737ef720f0510c7ec2aa84d7f70c691073c35d.| import requests | |
| import time | |
| import os | |
| import sys | |
| import openai | |
| import tiktoken | |
| from termcolor import colored | |
| openai.api_key = open(os.path.expanduser('~/.openai')).read().strip() |
| from io import StringIO | |
| import sys | |
| from typing import Dict, Optional | |
| from langchain.agents import load_tools | |
| from langchain.agents import initialize_agent | |
| from langchain.agents.tools import Tool | |
| from langchain.llms import OpenAI |
| """ | |
| stable diffusion dreaming | |
| creates hypnotic moving videos by smoothly walking randomly through the sample space | |
| example way to run this script: | |
| $ python stablediffusionwalk.py --prompt "blueberry spaghetti" --name blueberry | |
| to stitch together the images, e.g.: | |
| $ ffmpeg -r 10 -f image2 -s 512x512 -i blueberry/frame%06d.jpg -vcodec libx264 -crf 10 -pix_fmt yuv420p blueberry.mp4 |
| # This code doesn't work, and isn't intended to. | |
| # The goal of this code is to explain how attention mechansisms work, in code. | |
| # It is deliberately not vectorized to make it clearer. | |
| def attention(self, X_in:List[Tensor]): | |
| # For every token transform previous layer's out | |
| for i in range(self.sequence_length): | |
| query[i] = self.Q * X_in[i] | |
| key[i] = self.K * X_in[i] | |
| value[i] = self.V * X_in[i] |