A prompt to boost your lazy "do this" prompts. Install with one of the buttons below.
You are Gemini CLI, an expert AI assistant operating in a special 'Plan Mode'. Your sole purpose is to research, analyze, and create detailed implementation plans. You must operate in a strict read-only capacity.
Gemini CLI's primary goal is to act like a senior engineer: understand the request, investigate the codebase and relevant resources, formulate a robust strategy, and then present a clear, step-by-step plan for approval. You are forbidden from making any modifications. You are also forbidden from implementing the plan.
On every machine in the cluster install openmpi and mlx-lm:
conda install conda-forge::openmpi
pip install -U mlx-lmNext download the pipeline parallel run script. Download it to the same path on every machine:
| How to quantize 70B model so it will fit on 2x4090 GPUs: | |
| I tried EXL2, AutoAWQ, and SqueezeLLM and they all failed for different reasons (issues opened). | |
| HQQ worked: | |
| I rented a 4x GPU 1TB RAM ($19/hr) instance on runpod with 1024GB container and 1024GB workspace disk space. | |
| I think you only need 2x GPU with 80GB VRAM and 512GB+ system RAM so probably overpaid. | |
| Note you need to fill in the form to get access to the 70B Meta weights. |
| # Clone llama.cpp | |
| git clone https://github.com/ggerganov/llama.cpp.git | |
| cd llama.cpp | |
| # Build it | |
| make clean | |
| LLAMA_METAL=1 make | |
| # Download model | |
| export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin |
| // You can read the article I wrote for this setup on medium.com at the link below | |
| // https://medium.com/@rishi_singh/how-to-implement-end-to-end-encryption-using-pbkdf-in-flutter-a5508e7ad93e | |
| import 'dart:math'; | |
| import 'dart:typed_data'; | |
| import 'package:crypton/crypton.dart'; | |
| import 'package:pointycastle/block/aes.dart'; | |
| import 'package:pointycastle/digests/sha256.dart'; | |
| import 'package:pointycastle/key_derivators/pbkdf2.dart'; |
| Latest versions of these scripts are available in git repository https://github.com/jcmvbkbc/esp32-linux-build |
This worked on 14/May/23. The instructions will probably require updating in the future.
llama is a text prediction model similar to GPT-2, and the version of GPT-3 that has not been fine tuned yet. It is also possible to run fine tuned versions (like alpaca or vicuna with this. I think. Those versions are more focused on answering questions)
Note: I have been told that this does not support multiple GPUs. It can only use a single GPU.
It is possible to run LLama 13B with a 6GB graphics card now! (e.g. a RTX 2060). Thanks to the amazing work involved in llama.cpp. The latest change is CUDA/cuBLAS which allows you pick an arbitrary number of the transformer layers to be run on the GPU. This is perfect for low VRAM.
08737ef720f0510c7ec2aa84d7f70c691073c35d.