Onur Keskin, Ph.D. keskinonur

Boost Prompt

A prompt to boost your lazy "do this" prompts. Install with one of the buttons below.

Use

Gemini CLI Plan Mode

You are Gemini CLI, an expert AI assistant operating in a special 'Plan Mode'. Your sole purpose is to research, analyze, and create detailed implementation plans. You must operate in a strict read-only capacity.

Gemini CLI's primary goal is to act like a senior engineer: understand the request, investigate the codebase and relevant resources, formulate a robust strategy, and then present a clear, step-by-step plan for approval. You are forbidden from making any modifications. You are also forbidden from implementing the plan.

Core Principles of Plan Mode

Strictly Read-Only: You can inspect files, navigate code repositories, evaluate project structure, search the web, and examine documentation.
Absolutely No Modifications: You are prohibited from performing any action that alters the state of the system. This includes:

Setup

On every machine in the cluster install openmpi and mlx-lm:

conda install conda-forge::openmpi
pip install -U mlx-lm

Next download the pipeline parallel run script. Download it to the same path on every machine:

This worked on 14/May/23. The instructions will probably require updating in the future.

llama is a text prediction model similar to GPT-2, and the version of GPT-3 that has not been fine tuned yet. It is also possible to run fine tuned versions (like alpaca or vicuna with this. I think. Those versions are more focused on answering questions)

Note: I have been told that this does not support multiple GPUs. It can only use a single GPU.

It is possible to run LLama 13B with a 6GB graphics card now! (e.g. a RTX 2060). Thanks to the amazing work involved in llama.cpp. The latest change is CUDA/cuBLAS which allows you pick an arbitrary number of the transformer layers to be run on the GPU. This is perfect for low VRAM.

Clone llama.cpp from git, I am on commit 08737ef720f0510c7ec2aa84d7f70c691073c35d.

Purpose

Bootstrap knowledge of LLMs ASAP. With a bias/focus to GPT.

Avoid being a link dump. Try to provide only valuable well tuned information.

Prelude

Neural network links before starting with transformers.

	How to quantize 70B model so it will fit on 2x4090 GPUs:

	I tried EXL2, AutoAWQ, and SqueezeLLM and they all failed for different reasons (issues opened).

	HQQ worked:

	I rented a 4x GPU 1TB RAM ($19/hr) instance on runpod with 1024GB container and 1024GB workspace disk space.
	I think you only need 2x GPU with 80GB VRAM and 512GB+ system RAM so probably overpaid.

	Note you need to fill in the form to get access to the 70B Meta weights.

	# Clone llama.cpp
	git clone https://github.com/ggerganov/llama.cpp.git
	cd llama.cpp

	# Build it
	make clean
	LLAMA_METAL=1 make

	# Download model
	export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin

	// You can read the article I wrote for this setup on medium.com at the link below
	// https://medium.com/@rishi_singh/how-to-implement-end-to-end-encryption-using-pbkdf-in-flutter-a5508e7ad93e

	import 'dart:math';
	import 'dart:typed_data';

	import 'package:crypton/crypton.dart';
	import 'package:pointycastle/block/aes.dart';
	import 'package:pointycastle/digests/sha256.dart';
	import 'package:pointycastle/key_derivators/pbkdf2.dart';