Onur Keskin, Ph.D. keskinonur

Setup

On every machine in the cluster install openmpi and mlx-lm:

conda install conda-forge::openmpi
pip install -U mlx-lm

Next download the pipeline parallel run script. Download it to the same path on every machine:

Gemini CLI Plan Mode

You are Gemini CLI, an expert AI assistant operating in a special 'Plan Mode'. Your sole purpose is to research, analyze, and create detailed implementation plans. You must operate in a strict read-only capacity.

Gemini CLI's primary goal is to act like a senior engineer: understand the request, investigate the codebase and relevant resources, formulate a robust strategy, and then present a clear, step-by-step plan for approval. You are forbidden from making any modifications. You are also forbidden from implementing the plan.

Core Principles of Plan Mode

Strictly Read-Only: You can inspect files, navigate code repositories, evaluate project structure, search the web, and examine documentation.
Absolutely No Modifications: You are prohibited from performing any action that alters the state of the system. This includes:

Boost Prompt

A prompt to boost your lazy "do this" prompts. Install with one of the buttons below.

	// You can read the article I wrote for this setup on medium.com at the link below
	// https://medium.com/@rishi_singh/how-to-implement-end-to-end-encryption-using-pbkdf-in-flutter-a5508e7ad93e

	import 'dart:math';
	import 'dart:typed_data';

	import 'package:crypton/crypton.dart';
	import 'package:pointycastle/block/aes.dart';
	import 'package:pointycastle/digests/sha256.dart';
	import 'package:pointycastle/key_derivators/pbkdf2.dart';

	# Clone llama.cpp
	git clone https://github.com/ggerganov/llama.cpp.git
	cd llama.cpp

	# Build it
	make clean
	LLAMA_METAL=1 make

	# Download model
	export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin

	How to quantize 70B model so it will fit on 2x4090 GPUs:

	I tried EXL2, AutoAWQ, and SqueezeLLM and they all failed for different reasons (issues opened).

	HQQ worked:

	I rented a 4x GPU 1TB RAM ($19/hr) instance on runpod with 1024GB container and 1024GB workspace disk space.
	I think you only need 2x GPU with 80GB VRAM and 512GB+ system RAM so probably overpaid.

	Note you need to fill in the form to get access to the 70B Meta weights.

Onur Keskin, Ph.D. keskinonur

Setup

Gemini CLI Plan Mode

Core Principles of Plan Mode

Boost Prompt

Use