4090 on CUDA 12.1
seed=64
Evaluated using evaluate.py
,
python -m evaluate --model_name_or_path huggyllama/llama-7b --tokenizer_model_name_or_path huggyllama/llama-7b --bf16 --overrun_countermeasures False --prompt_style bare
4090 on CUDA 12.1
seed=64
Evaluated using evaluate.py
,
python -m evaluate --model_name_or_path huggyllama/llama-7b --tokenizer_model_name_or_path huggyllama/llama-7b --bf16 --overrun_countermeasures False --prompt_style bare
[14.614647 14.526281 14.438574 14.351521 14.265114 14.179349 | |
14.094221 14.009725 13.925854 13.842604 13.759968 13.677942 | |
13.596522 13.515701 13.435474 13.355838 13.276786 13.198313 | |
13.120416 13.043088 12.966325 12.890123 12.814478 12.739382 | |
12.664833 12.590827 12.517358 12.444422 12.372013 12.300129 | |
12.2287655 12.157917 12.087579 12.017748 11.94842 11.879589 | |
11.8112545 11.743409 11.67605 11.609174 11.542775 11.4768505 | |
11.411397 11.346409 11.281884 11.217818 11.154207 11.091047 | |
11.028336 10.966067 10.90424 10.842849 10.781891 10.721362 | |
10.661261 10.601581 10.542321 10.483477 10.425045 10.367022 |
Instructions are a work-in-progress (I haven't managed it yet, just writing what I do as I go along).
Install HF Code Autocomplete VSCode plugin.
We are not going to set an API token. We are going to specify an API endpoint.
We will try to deploy that API ourselves, to use our own GPU to provide the code assistance.
We will use bigcode/starcoder
, a 15.5B param model.
We will use NF4 4-bit quantization to fit this into 10787MiB VRAM.
It would require 23767MiB VRAM unquantized. (still fits on a 4090, which has 24564MiB)!
class ExtractedCriticSample(TypedDict): | |
prompt: str | |
continuation: str | |
rating: int | |
@dataclass | |
class DataCollatorForCriticLM(object): | |
tokenizer: transformers.PreTrainedTokenizer | |
prompt_max_len: int | |
continuation_max_len: int |
{ | |
"labeler": "e90a38f3-3135-4465-87af-3e6322e3d772", | |
"timestamp": "2022-07-17T16:56:51.323252", | |
"generation": null, | |
"is_quality_control_question": false, | |
"is_initial_screening_question": false, | |
"question": | |
{ | |
"problem": "How many positive two-digit integers leave a remainder of 2 when divided by 8?", | |
"ground_truth_answer": "12" |
Let's say we're trying to load a LLaMA model via AutoModelForCausalLM.from_pretrained
with 4-bit quantization in order to inference from it:
python -m generate.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, LlamaTokenizerFast, LlamaForCausalLM
import transformers
nvidia-smi
said this required 11181MiB, at least to train on the sequence lengths of prompt that occurred initially in the alpaca dataset (~337 token long prompts).
You can get this down to about 10.9GB if (by modifying qlora.py) you run torch.cuda.empty_cache()
after PEFT has been applied to your loaded model and before you begin training.
All instructions are written assuming your command-line shell is bash.
Clone repository:
# for i, x in model.named_parameters(): | |
# print(i) | |
transformer.word_embeddings.weight | |
transformer.h.0.ln_attn.weight | |
transformer.h.0.ln_attn.bias | |
transformer.h.0.ln_mlp.weight | |
transformer.h.0.ln_mlp.bias | |
transformer.h.0.self_attention.query_key_value.weight | |
transformer.h.0.self_attention.dense.weight | |
transformer.h.0.mlp.dense_h_to_4h.weight |
For CUDA 12, see Installing CUDA 12.1.1 + PyTorch nightly + Python 3.10 on Ubuntu 22.10 for how to install Nvidia driver 530, gcc 12 and CUDA 12.1.1 libraries.
If you want CUDA 11.8, then you can use latest Nvidia driver from Production branch, 525, with gcc 11.
Activate your conda environment, if you haven't done so already.
CUDA 11:
Make sure gcc 11 is the default gcc for your OS, or select gcc 11 explicitly.
CUDA 12:
Make sure gcc 12 is the default gcc for your OS, or select gcc 12 explicitly.
Check CUDA_DIR
below points to the CUDA installation you wish to use.