Skip to content

Instantly share code, notes, and snippets.

@brucevanhorn2
Created February 9, 2025 01:38
Show Gist options
  • Save brucevanhorn2/6f450ae9728109246876e698457601fa to your computer and use it in GitHub Desktop.
Save brucevanhorn2/6f450ae9728109246876e698457601fa to your computer and use it in GitHub Desktop.
LLM on your computer
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
load_in_8bit=True, # Enable 8-bit quantization to save VRAM
torch_dtype=torch.float16
)
# model_name = "tiiuae/falcon-7b-instruct"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(
# model_name,
# device_map="auto",
# torch_dtype=torch.float16
# )
inputs = tokenizer("Explain what a data warehouse is.", return_tensors="pt").to("cuda")
outputs = model.generate(inputs.input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
accelerate==1.2.1
certifi==2024.12.14
charset-normalizer==3.4.1
colorama==0.4.6
filelock==3.16.1
fsspec==2024.12.0
huggingface-hub==0.27.1
idna==3.10
Jinja2==3.1.5
MarkupSafe==3.0.2
mpmath==1.3.0
networkx==3.4.2
numpy==2.2.1
packaging==24.2
psutil==6.1.1
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
safetensors==0.5.2
setuptools==75.8.0
sympy==1.13.1
tokenizers==0.21.0
torch==2.5.1
tqdm==4.67.1
transformers==4.48.0
typing_extensions==4.12.2
urllib3==2.3.0
@brucevanhorn2
Copy link
Author

The commented out code for the falcon model requires a decent GPU in order to run. The llama2 model that's in there requires a very good GPU. I quantized it to 16 bit to save space. If you have a terrific GPU (RTX 4090 or maybe a 4060ti with 16GB) you can take that part out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment