Skip to content

Instantly share code, notes, and snippets.

View pszemraj's full-sized avatar

Peter pszemraj

View GitHub Profile
@pszemraj
pszemraj / load_and_ensure_tokens.py
Last active January 17, 2024 02:36
loads a Hugging Face Transformers tokenizer, checks for essential special tokens, adds them if necessary
from transformers import AutoTokenizer
def load_and_ensure_tokens(model_name):
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Essential special tokens with their default values
essential_tokens = {
"pad_token": "<pad>",
@pszemraj
pszemraj / hf_repofolder_watchdog.py
Created January 16, 2024 02:53
upload a folder to Hugging Face Hub and other utils
import argparse
import logging
import time
from datetime import datetime
from pathlib import Path
from typing import Optional
from huggingface_hub import upload_folder
from watchdog.events import PatternMatchingEventHandler
from watchdog.observers import Observer
@pszemraj
pszemraj / textgen_inference_code.py
Created January 6, 2024 23:38
example inference script for beecoder-220M-python
import logging
import random
import time
from pathlib import Path
import fire
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
logging.basicConfig(format="%(levelname)s - %(message)s", level=logging.INFO)
@pszemraj
pszemraj / hf_repofolder_watchdog.py
Created December 12, 2023 01:43
The script is designed to monitor a specified directory for any file system changes (like additions, deletions, or modifications of files and subdirectories) and automatically upload the changes to a specified repository on the Hugging Face Hub.
"""
The script is designed to monitor a specified directory for any file system changes (like additions, deletions, or modifications of files and subdirectories) and automatically upload the changes to a specified repository on the Hugging Face Hub.
pip install huggingface-hub watchdog
"""
import argparse
import logging
import time
from pathlib import Path
@pszemraj
pszemraj / format2alpaca.py
Created December 8, 2023 23:18
quick formatting function given instruction/input/response cols -> make 'text' col
import os
import random
from datasets import load_dataset
def format_dataset(example):
"""Formats the dataset example into a single 'text' field."""
# Add input only if it is longer than 2 characters
@pszemraj
pszemraj / tf32_activate.py
Created December 6, 2023 04:47
sort of manual - Check if the GPU supports NVIDIA Ampere or later and enable FP32 in PyTorch if it does.
import logging
import subprocess
import torch
def check_ampere_gpu():
"""Check if the GPU supports NVIDIA Ampere or later and enable FP32 in PyTorch if it does."""
# Check if CUDA is available
@pszemraj
pszemraj / test_synthsumm.py
Created December 6, 2023 03:16
test out synthsumm summarization models via the free inference api
import os
import time
import requests
class Timer:
"""Basic timer utility."""
def __enter__(self):
@pszemraj
pszemraj / ubuntu_util_pkgs.md
Created November 29, 2023 22:53
some ubuntu packages helpful for CPU things related to ML

Useful misc installs

Details

Kernel and Low-Level Tools

  1. Microcode Update: Keeping your CPU microcode updated can help in better performance and security. You can install the AMD microcode package by running:

sudo apt install amd64-microcode

@pszemraj
pszemraj / query_wellformedness_score.py
Created November 29, 2023 18:50
inference with a model trained on query well-formedness
"""
inference with a model trained on query well-formedness
https://huggingface.co/Ashishkr/query_wellformedness_score
pip transformers install accelerate optimum -q
"""
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
@pszemraj
pszemraj / run_summarization_langchain.py
Created November 24, 2023 17:17
summarization with langchain + openai
"""
run_langchain_summarization.py - Generate summaries using langchain + LLMs
For usage details, run `python run_langchain_summarization.py --help` and fire will print the usage details.
Notes:
- you need to have OPENAI_API_KEY set as an environment variable (easiest way is export OPENAI_API_KEY=memes123)
- install the dependencies using the requirements.txt file or below
pip install fire langchain clean-text tqdm tiktoken