Skip to content

Instantly share code, notes, and snippets.

View wassname's full-sized avatar
🙃

wassname (Michael J Clark) wassname

🙃
View GitHub Profile
@wassname
wassname / hf_datasets_cheatsheet.md
Last active March 21, 2025 06:57
huggingface datasets cheatsheet for text functional operations

what fnctional operations are there?

  • Dataset methods:
    • columns remove_columns rename_columns select_columns
    • map
      • reduce: in batched mode it can change the batch size or have side effects, letting us use it as a reduce
    • filter
    • select: this is how you slice
      • take: like head
  • shuffle
"""
This is a simple way to evaluate if a model prefers the accepted or rejected completions of a prompt.
We look at the perplexity of the chosen and rejected completions of a prompt.
Example dataset: https://huggingface.co/datasets/wassname/genies_preferences/viewer/illegal_dont_help?views[]=illegal_dont_help_train&views[]=illegal_dont_help_test
@url: https://gist.github.com/wassname/04f0c50a68054f0323f62b0da418daec
"""
import copy
@wassname
wassname / docustore_aten_asteroid_property_rights.md
Created March 18, 2025 10:36
docustore_aten_asteroid_property_rights.md

asteroid_claim

I, wassname, also known as Michael J Clark of Perth hereby establish a formal claim to the following celestial bodies. This claim is established with explicit intent toward future resource utilization. This declaration constitutes the formal establishment of property interest

This claim encompasses the entirety of the bodies, including all constituent materials, spatial volume within 50 km of its center of mass, and any natural satellites that may be discovered in the future. The claim extends to all mineral, volatile, and material resources contained within this boundary.

Legal Framework Anticipation

While acknowledging current limitations in international space law regarding private property claims on celestial bodies, this declaration is established in anticipation of evolving legal frameworks that will eventually recognize early, persistent, and well-documented claims as humanity expands into the solar system. This claim explicitly respects scientific research access and non-interf

@wassname
wassname / loguru_cheatsheet.py
Created March 10, 2025 01:30
loguru cheat sheet
# how to format it
# all variable are listed in "record dict) https://loguru.readthedocs.io/en/stable/api/logger.html
fmt = "<green>{time:YYYY-MM-DD HH:mm:ss.SSS Z}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan>:<cyan>{function}</cyan>:<cyan>{line}</cyan> - <level>{message}</level>"
fmt = "<green>{time:HH:mm:ss}</green> | <level>{level: <4}</level> | {process.id} | <cyan>{name: <4}</cyan>:<cyan>{function: <4}</cyan>:<cyan>{line}</cyan> - <level>{message}</level>"
# how to make it work in jupyter
logger.remove()
logger.add(os.sys.stdout, level="INFO", colorize=True)
# how to make it work with tqdm
@wassname
wassname / load_md_fm_j2_prompt.py
Created March 4, 2025 03:03
IMO the nicest prompt format is prompt.md.j2. Here we make the messages explicit, and the markdown and jinja syntax obvious
def split_frontmatter(fm_md_split :str):
"""Load prompt in md.jinja2 format
In this format we have multiple frontmatters and content sections, each defining a message. The idea here is to use jinja formatting in a promt.md.jinja file to make the markdown, and jinja formatting obvious
e.g.
---
role: system
---
@wassname
wassname / fix_peft_local_invalid_base_model.py
Last active March 1, 2025 00:31
When a peft adapter has an invalid base model, how do I fix it?
peft_model_id = 'v2ray/GPT4chan-8B-QLoRA'
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoConfig, LlamaConfig
from peft import PeftConfig, PeftModelForCausalLM
peft_config = PeftConfig.from_pretrained(peft_model_id)
# This model points to a local base model, which we don't have. So lets redirect it to a public version
peft_config.base_model_name_or_path="unsloth/Llama-3.1-8B"
# now load the modified config in 8bit

Design a physics-based model specifically for an LNG (Liquefied Natural Gas) compressor, suitable for implementation in a Physics-Informed Neural Network (PINN).

Given inputs:

  • Inlet temperature (t_in)
  • Inlet pressure (p_in)
  • Inlet flow rate (flow_in)
  • Inlet Guide Vane position (IGV%)

Required outputs to predict:

  • Outlet temperature (t_out)
@wassname
wassname / transformers_bf16.py
Last active January 27, 2025 11:26
Using 16 bit base weights in huggingface transformers
"""
You can save memory by converting your model to bf16, but ONLY if you use a special optimiser. Otherwise you round away small changes and get worse result.
This is how to use a brainfloat16 base model in huggingface transformers
@author:wassname
@url: https://gist.github.com/wassname/183153f9245b37ae6d08b3c3c4033bda
Usage:
@wassname
wassname / hf_tf_no_print_progress.py
Last active January 27, 2025 02:08
transformers stop the printing, while still logging to wandb, tensorboard, csv etc
from transformers.trainer_callback import ProgressCallback
class ProgressCallbackNoPrint(ProgressCallback):
"""ProgressCallback that doesn't print anything
Usage:
# at top of file, after transformers import, before settin up trainer, monkey patch the default progress callback
import transformers
transformers.DEFAULT_PROGRESS_CALLBACK = ProgressCallbackNoPrint
@wassname
wassname / reddit_thread2md.py
Created December 26, 2024 00:46
Format a reddit thread into markdown suitable for an llm
# from https://github.dev/JosefAlbers/rd2md
import textwrap
from datetime import datetime
def format_flair(obj):
if obj.author_flair_text is not None:
return f" *{obj.author_flair_text}*"
return ""