Skip to content

Instantly share code, notes, and snippets.

View stas00's full-sized avatar

Stas Bekman stas00

View GitHub Profile
@stas00
stas00 / calc_transformer_params.py
Created November 22, 2023 01:15 — forked from Quentin-Anthony/calc_transformer_params.py
Transformer Parameter Count
import argparse
import math
# Helper function to pretty-print message sizes
def convert_params(params):
if params == 0:
return "0"
size_name = ("", "K", "M", "B", "T", "P", "E", "Z", "Y")
i = int(math.floor(math.log(params, 1000)))
p = math.pow(1000, i)
@stas00
stas00 / benchmark_dist_init.py
Last active November 20, 2023 20:33
Profiling `init_process_group('nccl')`
# run as:
# python -u -m torch.distributed.run --nproc_per_node=8 --rdzv_endpoint localhost:6000 --rdzv_backend c10d benchmark_dist_init.py
import torch
import os
import cProfile
import torch.distributed as dist
import timeit

Connect via SSH to a Slurm compute job that runs as Enroot container

Being able to SSH directly into a compute job has the advantage of using all remote development tools such as using your IDE's debugger also for GPU jobs (VSCode, PyCharm, ...).

  • Slurm: Scheduling system that many HPC clusters use
  • Enroot: Container system like Docker for NVIDIA GPUs

General problem:

@stas00
stas00 / sft_trainer.py
Created October 13, 2023 17:53 — forked from lewtun/sft_trainer.py
Fine-tuning Mistral 7B with TRL & DeepSpeed ZeRO-3
# This is a modified version of TRL's `SFTTrainer` example (https://github.com/huggingface/trl/blob/main/examples/scripts/sft_trainer.py),
# adapted to run with DeepSpeed ZeRO-3 and Mistral-7B-V1.0. The settings below were run on 1 node of 8 x A100 (80GB) GPUs.
#
# Usage:
# - Install the latest transformers & accelerate versions: `pip install -U transformers accelerate`
# - Install deepspeed: `pip install deepspeed==0.9.5`
# - Install TRL from main: pip install git+https://github.com/huggingface/trl.git
# - Clone the repo: git clone github.com/huggingface/trl.git
# - Copy this Gist into trl/examples/scripts
# - Run from root of trl repo with: accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero3.yaml --gradient_accumulation_steps 8 examples/scripts/sft_trainer.py

MMAP's IO appears to be leaking memory when it's not the case

This write up is going to demonstrate that while it looks that a mmap'ed file IO looks like it's leaking memory it actually is not.

Emulating a computer with just 1GB of memory

Since we don't want to kill our computer while debugging memory issues we are going to emulate a computer with just 1GB of memory and no swap memory. Unless such computer has a protection from programs using more memory than the computer has most of the time such computers start thrashing and eventually crash.

def layernorm_forward(x, gamma, beta, ln_param):
"""
Forward pass for layer normalization.
During both training and test-time, the incoming data is normalized per data-point,
before being scaled by gamma and beta parameters identical to that of batch normalization.
Note that in contrast to batch normalization, the behavior during train and test-time for
layer normalization are identical, and we do not need to keep track of running averages
of any sort.
@stas00
stas00 / mp4_sharp_bug.py
Last active February 24, 2022 21:19 — forked from jeffra/mp4_sharp_bug.py
MP4 SHARP bug (edited to support modern launcher and added some status printing to make it easier to see what's going on)
import torch
import torch.distributed as dist
import os
local_rank = int(os.environ["LOCAL_RANK"])
dist.init_process_group(backend='nccl')
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)
#!/usr/bin/env python
# this program takes as its input cmd line args and it reformats them into a nice 80-char width that
# can be easily added to docs or forums.
#
# Example:
#
# cmd-wrap CUDA_VISIBLE_DEVICES=0 python ./examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no --save_strategy no --per_device_train_batch_size 32 --max_source_length 512 --max_target_length 512 --num_train_epochs 1 --overwrite_output_dir --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" --source_prefix "translate English to Romanian: " --warmup_steps 50 --max_train_samples 2001 --dataloader_num_workers 2
#
# CUDA_VISIBLE_DEVICES=0 python \
@stas00
stas00 / tb-rename-events.py
Created October 15, 2021 03:16
tensorboard rename event tags (based on https://stackoverflow.com/a/60080531/9201239)
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# this script renames event names in tensorboard log files
# it does the rename in place (so make back ups!)
#
# example:
#
# find . -name "*.tfevents*" -exec tb-rename-events.py {} "iteration-time" "iteration-time/iteration-time" \;
#
# benchmark datasets' to_json:
# - normal
# - multiproc version
# - sharded multiproc version
import time
from datasets import load_dataset
import pathlib
import os
from pathlib import Path