Skip to content

Instantly share code, notes, and snippets.

View sshleifer's full-sized avatar
🏠
Working from home

Sam Shleifer sshleifer

🏠
Working from home
View GitHub Profile
@sshleifer
sshleifer / optim_cmds.md
Last active July 22, 2021 23:39
gshard optimizer expeiment cmds

Setup

  • git clone [email protected]:fairinternal/fairseq-py.git && cd fairseq-py && git checkout stable-emb
  • if you don't have the fairseq conda env, follow these instructions
  • pip install numpy==1.20. (optional, but some people needed this)
  • pip install fairscale (should be > 0.3.7, as of writing)
  • on FAIR cluster: pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda110 -U)
  • OR on AWS: pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda111 -U)

Common Logic for all commands

Edit this as needed

@sshleifer
sshleifer / sharded_data_doc.md
Last active April 15, 2021 09:11
Construct+Use sharded dataset in fairseq

Constructing a sharded dataset

  • cat all your raw text into one huge file in /scratch/
  • run your favorite bpe on that file (20mins for 160GB with 20 workers), writing the result to /scratch.

Then we do some filtering of newlines

grep -A1 . /scratch/rc_train_big.bpe | grep -v "^--$" > /scratch/rc.filtered.train.bpe
@sshleifer
sshleifer / anki_setup.md
Created March 6, 2021 19:27
Anki Setup
@sshleifer
sshleifer / time_dbart_generate.py
Created October 26, 2020 17:29
Timing Generate
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import time
from tqdm import tqdm
from pathlib import Path
import pandas as pd
models = ['sshleifer/distilbart-cnn-12-3',
'sshleifer/distilbart-cnn-12-6',
'sshleifer/distilbart-cnn-6-6',
@sshleifer
sshleifer / latex_style.md
Created October 19, 2020 15:19
Sasha's latex style rules

Avoid:

  • [!h] for figures/tables.
  • two datasets in one 1plot
  • NameError introducing terms that haven't been defined.
@sshleifer
sshleifer / download_summ_data.py
Created October 7, 2020 19:19
Fetching summarization datasets
from pathlib import Path
import fire
from tqdm import tqdm
DS_TO_KEY = {
'gigaword': ('document', 'summary'),
'xsum': ('document', 'summary'),
'aeslc': ('email_body', 'subject_line'),
from pathlib import Path
import fire
from tqdm import tqdm
DS_TO_KEY = {
'gigaword': ('document', 'summary'),
'xsum': ('document', 'summary'),
'aeslc': ('email_body', 'subject_line'),

How BartConfig controls when LayerNorm is applied

6 groups of models inherit from BartForConditionalGeneration. The major differences between them are:

  • pretraining objective & data
  • finetuning objective & data
  • number of layers and dimension of each layer
  • when layernorm is applied

This document focuses on layernorm timing.

export b="s3://models.huggingface.co/bert"
stas_to_fb () {
src=$1
shift
aws s3 sync $b/stas/$src $b/facebook/$src $@
}
stas_to_allenai () {
src=$1
shift
@sshleifer
sshleifer / dynb.md
Last active September 9, 2020 19:25

Problem:

  • In WMT datasets, there is wide variation in the length of examples. Some are one sentence. Some are 10 sentences.
  • The max batch size that can fit on a v100 is roughly (4, 512)
  • you end up with lots of batches of shape (4, 12) or (4, small_int) which don't fully utilize the GPU.

Dynamic Batch Size: try to organize batches to be 4*512=2048 tokens, one batch might be shaped (4,512) another (32, 64).

Details of Fairseq Solution: