Skip to content

Instantly share code, notes, and snippets.

View sshleifer's full-sized avatar
🏠
Working from home

Sam Shleifer sshleifer

🏠
Working from home
View GitHub Profile
@sshleifer
sshleifer / finetune_pegasus_xsum.sh
Last active September 8, 2020 21:19
took 25hr
python finetune.py \
--task summarization \
--learning_rate=3e-4 \
--do_train \
--do_predict \
--val_check_interval 0.25 --n_val 1000 \
--data_dir xsum \
--max_source_length 512 --max_target_length=56 \
--freeze_embeds \
--model_name_or_path google/pegasus-large \
@sshleifer
sshleifer / mbart_lb.md
Last active September 2, 2020 15:13
Experiment results distilling mbart-large-en-ro and finetuning mbart-large-cc-25
for file in ls */*bleu.json
do
   echo "$file:"
   cat "$file" | sed -n '/^\s*$/!{p;q}' 
   echo  "------"
done

enro test bleu (distil-mbart unless otherwise specified, before post processing).

@sshleifer
sshleifer / pegasus.png
Last active August 21, 2020 15:10
Pegasus Thumbnail
pegasus.png
@sshleifer
sshleifer / lang_tag_logic.md
Created August 19, 2020 01:00
Language Tagging Process
@sshleifer
sshleifer / marian_constituents.py
Created August 18, 2020 15:37
Marian Multilingual Groups
# three letter code -> (group/language name, {constituents...}
# if this language is on the target side the constituents can be used as target language codes.
# if the language is on the source side they are supported natively without special codes.
{'aav': ('Austro-Asiatic languages',
{'hoc', 'hoc_Latn', 'kha', 'khm', 'khm_Latn', 'mnw', 'vie', 'vie_Hani'}),
'afa': ('Afro-Asiatic languages',
{'acm', 'afb', 'amh', 'apc', 'ara', 'arq', 'ary', 'arz', 'hau_Latn', 'heb', 'kab', 'mlt', 'rif_Latn', 'shy_Latn', 'som', 'thv', 'tir'}),
'afr': ('Afrikaans', {'afr'}),
'alv': ('Atlantic-Congo languages',
{'ewe', 'fuc', 'fuv', 'ibo', 'kin', 'lin', 'lug', 'nya', 'run', 'sag', 'sna', 'swh', 'toi_Latn', 'tso', 'umb', 'wol', 'xho', 'yor', 'zul'}),
@sshleifer
sshleifer / fairseq_model_inputs.md
Last active August 19, 2020 18:10
breakpoint at /home/shleifer/fairseq/fairseq/tasks/fairseq_task.py(385)train_step()

(first, wget fairseq_wmt_enro.tgz from s3)

During training, fairseq passes mbart dynamically sized batches (up to 128 tokens), in a dict called sample with the following relevant keys:

  • target (our labels): no bos, ends with [2, tgt_lang_code]
  • net_input.src_tokens (our input_ids): ends with [2, 250004]
  • net_input.prev_output_tokens (our decoder_input_ids): startswith 250020, ends with 2 . This is the "shift_tokens_right" version of target.

Here are the logs from my breakpoint:

@sshleifer
sshleifer / s3_wmt.sh
Created August 11, 2020 15:31
s3 translation dataset upload workflow
tar -czvf wmt16_en_ru.tgz wmt16_en_ru
# wmt16_en_ru/
# wmt16_en_ru/train.source
# wmt16_en_ru/train.target
# wmt16_en_ru/test.target
# wmt16_en_ru/test.source
# wmt16_en_ru/val.source
# wmt16_en_ru/val.target
@sshleifer
sshleifer / .vimrc
Last active August 10, 2020 05:01
vimrc
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
" Filename: .vimrc "
" Maintainer: Sam Shleifer <[email protected]> "
" URL: http://github.com/sshlefier/dotfiles "
" "
" "
" Sections: "
" 01. Plugins ................. using vundle "
" 02. python .................. General autocmd events "
" 03. Vim options ............ Colors, fonts, etc. "
@sshleifer
sshleifer / generate_cc25.sh
Last active June 21, 2020 19:06
Broken Script to translate with cc25
export langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
export CC25=/Users/shleifer/cc25_pretrain
export outfile=pred_en_ro.txt
export PRETRAIN=$CC25/model.pt
fairseq-generate tmp/ --path $PRETRAIN \
--task translation_from_pretrained_bart -t en_XX -s ro_RO --bpe 'sentencepiece' \
--sentencepiece-vocab $CC25/sentence.bpe.model --sacrebleu --remove-bpe 'sentencepiece' \
--max-sentences 32 --langs $langs --beam 5 > $outfile