Skip to content

Instantly share code, notes, and snippets.

View sshleifer's full-sized avatar
🏠
Working from home

Sam Shleifer sshleifer

🏠
Working from home
View GitHub Profile
# by stas00 and sshleifer
import nlp
from tqdm import tqdm
dataset = 'wmt19'
s = 'ru'
t = 'en'
pair = f'{s}-{t}'

Stanford CoreNLP Setup

ptb_tokenize () {
    cat $1 | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > $2
}

sudo apt install openjdk-8-jre-headless
sudo apt-get install ant
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip
model batch_size sequence_length MB
t5-large 1 128 6558
t5-large 1 512 9568
t5-large 1 1024 23124
facebook/bart-large 1 128 3758
facebook/bart-large 1 512 4670
facebook/bart-large 1 1024 8888
t5-base 1 128 2242
t5-base 1 512 3776
t5-base 1 1024 9056
@sshleifer
sshleifer / misc_ideas.md
Last active June 8, 2020 20:00
Misc Transformers ideas

Workflow: git: pre-commit hook to check style Cleanup: gpt and gpt2 attention vv similar besides caching

  • fine to just add caching to GPT

Infra: test coverage for t5 causal mask

Infra: add test to ModelTesterMixin that loss doesn't change if pad tokens are introduced

save_pretrained: should mkdir if save_path doesn't exist

@sshleifer
sshleifer / copy_subset.py
Last active June 5, 2020 15:41
Copy a subset of bart layers into a smaller model
from transformers import BartConfig, BartForConditionalGeneration, BartTokenizer
from torch import nn
from typing import List
layers_to_copy = { # maps # layers in student -> which teacher layers to copy
6: [0, 2, 4, 7, 9, 11],
1: [11],
3: [0, 6, 11],
2: [0, 11],
4: [0, 4, 8, 11],
9: [0, 1, 2, 4, 5, 7, 9, 10, 11],
@sshleifer
sshleifer / s3_suggestions.md
Created May 19, 2020 13:41
S3 Suggestions
  • everything must go under bert besides datasets
  • Put random models under your own namespace, like sshleifer/bart-tiny-random
  • use the --dry-run command line arg
  • [FIXME] You can login to the portal and do things manually at this URLwith your kibana creds (??)
@sshleifer
sshleifer / PreTweet Checklist
Created May 15, 2020 16:02
Before you tweet checklist
- for mention in tweet.grep('@'): assert twitter.get(mention) == expected_person
- assert photo has tags
- if thread: numbers make sense or down emoji
- all links work
- read it over once more
@sshleifer
sshleifer / tweet.md
Last active July 15, 2020 13:36
Tweet Translations using Marian MTModels

See Model List, Docs

en_tweet = ["Today we are releasing 1,008 machine translation models, covering combinations of 140 different languages.",
            "They were trained by @jorgtiedemann with @marian, and converted by @sam_shleifer.", 
            "Find your language here"]

en-da: I dag frigiver vi 1.008 maskinoversættelsesmodeller, der dækker kombinationer af 140 forskellige sprog. De blev uddannet af @jorgtiedemann med @marian, og konverteret af @sam_shleifer. Find dit sprog her:

@sshleifer
sshleifer / multilingual_groups.py
Last active May 29, 2020 06:33
Multilingual Group Members Mapping
GROUP_MEMBERS = {
'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo',
'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE',
'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']

MarianMTModel Best Practices:

  • split source language documents into sentences before passing them through the model.

Model Naming

model names use the following format: Helsinki-NLP/opus-mt-{src}-{tgt} fully captivalized values for src and tgt reference group names in the following lookup: