aditya-malte/smallberta_pretraining.ipynb

Created February 22, 2020 13:41

Star (80) You must be signed in to star a gist
Fork (24) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b.js"></script>
Save aditya-malte/2d4f896f471be9c38eb4d723a710768b to your computer and use it in GitHub Desktop.

Download ZIP

smallBERTa_Pretraining.ipynb

Raw

smallberta_pretraining.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

mrm8488 commented Feb 28, 2020

Great work! I have executed the Colab you provided https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
and I got this error:

02/27/2020 17:23:08 - INFO - __main__ -   Training new model from scratch
02/27/2020 17:23:16 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=1000000000000, cache_dir=None, config_name='./EsperBERTo', device=device(type='cuda'), do_eval=False, do_train=True, eval_all_checkpoints=False, eval_data_file=None, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=0.0001, line_by_line=False, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=-1, mlm=True, mlm_probability=0.15, model_name_or_path=None, model_type='roberta', n_gpu=1, no_cuda=False, num_train_epochs=1.0, output_dir='./EsperBERTo-small-v1', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=16, save_steps=2000, save_total_limit=2, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name='./EsperBERTo', train_data_file='./oscar.eo.txt', warmup_steps=0, weight_decay=0.0)
02/27/2020 17:23:16 - INFO - __main__ -   Loading features from cached file ./roberta_cached_lm_999999999998_oscar.eo.txt
Traceback (most recent call last):
  File "transformers/examples/run_language_modeling.py", line 799, in <module>
    main()
  File "transformers/examples/run_language_modeling.py", line 749, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "transformers/examples/run_language_modeling.py", line 245, in train
    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 94, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

julien-c commented Feb 28, 2020

@mrm8488 should be fixed now thanks to huggingface/blog#8

008karan commented Mar 6, 2020 •

edited

Loading

I want to train Albert than what changes I need to do in run_pretraining.py. What changes would require that?

@julien-c as training for Albert like model requires the generation of pre-training data so is pre-training data generated while training itself?

Nix07 commented Mar 17, 2020 •

edited

Loading

@jbmaxwell You can try other tokenizers like CharBPETokenizer, SentencePieceBPETokenizer, etc to check if that works for you.

To load weights and continue training, you can use the model_name_or_path parameter and point it to the latest checkpoint.

PhilipMay commented Jun 25, 2020

How do I have to preprocess the corpus when I want to train my own LM for roBERTa? I think it must be one sentence per row. But does it need empty lines between documents? Is it ok to shuffle the text line by line?

Shafi2016 commented Jul 12, 2020

I get the following error

File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 296, in
main()
File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 188, in main
config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
File "/usr/local/lib/python3.6/dist-packages/transformers/configuration_auto.py", line 217, in from_pretrained
"in its name: {}".format(pretrained_model_name_or_path, ", ".join(CONFIG_MAPPING.keys()))
ValueError: Unrecognized model in /content/models/smallBERTa. Should have a model_type key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder

After running these codes

cmd = '''python /content/transformers/examples/language-modeling/run_language_modeling.py --output_dir {0}
--model_type roberta
--mlm
--train_data_file {1}
--eval_data_file {2}
--config_name /content/models/smallBERTa
--tokenizer_name /content/models/smallBERTa
--do_train
--line_by_line
--overwrite_output_dir
--do_eval
--block_size 256
--learning_rate 1e-4
--num_train_epochs 5
--save_total_limit 2
--save_steps 2000
--logging_steps 500
--per_gpu_eval_batch_size 32
--per_gpu_train_batch_size 32
--evaluate_during_training
--seed 42
'''.format(weights_dir, train_path, eval_path)

Please let me know how to fix this error

RobertHua96 commented Jul 23, 2020

I hope this isn't a silly question because I'm very new to NLP and AI in general. I find the advantages of a bytepiece encoder very enticing - and am hoping to continue pretraining Distilbert on a custom corpus.

Is it possible to:

Train that bytepiece encoder on the dataset
Load it in with Distilbert (From HF's checkpoint)
Continue pretraining Distilbert with the bytepiece tokenizer on custom corpus?

NianzuMa commented Aug 2, 2020 •

edited

Loading

Hi, I have a question regarding the training file for the tokenizer.
At the beginning of the tutorial, it says:

To the Tokenizer:
LM data in a directory containing all samples in separate *.txt files.

Also there is code snippet:

for row in tqdm(data.to_list()):
  file_name = os.path.join(txt_files_dir, str(i)+'.txt')
  try:
    f = open(file_name, 'w')
    f.write(row)
    f.close()
  except Exception as e:  #catch exceptions(for eg. empty rows)
    print(row, e) 
  i+=1

What this does is to separate each sentence into a single file, rather than put 200_000 sentences line by line in a single file.

On contrast, in this tutorial: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=HOk4iZ9YZvec

the file oscar.eo.txt contains all sentences line by line in a single file.

I tried to search for the documentation but have no clue which way to do is correct.

Is it necessary to split each sentence into one file, which results in 200_000 files?

Thank you for your answer.

amazingsmash commented Sep 22, 2020

I'm kinda new to this, but playing a bit around with the code I noticed that the function call "tokenizer.save()" should be changed to "tokenizer.save_model()".

You let me know wether my hunch is correct. :)

carlstrath commented Jan 18, 2021

I get this error in line 20

TypeError Traceback (most recent call last)
in ()
----> 1 tokenizer.save("/content/models/smallBERTa", "smallBERTa")

/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
329 A path to the destination Tokenizer file
330 """
--> 331 return self._tokenizer.save(path, pretty)
332
333 def to_str(self, pretty: bool = False):

TypeError: Can't convert 'smallBERTa' to PyBool

julien-c commented Jan 18, 2021

@carlstrath In recent versions of tokenizers I think you can just call .save(path) (cc @n1t0)

carlstrath commented Jan 29, 2021 •

edited

Loading

Sorry to bother everyone again. I am now getting this error in ln27

python3: can't open file '/content/transformers/examples/run_language_modeling.py': [Errno 2] No such file or directory

Author

aditya-malte commented Jan 29, 2021

Hi @carlstrath,
(Sorry I’ve been a bit busy lately so wasn’t active).

This gist was made for a specific version of the transformer and tokenizer library. Can you try using it with the versions mentioned at the start.

Meanwhile, I guess it’s about time now that I update this gist to reflect changes in the dependencies.
Thanks
Aditya

Author

aditya-malte commented Jan 29, 2021 •

edited

Loading

Also, while cloning from git. Please ensure you use this (https://github.com/huggingface/transformers/tree/v2.5.0) github repo instead. (As the gist is compatible with that version of huggingface, the newer one probably doesn’t contain the required run_language_modeling file)

sv-v5 commented Sep 11, 2021

I ran into issues while following the directions from the 2020 blog post https://huggingface.co/blog/how-to-train. This gist was more helpful. Thank you 👍

For anyone interested in running through training with an updated transformers: I have a write-up here of a complete example on training from scratch using transformers 4.10 and the updated run_language_modeling.py script (https://github.com/huggingface/transformers/blob/4a872caef4e70595202c64687a074f99772d8e92/examples/legacy/run_language_modeling.py) committed on Jun 25, 2021.

https://github.com/sv-v5/train-roberta-ua

Python package verisions are locked with pipenv so the example remains reproducible. Tested on Linux and Windows on GPU and CPU.

Happy training

Author

aditya-malte commented Sep 11, 2021

Hi,
That’s great to hear! Also, thanks a lot for making an updated training script. I’ve been busy lately (earlier with work and now my Master’s), so your updated script is much appreciated.

mgiardinelli commented Oct 10, 2021

I had to update step #26 from tokenizer.save to tokenizer.save_model. FYI

tokenizer.save_model("/content/models/smallBERTa", "smallBERTa")

aditya-malte/smallberta_pretraining.ipynb

mrm8488 commented Feb 28, 2020

Uh oh!

julien-c commented Feb 28, 2020

Uh oh!

008karan commented Mar 6, 2020 •

edited

Loading

Uh oh!

Nix07 commented Mar 17, 2020 •

edited

Loading

Uh oh!

PhilipMay commented Jun 25, 2020

Uh oh!

Shafi2016 commented Jul 12, 2020

Uh oh!

RobertHua96 commented Jul 23, 2020

Uh oh!

NianzuMa commented Aug 2, 2020 •

edited

Loading

Uh oh!

amazingsmash commented Sep 22, 2020

Uh oh!

carlstrath commented Jan 18, 2021

Uh oh!

julien-c commented Jan 18, 2021

Uh oh!

carlstrath commented Jan 29, 2021 •

edited

Loading

Uh oh!

aditya-malte commented Jan 29, 2021

Uh oh!

aditya-malte commented Jan 29, 2021 •

edited

Loading

Uh oh!

sv-v5 commented Sep 11, 2021

Uh oh!

aditya-malte commented Sep 11, 2021

Uh oh!

mgiardinelli commented Oct 10, 2021

Uh oh!

aditya-malte/smallberta_pretraining.ipynb

mrm8488 commented Feb 28, 2020

Uh oh!

julien-c commented Feb 28, 2020

Uh oh!

008karan commented Mar 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nix07 commented Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PhilipMay commented Jun 25, 2020

Uh oh!

Shafi2016 commented Jul 12, 2020

Uh oh!

RobertHua96 commented Jul 23, 2020

Uh oh!

NianzuMa commented Aug 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amazingsmash commented Sep 22, 2020

Uh oh!

carlstrath commented Jan 18, 2021

Uh oh!

julien-c commented Jan 18, 2021

Uh oh!

carlstrath commented Jan 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditya-malte commented Jan 29, 2021

Uh oh!

aditya-malte commented Jan 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sv-v5 commented Sep 11, 2021

Uh oh!

aditya-malte commented Sep 11, 2021

Uh oh!

mgiardinelli commented Oct 10, 2021

Uh oh!

008karan commented Mar 6, 2020 •

edited

Loading

Nix07 commented Mar 17, 2020 •

edited

Loading

NianzuMa commented Aug 2, 2020 •

edited

Loading

carlstrath commented Jan 29, 2021 •

edited

Loading

aditya-malte commented Jan 29, 2021 •

edited

Loading