Skip to content

Instantly share code, notes, and snippets.

@aditya-malte
Created February 22, 2020 13:41
smallBERTa_Pretraining.ipynb
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@mrm8488
Copy link

mrm8488 commented Feb 28, 2020

Great work! I have executed the Colab you provided https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
and I got this error:

02/27/2020 17:23:08 - INFO - __main__ -   Training new model from scratch
02/27/2020 17:23:16 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=1000000000000, cache_dir=None, config_name='./EsperBERTo', device=device(type='cuda'), do_eval=False, do_train=True, eval_all_checkpoints=False, eval_data_file=None, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=0.0001, line_by_line=False, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=-1, mlm=True, mlm_probability=0.15, model_name_or_path=None, model_type='roberta', n_gpu=1, no_cuda=False, num_train_epochs=1.0, output_dir='./EsperBERTo-small-v1', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=16, save_steps=2000, save_total_limit=2, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name='./EsperBERTo', train_data_file='./oscar.eo.txt', warmup_steps=0, weight_decay=0.0)
02/27/2020 17:23:16 - INFO - __main__ -   Loading features from cached file ./roberta_cached_lm_999999999998_oscar.eo.txt
Traceback (most recent call last):
  File "transformers/examples/run_language_modeling.py", line 799, in <module>
    main()
  File "transformers/examples/run_language_modeling.py", line 749, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "transformers/examples/run_language_modeling.py", line 245, in train
    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 94, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

@julien-c
Copy link

@mrm8488 should be fixed now thanks to huggingface/blog#8

@008karan
Copy link

008karan commented Mar 6, 2020

I want to train Albert than what changes I need to do in run_pretraining.py. What changes would require that?

@julien-c as training for Albert like model requires the generation of pre-training data so is pre-training data generated while training itself?

@Nix07
Copy link

Nix07 commented Mar 17, 2020

@jbmaxwell You can try other tokenizers like CharBPETokenizer, SentencePieceBPETokenizer, etc to check if that works for you.

To load weights and continue training, you can use the model_name_or_path parameter and point it to the latest checkpoint.

@PhilipMay
Copy link

How do I have to preprocess the corpus when I want to train my own LM for roBERTa? I think it must be one sentence per row. But does it need empty lines between documents? Is it ok to shuffle the text line by line?

@Shafi2016
Copy link

I get the following error

File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 296, in
main()
File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 188, in main
config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
File "/usr/local/lib/python3.6/dist-packages/transformers/configuration_auto.py", line 217, in from_pretrained
"in its name: {}".format(pretrained_model_name_or_path, ", ".join(CONFIG_MAPPING.keys()))
ValueError: Unrecognized model in /content/models/smallBERTa. Should have a model_type key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder

After running these codes

cmd = '''python /content/transformers/examples/language-modeling/run_language_modeling.py --output_dir {0}
--model_type roberta
--mlm
--train_data_file {1}
--eval_data_file {2}
--config_name /content/models/smallBERTa
--tokenizer_name /content/models/smallBERTa
--do_train
--line_by_line
--overwrite_output_dir
--do_eval
--block_size 256
--learning_rate 1e-4
--num_train_epochs 5
--save_total_limit 2
--save_steps 2000
--logging_steps 500
--per_gpu_eval_batch_size 32
--per_gpu_train_batch_size 32
--evaluate_during_training
--seed 42
'''.format(weights_dir, train_path, eval_path)

Please let me know how to fix this error

@RobertHua96
Copy link

I hope this isn't a silly question because I'm very new to NLP and AI in general. I find the advantages of a bytepiece encoder very enticing - and am hoping to continue pretraining Distilbert on a custom corpus.

Is it possible to:

  1. Train that bytepiece encoder on the dataset
  2. Load it in with Distilbert (From HF's checkpoint)
  3. Continue pretraining Distilbert with the bytepiece tokenizer on custom corpus?

@NianzuMa
Copy link

NianzuMa commented Aug 2, 2020

Hi, I have a question regarding the training file for the tokenizer.
At the beginning of the tutorial, it says:

To the Tokenizer:
LM data in a directory containing all samples in separate *.txt files.

Also there is code snippet:

for row in tqdm(data.to_list()):
  file_name = os.path.join(txt_files_dir, str(i)+'.txt')
  try:
    f = open(file_name, 'w')
    f.write(row)
    f.close()
  except Exception as e:  #catch exceptions(for eg. empty rows)
    print(row, e) 
  i+=1

What this does is to separate each sentence into a single file, rather than put 200_000 sentences line by line in a single file.

On contrast, in this tutorial: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=HOk4iZ9YZvec

the file oscar.eo.txt contains all sentences line by line in a single file.

I tried to search for the documentation but have no clue which way to do is correct.

Is it necessary to split each sentence into one file, which results in 200_000 files?

Thank you for your answer.

@amazingsmash
Copy link

I'm kinda new to this, but playing a bit around with the code I noticed that the function call "tokenizer.save()" should be changed to "tokenizer.save_model()".

You let me know wether my hunch is correct. :)

@carlstrath
Copy link

I get this error in line 20

TypeError Traceback (most recent call last)
in ()
----> 1 tokenizer.save("/content/models/smallBERTa", "smallBERTa")

/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
329 A path to the destination Tokenizer file
330 """
--> 331 return self._tokenizer.save(path, pretty)
332
333 def to_str(self, pretty: bool = False):

TypeError: Can't convert 'smallBERTa' to PyBool

@julien-c
Copy link

@carlstrath In recent versions of tokenizers I think you can just call .save(path) (cc @n1t0)

@carlstrath
Copy link

carlstrath commented Jan 29, 2021

Sorry to bother everyone again. I am now getting this error in ln27

python3: can't open file '/content/transformers/examples/run_language_modeling.py': [Errno 2] No such file or directory

@aditya-malte
Copy link
Author

Hi @carlstrath,
(Sorry I’ve been a bit busy lately so wasn’t active).

This gist was made for a specific version of the transformer and tokenizer library. Can you try using it with the versions mentioned at the start.

Meanwhile, I guess it’s about time now that I update this gist to reflect changes in the dependencies.
Thanks
Aditya

@aditya-malte
Copy link
Author

aditya-malte commented Jan 29, 2021

Also, while cloning from git. Please ensure you use this (https://github.com/huggingface/transformers/tree/v2.5.0) github repo instead. (As the gist is compatible with that version of huggingface, the newer one probably doesn’t contain the required run_language_modeling file)

@sv-v5
Copy link

sv-v5 commented Sep 11, 2021

I ran into issues while following the directions from the 2020 blog post https://huggingface.co/blog/how-to-train. This gist was more helpful. Thank you 👍

For anyone interested in running through training with an updated transformers: I have a write-up here of a complete example on training from scratch using transformers 4.10 and the updated run_language_modeling.py script (https://github.com/huggingface/transformers/blob/4a872caef4e70595202c64687a074f99772d8e92/examples/legacy/run_language_modeling.py) committed on Jun 25, 2021.

https://github.com/sv-v5/train-roberta-ua

Python package verisions are locked with pipenv so the example remains reproducible. Tested on Linux and Windows on GPU and CPU.

Happy training

@aditya-malte
Copy link
Author

Hi,
That’s great to hear! Also, thanks a lot for making an updated training script. I’ve been busy lately (earlier with work and now my Master’s), so your updated script is much appreciated.

@mgiardinelli
Copy link

I had to update step #26 from tokenizer.save to tokenizer.save_model. FYI

tokenizer.save_model("/content/models/smallBERTa", "smallBERTa")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment