-
-
Save aditya-malte/2d4f896f471be9c38eb4d723a710768b to your computer and use it in GitHub Desktop.
I get the following error
File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 296, in
main()
File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 188, in main
config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
File "/usr/local/lib/python3.6/dist-packages/transformers/configuration_auto.py", line 217, in from_pretrained
"in its name: {}".format(pretrained_model_name_or_path, ", ".join(CONFIG_MAPPING.keys()))
ValueError: Unrecognized model in /content/models/smallBERTa. Should have a model_type
key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder
After running these codes
cmd = '''python /content/transformers/examples/language-modeling/run_language_modeling.py --output_dir {0}
--model_type roberta
--mlm
--train_data_file {1}
--eval_data_file {2}
--config_name /content/models/smallBERTa
--tokenizer_name /content/models/smallBERTa
--do_train
--line_by_line
--overwrite_output_dir
--do_eval
--block_size 256
--learning_rate 1e-4
--num_train_epochs 5
--save_total_limit 2
--save_steps 2000
--logging_steps 500
--per_gpu_eval_batch_size 32
--per_gpu_train_batch_size 32
--evaluate_during_training
--seed 42
'''.format(weights_dir, train_path, eval_path)
Please let me know how to fix this error
I hope this isn't a silly question because I'm very new to NLP and AI in general. I find the advantages of a bytepiece encoder very enticing - and am hoping to continue pretraining Distilbert on a custom corpus.
Is it possible to:
- Train that bytepiece encoder on the dataset
- Load it in with Distilbert (From HF's checkpoint)
- Continue pretraining Distilbert with the bytepiece tokenizer on custom corpus?
Hi, I have a question regarding the training file for the tokenizer.
At the beginning of the tutorial, it says:
To the Tokenizer:
LM data in a directory containing all samples in separate *.txt files.
Also there is code snippet:
for row in tqdm(data.to_list()):
file_name = os.path.join(txt_files_dir, str(i)+'.txt')
try:
f = open(file_name, 'w')
f.write(row)
f.close()
except Exception as e: #catch exceptions(for eg. empty rows)
print(row, e)
i+=1
What this does is to separate each sentence into a single file, rather than put 200_000 sentences line by line in a single file.
On contrast, in this tutorial: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=HOk4iZ9YZvec
the file oscar.eo.txt
contains all sentences line by line in a single file.
I tried to search for the documentation but have no clue which way to do is correct.
Is it necessary to split each sentence into one file, which results in 200_000 files?
Thank you for your answer.
I'm kinda new to this, but playing a bit around with the code I noticed that the function call "tokenizer.save()" should be changed to "tokenizer.save_model()".
You let me know wether my hunch is correct. :)
I get this error in line 20
TypeError Traceback (most recent call last)
in ()
----> 1 tokenizer.save("/content/models/smallBERTa", "smallBERTa")
/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
329 A path to the destination Tokenizer file
330 """
--> 331 return self._tokenizer.save(path, pretty)
332
333 def to_str(self, pretty: bool = False):
TypeError: Can't convert 'smallBERTa' to PyBool
@carlstrath In recent versions of tokenizers
I think you can just call .save(path)
(cc @n1t0)
Sorry to bother everyone again. I am now getting this error in ln27
python3: can't open file '/content/transformers/examples/run_language_modeling.py': [Errno 2] No such file or directory
Hi @carlstrath,
(Sorry I’ve been a bit busy lately so wasn’t active).
This gist was made for a specific version of the transformer and tokenizer library. Can you try using it with the versions mentioned at the start.
Meanwhile, I guess it’s about time now that I update this gist to reflect changes in the dependencies.
Thanks
Aditya
Also, while cloning from git. Please ensure you use this (https://github.com/huggingface/transformers/tree/v2.5.0) github repo instead. (As the gist is compatible with that version of huggingface, the newer one probably doesn’t contain the required run_language_modeling file)
I ran into issues while following the directions from the 2020 blog post https://huggingface.co/blog/how-to-train. This gist was more helpful. Thank you 👍
For anyone interested in running through training with an updated transformers: I have a write-up here of a complete example on training from scratch using transformers 4.10
and the updated run_language_modeling.py
script (https://github.com/huggingface/transformers/blob/4a872caef4e70595202c64687a074f99772d8e92/examples/legacy/run_language_modeling.py) committed on Jun 25, 2021.
https://github.com/sv-v5/train-roberta-ua
Python package verisions are locked with pipenv so the example remains reproducible. Tested on Linux and Windows on GPU and CPU.
Happy training
Hi,
That’s great to hear! Also, thanks a lot for making an updated training script. I’ve been busy lately (earlier with work and now my Master’s), so your updated script is much appreciated.
I had to update step #26 from tokenizer.save to tokenizer.save_model. FYI
tokenizer.save_model("/content/models/smallBERTa", "smallBERTa")
How do I have to preprocess the corpus when I want to train my own LM for roBERTa? I think it must be one sentence per row. But does it need empty lines between documents? Is it ok to shuffle the text line by line?