-
-
Save aditya-malte/2d4f896f471be9c38eb4d723a710768b to your computer and use it in GitHub Desktop.
I'm kinda new to this, but playing a bit around with the code I noticed that the function call "tokenizer.save()" should be changed to "tokenizer.save_model()".
You let me know wether my hunch is correct. :)
I get this error in line 20
TypeError Traceback (most recent call last)
in ()
----> 1 tokenizer.save("/content/models/smallBERTa", "smallBERTa")
/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
329 A path to the destination Tokenizer file
330 """
--> 331 return self._tokenizer.save(path, pretty)
332
333 def to_str(self, pretty: bool = False):
TypeError: Can't convert 'smallBERTa' to PyBool
@carlstrath In recent versions of tokenizers
I think you can just call .save(path)
(cc @n1t0)
Sorry to bother everyone again. I am now getting this error in ln27
python3: can't open file '/content/transformers/examples/run_language_modeling.py': [Errno 2] No such file or directory
Hi @carlstrath,
(Sorry I’ve been a bit busy lately so wasn’t active).
This gist was made for a specific version of the transformer and tokenizer library. Can you try using it with the versions mentioned at the start.
Meanwhile, I guess it’s about time now that I update this gist to reflect changes in the dependencies.
Thanks
Aditya
Also, while cloning from git. Please ensure you use this (https://github.com/huggingface/transformers/tree/v2.5.0) github repo instead. (As the gist is compatible with that version of huggingface, the newer one probably doesn’t contain the required run_language_modeling file)
I ran into issues while following the directions from the 2020 blog post https://huggingface.co/blog/how-to-train. This gist was more helpful. Thank you 👍
For anyone interested in running through training with an updated transformers: I have a write-up here of a complete example on training from scratch using transformers 4.10
and the updated run_language_modeling.py
script (https://github.com/huggingface/transformers/blob/4a872caef4e70595202c64687a074f99772d8e92/examples/legacy/run_language_modeling.py) committed on Jun 25, 2021.
https://github.com/sv-v5/train-roberta-ua
Python package verisions are locked with pipenv so the example remains reproducible. Tested on Linux and Windows on GPU and CPU.
Happy training
Hi,
That’s great to hear! Also, thanks a lot for making an updated training script. I’ve been busy lately (earlier with work and now my Master’s), so your updated script is much appreciated.
I had to update step #26 from tokenizer.save to tokenizer.save_model. FYI
tokenizer.save_model("/content/models/smallBERTa", "smallBERTa")
Hi, I have a question regarding the training file for the tokenizer.
At the beginning of the tutorial, it says:
Also there is code snippet:
What this does is to separate each sentence into a single file, rather than put 200_000 sentences line by line in a single file.
On contrast, in this tutorial: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=HOk4iZ9YZvec
the file
oscar.eo.txt
contains all sentences line by line in a single file.I tried to search for the documentation but have no clue which way to do is correct.
Is it necessary to split each sentence into one file, which results in 200_000 files?
Thank you for your answer.