-
-
Save aditya-malte/2d4f896f471be9c38eb4d723a710768b to your computer and use it in GitHub Desktop.
I think you mean custom “Dataset Loader”, because the code above already uses a custom tokenizer
I see what you mean—a custom parametrization of the BPE tokenizer. But my use case is very specialized (music), so I actually want a very specific tokenization. But yes, I may be able to specify it the way you've done here. I'll think more about that. Thanks!
ps - I did get it running with the given tokenizer, so that's a huge step forward!
You’re welcome :)
Also, do share this gist on your network
I'm struggling with trying to use a fixed vocabulary. My vocab.txt (for music) is small, and I want to avoid wordpieces, so that I don't have to predict multiple, adjacent pieces/tokens to get a "complete" prediction/"word". So all I want to do is load a vocab.txt and tokenize. Super simple, but I can't find a way to do that.
(If I can't find a way to do this, I'll just settle with the BPE tokenizer and figure out a way around the problems when I deploy it.)
Curious; is there a simple way to load weights and continue training?
Great work! I have executed the Colab you provided https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
and I got this error:
02/27/2020 17:23:08 - INFO - __main__ - Training new model from scratch
02/27/2020 17:23:16 - INFO - __main__ - Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=1000000000000, cache_dir=None, config_name='./EsperBERTo', device=device(type='cuda'), do_eval=False, do_train=True, eval_all_checkpoints=False, eval_data_file=None, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=0.0001, line_by_line=False, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=-1, mlm=True, mlm_probability=0.15, model_name_or_path=None, model_type='roberta', n_gpu=1, no_cuda=False, num_train_epochs=1.0, output_dir='./EsperBERTo-small-v1', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=16, save_steps=2000, save_total_limit=2, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name='./EsperBERTo', train_data_file='./oscar.eo.txt', warmup_steps=0, weight_decay=0.0)
02/27/2020 17:23:16 - INFO - __main__ - Loading features from cached file ./roberta_cached_lm_999999999998_oscar.eo.txt
Traceback (most recent call last):
File "transformers/examples/run_language_modeling.py", line 799, in <module>
main()
File "transformers/examples/run_language_modeling.py", line 749, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "transformers/examples/run_language_modeling.py", line 245, in train
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 94, in __init__
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0
@mrm8488 should be fixed now thanks to huggingface/blog#8
I want to train Albert than what changes I need to do in run_pretraining.py. What changes would require that?
@julien-c as training for Albert like model requires the generation of pre-training data so is pre-training data generated while training itself?
@jbmaxwell You can try other tokenizers like CharBPETokenizer, SentencePieceBPETokenizer, etc to check if that works for you.
To load weights and continue training, you can use the model_name_or_path
parameter and point it to the latest checkpoint.
How do I have to preprocess the corpus when I want to train my own LM for roBERTa? I think it must be one sentence per row. But does it need empty lines between documents? Is it ok to shuffle the text line by line?
I get the following error
File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 296, in
main()
File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 188, in main
config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
File "/usr/local/lib/python3.6/dist-packages/transformers/configuration_auto.py", line 217, in from_pretrained
"in its name: {}".format(pretrained_model_name_or_path, ", ".join(CONFIG_MAPPING.keys()))
ValueError: Unrecognized model in /content/models/smallBERTa. Should have a model_type
key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder
After running these codes
cmd = '''python /content/transformers/examples/language-modeling/run_language_modeling.py --output_dir {0}
--model_type roberta
--mlm
--train_data_file {1}
--eval_data_file {2}
--config_name /content/models/smallBERTa
--tokenizer_name /content/models/smallBERTa
--do_train
--line_by_line
--overwrite_output_dir
--do_eval
--block_size 256
--learning_rate 1e-4
--num_train_epochs 5
--save_total_limit 2
--save_steps 2000
--logging_steps 500
--per_gpu_eval_batch_size 32
--per_gpu_train_batch_size 32
--evaluate_during_training
--seed 42
'''.format(weights_dir, train_path, eval_path)
Please let me know how to fix this error
I hope this isn't a silly question because I'm very new to NLP and AI in general. I find the advantages of a bytepiece encoder very enticing - and am hoping to continue pretraining Distilbert on a custom corpus.
Is it possible to:
- Train that bytepiece encoder on the dataset
- Load it in with Distilbert (From HF's checkpoint)
- Continue pretraining Distilbert with the bytepiece tokenizer on custom corpus?
Hi, I have a question regarding the training file for the tokenizer.
At the beginning of the tutorial, it says:
To the Tokenizer:
LM data in a directory containing all samples in separate *.txt files.
Also there is code snippet:
for row in tqdm(data.to_list()):
file_name = os.path.join(txt_files_dir, str(i)+'.txt')
try:
f = open(file_name, 'w')
f.write(row)
f.close()
except Exception as e: #catch exceptions(for eg. empty rows)
print(row, e)
i+=1
What this does is to separate each sentence into a single file, rather than put 200_000 sentences line by line in a single file.
On contrast, in this tutorial: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=HOk4iZ9YZvec
the file oscar.eo.txt
contains all sentences line by line in a single file.
I tried to search for the documentation but have no clue which way to do is correct.
Is it necessary to split each sentence into one file, which results in 200_000 files?
Thank you for your answer.
I'm kinda new to this, but playing a bit around with the code I noticed that the function call "tokenizer.save()" should be changed to "tokenizer.save_model()".
You let me know wether my hunch is correct. :)
I get this error in line 20
TypeError Traceback (most recent call last)
in ()
----> 1 tokenizer.save("/content/models/smallBERTa", "smallBERTa")
/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
329 A path to the destination Tokenizer file
330 """
--> 331 return self._tokenizer.save(path, pretty)
332
333 def to_str(self, pretty: bool = False):
TypeError: Can't convert 'smallBERTa' to PyBool
@carlstrath In recent versions of tokenizers
I think you can just call .save(path)
(cc @n1t0)
Sorry to bother everyone again. I am now getting this error in ln27
python3: can't open file '/content/transformers/examples/run_language_modeling.py': [Errno 2] No such file or directory
Hi @carlstrath,
(Sorry I’ve been a bit busy lately so wasn’t active).
This gist was made for a specific version of the transformer and tokenizer library. Can you try using it with the versions mentioned at the start.
Meanwhile, I guess it’s about time now that I update this gist to reflect changes in the dependencies.
Thanks
Aditya
Also, while cloning from git. Please ensure you use this (https://github.com/huggingface/transformers/tree/v2.5.0) github repo instead. (As the gist is compatible with that version of huggingface, the newer one probably doesn’t contain the required run_language_modeling file)
I ran into issues while following the directions from the 2020 blog post https://huggingface.co/blog/how-to-train. This gist was more helpful. Thank you 👍
For anyone interested in running through training with an updated transformers: I have a write-up here of a complete example on training from scratch using transformers 4.10
and the updated run_language_modeling.py
script (https://github.com/huggingface/transformers/blob/4a872caef4e70595202c64687a074f99772d8e92/examples/legacy/run_language_modeling.py) committed on Jun 25, 2021.
https://github.com/sv-v5/train-roberta-ua
Python package verisions are locked with pipenv so the example remains reproducible. Tested on Linux and Windows on GPU and CPU.
Happy training
Hi,
That’s great to hear! Also, thanks a lot for making an updated training script. I’ve been busy lately (earlier with work and now my Master’s), so your updated script is much appreciated.
I had to update step #26 from tokenizer.save to tokenizer.save_model. FYI
tokenizer.save_model("/content/models/smallBERTa", "smallBERTa")
First of, thanks so much for sharing this—it definitely helped me get a lot further along!
I was hoping to use my own tokenizer though, so I'm guessing the only way would be write the tokenizer, then just replace the
LineByTextDataset()
call inload_and_cache_examples()
with my custom dataset, yes?