avidale/create_rut5-base.ipynb

Created April 30, 2021 21:51

Star (17) You must be signed in to star a gist
Fork (7) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/avidale/44cd35bfcdaf8bedf51d97c468cc8001.js"></script>
Save avidale/44cd35bfcdaf8bedf51d97c468cc8001 to your computer and use it in GitHub Desktop.

Download ZIP

create_rut5-base.ipynb

Raw

create_rut5-base.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

Sandeep0408 commented Nov 16, 2023 •

edited

Loading

Hi @avidale, I'm trying to run a sentiment classification on a Dutch dataset using the tokenizer as :
tokenizer = T5TokenizerFast.from_pretrained('yhavinga/t5-base-dutch')

and below arguments for training :
model_name_or_path="yhavinga/t5-base-dutch",
tokenizer_name_or_path="https://huggingface.co/yhavinga/t5-base-dutch/blob/main/tokenizer.json"

When trying to train the model , getting an error
170 def LoadFromFile(self, arg):
--> 171 return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

RuntimeError: Internal: /tmp/pip-install-ilecn6h0/sentencepiece_3c5f89f9146b4090a668d0f42db59389/bundled/sentencepiece/src/sentencepiece_processor.cc(823) [model_proto->ParseFromArray(serialized.data(), serialized.size())].

I'm done following changes but still no luck

Changed the version of transformer (currently its v4.35.2), tokenizers (currently it's 0.15.0) and sentencepiece.
Changed T5TokenizerFast to AutoTokenizer but still the issue is persistant.
Tried running it on an English dataset, which works fine but whenver I make changes on tokenizer ,model_name_or_path and tokenizer_name_or_path. I face the above said Issue.

Could you help? Thanks in Advance

Author

avidale commented Nov 16, 2023

When you say "arguments for training", where exactly do you use them? Are you using a huggingface trainer or something else?

If you give me a minimal example of code that can reproduce your problem, it would be easier for me to help.

My first guess is that you should replace

tokenizer_name_or_path="https://huggingface.co/yhavinga/t5-base-dutch/blob/main/tokenizer.json"

with simply

tokenizer_name_or_path="yhavinga/t5-base-dutch"

but without more context, I cannot be sure.

Sandeep0408 commented Nov 16, 2023

Thanks for your response @avidale , Yes I'm using a HF trainer and the arguments are for that. I did made the change as suggested by you but getting a different error as below

-> 310 return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
311
312 def _EncodeAsIds(self, text, enable_sampling, nbest_size, alpha, add_bos, add_eos, reverse, emit_unk_piece):

TypeError: not a string

I've been following this notebook ("https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb#scrollTo=hcKmeIGiI582"). Appreciate the help!

Author

avidale commented Nov 16, 2023

Now it looks like a problem with incorrect input.
But again, without knowing the exact code that led to the error, I cannot say for sure.

Sandeep0408 commented Nov 16, 2023

Would this link of gist help : https://gist.github.com/Sandeep0408/236b164cb09408c920aedb15d5c7e984

If not, I can give you the access for the colab notebook via mail. Thanks!

WEXIJUE commented Nov 20, 2023

Hello, I would like to know what version of python you are using, I saved the model as model.safetensors instead of pytorch_model.bin, please do you have any solution, thank you very much

Nehc commented Mar 2, 2024

Should this work with XLMRobertaModel, like e5-large? Or is something fundamentally different being used there. It didn't work out for me.

Author

avidale commented Mar 2, 2024

@Nehc

Should this work with XLMRobertaModel, like e5-large? Or is something fundamentally different being used there. It didn't work out for me.

As I can judge from the HF documentation, XLMRobertaTokenizer is based on SentencePiece, just like T5Tokenizer. Thus, in principle, the approach should work; I don't see any fundamental reasons why it wouldn't.

Nevertheless, the specific details, such as model parameter names, tokenizer parameter names, special tokens etc. may differ between T5 and XLMRoberta, so my code will surely need some adaptation to work with E5.

MuaazAnsari commented Nov 23, 2024

Hi @avidale . Hope your are doing great! Great work, Indeed! I have implemented the above code in Urdu language and want to finetune it for text summarisation . I just wanted to know about the arguments which needs to passed while training ( For example, learning rate, weight decay etc) . Given below are the training args for finetuning the mt5-base model on urdu dataset. What is your suggestion on the argument values for the reduced model (39.9% parameters of the original model). That would be a great help.

from transformers import Seq2SeqTrainingArguments

batch_size = 8
num_train_epochs = 8

Show the training loss with every epoch

logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

args = Seq2SeqTrainingArguments(
output_dir=f"{model_name}-finetuned_urdu_mt5-base",
evaluation_strategy="epoch",
learning_rate=5.6e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=num_train_epochs,
predict_with_generate=True,
logging_steps=logging_steps,
push_to_hub=True,
)

I tried using the same args on the reduced model and it got overfit.

Author

avidale commented Nov 25, 2024

Hi @MuaazAnsari!
I have two comments on your question:

I don't expect the optimal hyperparameter values to depend on whether the model vocabulary has been pruned or not.
Unfortunately, they do depend on the dataset that you train on (including the input and output lengths, dataset size, and task difficulty) and your hardware (e.g. if your GPU memory is limited, you'd have to decrease batch size, but then you might want to decrease learning rate or increase gradient accumulation to compensate). Thus, I cannot suggest parameters that would be universally good.

avidale/create_rut5-base.ipynb

Sandeep0408 commented Nov 16, 2023 •

edited

Loading

Uh oh!

avidale commented Nov 16, 2023

Uh oh!

Sandeep0408 commented Nov 16, 2023

Uh oh!

avidale commented Nov 16, 2023

Uh oh!

Sandeep0408 commented Nov 16, 2023

Uh oh!

WEXIJUE commented Nov 20, 2023

Uh oh!

Nehc commented Mar 2, 2024

Uh oh!

avidale commented Mar 2, 2024

Uh oh!

MuaazAnsari commented Nov 23, 2024

Uh oh!

avidale commented Nov 25, 2024

Uh oh!

avidale/create_rut5-base.ipynb

Sandeep0408 commented Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avidale commented Nov 16, 2023

Uh oh!

Sandeep0408 commented Nov 16, 2023

Uh oh!

avidale commented Nov 16, 2023

Uh oh!

Sandeep0408 commented Nov 16, 2023

Uh oh!

WEXIJUE commented Nov 20, 2023

Uh oh!

Nehc commented Mar 2, 2024

Uh oh!

avidale commented Mar 2, 2024

Uh oh!

MuaazAnsari commented Nov 23, 2024

Show the training loss with every epoch

Uh oh!

avidale commented Nov 25, 2024

Uh oh!

Sandeep0408 commented Nov 16, 2023 •

edited

Loading