-
-
Save avidale/44cd35bfcdaf8bedf51d97c468cc8001 to your computer and use it in GitHub Desktop.
I have question. I suppose if i have to fine tuning on my own text y must only extract sentence from my documents and it will work, right?
I suppose if i have to fine tuning on my own text y must only extract sentence from my documents and it will work, right?
You can fine-tune the model created by my script exactly the same way as you would fine-tune any other seq2seq transformer.
Quick question, how could I feed the BIO format of the sentences as part of the input fed to the model?
@AimeM250 it depends.
Normally, BIO format is used for sequence tagging problems (such as named entity recognition). But for these problems, an encoder-decoder architecture such as T5 is redundant; encoder-only models (such as BERT) are usually enough.
What exactly are you trying to achieve?
Awesome work man. It helps a ton! thanks for your contribution to the community.
I have a question. Why do you get the model with T5ForConditionalGeneration() instead of MT5ForConditionalGeneration()?
Would there be any difference to get the model using MT5ForConditionalGeneration() ?
Because when I finished creating the mt5 with the spanish language, the model.config states that the architecture is MT5ForConditionalGeneration, while yours states that it is T5ForConditionalGeneration. I don't know if its something I should just ignore, of I messed up somewhere.
Hi @EDF99,
I never encountered any problems by using MT5ForConditionalGeneration and T5ForConditionalGeneration interchangeably.
The comments in MT5 source code (https://github.com/huggingface/transformers/blob/v4.30.0/src/transformers/models/mt5/modeling_mt5.p) suggest that most of it was copy-pasted from T5.
This doesn't guarantee that they are 100% compatible, but I have an impression that they are.
Thanks man @avidale!
Hi, thanks a lot for this, it's super helpful and I've managed to get it to work well.
One question: do you have any idea how could I get this to work with XLM-R? They both use sentencepiece tokenizers, but I can't seem to get it to work.
All of my attempts so far have resulted in an "unkown error" on HF's inference API, and an "index out of range in self" error when I try to use them myself.
@ozzieandthestraw you are talking about dropping tokens in models like https://huggingface.co/xlm-roberta-base, right? I could try adapting my code for this.
@avidale yep, that's the one, thank you. I think the biggest issue is with updating the embeddings, the rest of your code seems to work fine.
@ozzieandthestraw Ok, here is my notebook for updating XLM-Roberta: https://colab.research.google.com/drive/1f-n3zBQjmtMrp7oHzvunHPSC5aIMNe_N?usp=sharing.
The principle is the same as the notebook above. A nice difference is that we no longer need to compile sentencepiece_model.proto
manually; nowadays the required objects are already included in the sentencepiece distribution.
@avidale thank you so much! That works excellently, you are a life saver.
One thing, I changed the line tokenizer_new = XLMRobertaTokenizer.from_pretrained('tmp_tokenizer')
to tokenizer_new = XLMRobertaTokenizer.from_pretrained('tmp_tokenizer/sentencepiece.bpe.model')
.
@avidale thank you so much! That works excellently.
I am trying to fine-tune a model which works like ChatGPT for Punjabi language, using the mt5-base, however I am not sure if I should go ahead with it since it does not even generate text and when I try to use it, I just get a response as <extra_pad> 0. I have checked the tokenizers, they work fine with Punjabi language, can anyone please tell how may I go on about it?
The dataset I will be using is an instruction following dataset in the format of alpaca and is of high quality.
I have tried fine-tuning indic-gpt before, however it has a very small token size i.e.1024 so I changed my base model.
Thanks in advance!
Hi @rukaiya-rk-24 !
I have never fine-tuned ChatGPT-like models and I know very little about Punjabi, so I cannot tell you for sure. But what I know is:
- MT5 probably hasn’t seen Punjabi at all during pretraining (or maybe only a little bit of it that got mixed into other languages by accident). The list of languages that MT5 has seen is at https://huggingface.co/google/mt5-base.
- MT5 wasn’t pretrained to generate full sentences at all, only to predict missing tokens. Thus, to make it generate long coherent answers, like chatgpt, you need to train it for a very long time on a very big diverse dataset. I tried to do this with MT5 for Russian, which is one of its pretraining languages (so it should be easier), but still the results were far from perfect.
So I would recommend to you to use instead of mT5 a model that (1) has been pretrained with Punjabi as one of languages, and (2) has been pretrained with the autoregressive language modelling task (the one that GPT models also use), so that it already can generate fluent texts. One model that fulfills these criteria is BLOOM, so I suggest picking the largest of the BLOOM models from https://huggingface.co/bigscience?sort_models=likes#models that fits into your memory during fine-tuning (e.g. the 1B version).
Thank you so much for the help!
Hi @avidale, I'm trying to run a sentiment classification on a Dutch dataset using the tokenizer as :
tokenizer = T5TokenizerFast.from_pretrained('yhavinga/t5-base-dutch')
and below arguments for training :
model_name_or_path="yhavinga/t5-base-dutch",
tokenizer_name_or_path="https://huggingface.co/yhavinga/t5-base-dutch/blob/main/tokenizer.json"
When trying to train the model , getting an error
170 def LoadFromFile(self, arg):
--> 171 return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: /tmp/pip-install-ilecn6h0/sentencepiece_3c5f89f9146b4090a668d0f42db59389/bundled/sentencepiece/src/sentencepiece_processor.cc(823) [model_proto->ParseFromArray(serialized.data(), serialized.size())].
I'm done following changes but still no luck
- Changed the version of transformer (currently its v4.35.2), tokenizers (currently it's 0.15.0) and sentencepiece.
- Changed T5TokenizerFast to AutoTokenizer but still the issue is persistant.
- Tried running it on an English dataset, which works fine but whenver I make changes on tokenizer ,model_name_or_path and tokenizer_name_or_path. I face the above said Issue.
Could you help? Thanks in Advance
When you say "arguments for training", where exactly do you use them? Are you using a huggingface trainer or something else?
If you give me a minimal example of code that can reproduce your problem, it would be easier for me to help.
My first guess is that you should replace
tokenizer_name_or_path="https://huggingface.co/yhavinga/t5-base-dutch/blob/main/tokenizer.json"
with simply
tokenizer_name_or_path="yhavinga/t5-base-dutch"
but without more context, I cannot be sure.
Thanks for your response @avidale , Yes I'm using a HF trainer and the arguments are for that. I did made the change as suggested by you but getting a different error as below
-> 310 return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
311
312 def _EncodeAsIds(self, text, enable_sampling, nbest_size, alpha, add_bos, add_eos, reverse, emit_unk_piece):
TypeError: not a string
I've been following this notebook ("https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb#scrollTo=hcKmeIGiI582"). Appreciate the help!
Now it looks like a problem with incorrect input.
But again, without knowing the exact code that led to the error, I cannot say for sure.
Would this link of gist help : https://gist.github.com/Sandeep0408/236b164cb09408c920aedb15d5c7e984
If not, I can give you the access for the colab notebook via mail. Thanks!
Hello, I would like to know what version of python you are using, I saved the model as model.safetensors instead of pytorch_model.bin, please do you have any solution, thank you very much
Should this work with XLMRobertaModel, like e5-large? Or is something fundamentally different being used there. It didn't work out for me.
Should this work with XLMRobertaModel, like e5-large? Or is something fundamentally different being used there. It didn't work out for me.
As I can judge from the HF documentation, XLMRobertaTokenizer is based on SentencePiece, just like T5Tokenizer. Thus, in principle, the approach should work; I don't see any fundamental reasons why it wouldn't.
Nevertheless, the specific details, such as model parameter names, tokenizer parameter names, special tokens etc. may differ between T5 and XLMRoberta, so my code will surely need some adaptation to work with E5.
Hi @avidale . Hope your are doing great! Great work, Indeed! I have implemented the above code in Urdu language and want to finetune it for text summarisation . I just wanted to know about the arguments which needs to passed while training ( For example, learning rate, weight decay etc) . Given below are the training args for finetuning the mt5-base model on urdu dataset. What is your suggestion on the argument values for the reduced model (39.9% parameters of the original model). That would be a great help.
from transformers import Seq2SeqTrainingArguments
batch_size = 8
num_train_epochs = 8
Show the training loss with every epoch
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
output_dir=f"{model_name}-finetuned_urdu_mt5-base",
evaluation_strategy="epoch",
learning_rate=5.6e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=num_train_epochs,
predict_with_generate=True,
logging_steps=logging_steps,
push_to_hub=True,
)
I tried using the same args on the reduced model and it got overfit.
Hi @MuaazAnsari!
I have two comments on your question:
- I don't expect the optimal hyperparameter values to depend on whether the model vocabulary has been pruned or not.
- Unfortunately, they do depend on the dataset that you train on (including the input and output lengths, dataset size, and task difficulty) and your hardware (e.g. if your GPU memory is limited, you'd have to decrease batch size, but then you might want to decrease learning rate or increase gradient accumulation to compensate). Thus, I cannot suggest parameters that would be universally good.
Nice work