Last active
September 12, 2019 07:38
-
-
Save Shivampanwar/bea7108054c4e6dd8b37e4301d9d3eba to your computer and use it in GitHub Desktop.
Fine-tunes Bert language model on Google Colab
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
##combine train and test reviews | |
lm_df = pd.concat([train_df[['review']],test_df[['review']]]) | |
lm_df.review = lm_df.review.str.lower() | |
tqdm.pandas() | |
## We need one sentence per line with space between two lines | |
changed_text=lm_df.review.apply(lambda x:x+"\n"+"\n") | |
open(os.path.join(directory_path,'data_lm.txt'), "w").write(''.join(changed_text)) | |
##We need to create data in Bert Format, below command will do that | |
!python3 pregenerate_training_data.py --train_corpus data_lm.txt --bert_model bert-base-uncased --do_lower_case --output_dir training/ --epochs_to_generate 2 --max_seq_len 256 | |
##We would now use this data to finetune the model | |
!python3 finetune_on_pregenerated.py --pregenerated_data training/ --bert_model bert-base-uncased --do_lower_case --train_batch_size 16 --output_dir finetuned_lm/ --epochs 2 | |
##Now, our model is ready to be used. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment