Created
May 6, 2021 22:31
-
-
Save hamelsmu/014c014282aa3981e27112dfd09f33f0 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "517bfefb", | |
"metadata": {}, | |
"source": [ | |
"# Fine Tune Text Summarizer With Hugging Face\n", | |
"\n", | |
"Trying to adapt and follow: https://github.com/huggingface/notebooks/blob/master/examples/summarization.ipynb" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "fd8916c5", | |
"metadata": {}, | |
"source": [ | |
"This will fine tune a model to summarize GitHub Issues from the GitHub repo fastai/fastai" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "a79d72d2", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Requirement already satisfied: ghapi in /opt/conda/lib/python3.7/site-packages (0.1.16)\n", | |
"Requirement already satisfied: pip in /opt/conda/lib/python3.7/site-packages (from ghapi) (21.0.1)\n", | |
"Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from ghapi) (20.9)\n", | |
"Requirement already satisfied: fastcore in /opt/conda/lib/python3.7/site-packages (from ghapi) (1.3.20)\n", | |
"Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging->ghapi) (2.4.7)\n" | |
] | |
} | |
], | |
"source": [ | |
"! pip install ghapi" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "9ad38913", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from ghapi.core import GhApi\n", | |
"from ghapi.all import github_token, paged\n", | |
"import os, pickle\n", | |
"from fastcore.all import L" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "7a63efb9", | |
"metadata": {}, | |
"source": [ | |
"# Get the Data" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "df5d86c7", | |
"metadata": {}, | |
"source": [ | |
"Uncomment this block and run it if this is the first time running this notebook. You need to have a [personal access token](https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token) in an environment variable named `GH_PAT`. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "31e3425e", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# api = GhApi(owner='fastai', repo='fastai', token=os.getenv('GH_PAT'))\n", | |
"# issues = L(paged(api.issues.list_for_repo, state='all')).concat()\n", | |
"# pickle.dump(issues, open( \"issues.p\", \"wb\" ) )\n", | |
"# len(issues)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "37a575f5", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from datasets import load_metric\n", | |
"from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer\n", | |
"\n", | |
"model_checkpoint = \"t5-small\"\n", | |
"metric = load_metric(\"rouge\")\n", | |
"tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)\n", | |
"model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "d9246aae", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"False\n" | |
] | |
} | |
], | |
"source": [ | |
"print(model.training)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ba1048e4", | |
"metadata": {}, | |
"source": [ | |
"## Process The Data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"id": "bbbd537f", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import pickle\n", | |
"issues = pickle.load(open( \"issues.p\", \"rb\" ))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "e9f1504e", | |
"metadata": { | |
"scrolled": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(#2) [{'body': \"First reported bug here:\\r\\nhttps://forums.fast.ai/t/bug-learn-summary-does-not-work-on-2nd-transfer-learning/77897\\r\\n\\r\\n**tldr:** learn.summary() crashes out with the following summary when doing a 2nd cycle of transfer learning\\r\\n```\\r\\n 57 elif val <= self.first_its or val >= self.last_v + self.wait_for or val >= self.total:\\r\\n 58 cur_t = time.time()\\r\\n 59 avg_t = (cur_t - self.start_t) / val\\r\\n 60 self.wait_for = max(int(self.update_every / (avg_t+1e-8)),1)\\r\\n 61 self.pred_t = avg_t * self.total\\r\\n \\r\\n AttributeError: 'NBProgressBar' object has no attribute 'start_t'\\r\\n```\\r\\n\\r\\nWhen I revert back the fastai code (in the fastai2 repo to commit: 59d878d3cf233ea24eb8fd8987098f17edd8c8ef) this crash goes away. I've isolated it to the subsequent commit d9ed4a8337bab36d3680fd787494e83ebd2f9a4b, with refactor learn.summary() that causes this error.\\r\\n\\r\\n\\r\\n\", 'title': 'learn.summary() crashes out on 2nd transfer learning'},{'body': '`get_cv_idxs` returns empty list when `val_pct` specified is > 0.25. This is because of the `cv_idx` parameter is defaulted to 4. More details - http://forums.fast.ai/t/understanding-get-val-idxs/8148\\r\\n\\r\\nFix is based on suggestion by Adrian Galdran in the same forum post.\\r\\n\\r\\nTesting:\\r\\nCurrent Status:\\r\\n<img width=\"755\" alt=\"screen shot 2017-11-21 at 11 50 45 am\" src=\"https://user-images.githubusercontent.com/1437573/33093767-c8dd355a-ceb2-11e7-9f71-172607738106.png\">\\r\\n\\r\\nFix:\\r\\n<img width=\"702\" alt=\"screen shot 2017-11-21 at 11 47 47 am\" src=\"https://user-images.githubusercontent.com/1437573/33093781-d3d1bfd0-ceb2-11e7-91e9-16ab6a60b1d2.png\">\\r\\n', 'title': 'Bug: get_cv_idxs error when val_pct > 0.3'}]" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pairs = (issues\n", | |
" .filter(lambda x: x.body and x.title and len(x.body) > 10 and len(x.title) > 5)\n", | |
" .map(lambda x: {'body':x.body, 'title':x.title}).shuffle()\n", | |
" )\n", | |
"pairs[:2]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "0dddd189", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"train_pairs = pairs[:2800]\n", | |
"eval_pairs = pairs[2800:]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a50130fd", | |
"metadata": {}, | |
"source": [ | |
"### Tokenize" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"id": "a03390bd", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"prefix = \"summarize: \"\n", | |
"\n", | |
"max_input_length = 1024\n", | |
"max_target_length = 128\n", | |
"\n", | |
"def preprocess_function(examples):\n", | |
" inputs = [prefix + doc for doc in examples.map(lambda x: x['body'])]\n", | |
" model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)\n", | |
"\n", | |
" # Setup the tokenizer for targets\n", | |
" with tokenizer.as_target_tokenizer():\n", | |
" labels = tokenizer(list(examples.map(lambda x: x['title'])), max_length=max_target_length, truncation=True)\n", | |
"\n", | |
" model_inputs[\"labels\"] = labels[\"input_ids\"]\n", | |
" return model_inputs" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"id": "d5c23c41", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from torch.utils.data import Dataset\n", | |
"\n", | |
"class GH_Issues(Dataset):\n", | |
" def __init__(self, data): self.data = data\n", | |
"\n", | |
" def __len__(self): return len(self.data)\n", | |
"\n", | |
" def __getitem__(self, idx):\n", | |
" return {x: self.data[x][idx] for x in ['input_ids', 'attention_mask', 'labels']}" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"id": "0bf442c8", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"tokenized_datasets = {}\n", | |
"tokenized_datasets['train'] = GH_Issues(preprocess_function(train_pairs))\n", | |
"tokenized_datasets['eval'] = GH_Issues(preprocess_function(eval_pairs))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "b027be98", | |
"metadata": {}, | |
"source": [ | |
"## Fine Tuning The Model With Trainer" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"id": "b8a62dda", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import nltk\n", | |
"import numpy as np\n", | |
"\n", | |
"def compute_metrics(eval_pred):\n", | |
" predictions, labels = eval_pred\n", | |
" decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)\n", | |
" # Replace -100 in the labels as we can't decode them.\n", | |
" labels = np.where(labels != -100, labels, tokenizer.pad_token_id)\n", | |
" decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)\n", | |
" \n", | |
" # Rouge expects a newline after each sentence\n", | |
" decoded_preds = [\"\\n\".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]\n", | |
" decoded_labels = [\"\\n\".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]\n", | |
" \n", | |
" result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)\n", | |
" # Extract a few results\n", | |
" result = {key: value.mid.fmeasure * 100 for key, value in result.items()}\n", | |
" \n", | |
" # Add mean generated length\n", | |
" prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]\n", | |
" result[\"gen_len\"] = np.mean(prediction_lens)\n", | |
" \n", | |
" return {k: round(v, 4) for k, v in result.items()}" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"id": "daf1f968", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"batch_size = 16\n", | |
"args = Seq2SeqTrainingArguments(\n", | |
" \"test-summarization\",\n", | |
" evaluation_strategy = \"epoch\",\n", | |
" learning_rate=2e-5,\n", | |
" per_device_train_batch_size=batch_size,\n", | |
" per_device_eval_batch_size=batch_size,\n", | |
" weight_decay=0.01,\n", | |
" save_total_limit=3,\n", | |
" num_train_epochs=5,\n", | |
" predict_with_generate=True,\n", | |
" fp16=True,\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"id": "954a8422", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"id": "73e015ba", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"trainer = Seq2SeqTrainer(\n", | |
" model,\n", | |
" args,\n", | |
" train_dataset=tokenized_datasets[\"train\"],\n", | |
" eval_dataset=tokenized_datasets[\"eval\"],\n", | |
" data_collator=data_collator,\n", | |
" tokenizer=tokenizer,\n", | |
" compute_metrics=compute_metrics\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"id": "e16b15ca", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", | |
" warnings.warn('Was asked to gather along dimension 0, but all '\n", | |
"/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:134: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate\n", | |
" \"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate\", UserWarning)\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/html": [ | |
"\n", | |
" <div>\n", | |
" <style>\n", | |
" /* Turns off some styling */\n", | |
" progress {\n", | |
" /* gets rid of default border in Firefox and Opera. */\n", | |
" border: none;\n", | |
" /* Needs to be in here for Safari polyfill so background images work as expected. */\n", | |
" background-size: auto;\n", | |
" }\n", | |
" </style>\n", | |
" \n", | |
" <progress value='5' max='5' style='width:300px; height:20px; vertical-align: middle;'></progress>\n", | |
" [5/5 00:04, Epoch 5/5]\n", | |
" </div>\n", | |
" <table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: left;\">\n", | |
" <th>Epoch</th>\n", | |
" <th>Training Loss</th>\n", | |
" <th>Validation Loss</th>\n", | |
" <th>Rouge1</th>\n", | |
" <th>Rouge2</th>\n", | |
" <th>Rougel</th>\n", | |
" <th>Rougelsum</th>\n", | |
" <th>Gen Len</th>\n", | |
" <th>Runtime</th>\n", | |
" <th>Samples Per Second</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <td>1</td>\n", | |
" <td>No log</td>\n", | |
" <td>4.610332</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>19.000000</td>\n", | |
" <td>0.723500</td>\n", | |
" <td>4.146000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>2</td>\n", | |
" <td>No log</td>\n", | |
" <td>4.610332</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>19.000000</td>\n", | |
" <td>0.673000</td>\n", | |
" <td>4.458000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>3</td>\n", | |
" <td>No log</td>\n", | |
" <td>4.610332</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>19.000000</td>\n", | |
" <td>0.682800</td>\n", | |
" <td>4.394000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>4</td>\n", | |
" <td>No log</td>\n", | |
" <td>4.610332</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>19.000000</td>\n", | |
" <td>0.686600</td>\n", | |
" <td>4.370000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>5</td>\n", | |
" <td>No log</td>\n", | |
" <td>4.610412</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>11.000000</td>\n", | |
" <td>19.000000</td>\n", | |
" <td>0.688400</td>\n", | |
" <td>4.358000</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table><p>" | |
], | |
"text/plain": [ | |
"<IPython.core.display.HTML object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"TrainOutput(global_step=5, training_loss=3.04730167388916, metrics={'train_runtime': 7.7661, 'train_samples_per_second': 0.644, 'total_flos': 2052989752320.0, 'epoch': 5.0, 'init_mem_cpu_alloc_delta': 2618056704, 'init_mem_gpu_alloc_delta': 242026496, 'init_mem_cpu_peaked_delta': 164737024, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 1319022592, 'train_mem_gpu_alloc_delta': 728832512, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 157294080})" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"trainer.train()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "638ab856", | |
"metadata": {}, | |
"source": [ | |
"# Questions\n", | |
"\n", | |
"- In the training log, why does `Training Loss` report \"No log\" ? Is this a bug?\n", | |
"- We don't need to enable training before using the trainer? Does the trainer automatically enable training mode, and then disable training mode when its done? While browsing the code for Trainer, it appears that the attribute `is_in_train` is set to True and then later set to False at the end of training, but I am not 100% sure. \n", | |
"- Is there a way to grab the recommended Training arguments for the model I'm using from the Hub rather that manually specifying them myself using `Seq2SeqTrainingArguments`? It seems that the defaults for `Seq2SeqTrainingArguments` are not model-specific, but perhaps there is a way to get this? \n", | |
"- Similarly is there a way to automatically grab the metrics or recommended metrics to use that were trained the model just to ensure consistency from the model hub? \n", | |
"- Why is `model` passed into DataCollatorForSeq2Seq? I can see from the code that it is using the model's `prepare_decoder_input_ids_from_labels` attribute, but isn't that something that would/should also be available in a tokenizer instead? I'm just trying to build a better mental model of what is happening. \n", | |
"- Is there a way to create my own pipeline object similar to the high level magic you have for pretrained models? Like is there a way I can leverage some of the same machinery you have so I don't have to create my own inf? I know I can create it myself, but wondering if there is a way to leverage something you already have so I don't have to build a thing that takes strings, tokenizes, numericalizes, does a forward pass, then decodes that back into a string with beam search etc. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "1ec9eef8", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.10" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment