Last active
September 23, 2022 09:43
-
-
Save Createdd/69da98fe885d034fc459f62922d5ba72 to your computer and use it in GitHub Desktop.
text_processing_lstm.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/Createdd/69da98fe885d034fc459f62922d5ba72/text_processing_lstm.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Disclaimer\n", | |
"\n", | |
"I wrote this during the course of my studies within the AI Master's degree at the JKU Linz in Austria. \n", | |
"\n", | |
"This was one of the exercises that we needed to work on. However, due to copyright I re-wrote the instructions. Nevertheless, I want to give credit to the institute for coming up with the idea of this assignment within the study programm. Feel free to check out this [study program](https://www.jku.at/en/degree-programs/types-of-degree-programs/masters-degree-programs/ma-artificial-intelligence/) if you are interested. I think it is an amazing program. \n", | |
"\n", | |
"I am not associated with the institute and this does not reflect the quality of the study program. These are also just parts of the original exercise. Those are my elaborations and also not the best possible solutions on this topic. \n", | |
"\n", | |
"\n", | |
"---\n", | |
"\n", | |
"\n", | |
"Furthermore, I will not provide the text dataset that was used. However, I can say that it is quite easy to get Trump speech text. \n", | |
"\n", | |
"\n", | |
"--- \n" | |
], | |
"metadata": { | |
"id": "b83dOFUNDxWv" | |
}, | |
"id": "b83dOFUNDxWv" | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Introduction -> Text processing with LSTM and PyTorch\n", | |
"\n", | |
"The goal is to train a LSTM model to generate some text.\n", | |
"\n", | |
"Therefore we need to represent text for the network to learn. Here we use character embedding. We need to \n", | |
"- define an alphabet (a set of characters)\n", | |
"- define the position of the character in the alphabet we want to represent \n", | |
"- let the neural network add weights of the positions of the character\n", | |
"\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "eiTxHpMT_1Gm" | |
}, | |
"id": "eiTxHpMT_1Gm" | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## 1. Encode Characters\n", | |
"\n", | |
"Encode characters by transforming them to tensors and vice-versa" | |
], | |
"metadata": { | |
"id": "lYeW5oLz_x2W" | |
}, | |
"id": "lYeW5oLz_x2W" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "17f16ae5", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"outputId": "e858a461-029f-435c-ed51-c186f50bfef8" | |
}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"cuda:0 True\n" | |
] | |
} | |
], | |
"source": [ | |
"import re\n", | |
"import torch\n", | |
"\n", | |
"ALL_CHARS_NUMS = 'abcdefghijklmnopqrstuvwxyz0123456789 .!?'\n", | |
"\n", | |
"# Checks for cpu and gpu setup\n", | |
"device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n", | |
"print(device, torch.cuda.is_available())\n", | |
"\n", | |
"class Encoder:\n", | |
" def __init__(self, all_chars):\n", | |
" self.encode_map = {i : char for char, i in enumerate(all_chars)}\n", | |
" self.decode_map = {char : i for char, i in enumerate(all_chars)}\n", | |
" \n", | |
" def __call__(self, input):\n", | |
" # If the input text contains characters that are not in the alphabet, \n", | |
" # then __call__ should either remove them or map them to a corresponding character that belongs to the alphabet.\n", | |
" if type(input) is str:\n", | |
" encoded = []\n", | |
" for char in input:\n", | |
" if char in self.encode_map:\n", | |
" encoded.append(self.encode_map[char])\n", | |
"\n", | |
" # we use torch of data type long because otherwise errors in the later calcualtions happen\n", | |
" encode_res = torch.tensor(encoded, dtype=torch.long).to(device)\n", | |
" \n", | |
" return encode_res\n", | |
"\n", | |
" # If the argument is a torch.Tensor, then the method should return \n", | |
" # a string representation of the input, i.e. it should function as decoder.\n", | |
" else:\n", | |
" sring_rep = ''\n", | |
" for i in input:\n", | |
" char = self.decode_map[i.item()]\n", | |
" sring_rep += char\n", | |
"\n", | |
" return sring_rep" | |
], | |
"id": "17f16ae5" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "wTQG93vJ4xs7", | |
"outputId": "cb32c65c-af3c-4653-ad7d-68d6444f4bd5" | |
}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"Encode ab22a = tensor([ 0, 1, 28, 28, 0], device='cuda:0')\n", | |
"Decode tensor([ 0, 1, 28, 28, 0], device='cuda:0') = ab22a\n" | |
] | |
} | |
], | |
"source": [ | |
"# test encoder class\n", | |
"encode = Encoder(ALL_CHARS_NUMS)\n", | |
"sample = 'ab22a'\n", | |
"test_encode = encode(sample)\n", | |
"test_decode = encode(test_encode)\n", | |
"\n", | |
"print(f'Encode {sample} = ', test_encode)\n", | |
"print(f'Decode {test_encode} = ', test_decode)\n", | |
"\n", | |
"if test_decode != sample:\n", | |
" raise AssertionError('Encoder not working')" | |
], | |
"id": "wTQG93vJ4xs7" | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## 2. Create Dataset from text file\n", | |
"\n", | |
"We want to create the functionality to provide a text file, consider it as a long sequence of characters, and return the character and its position in the provided file." | |
], | |
"metadata": { | |
"id": "NcsbGmhsEKJ-" | |
}, | |
"id": "NcsbGmhsEKJ-" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "9df917ce" | |
}, | |
"outputs": [], | |
"source": [ | |
"from torch.utils.data import Dataset\n", | |
"\n", | |
"class TextDataset(torch.utils.data.Dataset):\n", | |
" def __init__(self, path, l):\n", | |
" # load text data from specified path\n", | |
" self.l = l\n", | |
" f = open(path, \"r\", encoding=\"utf-8\")\n", | |
" data_str = f.read().lower()\n", | |
"\n", | |
" # encode text with previously defined function\n", | |
" encoder = Encoder(ALL_CHARS_NUMS)\n", | |
" data = encoder(data_str).to(device)\n", | |
"\n", | |
" # split to pre-defined sequence length\n", | |
" # Splits the tensor into chunks. Each chunk is a view of the original tensor.\n", | |
" data = torch.split(data, l)\n", | |
"\n", | |
" # concatinate the sequences of data in torch way and assign to data var\n", | |
" # torch stack Concatenates a sequence of tensors along a new dimension.\n", | |
" self.data = torch.stack(data[:-1]) if len(data[-1]) < l else torch.stack(data)\n", | |
" \n", | |
" def __len__(self):\n", | |
" return len(self.data)\n", | |
" \n", | |
" def __getitem__(self, i):\n", | |
" return self.data[i]" | |
], | |
"id": "9df917ce" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "ChxpwLlf68XD", | |
"outputId": "0ac66d84-c167-4892-d517-bda07f8cad98" | |
}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"Encoded data sample:\n", | |
" tensor([18, 15, 4, 4, 2, 7, 36, 27, 37, 37, 37, 19, 7, 0, 13, 10, 36, 24,\n", | |
" 14, 20, 36, 18, 14, 36, 12, 20, 2, 7, 37, 36, 36, 19, 7, 0, 19, 18,\n", | |
" 36, 18, 14, 36, 13, 8, 2, 4, 37, 36, 36, 8, 18, 13, 19, 36, 7, 4,\n", | |
" 36, 0, 36, 6, 17, 4, 0, 19, 36, 6, 20, 24, 37, 36, 36, 7, 4, 36,\n", | |
" 3, 14, 4, 18, 13, 19, 36, 6, 4, 19, 36, 0, 36, 5, 0, 8, 17, 36,\n", | |
" 15, 17, 4, 18, 18, 36, 7, 4, 36, 3], device='cuda:0')\n", | |
"Sample text: speech 1...thank you so much. thats so nice. isnt he a great guy. he doesnt get a fair press he d\n", | |
"length of text 100\n", | |
"length of encoded 8422\n" | |
] | |
} | |
], | |
"source": [ | |
"# test encoded dataset\n", | |
"text_data_encoded = TextDataset('data/trump_train.txt', l=100)\n", | |
"print('Encoded data sample:\\n', text_data_encoded[0])\n", | |
"print(f'Sample text: {encode(text_data_encoded[0])}')\n", | |
"print(f'length of text {encode(text_data_encoded[0]).__len__()}')\n", | |
"print(f'length of encoded {text_data_encoded.__len__()}')" | |
], | |
"id": "ChxpwLlf68XD" | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## 3. LSTM Model \n", | |
"\n", | |
"We create a module that consists of \n", | |
"- an Embeddung layer that maps the alphabet to the embeddings\n", | |
"- an LSTM layer the maps the embeddings to the hidden states\n", | |
"- a linear layer that maps the hidden states back to the alphabet\n", | |
"\n", | |
"In the forward pass the input sequence results in the logits." | |
], | |
"metadata": { | |
"id": "eYM_Wr6Vx8OU" | |
}, | |
"id": "eYM_Wr6Vx8OU" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "459fe907" | |
}, | |
"outputs": [], | |
"source": [ | |
"import torch.nn as nn\n", | |
"import torch.nn.functional as F\n", | |
"\n", | |
"class NextCharLSTM(nn.Module):\n", | |
" def __init__(self, alphabet_size, embedding_dim, hidden_dim):\n", | |
" super(NextCharLSTM, self).__init__()\n", | |
"\n", | |
" # define layers as instructed\n", | |
" # A simple lookup table that stores embeddings of a fixed dictionary and size.\n", | |
" # This module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.\n", | |
" self.embeddings = nn.Embedding(alphabet_size, embedding_dim)\n", | |
"\n", | |
" # pytorch LSTM module by default expects non-batch first input, the batch size was expected to be at dimension index 1, \n", | |
" # so that is why I set batch_first=True\n", | |
" # Applies a multi-layer long short-term memory (LSTM) RNN to an input sequence.\n", | |
" self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)\n", | |
"\n", | |
" # Applies a linear transformation to the incoming data\n", | |
" self.linear = nn.Linear(hidden_dim, alphabet_size)\n", | |
"\n", | |
" # utilize GPUs\n", | |
" self.embeddings.to(device)\n", | |
" self.lstm.to(device)\n", | |
" self.linear.to(device)\n", | |
"\n", | |
"\n", | |
"\n", | |
" def forward(self, inputs):\n", | |
" embeds = self.embeddings(inputs)\n", | |
"\n", | |
" #Outputs: output, (h_n, c_n)\n", | |
" output, hidden = self.lstm(embeds)\n", | |
"\n", | |
" # add the hidden states of all activations and not only the outout layer\n", | |
" logits = self.linear(output)\n", | |
" return logits\n", | |
" " | |
], | |
"id": "459fe907" | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## 4. We write an epoch function that\n", | |
"- validates the model if no optimizer is given\n", | |
"- trains the model in many-to-many setting is optimizer is given\n", | |
"- per mini-beach a validation/training erpoch shall be performed\n" | |
], | |
"metadata": { | |
"id": "lpoDSoyl0B7R" | |
}, | |
"id": "lpoDSoyl0B7R" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "13f33250" | |
}, | |
"outputs": [], | |
"source": [ | |
"from torch.utils.data import DataLoader\n", | |
"import numpy as np\n", | |
"\n", | |
"\n", | |
"# define an epoch function that depends on a textloader, the lstm model, an optimizer and \n", | |
"# the fact if we would need to train many2one setting (because this is the bonus task further down)\n", | |
"\n", | |
"def epoch(data_loader, lstm_model, optimizer, many_to_one=False):\n", | |
" # define loss function and loss var\n", | |
" loss_function = torch.nn.CrossEntropyLoss()\n", | |
" loss_function.to(device)\n", | |
" batch_losses = []\n", | |
"\n", | |
" # if an optimizer is set then there is training mode and not validation (as instructed) \n", | |
" # train() sets the modules in the network in training mode. \n", | |
" # It tells our model that we are currently in the training phase so the model keeps some layers, like dropout, batch-normalization which behaves differently depends on the current phase, active. \n", | |
" # whereas the model.eval() does the opposite\n", | |
" train_mode = optimizer is not None\n", | |
" if train_mode:\n", | |
" lstm_model.train()\n", | |
" else:\n", | |
" lstm_model.eval()\n", | |
"\n", | |
"\n", | |
" for i, batch in enumerate(data_loader):\n", | |
" # In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation \n", | |
" # (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. \n", | |
" if train_mode:\n", | |
" optimizer.zero_grad()\n", | |
" \n", | |
" with torch.set_grad_enabled(train_mode):\n", | |
" logits = lstm_model(batch[:,:-1])\n", | |
"\n", | |
" # differantiate between many2one as reshaping is necessary,\n", | |
" # namely taking the last dimension instead of all for the loss\n", | |
" # transpose returns a tensor that is a transposed version of input. The given dimensions dim0 and dim1 are swapped.\n", | |
" # this needs to be done for the calculation of the cross entropy loss function (see pytorch docs)\n", | |
" if many_to_one:\n", | |
" x = torch.transpose(logits, 1, 2)[:,:,-1]\n", | |
" y = batch[:,-1]\n", | |
" else:\n", | |
" x = torch.transpose(logits, 1, 2)\n", | |
" y = batch[:,1:]\n", | |
"\n", | |
" loss = loss_function(x, y)\n", | |
" \n", | |
" batch_losses.append(loss.item())\n", | |
" \n", | |
" # accumulate gradients and update parameters for training\n", | |
" # optimizer.step performs a single optimization step (parameter update).\n", | |
" if train_mode:\n", | |
" loss.backward()\n", | |
" optimizer.step()\n", | |
" \n", | |
" all_losses = np.array(batch_losses)\n", | |
" return all_losses" | |
], | |
"id": "13f33250" | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## 5. Train model and visualize results\n", | |
"\n", | |
"Putting everything together with a pre-defined set of hyperparameters.\n", | |
"We will validate on a separate dataset to see if the learning actually generalizes well." | |
], | |
"metadata": { | |
"id": "t_N6JpfG5SYg" | |
}, | |
"id": "t_N6JpfG5SYg" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "AQxsNd7WeYsy" | |
}, | |
"outputs": [], | |
"source": [ | |
"# For showing the results of the learning we will use the generate_text function that is developed further down in the notebook\n", | |
"# This shall show us if the training not only makes sense in terms of decreasing loss but also in terms of text results\n", | |
"\n", | |
"####################################################\n", | |
"# Copied from Exercise 8\n", | |
"####################################################\n", | |
"from torch.distributions import Categorical\n", | |
"\n", | |
"def generate_text(seed_text, encoder, lstm_model, text_length, top_k_characters=1):\n", | |
" # set up model for evaluation\n", | |
" lstm_model.eval()\n", | |
" result = encoder(seed_text.lower())\n", | |
" \n", | |
" # disable grad computations\n", | |
" with torch.no_grad():\n", | |
" # predict for each character the last topk character\n", | |
" for i in range(text_length):\n", | |
" logits = lstm_model(result.view(1, -1))\n", | |
" # use softmax for proper topk computation\n", | |
" softmax = torch.nn.functional.softmax(logits, dim=2)\n", | |
" topk = torch.topk(softmax, top_k_characters, 2)\n", | |
" \n", | |
" # create probabilistic distribution\n", | |
" categorical = Categorical(topk.values[:,-1:])\n", | |
" sample = categorical.sample()\n", | |
"\n", | |
" # concatinate the results together\n", | |
" result = torch.cat((result, topk.indices[0, -1, sample].view(-1)))\n", | |
" \n", | |
" return encoder(result)\n", | |
"####################################################" | |
], | |
"id": "AQxsNd7WeYsy" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "8987ae83", | |
"outputId": "88ce0236-1099-441d-ae73-b064d57a420d" | |
}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"Initially:\n", | |
"-> america is n.n.bn1yn1n.yln11ny.y.lm1an.nb.nnbn1nnb1yl1n.y..ylnb..19lqmymymlal11y.nnb.y111919ymmabn.lqnyn1.n1ynl \n", | |
" ----------------------------------------------------------------------------------------------------\n" | |
] | |
} | |
], | |
"source": [ | |
"import matplotlib.pyplot as plt\n", | |
"import os\n", | |
"from tqdm import trange\n", | |
"\n", | |
"\n", | |
"# Set all parameters for training the LSTM network\n", | |
"sequence_length = 100\n", | |
"batch_size = 256\n", | |
"embedding_dim = 8\n", | |
"hidden_dim = 512\n", | |
"learning_rate = 1e-3\n", | |
"num_epochs = 50\n", | |
"\n", | |
"start_text = 'America is '\n", | |
"best_valid_loss = None\n", | |
"many_to_one = False\n", | |
"output_file = \"best_model_m2m.pt\"\n", | |
"encoder = Encoder(ALL_CHARS_NUMS)\n", | |
"\n", | |
"train_data = TextDataset(os.path.join('data/', 'trump_train.txt'), l=sequence_length)\n", | |
"valid_data = TextDataset(os.path.join('data/', 'trump_val.txt'), l=sequence_length)\n", | |
"train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)\n", | |
"valid_loader = DataLoader(valid_data, batch_size=batch_size, shuffle=True)\n", | |
"\n", | |
"all_chars_nums_length = len(ALL_CHARS_NUMS)\n", | |
"m2m_model = NextCharLSTM(all_chars_nums_length, embedding_dim, hidden_dim).to(device)\n", | |
"optim = torch.optim.Adam(params=m2m_model.parameters(), lr=learning_rate)\n", | |
"\n", | |
"generated_text = generate_text(start_text, encoder, m2m_model, 100, 4)\n", | |
"print(f'Initially:\\n-> {generated_text} \\n {\"-\"*100}')" | |
], | |
"id": "8987ae83" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "LfTsh39YqCxG" | |
}, | |
"outputs": [], | |
"source": [ | |
"# Define training by model, corresponding loaders, optimizer, epochs, oputput and if many2one setting\n", | |
"def train(model, train_loader, valid_loader, optimizer, num_epochs, output_file, many_to_one):\n", | |
" train_losses = []\n", | |
" valid_losses = []\n", | |
" \n", | |
" for i in range(num_epochs):\n", | |
" #calculate losses in pre-defined setting\n", | |
" train_loss = epoch(train_loader, model, optimizer, many_to_one)\n", | |
" train_losses.append(np.mean(train_loss).item())\n", | |
"\n", | |
" valid_loss = epoch(valid_loader, model, None, many_to_one)\n", | |
" valid_losses.append(np.mean(valid_loss).item())\n", | |
"\n", | |
" print(f'epoch {i}\\ntrain loss: {train_losses[-1]}, validation loss: {valid_losses[-1]}/n')\n", | |
"\n", | |
" # store the best performance in pre-defined file by taking the lowest score\n", | |
" # if only one is stored, take that one, else take the smalles\n", | |
" if len(valid_losses) == 1 or valid_losses[-1] < min(valid_losses[:-1]):\n", | |
" torch.save(model.state_dict(), output_file)\n", | |
"\n", | |
" # every 20th iteration generate text to see improvements\n", | |
" if i % 20 == 0:\n", | |
" generated_text = generate_text(start_text, encoder, m2m_model, 100, 4)\n", | |
" print(f'Improved text:\\n-> {generated_text} \\n {\"-\"*20}')\n", | |
" \n", | |
" return train_losses, valid_losses" | |
], | |
"id": "LfTsh39YqCxG" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "Gn1VQ8TVKif6", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"outputId": "3142fcf1-084e-452a-c2c3-3ce14e6c8fdf" | |
}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"epoch 0\n", | |
"train loss: 3.0722065260916045, validation loss: 2.8938543796539307/n\n", | |
"Improved text:\n", | |
"-> america is to te ee et atee ae eta e at taa aaa t e a ee t e to t eetea et etee a a e \n", | |
" --------------------\n", | |
"epoch 1\n", | |
"train loss: 2.847360726558801, validation loss: 2.778059959411621/n\n", | |
"epoch 2\n", | |
"train loss: 2.660712401072184, validation loss: 2.6005918979644775/n\n", | |
"epoch 3\n", | |
"train loss: 2.496720465746793, validation loss: 2.477595806121826/n\n", | |
"epoch 4\n", | |
"train loss: 2.3691409934650767, validation loss: 2.3729653358459473/n\n", | |
"epoch 5\n", | |
"train loss: 2.25620529868386, validation loss: 2.266242742538452/n\n", | |
"epoch 6\n", | |
"train loss: 2.1481954328941577, validation loss: 2.1730096340179443/n\n", | |
"epoch 7\n", | |
"train loss: 2.0482463403181597, validation loss: 2.0945851802825928/n\n", | |
"epoch 8\n", | |
"train loss: 1.9594271616502241, validation loss: 2.028244972229004/n\n", | |
"epoch 9\n", | |
"train loss: 1.8832954240567756, validation loss: 1.9642455577850342/n\n", | |
"epoch 10\n", | |
"train loss: 1.8153246207670732, validation loss: 1.9098560810089111/n\n", | |
"epoch 11\n", | |
"train loss: 1.7568744529377331, validation loss: 1.857060194015503/n\n", | |
"epoch 12\n", | |
"train loss: 1.7044664765849258, validation loss: 1.8210113048553467/n\n", | |
"epoch 13\n", | |
"train loss: 1.6578938888780999, validation loss: 1.7762720584869385/n\n", | |
"epoch 14\n", | |
"train loss: 1.6164502555673772, validation loss: 1.7423654794692993/n\n", | |
"epoch 15\n", | |
"train loss: 1.5778644699038882, validation loss: 1.7043498754501343/n\n", | |
"epoch 16\n", | |
"train loss: 1.5420833826065063, validation loss: 1.6718835830688477/n\n", | |
"epoch 17\n", | |
"train loss: 1.5101951974810977, validation loss: 1.6475307941436768/n\n", | |
"epoch 18\n", | |
"train loss: 1.480200590509357, validation loss: 1.621835708618164/n\n", | |
"epoch 19\n", | |
"train loss: 1.4530761169664788, validation loss: 1.5967477560043335/n\n", | |
"epoch 20\n", | |
"train loss: 1.4271232214840976, validation loss: 1.5742368698120117/n\n", | |
"Improved text:\n", | |
"-> america is one to git. they say thank you.w. and you know there. tougs. we will not to did that are trodu lige \n", | |
" --------------------\n", | |
"epoch 21\n", | |
"train loss: 1.4043170141451287, validation loss: 1.5603150129318237/n\n", | |
"epoch 22\n", | |
"train loss: 1.382313695820895, validation loss: 1.5382187366485596/n\n", | |
"epoch 23\n", | |
"train loss: 1.3619441986083984, validation loss: 1.5201952457427979/n\n", | |
"epoch 24\n", | |
"train loss: 1.3428273778973203, validation loss: 1.5054054260253906/n\n", | |
"epoch 25\n", | |
"train loss: 1.3253240007342715, validation loss: 1.49391508102417/n\n", | |
"epoch 26\n", | |
"train loss: 1.3081143877723, validation loss: 1.4762858152389526/n\n", | |
"epoch 27\n", | |
"train loss: 1.292138959422256, validation loss: 1.4650746583938599/n\n", | |
"epoch 28\n", | |
"train loss: 1.2775824106100835, validation loss: 1.4516977071762085/n\n", | |
"epoch 29\n", | |
"train loss: 1.2638496991359827, validation loss: 1.4478967189788818/n\n", | |
"epoch 30\n", | |
"train loss: 1.2505281007651128, validation loss: 1.4315167665481567/n\n", | |
"epoch 31\n", | |
"train loss: 1.2370703148119377, validation loss: 1.4247608184814453/n\n", | |
"epoch 32\n", | |
"train loss: 1.2247026400132612, validation loss: 1.4173210859298706/n\n", | |
"epoch 33\n", | |
"train loss: 1.213089599753871, validation loss: 1.4089386463165283/n\n", | |
"epoch 34\n", | |
"train loss: 1.2022016734787913, validation loss: 1.3989789485931396/n\n", | |
"epoch 35\n", | |
"train loss: 1.1902846350814358, validation loss: 1.3936214447021484/n\n", | |
"epoch 36\n", | |
"train loss: 1.1808778155933728, validation loss: 1.3870627880096436/n\n", | |
"epoch 37\n", | |
"train loss: 1.1707697853897556, validation loss: 1.380878210067749/n\n", | |
"epoch 38\n", | |
"train loss: 1.1600048289154514, validation loss: 1.373460292816162/n\n", | |
"epoch 39\n", | |
"train loss: 1.1505673365159468, validation loss: 1.3698058128356934/n\n", | |
"epoch 40\n", | |
"train loss: 1.1412737514033462, validation loss: 1.363592267036438/n\n", | |
"Improved text:\n", | |
"-> america is an interests. and i watced the poll that he didnt were the reason i cant do their family. im not a c \n", | |
" --------------------\n", | |
"epoch 41\n", | |
"train loss: 1.1321798707499648, validation loss: 1.3571323156356812/n\n", | |
"epoch 42\n", | |
"train loss: 1.1239526813680476, validation loss: 1.3577945232391357/n\n", | |
"epoch 43\n", | |
"train loss: 1.1159283645225293, validation loss: 1.3480942249298096/n\n", | |
"epoch 44\n", | |
"train loss: 1.1068917910257976, validation loss: 1.347411870956421/n\n", | |
"epoch 45\n", | |
"train loss: 1.09801472678329, validation loss: 1.3405795097351074/n\n", | |
"epoch 46\n", | |
"train loss: 1.0899495283762615, validation loss: 1.3408769369125366/n\n", | |
"epoch 47\n", | |
"train loss: 1.081646922862891, validation loss: 1.3338264226913452/n\n", | |
"epoch 48\n", | |
"train loss: 1.0735088081070872, validation loss: 1.3323742151260376/n\n", | |
"epoch 49\n", | |
"train loss: 1.0657582355268074, validation loss: 1.3278988599777222/n\n" | |
] | |
} | |
], | |
"source": [ | |
"train_losses, valid_losses = train(m2m_model, train_loader, valid_loader, optim, num_epochs, output_file, many_to_one)" | |
], | |
"id": "Gn1VQ8TVKif6" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 281 | |
}, | |
"id": "f7M5Et3PuEaa", | |
"outputId": "0810f61a-6c0f-4ea4-af9d-e01df7e56057" | |
}, | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"data": { | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
], | |
"image/png": "\n" | |
}, | |
"metadata": { | |
"needs_background": "light" | |
} | |
} | |
], | |
"source": [ | |
"# Plot the training results\n", | |
"plt.title(\"Loss curves\")\n", | |
"loss_curve, = plt.plot(np.array(train_losses), label='train')\n", | |
"plt.plot(np.array(valid_losses), linestyle='--', label='valid')\n", | |
"plt.legend();" | |
], | |
"id": "f7M5Et3PuEaa" | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## 6. Create a top-k accuracy \n", | |
"\n", | |
"\n", | |
"Where we want to check if the true label appears in the topk-predicted class.\n", | |
"It would make sense to see that higher topk range leads to better accuracy.\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "oSn04x9E9iVG" | |
}, | |
"id": "oSn04x9E9iVG" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "fe1f70cf" | |
}, | |
"outputs": [], | |
"source": [ | |
"# define the top-k accuracy function \n", | |
"def topk_accuracy(k_list, lstm_model, data_loader, many_to_one):\n", | |
" # use the lstm mode in evaluation mode and create a dict for accuracies\n", | |
" lstm_model.eval()\n", | |
" accuracies = {k : [] for k in k_list}\n", | |
"\n", | |
" # Disabling gradient calculation is useful for inference\n", | |
" # It will reduce memory consumption for computations\n", | |
" # We evaluate performance and do not need the gradients here\n", | |
" with torch.no_grad():\n", | |
" for batch in iter(data_loader):\n", | |
"\n", | |
" logits = lstm_model(batch[:,1:])\n", | |
" softmax = torch.nn.functional.softmax(logits, dim=2)\n", | |
" batch = batch[:,1:]\n", | |
" # softmax = softmax[:,-1:,:]\n", | |
"\n", | |
" \n", | |
" # discriminate on which setting is used, because\n", | |
" # we need to apply the softmax function on the last 2nd dimension\n", | |
" if many_to_one:\n", | |
" # perform prediction on batch of sequence input and\n", | |
" # use softmax for further calculations of results\n", | |
" logits = lstm_model(batch[:,1:])\n", | |
" softmax = torch.nn.functional.softmax(logits, dim=2)\n", | |
" batch = batch[:,-1:]\n", | |
" softmax = softmax[:,-1:,:]\n", | |
" # else:\n", | |
" # batch = batch[:,1:]\n", | |
" \n", | |
" \n", | |
" for k in k_list:\n", | |
" # Return the k largest elements of the given input tensor along the given dimension\n", | |
" topk = torch.topk(softmax, k, 2)\n", | |
"\n", | |
" # compare whether the topk indices match and sum up\n", | |
" # .view() returns a new tensor with the same data as the self tensor but of a different shape.\n", | |
" # .item() returns the value of this tensor as a standard Python number. This only works for tensors with one element. \n", | |
" correct = (\n", | |
" topk.indices == batch.view(topk.indices.shape[0], topk.indices.shape[1], 1)\n", | |
" ).sum().item()\n", | |
"\n", | |
" # in a many to many setting we add the second dimension of the batches as well\n", | |
" # because we not only have the last sequence\n", | |
" if many_to_one:\n", | |
" accuracies[k].append(correct / batch.shape[0])\n", | |
" else:\n", | |
" accuracies[k].append(correct / (batch.shape[0] * batch.shape[1]))\n", | |
"\n", | |
" \n", | |
" for k, accs in accuracies.items():\n", | |
" #calculate accs with the mean \n", | |
" accuracies[k] = np.mean(np.array(accs))\n", | |
" \n", | |
" return list(accuracies.values())" | |
], | |
"id": "fe1f70cf" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 281 | |
}, | |
"id": "LU0xl0EJLYPE", | |
"outputId": "6aa2aabb-e845-4bd7-febc-01316a08076f" | |
}, | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"data": { | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
], | |
"image/png": "\n" | |
}, | |
"metadata": { | |
"needs_background": "light" | |
} | |
} | |
], | |
"source": [ | |
"m2m_model = NextCharLSTM(len(ALL_CHARS_NUMS), embedding_dim, hidden_dim)\n", | |
"m2m_model.load_state_dict(torch.load('best_model_m2m.pt'))\n", | |
"\n", | |
"# perform calculation on validation data\n", | |
"valid_loader = DataLoader(valid_data, 1, shuffle=False)\n", | |
"k = [i + 1 for i in range(len(ALL_CHARS_NUMS))]\n", | |
"top_k = topk_accuracy(k, m2m_model, valid_loader, many_to_one=False)\n", | |
"\n", | |
"# plot results\n", | |
"plt.title(\"Top-k Accuracies\")\n", | |
"accuracies, = plt.plot(np.array(top_k), label='top-k acc')\n", | |
"plt.legend();" | |
], | |
"id": "LU0xl0EJLYPE" | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## 8. Create Text (probabilistic) \n", | |
"\n", | |
"Let's create a function that feeds some text into the model and predicts the top-1 character with a probability distribution." | |
], | |
"metadata": { | |
"id": "HP1urFvRr_Hb" | |
}, | |
"id": "HP1urFvRr_Hb" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "ac5d1a15" | |
}, | |
"outputs": [], | |
"source": [ | |
"from torch.distributions import Categorical\n", | |
"\n", | |
"def text_generate_probabilistic(seed_text, encoder, lstm_model, text_length, top_k_characters=1):\n", | |
" # set up model for evaluation\n", | |
" lstm_model.eval()\n", | |
" result = encoder(seed_text.lower())\n", | |
" \n", | |
" # disable grad computations\n", | |
" with torch.no_grad():\n", | |
" # predict for each character the last topk character\n", | |
" for i in range(text_length):\n", | |
" logits = lstm_model(result.view(1, -1))\n", | |
" # use softmax for proper topk computation\n", | |
" softmax = torch.nn.functional.softmax(logits, dim=2)\n", | |
" topk = torch.topk(softmax, top_k_characters, 2)\n", | |
" \n", | |
" # create probabilistic distribution\n", | |
" categorical = Categorical(topk.values[:,-1:])\n", | |
" sample = categorical.sample()\n", | |
"\n", | |
" # concatinate the results together\n", | |
" result = torch.cat((result, topk.indices[0, -1, sample].view(-1)))\n", | |
" \n", | |
" return encoder(result)" | |
], | |
"id": "ac5d1a15" | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"text_generate_probabilistic(start_text, encoder, m2m_model, 500, top_k_characters=1)" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 89 | |
}, | |
"id": "hQnHLzVbprjJ", | |
"outputId": "f0d62c7f-9c18-4c9f-b800-3bc56b86e16e" | |
}, | |
"id": "hQnHLzVbprjJ", | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"'america is not going to happen anymore. we have to be smart that were going to have to do it. i said what they do it is the worst things that were going to be so much more than they want to do it. i dont want to be the way it was a tough and i was going to have to make our country great again. we have to be so much money that we have to do it and i said what they do it is the worst things that were going to be so much more than they want to do it. i dont want to be the way it was a tough and i was going to'" | |
], | |
"application/vnd.google.colaboratory.intrinsic+json": { | |
"type": "string" | |
} | |
}, | |
"metadata": {}, | |
"execution_count": 22 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 89 | |
}, | |
"id": "-vosoMjFbtRM", | |
"outputId": "444cb3d8-a30b-4c68-8b16-9fa5ddb99fe5" | |
}, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"'america is an amazing all the things outside of this country is an annotance.i will start all of it. because i will tell you that. were going to be six long. and we also done. we have to be true what they did a great thing and i will be saying whats had to do this. i would have had the biggest place.were going to waik number ones whine was going to get into the most incredible. the people that want to be strong. its all over the country. its got to be obama adminisared that have been. what the pollstaria a'" | |
], | |
"application/vnd.google.colaboratory.intrinsic+json": { | |
"type": "string" | |
} | |
}, | |
"metadata": {}, | |
"execution_count": 23 | |
} | |
], | |
"source": [ | |
"text_generate_probabilistic(start_text, encoder, m2m_model, 500, top_k_characters=4)" | |
], | |
"id": "-vosoMjFbtRM" | |
} | |
], | |
"metadata": { | |
"accelerator": "GPU", | |
"colab": { | |
"collapsed_sections": [], | |
"machine_shape": "hm", | |
"name": "text_processing_lstm.ipynb", | |
"provenance": [], | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.8.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment