Last active
June 30, 2020 01:28
-
-
Save jhumigas/7b984c96d89f6f53e7ea3d0afa4f4b09 to your computer and use it in GitHub Desktop.
Augmenting the LSTM PoS tagger with Character-level features (PyTorch)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"---\n", | |
"# Exercise: Augmenting the LSTM PoS tagger with Character-level features\n", | |
"\n", | |
"My proposal for [this](http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#exercise-augmenting-the-lstm-part-of-speech-tagger-with-character-level-features)\n", | |
"\n", | |
"In the [previous example](http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#example-an-lstm-for-part-of-speech-tagging), each word had an embedding, which served as the inputs to our sequence model. Let’s augment the word embeddings with a representation derived from the characters of the word. We expect that this should help significantly, since character-level information like affixes have a large bearing on part-of-speech. For example, words with the affix -ly are almost always tagged as adverbs in English.\n", | |
"\n", | |
"\n", | |
"## Sequence Models and LSTM networks\n", | |
"\n", | |
"Few notes about LSTMs... you can move on if you know about them\n", | |
"There is dependence through time between your inputs. \n", | |
"It maintains some kind of state that could be used as part of the next input, so that the information can propagate along as the network passes over the sequence.\n", | |
"\n", | |
"Colah's [blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) is a nice piece to read about them. They chain-like nature makes them appropriate for sequences and lists.\n", | |
"Standard RNNs are not effective when the context becomes wide.\n", | |
"LSTMs were introduced by Hochreiter & Schimdhuber, explicitly to avoid long-term dependency problem.\n", | |
"Core components in LSTMs: cell state, gates\n", | |
"* $C_{t-1}$ previous cell state\n", | |
"* $h_{t-1}$ hidden state\n", | |
"* $x_t$ current input\n", | |
"1. First, we need to know what information to keep from the previous cell state.\n", | |
" * $f_t = \\sigma(W_f \\dot [ h_{t-1}, x_t] + b_f)$\n", | |
"2. Second, what information to store in the cell state.\n", | |
" * $i_t = \\sigma(W_i \\dot [h_{t-1}, x_t] + b_i)$\n", | |
" * $\\tilde{C}_t = \\text{tanh}(W_C \\dot [h_{t-1}, x_t] + b_C)$\n", | |
" * $C_t = f_t * C_{t-1} + i_t * \\tilde{C}_t$\n", | |
"3. Finally, what to output \n", | |
" * $o_t = \\sigma(W_o \\dot [h_{t-1}, x_t] + b_o)$\n", | |
" * $h_t = o_t * \\text{tanh}(C_t)$" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Preparing the dataset\n", | |
"\n", | |
"We first prepare our toy dataset and a few function to preprocess them.\n", | |
"As before, we will first build a dictionary that will help as change each sentence into a sequence of indexes.\n", | |
"As a reminder, such dictionary hold as key the words and value the corresponding index.\n", | |
"We will do the same for words (so they can be representes as character indexes thanks to a dictionary).\n", | |
"Finally we'll generate a dictionary of tensor to represent each word in the dictionary as a sequence of characters indexes." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<torch._C.Generator at 0x106547cf0>" | |
] | |
}, | |
"execution_count": 1, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import torch\n", | |
"import torch.autograd as autograd\n", | |
"import torch.nn as nn\n", | |
"import torch.nn.functional as F\n", | |
"import torch.optim as optim \n", | |
"\n", | |
"torch.manual_seed(1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"{'read': 5, 'the': 0, 'apple': 3, 'everybody': 4, 'book': 7, 'dog': 1, 'ate': 2, 'that': 6}\n", | |
"{'o': 11, 'v': 8, 'l': 7, 'p': 6, 'r': 0, 't': 4, 'g': 13, 'y': 9, 'h': 5, 'd': 3, 'a': 2, 'k': 12, 'b': 10, 'e': 1}\n", | |
"Variable containing:\n", | |
" 4\n", | |
" 5\n", | |
" 1\n", | |
"[torch.LongTensor of size 3]\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"def prepare_sequence(seq, to_ix):\n", | |
" \"\"\"Convert a sequence of things to a matrix\n", | |
" \n", | |
" Args:\n", | |
" seq(list): Sequence of things\n", | |
" to_ix(dict): key value pairs with things as keys and their index as value\n", | |
" \n", | |
" Returns:\n", | |
" matrix with each line correspond to a word in the sequence with its index in \n", | |
" the dictionary to_ix\n", | |
" \"\"\"\n", | |
" idxs = [to_ix[w] for w in seq]\n", | |
" tensor = torch.LongTensor(idxs)\n", | |
" return autograd.Variable(tensor)\n", | |
"\n", | |
"def prepare_words_tensor(word_to_ix, char_to_ix):\n", | |
" \"\"\"Convert words(keys) in the dictionary word_to_ix into\n", | |
" tensors that contains character indexes\n", | |
" \n", | |
" Args:\n", | |
" word_to_ix(dict): key value pairs with words as keys and their indexes as values\n", | |
" char_to_ix(dict): key value pairs with characters as keys and their indexes as values\n", | |
" \n", | |
" Returns:\n", | |
" dict: Contains keys as index of words and values the tensors of words\n", | |
" \"\"\"\n", | |
" list_words_tensor = {}\n", | |
" for word, idx in word_to_ix.items():\n", | |
" list_words_tensor[idx] = prepare_sequence(word, char_to_ix)\n", | |
" return list_words_tensor\n", | |
"\n", | |
"# Toy training data\n", | |
"training_data = [\n", | |
" (\"The dog ate the apple\".casefold().split(), [\"DET\", \"NN\", \"V\", \"DET\", \"NN\"]),\n", | |
" (\"Everybody read that book\".casefold().split(), [\"NN\", \"V\", \"DET\", \"NN\"])\n", | |
"]\n", | |
"\n", | |
"# Preparing our vocabulary of words\n", | |
"word_to_ix = {}\n", | |
"for sent, tags in training_data:\n", | |
" for word in sent:\n", | |
" if word not in word_to_ix:\n", | |
" word_to_ix[word] = len(word_to_ix)\n", | |
"print(word_to_ix)\n", | |
"tag_to_ix = {\"DET\": 0, \"NN\": 1, \"V\":2}\n", | |
"\n", | |
"# Preparing our vocabulary of characters\n", | |
"char_to_ix = {}\n", | |
"for word in word_to_ix.keys():\n", | |
" for c in word:\n", | |
" if c not in char_to_ix:\n", | |
" char_to_ix[c] = len(char_to_ix)\n", | |
"\n", | |
"print(char_to_ix)\n", | |
"\n", | |
"# Preparing representation of words at the character level\n", | |
"list_words_tensor = prepare_words_tensor(word_to_ix, char_to_ix)\n", | |
"print(list_words_tensor[0])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"class LSTMTaggerAug(nn.Module):\n", | |
" def __init__(self, embedding_dim_words, embedding_dim_chars, hidden_dim_words, hidden_dim_chars, vocab_size, tagset_size, charset_size):\n", | |
" \"\"\"LSTM Part-of-Speech Tagger Augmented with Character level features\n", | |
" \n", | |
" Atttributes:\n", | |
" embedding_dim_words: Embedding dimension of word features to input to LSTM word level\n", | |
" embedding_dim_chars: Embedding dimension of word features to input to character level\n", | |
" hidden_dim_words: Output size of the LSTM word level\n", | |
" hidden_dim_chars: Output size of the LSTM character level\n", | |
" vocab_size: Size of the vocabulary of characters\n", | |
" tagset_size: Size of the set of labels\n", | |
" charset_size: Size of the vocabulary of characters\n", | |
" \"\"\"\n", | |
" super(LSTMTaggerAug, self).__init__()\n", | |
" self.hidden_dim_words = hidden_dim_words\n", | |
" self.hidden_dim_chars = hidden_dim_chars\n", | |
" self.word_embeddings = nn.Embedding(vocab_size, embedding_dim_words)\n", | |
" self.char_embeddings = nn.Embedding(charset_size, embedding_dim_chars)\n", | |
" self.lstm_char = nn.LSTM(embedding_dim_chars, hidden_dim_chars)\n", | |
" self.lstm_words = nn.LSTM(embedding_dim_words + hidden_dim_chars, hidden_dim_words)\n", | |
" self.hidden2tag = nn.Linear(hidden_dim_words, tagset_size)\n", | |
" self.hidden_char = self.init_hidden(c=False)\n", | |
" self.hidden_words = self.init_hidden(c=True)\n", | |
" \n", | |
" def init_hidden(self, c=True):\n", | |
" \"\"\"Initialize hidden state of LSTMs\n", | |
" \n", | |
" Args:\n", | |
" c(boolean): return initialized hidden state for LSTM word level if true\n", | |
" \n", | |
" \"\"\"\n", | |
" if c:\n", | |
" return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim_words)),\n", | |
" autograd.Variable(torch.zeros(1, 1, self.hidden_dim_words)))\n", | |
" else:\n", | |
" return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim_chars)),\n", | |
" autograd.Variable(torch.zeros(1, 1, self.hidden_dim_chars)))\n", | |
" \n", | |
" \n", | |
" def forward(self, sentence_seq, words_tensor_dict):\n", | |
" \"\"\"Forward propagation\n", | |
" \n", | |
" Args:\n", | |
" sentence_seq(list): Sequence of indexis related to the corresponding sentence words\n", | |
" words_tensor_dict(dict): Dictionary of tensors of words at the character level\n", | |
" \n", | |
" Returns:\n", | |
" tensor: Labels predicted (POS) for the sequence\n", | |
" \"\"\"\n", | |
" # embeds = self.word_embeddings(sentence)\n", | |
" for ix, word_idx in enumerate(sentence_seq):\n", | |
" \n", | |
" # Char level\n", | |
" word_chars_tensors = words_tensor_dict[int(word_idx)]\n", | |
" char_embeds = self.char_embeddings(word_chars_tensors)\n", | |
" \n", | |
" # Remember that the input of LSTM is a 3D Tensor:\n", | |
" # The first axis is the sequence itself, \n", | |
" # the second indexes instances in the mini-batch, and \n", | |
" # the third indexes elements of the input.\n", | |
" lstm_char_out, self.hidden_char = self.lstm_char(\n", | |
" char_embeds.view(len(char_embeds), 1, -1), self.hidden_char)\n", | |
" \n", | |
" # Word level\n", | |
" embeds = self.word_embeddings(word_idx)\n", | |
" # Now here we will only keep the final hidden state of the character level LSTM\n", | |
" # i.e lstm_char_out[-1]\n", | |
" embeds_cat = torch.cat((embeds, lstm_char_out[-1]), dim=1)\n", | |
" \n", | |
" lstm_out, self.hidden_words = self.lstm_words(embeds_cat, self.hidden_words)\n", | |
" tag_space = self.hidden2tag(lstm_out.view(1, -1))\n", | |
" \n", | |
" tag_score = F.log_softmax(tag_space, dim=1)\n", | |
" if ix==0:\n", | |
" tag_scores = tag_score\n", | |
" else:\n", | |
" tag_scores = torch.cat((tag_scores, tag_score), 0)\n", | |
" \n", | |
" return tag_scores\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Used dimensions for our LSTMTaggerAug\n", | |
"EMBEDDING_DIM = 6\n", | |
"HIDDEN_DIM = 6" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Let's test it\n", | |
"model = LSTMTaggerAug(EMBEDDING_DIM, EMBEDDING_DIM, HIDDEN_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix), len(char_to_ix))\n", | |
"loss_function = nn.NLLLoss()\n", | |
"optimizer = optim.SGD(model.parameters(), lr=0.1)\n", | |
"\n", | |
"inputs = prepare_sequence(training_data[0][0], word_to_ix)\n", | |
"words_tensors = prepare_words_tensor(word_to_ix, char_to_ix)\n", | |
"# print(words_in)\n", | |
"# print(inputs)\n", | |
"tag_scores = model(inputs, words_tensors)\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Variable containing:\n", | |
"-0.0337 -3.4797 -6.0577\n", | |
"-5.1650 -0.0230 -4.0739\n", | |
"-5.4177 -3.3947 -0.0387\n", | |
"-0.0336 -3.5511 -5.4472\n", | |
"-3.7990 -0.0385 -4.1764\n", | |
"[torch.FloatTensor of size 5x3]\n", | |
"\n", | |
"['the', 'dog', 'ate', 'the', 'apple']\n" | |
] | |
} | |
], | |
"source": [ | |
"# Let's train it !\n", | |
"model = LSTMTaggerAug(EMBEDDING_DIM, EMBEDDING_DIM, HIDDEN_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix), len(char_to_ix))\n", | |
"loss_function = nn.NLLLoss()\n", | |
"optimizer = optim.SGD(model.parameters(), lr=0.1)\n", | |
"\n", | |
"inputs = prepare_sequence(training_data[0][0], word_to_ix)\n", | |
"words_tensors = prepare_words_tensor(word_to_ix, char_to_ix)\n", | |
"tag_scores = model(inputs, words_tensors)\n", | |
"# print(tag_scores)\n", | |
"\n", | |
"for epoch in range(300):\n", | |
" for sentence, tags in training_data:\n", | |
" model.zero_grad()\n", | |
" model.hidden = model.init_hidden()\n", | |
" sentence_in = prepare_sequence(sentence, word_to_ix)\n", | |
" words_tensors = prepare_words_tensor(word_to_ix, char_to_ix)\n", | |
" targets = prepare_sequence(tags, tag_to_ix)\n", | |
" tag_scores = model(sentence_in, words_tensors)\n", | |
" loss = loss_function(tag_scores, targets)\n", | |
" loss.backward(retain_graph=True)\n", | |
" optimizer.step()\n", | |
" \n", | |
"inputs = prepare_sequence(training_data[0][0], word_to_ix)\n", | |
"words_in = prepare_words_tensor(word_to_ix, char_to_ix)\n", | |
"tag_scores = model(inputs, words_in)\n", | |
"print(tag_scores)\n", | |
"print(training_data[0][0])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello @jhumigas,
I tried out the code you have compiled for the character level LSTM. It generates the following error (with traceback):
Can you help me figure out what has gone wrong?
Thank you.