Created
September 23, 2015 08:28
-
-
Save juanshishido/312f60da210353e60ad6 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "In this assignment, you'll see an example in which some text does not tag well, most likely because the training data did not have many examples of the target sentence structure. You'll see the effects of adding a few sentences of training data with the missing sentence structure on the accuracy of the tagger." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "%pprint\n\nimport re\n\nimport nltk\nfrom nltk.corpus import brown\nfrom nltk import word_tokenize", | |
"execution_count": 1, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "Pretty printing has been turned OFF\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "First, create datasets and train an ngram backoff tagger as before, using the brown corpus as the training set." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "def create_data_sets(sentences):\n size = int(len(sentences) * 0.9)\n train_sents = sentences[:size]\n test_sents = sentences[size:]\n return train_sents, test_sents\n\ndef build_backoff_tagger (train_sents):\n t0 = nltk.DefaultTagger('NN')\n t1 = nltk.UnigramTagger(train_sents, backoff=t0)\n t2 = nltk.BigramTagger(train_sents, backoff=t1)\n return t2\n\nbrown_tagged_sents = brown.tagged_sents(categories=['adventure', 'belles_lettres', 'editorial',\n 'fiction', 'government', 'hobbies', 'humor',\n 'learned', 'lore', 'mystery', 'religion',\n 'reviews', 'romance', 'science_fiction'],\n tagset='universal')\n\ntrain_sents, test_sents = create_data_sets(brown_tagged_sents)\n\nngram_tagger = build_backoff_tagger(train_sents)\n\nprint (\"%0.3f\" % ngram_tagger.evaluate(test_sents))", | |
"execution_count": 2, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "0.923\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"collapsed": true | |
}, | |
"cell_type": "markdown", | |
"source": "Next, read in a file of recipes and tokenize it." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "with open('data/cookbooks.txt', 'r') as text_file:\n cookbooks_corpus = text_file.read()\n\ndef tokenize_text(corpus):\n sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')\n raw_sents = sent_tokenizer.tokenize(corpus) # Split text into sentences\n \n return [nltk.word_tokenize(word) for word in raw_sents]\n\ncookbook_sents = tokenize_text(cookbooks_corpus)", | |
"execution_count": 3, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Now, in order to see the sentences where errors are occuring, the code below finds sentences that begin with imperatives and prints them out, along with their assigned parts of speech." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "for sent in cookbook_sents:\n if sent[0] in [\"Wash\", \"Stir\", \"Moisten\", \"Drain\", \"Cook\", \"Pour\", \"Chop\", \"Slice\", \"Season\", \"Mix\", \"Fry\", \"Bake\", \"Roast\", \"Wisk\"]:\n for item in ngram_tagger.tag(sent):\n print(item) \n print()", | |
"execution_count": 4, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "('Wash', 'NN')\n('a', 'DET')\n('quarter', 'NOUN')\n('of', 'ADP')\n('a', 'DET')\n('pound', 'NOUN')\n('of', 'ADP')\n('best', 'ADJ')\n('pearl', 'NN')\n('sago', 'NOUN')\n('thoroughly', 'ADV')\n(',', '.')\n('then', 'ADV')\n('stew', 'NOUN')\n('it', 'PRON')\n('quite', 'ADV')\n('tender', 'ADJ')\n('and', 'CONJ')\n('very', 'ADV')\n('View', 'NN')\n('page', 'NOUN')\n('[', '.')\n('32', 'NUM')\n(']', '.')\n('thick', 'ADJ')\n('in', 'ADP')\n('water', 'NOUN')\n('or', 'CONJ')\n('thick', 'ADJ')\n('broth', 'NOUN')\n(';', '.')\n('(', '.')\n('it', 'PRON')\n('will', 'VERB')\n('require', 'VERB')\n('nearly', 'ADV')\n('or', 'CONJ')\n('quite', 'ADV')\n('a', 'DET')\n('quart', 'NOUN')\n('of', 'ADP')\n('liquid', 'NOUN')\n(',', '.')\n('which', 'DET')\n('should', 'VERB')\n('be', 'VERB')\n('poured', 'VERB')\n('to', 'PRT')\n('it', 'PRON')\n('cold', 'ADJ')\n('and', 'CONJ')\n('heated', 'VERB')\n('slowly', 'ADV')\n(';', '.')\n(')', '.')\n('then', 'ADV')\n('mix', 'VERB')\n('gradually', 'ADV')\n('with', 'ADP')\n('it', 'PRON')\n('a', 'DET')\n('pint', 'NOUN')\n('of', 'ADP')\n('good', 'ADJ')\n('boiling', 'VERB')\n('cream', 'NOUN')\n('or', 'CONJ')\n('milk', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('the', 'DET')\n('yolks', 'NN')\n('of', 'ADP')\n('four', 'NUM')\n('fresh', 'ADJ')\n('eggs', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('mingle', 'VERB')\n('the', 'DET')\n('whole', 'ADJ')\n('carefully', 'ADV')\n('with', 'ADP')\n('two', 'NUM')\n('quarts', 'NN')\n('of', 'ADP')\n('strong', 'ADJ')\n('veal', 'NOUN')\n('or', 'CONJ')\n('beef', 'NOUN')\n('stock', 'NOUN')\n(',', '.')\n('which', 'DET')\n('should', 'VERB')\n('always', 'ADV')\n('be', 'VERB')\n('kept', 'VERB')\n('ready', 'ADJ')\n('boiling', 'VERB')\n('.', '.')\n\n('Pour', 'NN')\n('it', 'PRON')\n('over', 'PRT')\n('the', 'DET')\n('meat', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('serve', 'VERB')\n('.', '.')\n\n('Bake', 'NOUN')\n('it', 'PRON')\n('slowly', 'ADV')\n('for', 'ADP')\n('six', 'NUM')\n('hours', 'NOUN')\n('.', '.')\n\n('Drain', 'VERB')\n('it', 'PRON')\n('from', 'ADP')\n('fat', 'NOUN')\n(',', '.')\n('unbind', 'NN')\n('it', 'PRON')\n(',', '.')\n('and', 'CONJ')\n('serve', 'VERB')\n('it', 'PRON')\n('with', 'ADP')\n('a', 'DET')\n('good', 'ADJ')\n('brown', 'ADJ')\n('gravy', 'NOUN')\n(',', '.')\n('or', 'CONJ')\n('any', 'DET')\n('sauce', 'NOUN')\n('preferred', 'VERB')\n(',', '.')\n('or', 'CONJ')\n('with', 'ADP')\n('melted', 'VERB')\n('butter', 'NOUN')\n('in', 'ADP')\n('a', 'DET')\n('tureen', 'NN')\n(',', '.')\n('a', 'DET')\n('cut', 'VERB')\n('lemon', 'NOUN')\n('and', 'CONJ')\n('cayenne', 'NOUN')\n('.', '.')\n\n('Mix', 'NN')\n('all', 'PRT')\n('these', 'DET')\n('ingredients', 'NOUN')\n('well', 'ADV')\n(',', '.')\n('and', 'CONJ')\n('rub', 'VERB')\n('them', 'PRON')\n('well', 'ADV')\n('into', 'ADP')\n('the', 'DET')\n('beef', 'NOUN')\n(',', '.')\n('particularly', 'ADV')\n('into', 'ADP')\n('the', 'DET')\n('holes', 'NOUN')\n(',', '.')\n('adding', 'VERB')\n('occasionally', 'ADV')\n('a', 'DET')\n('little', 'ADJ')\n('salt', 'NOUN')\n('.', '.')\n\n('Chop', 'NN')\n('some', 'DET')\n('suet', 'NN')\n('very', 'ADV')\n('finely', 'ADV')\n(',', '.')\n('cover', 'VERB')\n('the', 'DET')\n('beef', 'NOUN')\n('with', 'ADP')\n('it', 'PRON')\n(',', '.')\n('and', 'CONJ')\n('bake', 'NN')\n('it', 'PRON')\n('in', 'ADP')\n('a', 'DET')\n('moderately', 'ADV')\n('heated', 'VERB')\n('oven', 'NOUN')\n(',', '.')\n('from', 'ADP')\n('five', 'NUM')\n('to', 'ADP')\n('six', 'NUM')\n('hours', 'NOUN')\n('.', '.')\n\n('Pour', 'NN')\n('the', 'DET')\n('pickle', 'NOUN')\n('into', 'ADP')\n('a', 'DET')\n('deep', 'ADJ')\n('earthen', 'NN')\n('jar', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('when', 'ADV')\n('it', 'PRON')\n('is', 'VERB')\n('cold', 'ADJ')\n('lay', 'ADJ')\n('in', 'ADP')\n('the', 'DET')\n('meat', 'NOUN')\n('so', 'ADP')\n('that', 'DET')\n('every', 'DET')\n('part', 'NOUN')\n('is', 'VERB')\n('covered', 'VERB')\n('.', '.')\n\n('Season', 'NN')\n('with', 'ADP')\n('pepper', 'NOUN')\n('and', 'CONJ')\n('salt', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('sprinkle', 'VERB')\n('with', 'ADP')\n('oat-meal', 'NN')\n(';', '.')\n('chop', 'NOUN')\n('a', 'DET')\n('half', 'PRT')\n('handful', 'NOUN')\n('of', 'ADP')\n('parsley', 'NOUN')\n('and', 'CONJ')\n('thyme', 'NN')\n('and', 'CONJ')\n('throw', 'VERB')\n('in', 'ADP')\n(';', '.')\n('boil', 'VERB')\n('a', 'DET')\n('large', 'ADJ')\n('onion', 'NOUN')\n('nearly', 'ADV')\n('tender', 'ADJ')\n(',', '.')\n('chop', 'NOUN')\n('it', 'PRON')\n('and', 'CONJ')\n('mix', 'VERB')\n('it', 'PRON')\n('in', 'ADP')\n(';', '.')\n('add', 'VERB')\n('sufficient', 'ADJ')\n('broth', 'NOUN')\n('or', 'CONJ')\n('skim-milk', 'NN')\n('and', 'CONJ')\n('water', 'NOUN')\n('to', 'PRT')\n('cover', 'VERB')\n('the', 'DET')\n('beef', 'NOUN')\n(';', '.')\n('let', 'VERB')\n('it', 'PRON')\n('simmer', 'VERB')\n('two', 'NUM')\n('hours', 'NOUN')\n(';', '.')\n('then', 'ADV')\n('thicken', 'VERB')\n('with', 'ADP')\n('a', 'DET')\n('little', 'ADJ')\n('oatmeal', 'NN')\n(',', '.')\n('and', 'CONJ')\n('add', 'VERB')\n('a', 'DET')\n('dessert', 'NOUN')\n('spoonful', 'NOUN')\n('of', 'ADP')\n('mushroom', 'NOUN')\n('or', 'CONJ')\n('walnut', 'NOUN')\n('catsup', 'NOUN')\n(';', '.')\n('stir', 'VERB')\n('well', 'ADV')\n(',', '.')\n('boil', 'VERB')\n('a', 'DET')\n('minute', 'NOUN')\n('and', 'CONJ')\n('serve', 'VERB')\n('with', 'ADP')\n('pieces', 'NOUN')\n('of', 'ADP')\n('bread', 'NOUN')\n('toasted', 'VERB')\n('.', '.')\n\n('Pour', 'NN')\n('it', 'PRON')\n('on', 'ADP')\n('the', 'DET')\n('beef', 'NOUN')\n('boiling', 'VERB')\n('hot', 'ADJ')\n('and', 'CONJ')\n('cover', 'VERB')\n('closely', 'ADV')\n('.', '.')\n\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Notice that most of the initial words are incorrectly tagged as nouns rather than verbs. How can we fix this? One way is to label a few rather generic sentences with the structure we are interested in, add them to the start of the training data, and then retrain the tagger." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "cooking_action_sents = [[('Strain', 'VB'), ('it', 'PPS'), ('well', 'RB'), ('.', '.')],\n [('Mix', 'VB'), ('them', 'PPS'), ('well', 'RB'), ('.', '.')],\n [('Season', 'VB'), ('them', 'PPS'), ('with', 'IN'), ('pepper', 'NN'), ('.', '.')], \n [('Wash', 'VB'), ('it', 'PPS'), ('well', 'RB'), ('.', '.')],\n [('Chop', 'VB'), ('the', 'AT'), ('greens', 'NNS'), ('.', '.')],\n [('Slice', 'VB'), ('it', 'PPS'), ('well', 'RB'), ('.', '.')],\n [('Bake', 'VB'), ('the', 'AT'), ('cake', 'NN'), ('.', '.')],\n [('Pour', 'VB'), ('into', 'IN'), ('a', 'AT'), ('mold', 'NN'), ('.', '.')],\n [('Stir', 'VB'), ('the', 'AT'), ('mixture', 'NN'), ('.', '.')],\n [('Moisten', 'VB'), ('the', 'AT'), ('grains', 'NNS'), ('.', '.')],\n [('Cook', 'VB'), ('the', 'AT'), ('duck', 'NN'), ('.', '.')],\n [('Drain', 'VB'), ('for', 'IN'), ('one', 'CD'), ('day', 'NN'), ('.', '.')]]\n\n\nall_tagged_sents = cooking_action_sents + brown_tagged_sents\n\ntrain_sents, test_sents = create_data_sets(all_tagged_sents)\n\nngram_tagger_all_sents = build_backoff_tagger(train_sents)\n\nprint (\"%0.3f\" % ngram_tagger_all_sents.evaluate(test_sents))", | |
"execution_count": 5, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "0.923\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "for sent in cookbook_sents:\n if sent[0] in [\"Wash\", \"Stir\", \"Moisten\", \"Drain\", \"Cook\", \"Pour\", \"Chop\", \"Slice\", \"Season\", \"Mix\", \"Fry\", \"Bake\", \"Roast\", \"Wisk\"]:\n for item in ngram_tagger_all_sents.tag(sent):\n print(item) \n print()", | |
"execution_count": 6, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "('Wash', 'VB')\n('a', 'DET')\n('quarter', 'NOUN')\n('of', 'ADP')\n('a', 'DET')\n('pound', 'NOUN')\n('of', 'ADP')\n('best', 'ADJ')\n('pearl', 'NN')\n('sago', 'NOUN')\n('thoroughly', 'ADV')\n(',', '.')\n('then', 'ADV')\n('stew', 'NOUN')\n('it', 'PRON')\n('quite', 'ADV')\n('tender', 'ADJ')\n('and', 'CONJ')\n('very', 'ADV')\n('View', 'NN')\n('page', 'NOUN')\n('[', '.')\n('32', 'NUM')\n(']', '.')\n('thick', 'ADJ')\n('in', 'ADP')\n('water', 'NOUN')\n('or', 'CONJ')\n('thick', 'ADJ')\n('broth', 'NOUN')\n(';', '.')\n('(', '.')\n('it', 'PRON')\n('will', 'VERB')\n('require', 'VERB')\n('nearly', 'ADV')\n('or', 'CONJ')\n('quite', 'ADV')\n('a', 'DET')\n('quart', 'NOUN')\n('of', 'ADP')\n('liquid', 'NOUN')\n(',', '.')\n('which', 'DET')\n('should', 'VERB')\n('be', 'VERB')\n('poured', 'VERB')\n('to', 'PRT')\n('it', 'PRON')\n('cold', 'ADJ')\n('and', 'CONJ')\n('heated', 'VERB')\n('slowly', 'ADV')\n(';', '.')\n(')', '.')\n('then', 'ADV')\n('mix', 'VERB')\n('gradually', 'ADV')\n('with', 'ADP')\n('it', 'PRON')\n('a', 'DET')\n('pint', 'NOUN')\n('of', 'ADP')\n('good', 'ADJ')\n('boiling', 'VERB')\n('cream', 'NOUN')\n('or', 'CONJ')\n('milk', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('the', 'DET')\n('yolks', 'NN')\n('of', 'ADP')\n('four', 'NUM')\n('fresh', 'ADJ')\n('eggs', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('mingle', 'VERB')\n('the', 'DET')\n('whole', 'ADJ')\n('carefully', 'ADV')\n('with', 'ADP')\n('two', 'NUM')\n('quarts', 'NN')\n('of', 'ADP')\n('strong', 'ADJ')\n('veal', 'NOUN')\n('or', 'CONJ')\n('beef', 'NOUN')\n('stock', 'NOUN')\n(',', '.')\n('which', 'DET')\n('should', 'VERB')\n('always', 'ADV')\n('be', 'VERB')\n('kept', 'VERB')\n('ready', 'ADJ')\n('boiling', 'VERB')\n('.', '.')\n\n('Pour', 'VB')\n('it', 'PPS')\n('over', 'ADP')\n('the', 'DET')\n('meat', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('serve', 'VERB')\n('.', '.')\n\n('Bake', 'NOUN')\n('it', 'PRON')\n('slowly', 'ADV')\n('for', 'ADP')\n('six', 'NUM')\n('hours', 'NOUN')\n('.', '.')\n\n('Drain', 'VERB')\n('it', 'PRON')\n('from', 'ADP')\n('fat', 'NOUN')\n(',', '.')\n('unbind', 'NN')\n('it', 'PRON')\n(',', '.')\n('and', 'CONJ')\n('serve', 'VERB')\n('it', 'PRON')\n('with', 'ADP')\n('a', 'DET')\n('good', 'ADJ')\n('brown', 'ADJ')\n('gravy', 'NOUN')\n(',', '.')\n('or', 'CONJ')\n('any', 'DET')\n('sauce', 'NOUN')\n('preferred', 'VERB')\n(',', '.')\n('or', 'CONJ')\n('with', 'ADP')\n('melted', 'VERB')\n('butter', 'NOUN')\n('in', 'ADP')\n('a', 'DET')\n('tureen', 'NN')\n(',', '.')\n('a', 'DET')\n('cut', 'VERB')\n('lemon', 'NOUN')\n('and', 'CONJ')\n('cayenne', 'NOUN')\n('.', '.')\n\n('Mix', 'VB')\n('all', 'PRT')\n('these', 'DET')\n('ingredients', 'NOUN')\n('well', 'ADV')\n(',', '.')\n('and', 'CONJ')\n('rub', 'VERB')\n('them', 'PRON')\n('well', 'ADV')\n('into', 'ADP')\n('the', 'DET')\n('beef', 'NOUN')\n(',', '.')\n('particularly', 'ADV')\n('into', 'ADP')\n('the', 'DET')\n('holes', 'NOUN')\n(',', '.')\n('adding', 'VERB')\n('occasionally', 'ADV')\n('a', 'DET')\n('little', 'ADJ')\n('salt', 'NOUN')\n('.', '.')\n\n('Chop', 'VB')\n('some', 'DET')\n('suet', 'NN')\n('very', 'ADV')\n('finely', 'ADV')\n(',', '.')\n('cover', 'VERB')\n('the', 'DET')\n('beef', 'NOUN')\n('with', 'ADP')\n('it', 'PRON')\n(',', '.')\n('and', 'CONJ')\n('bake', 'NN')\n('it', 'PRON')\n('in', 'ADP')\n('a', 'DET')\n('moderately', 'ADV')\n('heated', 'VERB')\n('oven', 'NOUN')\n(',', '.')\n('from', 'ADP')\n('five', 'NUM')\n('to', 'ADP')\n('six', 'NUM')\n('hours', 'NOUN')\n('.', '.')\n\n('Pour', 'VB')\n('the', 'AT')\n('pickle', 'NOUN')\n('into', 'ADP')\n('a', 'DET')\n('deep', 'ADJ')\n('earthen', 'NN')\n('jar', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('when', 'ADV')\n('it', 'PRON')\n('is', 'VERB')\n('cold', 'ADJ')\n('lay', 'ADJ')\n('in', 'ADP')\n('the', 'DET')\n('meat', 'NOUN')\n('so', 'ADP')\n('that', 'DET')\n('every', 'DET')\n('part', 'NOUN')\n('is', 'VERB')\n('covered', 'VERB')\n('.', '.')\n\n('Season', 'VB')\n('with', 'ADP')\n('pepper', 'NOUN')\n('and', 'CONJ')\n('salt', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('sprinkle', 'VERB')\n('with', 'ADP')\n('oat-meal', 'NN')\n(';', '.')\n('chop', 'NOUN')\n('a', 'DET')\n('half', 'PRT')\n('handful', 'NOUN')\n('of', 'ADP')\n('parsley', 'NOUN')\n('and', 'CONJ')\n('thyme', 'NN')\n('and', 'CONJ')\n('throw', 'VERB')\n('in', 'ADP')\n(';', '.')\n('boil', 'VERB')\n('a', 'DET')\n('large', 'ADJ')\n('onion', 'NOUN')\n('nearly', 'ADV')\n('tender', 'ADJ')\n(',', '.')\n('chop', 'NOUN')\n('it', 'PRON')\n('and', 'CONJ')\n('mix', 'VERB')\n('it', 'PRON')\n('in', 'ADP')\n(';', '.')\n('add', 'VERB')\n('sufficient', 'ADJ')\n('broth', 'NOUN')\n('or', 'CONJ')\n('skim-milk', 'NN')\n('and', 'CONJ')\n('water', 'NOUN')\n('to', 'PRT')\n('cover', 'VERB')\n('the', 'DET')\n('beef', 'NOUN')\n(';', '.')\n('let', 'VERB')\n('it', 'PRON')\n('simmer', 'VERB')\n('two', 'NUM')\n('hours', 'NOUN')\n(';', '.')\n('then', 'ADV')\n('thicken', 'VERB')\n('with', 'ADP')\n('a', 'DET')\n('little', 'ADJ')\n('oatmeal', 'NN')\n(',', '.')\n('and', 'CONJ')\n('add', 'VERB')\n('a', 'DET')\n('dessert', 'NOUN')\n('spoonful', 'NOUN')\n('of', 'ADP')\n('mushroom', 'NOUN')\n('or', 'CONJ')\n('walnut', 'NOUN')\n('catsup', 'NOUN')\n(';', '.')\n('stir', 'VERB')\n('well', 'ADV')\n(',', '.')\n('boil', 'VERB')\n('a', 'DET')\n('minute', 'NOUN')\n('and', 'CONJ')\n('serve', 'VERB')\n('with', 'ADP')\n('pieces', 'NOUN')\n('of', 'ADP')\n('bread', 'NOUN')\n('toasted', 'VERB')\n('.', '.')\n\n('Pour', 'VB')\n('it', 'PPS')\n('on', 'ADP')\n('the', 'DET')\n('beef', 'NOUN')\n('boiling', 'VERB')\n('hot', 'ADJ')\n('and', 'CONJ')\n('cover', 'VERB')\n('closely', 'ADV')\n('.', '.')\n\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "How well is this working? " | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": true | |
}, | |
"cell_type": "code", | |
"source": "with open ('text-collection/jsm-collection.txt', 'r', encoding='utf-8') as jsm:\n t = jsm.read()\n\n# remove chapter and section headings\nt = re.sub('\\s+', ' ',\n re.sub(r'[A-Z]{2,}', '',\n re.sub('((?<=[A-Z])\\sI | I\\s(?=[A-Z]))', ' ', t)))\n\njsm_sents = tokenize_text(t)", | |
"execution_count": 7, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "for sent in jsm_sents[:2]:\n for item in ngram_tagger.tag(sent):\n print(item) \n print()", | |
"execution_count": 8, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "('It', 'PRON')\n('seems', 'VERB')\n('proper', 'ADJ')\n('that', 'ADP')\n('I', 'PRON')\n('should', 'VERB')\n('prefix', 'NN')\n('to', 'PRT')\n('the', 'DET')\n('following', 'VERB')\n('biographical', 'ADJ')\n('sketch', 'NOUN')\n('some', 'DET')\n('mention', 'NOUN')\n('of', 'ADP')\n('the', 'DET')\n('reasons', 'NOUN')\n('which', 'DET')\n('have', 'VERB')\n('made', 'VERB')\n('me', 'PRON')\n('think', 'VERB')\n('it', 'PRON')\n('desirable', 'ADJ')\n('that', 'ADP')\n('I', 'PRON')\n('should', 'VERB')\n('leave', 'VERB')\n('behind', 'ADP')\n('me', 'PRON')\n('such', 'PRT')\n('a', 'DET')\n('memorial', 'NOUN')\n('of', 'ADP')\n('so', 'ADV')\n('uneventful', 'NN')\n('a', 'DET')\n('life', 'NOUN')\n('as', 'ADP')\n('mine', 'PRON')\n('.', '.')\n\n('I', 'PRON')\n('do', 'VERB')\n('not', 'ADV')\n('for', 'ADP')\n('a', 'DET')\n('moment', 'NOUN')\n('imagine', 'VERB')\n('that', 'ADP')\n('any', 'DET')\n('part', 'NOUN')\n('of', 'ADP')\n('what', 'DET')\n('I', 'PRON')\n('have', 'VERB')\n('to', 'PRT')\n('relate', 'VERB')\n('can', 'VERB')\n('be', 'VERB')\n('interesting', 'ADJ')\n('to', 'PRT')\n('the', 'DET')\n('public', 'ADJ')\n('as', 'ADP')\n('a', 'DET')\n('narrative', 'NOUN')\n('or', 'CONJ')\n('as', 'ADP')\n('being', 'VERB')\n('connected', 'VERB')\n('with', 'ADP')\n('myself', 'PRON')\n('.', '.')\n\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "jsm_sents[1000:1005]", | |
"execution_count": 9, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "[['This', 'distinction', 'at', 'once', 'made', 'my', 'mind', 'clear', 'as', 'to', 'what', 'was', 'perplexing', 'me', 'in', 'respect', 'to', 'the', 'philosophy', 'of', 'politics', '.'], ['I', 'now', 'saw', ',', 'that', 'a', 'science', 'is', 'either', 'deductive', 'or', 'experimental', ',', 'according', 'as', ',', 'in', 'the', 'province', 'it', 'deals', 'with', ',', 'the', 'effects', 'of', 'causes', 'when', 'conjoined', ',', 'are', 'or', 'are', 'not', 'the', 'sums', 'of', 'the', 'effects', 'which', 'the', 'same', 'causes', 'produce', 'when', 'separate', '.'], ['It', 'followed', 'that', 'politics', 'must', 'be', 'a', 'deductive', 'science', '.'], ['It', 'thus', 'appeared', ',', 'that', 'both', 'Macaulay', 'and', 'my', 'father', 'were', 'wrong', ';', 'the', 'one', 'in', 'assimilating', 'the', 'method', 'of', 'philosophizing', 'in', 'politics', 'to', 'the', 'purely', 'experimental', 'method', 'of', 'chemistry', ';', 'while', 'the', 'other', ',', 'though', 'right', 'in', 'adopting', 'a', 'deductive', 'method', ',', 'had', 'made', 'a', 'wrong', 'selection', 'of', 'one', ',', 'having', 'taken', 'as', 'the', 'type', 'of', 'deduction', ',', 'not', 'the', 'appropriate', 'process', ',', 'that', 'of', 'the', 'deductive', 'branches', 'of', 'natural', 'philosophy', ',', 'but', 'the', 'inappropriate', 'one', 'of', 'pure', 'geometry', ',', 'which', ',', 'not', 'being', 'a', 'science', 'of', 'causation', 'at', 'all', ',', 'does', 'not', 'require', 'or', 'admit', 'of', 'any', 'summing-up', 'of', 'effects', '.'], ['A', 'foundation', 'was', 'thus', 'laid', 'in', 'my', 'thoughts', 'for', 'the', 'principal', 'chapters', 'of', 'what', 'I', 'afterwards', 'published', 'on', 'the', 'Logic', 'of', 'the', 'Moral', 'Sciences', ';', 'and', 'my', 'new', 'position', 'in', 'respect', 'to', 'my', 'old', 'political', 'creed', ',', 'now', 'became', 'perfectly', 'definite', '.']]" | |
}, | |
"metadata": {}, | |
"execution_count": 9 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "jsm_sents_tagged = [[('This', 'PRON'), ('distinction', 'NOUN'), ('made', 'VERB'), ('my', 'DET'), ('mind', 'NOUN'), ('clear', 'ADJ'), ('.', '.')],\n [('A', 'DET'), ('foundation', 'NOUN'), ('was', 'VERB'), ('thus', 'ADV'), ('laid', 'VERB'), ('in', 'ADP'), ('my', 'DET'), ('thoughts', 'NOUN'), ('.', '.')],\n [('It', 'PRON'), ('followed', 'VERB'), ('that', 'PRON'), ('politics', 'NOUN'), ('must', 'VERB'), ('be', 'VERB'), ('a', 'DET'), ('deductive', 'ADJ'), ('science', 'NOUN'), ('.', '.')], \n [('A', 'DET'), ('science', 'NOUN'), ('is', 'VERB'), ('either', 'CONJ'), ('deductive', 'ADJ'), ('or', 'CONJ'), ('experimental', 'ADJ'), ('.', '.')]]\n\n\nall_tagged_sents = jsm_sents_tagged + brown_tagged_sents\n\ntrain_sents, test_sents = create_data_sets(all_tagged_sents)\n\nngram_tagger_all_sents = build_backoff_tagger(train_sents)\n\nprint (\"%0.3f\" % ngram_tagger_all_sents.evaluate(test_sents))", | |
"execution_count": 10, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "0.923\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "for sent in jsm_sents[:2]:\n for item in ngram_tagger.tag(sent):\n print(item) \n print()", | |
"execution_count": 11, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "('It', 'PRON')\n('seems', 'VERB')\n('proper', 'ADJ')\n('that', 'ADP')\n('I', 'PRON')\n('should', 'VERB')\n('prefix', 'NN')\n('to', 'PRT')\n('the', 'DET')\n('following', 'VERB')\n('biographical', 'ADJ')\n('sketch', 'NOUN')\n('some', 'DET')\n('mention', 'NOUN')\n('of', 'ADP')\n('the', 'DET')\n('reasons', 'NOUN')\n('which', 'DET')\n('have', 'VERB')\n('made', 'VERB')\n('me', 'PRON')\n('think', 'VERB')\n('it', 'PRON')\n('desirable', 'ADJ')\n('that', 'ADP')\n('I', 'PRON')\n('should', 'VERB')\n('leave', 'VERB')\n('behind', 'ADP')\n('me', 'PRON')\n('such', 'PRT')\n('a', 'DET')\n('memorial', 'NOUN')\n('of', 'ADP')\n('so', 'ADV')\n('uneventful', 'NN')\n('a', 'DET')\n('life', 'NOUN')\n('as', 'ADP')\n('mine', 'PRON')\n('.', '.')\n\n('I', 'PRON')\n('do', 'VERB')\n('not', 'ADV')\n('for', 'ADP')\n('a', 'DET')\n('moment', 'NOUN')\n('imagine', 'VERB')\n('that', 'ADP')\n('any', 'DET')\n('part', 'NOUN')\n('of', 'ADP')\n('what', 'DET')\n('I', 'PRON')\n('have', 'VERB')\n('to', 'PRT')\n('relate', 'VERB')\n('can', 'VERB')\n('be', 'VERB')\n('interesting', 'ADJ')\n('to', 'PRT')\n('the', 'DET')\n('public', 'ADJ')\n('as', 'ADP')\n('a', 'DET')\n('narrative', 'NOUN')\n('or', 'CONJ')\n('as', 'ADP')\n('being', 'VERB')\n('connected', 'VERB')\n('with', 'ADP')\n('myself', 'PRON')\n('.', '.')\n\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": true | |
}, | |
"cell_type": "code", | |
"source": "", | |
"execution_count": null, | |
"outputs": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3", | |
"language": "python" | |
}, | |
"language_info": { | |
"mimetype": "text/x-python", | |
"file_extension": ".py", | |
"nbconvert_exporter": "python", | |
"version": "3.4.2", | |
"codemirror_mode": { | |
"version": 3, | |
"name": "ipython" | |
}, | |
"name": "python", | |
"pygments_lexer": "ipython3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment