juanshishido · September 23, 2015 08:28
diff --git a/Tagger Create New Training Data and Retrain.ipynb b/Tagger Create New Training Data and Retrain.ipynb
 {
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "In this assignment, you'll see an example in which some text does not tag well, most likely because the training data did not have many examples of the target sentence structure.  You'll see the effects of adding a few sentences of training data with the missing sentence structure on the accuracy of the tagger."
  },
  {
   "metadata": {
    "collapsed": false,
    "trusted": true
   },
   "cell_type": "code",
   "source": "%pprint\n\nimport re\n\nimport nltk\nfrom nltk.corpus import brown\nfrom nltk import word_tokenize",
   "execution_count": 1,
   "outputs": [
    {
     "output_type": "stream",
     "text": "Pretty printing has been turned OFF\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "First, create datasets and train an ngram backoff tagger as before, using the brown corpus as the training set."
  },
  {
   "metadata": {
    "collapsed": false,
    "trusted": true
   },
   "cell_type": "code",
   "source": "def create_data_sets(sentences):\n    size = int(len(sentences) * 0.9)\n    train_sents = sentences[:size]\n    test_sents = sentences[size:]\n    return train_sents, test_sents\n\ndef build_backoff_tagger (train_sents):\n    t0 = nltk.DefaultTagger('NN')\n    t1 = nltk.UnigramTagger(train_sents, backoff=t0)\n    t2 = nltk.BigramTagger(train_sents, backoff=t1)\n    return t2\n\nbrown_tagged_sents = brown.tagged_sents(categories=['adventure', 'belles_lettres', 'editorial',\n                                                    'fiction', 'government', 'hobbies', 'humor',\n                                                    'learned', 'lore', 'mystery', 'religion',\n                                                    'reviews', 'romance', 'science_fiction'],\n                                        tagset='universal')\n\ntrain_sents, test_sents = create_data_sets(brown_tagged_sents)\n\nngram_tagger = build_backoff_tagger(train_sents)\n\nprint (\"%0.3f\" % ngram_tagger.evaluate(test_sents))",
   "execution_count": 2,
   "outputs": [
    {
     "output_type": "stream",
     "text": "0.923\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "collapsed": true
   },
   "cell_type": "markdown",
   "source": "Next, read in a file of recipes and tokenize it."
  },
  {
   "metadata": {
    "collapsed": false,
    "trusted": true
   },
   "cell_type": "code",
   "source": "with open('data/cookbooks.txt', 'r') as text_file:\n    cookbooks_corpus = text_file.read()\n\ndef tokenize_text(corpus):\n    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')\n    raw_sents = sent_tokenizer.tokenize(corpus) # Split text into sentences\n    \n    return [nltk.word_tokenize(word) for word in raw_sents]\n\ncookbook_sents = tokenize_text(cookbooks_corpus)",
   "execution_count": 3,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "Now, in order to see the sentences where errors are occuring, the code below finds sentences that begin with imperatives and prints them out, along with their assigned parts of speech."
  },
  {
   "metadata": {
    "collapsed": false,
    "trusted": true
   },
   "cell_type": "code",
   "source": "for sent in cookbook_sents:\n    if sent[0] in [\"Wash\", \"Stir\", \"Moisten\", \"Drain\", \"Cook\", \"Pour\", \"Chop\", \"Slice\", \"Season\", \"Mix\", \"Fry\", \"Bake\", \"Roast\", \"Wisk\"]:\n        for item in ngram_tagger.tag(sent):\n            print(item) \n        print()",
   "execution_count": 4,
   "outputs": [
    {
     "output_type": "stream",
     "text": "('Wash', 'NN')\n('a', 'DET')\n('quarter', 'NOUN')\n('of', 'ADP')\n('a', 'DET')\n('pound', 'NOUN')\n('of', 'ADP')\n('best', 'ADJ')\n('pearl', 'NN')\n('sago', 'NOUN')\n('thoroughly', 'ADV')\n(',', '.')\n('then', 'ADV')\n('stew', 'NOUN')\n('it', 'PRON')\n('quite', 'ADV')\n('tender', 'ADJ')\n('and', 'CONJ')\n('very', 'ADV')\n('View', 'NN')\n('page', 'NOUN')\n('[', '.')\n('32', 'NUM')\n(']', '.')\n('thick', 'ADJ')\n('in', 'ADP')\n('water', 'NOUN')\n('or', 'CONJ')\n('thick', 'ADJ')\n('broth', 'NOUN')\n(';', '.')\n('(', '.')\n('it', 'PRON')\n('will', 'VERB')\n('require', 'VERB')\n('nearly', 'ADV')\n('or', 'CONJ')\n('quite', 'ADV')\n('a', 'DET')\n('quart', 'NOUN')\n('of', 'ADP')\n('liquid', 'NOUN')\n(',', '.')\n('which', 'DET')\n('should', 'VERB')\n('be', 'VERB')\n('poured', 'VERB')\n('to', 'PRT')\n('it', 'PRON')\n('cold', 'ADJ')\n('and', 'CONJ')\n('heated', 'VERB')\n('slowly', 'ADV')\n(';', '.')\n(')', '.')\n('then', 'ADV')\n('mix', 'VERB')\n('gradually', 'ADV')\n('with', 'ADP')\n('it', 'PRON')\n('a', 'DET')\n('pint', 'NOUN')\n('of', 'ADP')\n('good', 'ADJ')\n('boiling', 'VERB')\n('cream', 'NOUN')\n('or', 'CONJ')\n('milk', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('the', 'DET')\n('yolks', 'NN')\n('of', 'ADP')\n('four', 'NUM')\n('fresh', 'ADJ')\n('eggs', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('mingle', 'VERB')\n('the', 'DET')\n('whole', 'ADJ')\n('carefully', 'ADV')\n('with', 'ADP')\n('two', 'NUM')\n('quarts', 'NN')\n('of', 'ADP')\n('strong', 'ADJ')\n('veal', 'NOUN')\n('or', 'CONJ')\n('beef', 'NOUN')\n('stock', 'NOUN')\n(',', '.')\n('which', 'DET')\n('should', 'VERB')\n('always', 'ADV')\n('be', 'VERB')\n('kept', 'VERB')\n('ready', 'ADJ')\n('boiling', 'VERB')\n('.', '.')\n\n('Pour', 'NN')\n('it', 'PRON')\n('over', 'PRT')\n('the', 'DET')\n('meat', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('serve', 'VERB')\n('.', '.')\n\n('Bake', 'NOUN')\n('it', 'PRON')\n('slowly', 'ADV')\n('for', 'ADP')\n('six', 'NUM')\n('hours', 'NOUN')\n('.', '.')\n\n('Drain', 'VERB')\n('it', 'PRON')\n('from', 'ADP')\n('fat', 'NOUN')\n(',', '.')\n('unbind', 'NN')\n('it', 'PRON')\n(',', '.')\n('and', 'CONJ')\n('serve', 'VERB')\n('it', 'PRON')\n('with', 'ADP')\n('a', 'DET')\n('good', 'ADJ')\n('brown', 'ADJ')\n('gravy', 'NOUN')\n(',', '.')\n('or', 'CONJ')\n('any', 'DET')\n('sauce', 'NOUN')\n('preferred', 'VERB')\n(',', '.')\n('or', 'CONJ')\n('with', 'ADP')\n('melted', 'VERB')\n('butter', 'NOUN')\n('in', 'ADP')\n('a', 'DET')\n('tureen', 'NN')\n(',', '.')\n('a', 'DET')\n('cut', 'VERB')\n('lemon', 'NOUN')\n('and', 'CONJ')\n('cayenne', 'NOUN')\n('.', '.')\n\n('Mix', 'NN')\n('all', 'PRT')\n('these', 'DET')\n('ingredients', 'NOUN')\n('well', 'ADV')\n(',', '.')\n('and', 'CONJ')\n('rub', 'VERB')\n('them', 'PRON')\n('well', 'ADV')\n('into', 'ADP')\n('the', 'DET')\n('beef', 'NOUN')\n(',', '.')\n('particularly', 'ADV')\n('into', 'ADP')\n('the', 'DET')\n('holes', 'NOUN')\n(',', '.')\n('adding', 'VERB')\n('occasionally', 'ADV')\n('a', 'DET')\n('little', 'ADJ')\n('salt', 'NOUN')\n('.', '.')\n\n('Chop', 'NN')\n('some', 'DET')\n('suet', 'NN')\n('very', 'ADV')\n('finely', 'ADV')\n(',', '.')\n('cover', 'VERB')\n('the', 'DET')\n('beef', 'NOUN')\n('with', 'ADP')\n('it', 'PRON')\n(',', '.')\n('and', 'CONJ')\n('bake', 'NN')\n('it', 'PRON')\n('in', 'ADP')\n('a', 'DET')\n('moderately', 'ADV')\n('heated', 'VERB')\n('oven', 'NOUN')\n(',', '.')\n('from', 'ADP')\n('five', 'NUM')\n('to', 'ADP')\n('six', 'NUM')\n('hours', 'NOUN')\n('.', '.')\n\n('Pour', 'NN')\n('the', 'DET')\n('pickle', 'NOUN')\n('into', 'ADP')\n('a', 'DET')\n('deep', 'ADJ')\n('earthen', 'NN')\n('jar', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('when', 'ADV')\n('it', 'PRON')\n('is', 'VERB')\n('cold', 'ADJ')\n('lay', 'ADJ')\n('in', 'ADP')\n('the', 'DET')\n('meat', 'NOUN')\n('so', 'ADP')\n('that', 'DET')\n('every', 'DET')\n('part', 'NOUN')\n('is', 'VERB')\n('covered', 'VERB')\n('.', '.')\n\n('Season', 'NN')\n('with', 'ADP')\n('pepper', 'NOUN')\n('and', 'CONJ')\n('salt', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('sprinkle', 'VERB')\n('with', 'ADP')\n('oat-meal', 'NN')\n(';', '.')\n('chop', 'NOUN')\n('a', 'DET')\n('half', 'PRT')\n('handful', 'NOUN')\n('of', 'ADP')\n('parsley', 'NOUN')\n('and', 'CONJ')\n('thyme', 'NN')\n('and', 'CONJ')\n('throw', 'VERB')\n('in', 'ADP')\n(';', '.')\n('boil', 'VERB')\n('a', 'DET')\n('large', 'ADJ')\n('onion', 'NOUN')\n('nearly', 'ADV')\n('tender', 'ADJ')\n(',', '.')\n('chop', 'NOUN')\n('it', 'PRON')\n('and', 'CONJ')\n('mix', 'VERB')\n('it', 'PRON')\n('in', 'ADP')\n(';', '.')\n('add', 'VERB')\n('sufficient', 'ADJ')\n('broth', 'NOUN')\n('or', 'CONJ')\n('skim-milk', 'NN')\n('and', 'CONJ')\n('water', 'NOUN')\n('to', 'PRT')\n('cover', 'VERB')\n('the', 'DET')\n('beef', 'NOUN')\n(';', '.')\n('let', 'VERB')\n('it', 'PRON')\n('simmer', 'VERB')\n('two', 'NUM')\n('hours', 'NOUN')\n(';', '.')\n('then', 'ADV')\n('thicken', 'VERB')\n('with', 'ADP')\n('a', 'DET')\n('little', 'ADJ')\n('oatmeal', 'NN')\n(',', '.')\n('and', 'CONJ')\n('add', 'VERB')\n('a', 'DET')\n('dessert', 'NOUN')\n('spoonful', 'NOUN')\n('of', 'ADP')\n('mushroom', 'NOUN')\n('or', 'CONJ')\n('walnut', 'NOUN')\n('catsup', 'NOUN')\n(';', '.')\n('stir', 'VERB')\n('well', 'ADV')\n(',', '.')\n('boil', 'VERB')\n('a', 'DET')\n('minute', 'NOUN')\n('and', 'CONJ')\n('serve', 'VERB')\n('with', 'ADP')\n('pieces', 'NOUN')\n('of', 'ADP')\n('bread', 'NOUN')\n('toasted', 'VERB')\n('.', '.')\n\n('Pour', 'NN')\n('it', 'PRON')\n('on', 'ADP')\n('the', 'DET')\n('beef', 'NOUN')\n('boiling', 'VERB')\n('hot', 'ADJ')\n('and', 'CONJ')\n('cover', 'VERB')\n('closely', 'ADV')\n('.', '.')\n\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "Notice that most of the initial words are incorrectly tagged as nouns rather than verbs.  How can we fix this?  One way is to label a few rather generic sentences with the structure we are interested in, add them to the start of the training data, and then retrain the tagger."
  },
  {
   "metadata": {
    "collapsed": false,
    "trusted": true
   },
   "cell_type": "code",
   "source": "cooking_action_sents = [[('Strain', 'VB'), ('it', 'PPS'), ('well', 'RB'), ('.', '.')],\n                        [('Mix', 'VB'), ('them', 'PPS'), ('well', 'RB'), ('.', '.')],\n                        [('Season', 'VB'), ('them', 'PPS'), ('with', 'IN'), ('pepper', 'NN'), ('.', '.')], \n                        [('Wash', 'VB'), ('it', 'PPS'), ('well', 'RB'), ('.', '.')],\n                        [('Chop', 'VB'), ('the', 'AT'), ('greens', 'NNS'), ('.', '.')],\n                        [('Slice', 'VB'), ('it', 'PPS'), ('well', 'RB'), ('.', '.')],\n                        [('Bake', 'VB'), ('the', 'AT'), ('cake', 'NN'), ('.', '.')],\n                        [('Pour', 'VB'), ('into', 'IN'), ('a', 'AT'), ('mold', 'NN'), ('.', '.')],\n                        [('Stir', 'VB'), ('the', 'AT'), ('mixture', 'NN'), ('.', '.')],\n                        [('Moisten', 'VB'), ('the', 'AT'), ('grains', 'NNS'), ('.', '.')],\n                        [('Cook', 'VB'), ('the', 'AT'), ('duck', 'NN'), ('.', '.')],\n                        [('Drain', 'VB'), ('for', 'IN'), ('one', 'CD'), ('day', 'NN'), ('.', '.')]]\n\n\nall_tagged_sents = cooking_action_sents + brown_tagged_sents\n\ntrain_sents, test_sents = create_data_sets(all_tagged_sents)\n\nngram_tagger_all_sents = build_backoff_tagger(train_sents)\n\nprint (\"%0.3f\" % ngram_tagger_all_sents.evaluate(test_sents))",
   "execution_count": 5,
   "outputs": [
    {
     "output_type": "stream",
     "text": "0.923\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "collapsed": false,
    "trusted": true
   },
   "cell_type": "code",
   "source": "for sent in cookbook_sents:\n     if sent[0] in [\"Wash\", \"Stir\", \"Moisten\", \"Drain\", \"Cook\", \"Pour\", \"Chop\", \"Slice\", \"Season\", \"Mix\", \"Fry\", \"Bake\", \"Roast\", \"Wisk\"]:\n            for item in ngram_tagger_all_sents.tag(sent):\n                print(item) \n            print()",
   "execution_count": 6,
   "outputs": [
    {
     "output_type": "stream",
     "text": "('Wash', 'VB')\n('a', 'DET')\n('quarter', 'NOUN')\n('of', 'ADP')\n('a', 'DET')\n('pound', 'NOUN')\n('of', 'ADP')\n('best', 'ADJ')\n('pearl', 'NN')\n('sago', 'NOUN')\n('thoroughly', 'ADV')\n(',', '.')\n('then', 'ADV')\n('stew', 'NOUN')\n('it', 'PRON')\n('quite', 'ADV')\n('tender', 'ADJ')\n('and', 'CONJ')\n('very', 'ADV')\n('View', 'NN')\n('page', 'NOUN')\n('[', '.')\n('32', 'NUM')\n(']', '.')\n('thick', 'ADJ')\n('in', 'ADP')\n('water', 'NOUN')\n('or', 'CONJ')\n('thick', 'ADJ')\n('broth', 'NOUN')\n(';', '.')\n('(', '.')\n('it', 'PRON')\n('will', 'VERB')\n('require', 'VERB')\n('nearly', 'ADV')\n('or', 'CONJ')\n('quite', 'ADV')\n('a', 'DET')\n('quart', 'NOUN')\n('of', 'ADP')\n('liquid', 'NOUN')\n(',', '.')\n('which', 'DET')\n('should', 'VERB')\n('be', 'VERB')\n('poured', 'VERB')\n('to', 'PRT')\n('it', 'PRON')\n('cold', 'ADJ')\n('and', 'CONJ')\n('heated', 'VERB')\n('slowly', 'ADV')\n(';', '.')\n(')', '.')\n('then', 'ADV')\n('mix', 'VERB')\n('gradually', 'ADV')\n('with', 'ADP')\n('it', 'PRON')\n('a', 'DET')\n('pint', 'NOUN')\n('of', 'ADP')\n('good', 'ADJ')\n('boiling', 'VERB')\n('cream', 'NOUN')\n('or', 'CONJ')\n('milk', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('the', 'DET')\n('yolks', 'NN')\n('of', 'ADP')\n('four', 'NUM')\n('fresh', 'ADJ')\n('eggs', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('mingle', 'VERB')\n('the', 'DET')\n('whole', 'ADJ')\n('carefully', 'ADV')\n('with', 'ADP')\n('two', 'NUM')\n('quarts', 'NN')\n('of', 'ADP')\n('strong', 'ADJ')\n('veal', 'NOUN')\n('or', 'CONJ')\n('beef', 'NOUN')\n('stock', 'NOUN')\n(',', '.')\n('which', 'DET')\n('should', 'VERB')\n('always', 'ADV')\n('be', 'VERB')\n('kept', 'VERB')\n('ready', 'ADJ')\n('boiling', 'VERB')\n('.', '.')\n\n('Pour', 'VB')\n('it', 'PPS')\n('over', 'ADP')\n('the', 'DET')\n('meat', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('serve', 'VERB')\n('.', '.')\n\n('Bake', 'NOUN')\n('it', 'PRON')\n('slowly', 'ADV')\n('for', 'ADP')\n('six', 'NUM')\n('hours', 'NOUN')\n('.', '.')\n\n('Drain', 'VERB')\n('it', 'PRON')\n('from', 'ADP')\n('fat', 'NOUN')\n(',', '.')\n('unbind', 'NN')\n('it', 'PRON')\n(',', '.')\n('and', 'CONJ')\n('serve', 'VERB')\n('it', 'PRON')\n('with', 'ADP')\n('a', 'DET')\n('good', 'ADJ')\n('brown', 'ADJ')\n('gravy', 'NOUN')\n(',', '.')\n('or', 'CONJ')\n('any', 'DET')\n('sauce', 'NOUN')\n('preferred', 'VERB')\n(',', '.')\n('or', 'CONJ')\n('with', 'ADP')\n('melted', 'VERB')\n('butter', 'NOUN')\n('in', 'ADP')\n('a', 'DET')\n('tureen', 'NN')\n(',', '.')\n('a', 'DET')\n('cut', 'VERB')\n('lemon', 'NOUN')\n('and', 'CONJ')\n('cayenne', 'NOUN')\n('.', '.')\n\n('Mix', 'VB')\n('all', 'PRT')\n('these', 'DET')\n('ingredients', 'NOUN')\n('well', 'ADV')\n(',', '.')\n('and', 'CONJ')\n('rub', 'VERB')\n('them', 'PRON')\n('well', 'ADV')\n('into', 'ADP')\n('the', 'DET')\n('beef', 'NOUN')\n(',', '.')\n('particularly', 'ADV')\n('into', 'ADP')\n('the', 'DET')\n('holes', 'NOUN')\n(',', '.')\n('adding', 'VERB')\n('occasionally', 'ADV')\n('a', 'DET')\n('little', 'ADJ')\n('salt', 'NOUN')\n('.', '.')\n\n('Chop', 'VB')\n('some', 'DET')\n('suet', 'NN')\n('very', 'ADV')\n('finely', 'ADV')\n(',', '.')\n('cover', 'VERB')\n('the', 'DET')\n('beef', 'NOUN')\n('with', 'ADP')\n('it', 'PRON')\n(',', '.')\n('and', 'CONJ')\n('bake', 'NN')\n('it', 'PRON')\n('in', 'ADP')\n('a', 'DET')\n('moderately', 'ADV')\n('heated', 'VERB')\n('oven', 'NOUN')\n(',', '.')\n('from', 'ADP')\n('five', 'NUM')\n('to', 'ADP')\n('six', 'NUM')\n('hours', 'NOUN')\n('.', '.')\n\n('Pour', 'VB')\n('the', 'AT')\n('pickle', 'NOUN')\n('into', 'ADP')\n('a', 'DET')\n('deep', 'ADJ')\n('earthen', 'NN')\n('jar', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('when', 'ADV')\n('it', 'PRON')\n('is', 'VERB')\n('cold', 'ADJ')\n('lay', 'ADJ')\n('in', 'ADP')\n('the', 'DET')\n('meat', 'NOUN')\n('so', 'ADP')\n('that', 'DET')\n('every', 'DET')\n('part', 'NOUN')\n('is', 'VERB')\n('covered', 'VERB')\n('.', '.')\n\n('Season', 'VB')\n('with', 'ADP')\n('pepper', 'NOUN')\n('and', 'CONJ')\n('salt', 'NOUN')\n(',', '.')\n('and', 'CONJ')\n('sprinkle', 'VERB')\n('with', 'ADP')\n('oat-meal', 'NN')\n(';', '.')\n('chop', 'NOUN')\n('a', 'DET')\n('half', 'PRT')\n('handful', 'NOUN')\n('of', 'ADP')\n('parsley', 'NOUN')\n('and', 'CONJ')\n('thyme', 'NN')\n('and', 'CONJ')\n('throw', 'VERB')\n('in', 'ADP')\n(';', '.')\n('boil', 'VERB')\n('a', 'DET')\n('large', 'ADJ')\n('onion', 'NOUN')\n('nearly', 'ADV')\n('tender', 'ADJ')\n(',', '.')\n('chop', 'NOUN')\n('it', 'PRON')\n('and', 'CONJ')\n('mix', 'VERB')\n('it', 'PRON')\n('in', 'ADP')\n(';', '.')\n('add', 'VERB')\n('sufficient', 'ADJ')\n('broth', 'NOUN')\n('or', 'CONJ')\n('skim-milk', 'NN')\n('and', 'CONJ')\n('water', 'NOUN')\n('to', 'PRT')\n('cover', 'VERB')\n('the', 'DET')\n('beef', 'NOUN')\n(';', '.')\n('let', 'VERB')\n('it', 'PRON')\n('simmer', 'VERB')\n('two', 'NUM')\n('hours', 'NOUN')\n(';', '.')\n('then', 'ADV')\n('thicken', 'VERB')\n('with', 'ADP')\n('a', 'DET')\n('little', 'ADJ')\n('oatmeal', 'NN')\n(',', '.')\n('and', 'CONJ')\n('add', 'VERB')\n('a', 'DET')\n('dessert', 'NOUN')\n('spoonful', 'NOUN')\n('of', 'ADP')\n('mushroom', 'NOUN')\n('or', 'CONJ')\n('walnut', 'NOUN')\n('catsup', 'NOUN')\n(';', '.')\n('stir', 'VERB')\n('well', 'ADV')\n(',', '.')\n('boil', 'VERB')\n('a', 'DET')\n('minute', 'NOUN')\n('and', 'CONJ')\n('serve', 'VERB')\n('with', 'ADP')\n('pieces', 'NOUN')\n('of', 'ADP')\n('bread', 'NOUN')\n('toasted', 'VERB')\n('.', '.')\n\n('Pour', 'VB')\n('it', 'PPS')\n('on', 'ADP')\n('the', 'DET')\n('beef', 'NOUN')\n('boiling', 'VERB')\n('hot', 'ADJ')\n('and', 'CONJ')\n('cover', 'VERB')\n('closely', 'ADV')\n('.', '.')\n\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "How well is this working? "
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": true
   },
   "cell_type": "code",
   "source": "with open ('text-collection/jsm-collection.txt', 'r', encoding='utf-8') as jsm:\n    t = jsm.read()\n\n# remove chapter and section headings\nt = re.sub('\\s+', ' ',\n           re.sub(r'[A-Z]{2,}', '',\n                  re.sub('((?<=[A-Z])\\sI | I\\s(?=[A-Z]))', ' ', t)))\n\njsm_sents = tokenize_text(t)",
   "execution_count": 7,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": false
   },
   "cell_type": "code",
   "source": "for sent in jsm_sents[:2]:\n    for item in ngram_tagger.tag(sent):\n        print(item) \n    print()",
   "execution_count": 8,
   "outputs": [
    {
     "output_type": "stream",
     "text": "('It', 'PRON')\n('seems', 'VERB')\n('proper', 'ADJ')\n('that', 'ADP')\n('I', 'PRON')\n('should', 'VERB')\n('prefix', 'NN')\n('to', 'PRT')\n('the', 'DET')\n('following', 'VERB')\n('biographical', 'ADJ')\n('sketch', 'NOUN')\n('some', 'DET')\n('mention', 'NOUN')\n('of', 'ADP')\n('the', 'DET')\n('reasons', 'NOUN')\n('which', 'DET')\n('have', 'VERB')\n('made', 'VERB')\n('me', 'PRON')\n('think', 'VERB')\n('it', 'PRON')\n('desirable', 'ADJ')\n('that', 'ADP')\n('I', 'PRON')\n('should', 'VERB')\n('leave', 'VERB')\n('behind', 'ADP')\n('me', 'PRON')\n('such', 'PRT')\n('a', 'DET')\n('memorial', 'NOUN')\n('of', 'ADP')\n('so', 'ADV')\n('uneventful', 'NN')\n('a', 'DET')\n('life', 'NOUN')\n('as', 'ADP')\n('mine', 'PRON')\n('.', '.')\n\n('I', 'PRON')\n('do', 'VERB')\n('not', 'ADV')\n('for', 'ADP')\n('a', 'DET')\n('moment', 'NOUN')\n('imagine', 'VERB')\n('that', 'ADP')\n('any', 'DET')\n('part', 'NOUN')\n('of', 'ADP')\n('what', 'DET')\n('I', 'PRON')\n('have', 'VERB')\n('to', 'PRT')\n('relate', 'VERB')\n('can', 'VERB')\n('be', 'VERB')\n('interesting', 'ADJ')\n('to', 'PRT')\n('the', 'DET')\n('public', 'ADJ')\n('as', 'ADP')\n('a', 'DET')\n('narrative', 'NOUN')\n('or', 'CONJ')\n('as', 'ADP')\n('being', 'VERB')\n('connected', 'VERB')\n('with', 'ADP')\n('myself', 'PRON')\n('.', '.')\n\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": false
   },
   "cell_type": "code",
   "source": "jsm_sents[1000:1005]",
   "execution_count": 9,
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "[['This', 'distinction', 'at', 'once', 'made', 'my', 'mind', 'clear', 'as', 'to', 'what', 'was', 'perplexing', 'me', 'in', 'respect', 'to', 'the', 'philosophy', 'of', 'politics', '.'], ['I', 'now', 'saw', ',', 'that', 'a', 'science', 'is', 'either', 'deductive', 'or', 'experimental', ',', 'according', 'as', ',', 'in', 'the', 'province', 'it', 'deals', 'with', ',', 'the', 'effects', 'of', 'causes', 'when', 'conjoined', ',', 'are', 'or', 'are', 'not', 'the', 'sums', 'of', 'the', 'effects', 'which', 'the', 'same', 'causes', 'produce', 'when', 'separate', '.'], ['It', 'followed', 'that', 'politics', 'must', 'be', 'a', 'deductive', 'science', '.'], ['It', 'thus', 'appeared', ',', 'that', 'both', 'Macaulay', 'and', 'my', 'father', 'were', 'wrong', ';', 'the', 'one', 'in', 'assimilating', 'the', 'method', 'of', 'philosophizing', 'in', 'politics', 'to', 'the', 'purely', 'experimental', 'method', 'of', 'chemistry', ';', 'while', 'the', 'other', ',', 'though', 'right', 'in', 'adopting', 'a', 'deductive', 'method', ',', 'had', 'made', 'a', 'wrong', 'selection', 'of', 'one', ',', 'having', 'taken', 'as', 'the', 'type', 'of', 'deduction', ',', 'not', 'the', 'appropriate', 'process', ',', 'that', 'of', 'the', 'deductive', 'branches', 'of', 'natural', 'philosophy', ',', 'but', 'the', 'inappropriate', 'one', 'of', 'pure', 'geometry', ',', 'which', ',', 'not', 'being', 'a', 'science', 'of', 'causation', 'at', 'all', ',', 'does', 'not', 'require', 'or', 'admit', 'of', 'any', 'summing-up', 'of', 'effects', '.'], ['A', 'foundation', 'was', 'thus', 'laid', 'in', 'my', 'thoughts', 'for', 'the', 'principal', 'chapters', 'of', 'what', 'I', 'afterwards', 'published', 'on', 'the', 'Logic', 'of', 'the', 'Moral', 'Sciences', ';', 'and', 'my', 'new', 'position', 'in', 'respect', 'to', 'my', 'old', 'political', 'creed', ',', 'now', 'became', 'perfectly', 'definite', '.']]"
     },
     "metadata": {},
     "execution_count": 9
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": false
   },
   "cell_type": "code",
   "source": "jsm_sents_tagged = [[('This', 'PRON'), ('distinction', 'NOUN'), ('made', 'VERB'), ('my', 'DET'), ('mind', 'NOUN'), ('clear', 'ADJ'), ('.', '.')],\n                    [('A', 'DET'), ('foundation', 'NOUN'), ('was', 'VERB'), ('thus', 'ADV'), ('laid', 'VERB'), ('in', 'ADP'), ('my', 'DET'), ('thoughts', 'NOUN'), ('.', '.')],\n                    [('It', 'PRON'), ('followed', 'VERB'), ('that', 'PRON'), ('politics', 'NOUN'), ('must', 'VERB'), ('be', 'VERB'), ('a', 'DET'), ('deductive', 'ADJ'), ('science', 'NOUN'), ('.', '.')], \n                    [('A', 'DET'), ('science', 'NOUN'), ('is', 'VERB'), ('either', 'CONJ'), ('deductive', 'ADJ'), ('or', 'CONJ'), ('experimental', 'ADJ'), ('.', '.')]]\n\n\nall_tagged_sents = jsm_sents_tagged + brown_tagged_sents\n\ntrain_sents, test_sents = create_data_sets(all_tagged_sents)\n\nngram_tagger_all_sents = build_backoff_tagger(train_sents)\n\nprint (\"%0.3f\" % ngram_tagger_all_sents.evaluate(test_sents))",
   "execution_count": 10,
   "outputs": [
    {
     "output_type": "stream",
     "text": "0.923\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": false
   },
   "cell_type": "code",
   "source": "for sent in jsm_sents[:2]:\n    for item in ngram_tagger.tag(sent):\n        print(item) \n    print()",
   "execution_count": 11,
   "outputs": [
    {
     "output_type": "stream",
     "text": "('It', 'PRON')\n('seems', 'VERB')\n('proper', 'ADJ')\n('that', 'ADP')\n('I', 'PRON')\n('should', 'VERB')\n('prefix', 'NN')\n('to', 'PRT')\n('the', 'DET')\n('following', 'VERB')\n('biographical', 'ADJ')\n('sketch', 'NOUN')\n('some', 'DET')\n('mention', 'NOUN')\n('of', 'ADP')\n('the', 'DET')\n('reasons', 'NOUN')\n('which', 'DET')\n('have', 'VERB')\n('made', 'VERB')\n('me', 'PRON')\n('think', 'VERB')\n('it', 'PRON')\n('desirable', 'ADJ')\n('that', 'ADP')\n('I', 'PRON')\n('should', 'VERB')\n('leave', 'VERB')\n('behind', 'ADP')\n('me', 'PRON')\n('such', 'PRT')\n('a', 'DET')\n('memorial', 'NOUN')\n('of', 'ADP')\n('so', 'ADV')\n('uneventful', 'NN')\n('a', 'DET')\n('life', 'NOUN')\n('as', 'ADP')\n('mine', 'PRON')\n('.', '.')\n\n('I', 'PRON')\n('do', 'VERB')\n('not', 'ADV')\n('for', 'ADP')\n('a', 'DET')\n('moment', 'NOUN')\n('imagine', 'VERB')\n('that', 'ADP')\n('any', 'DET')\n('part', 'NOUN')\n('of', 'ADP')\n('what', 'DET')\n('I', 'PRON')\n('have', 'VERB')\n('to', 'PRT')\n('relate', 'VERB')\n('can', 'VERB')\n('be', 'VERB')\n('interesting', 'ADJ')\n('to', 'PRT')\n('the', 'DET')\n('public', 'ADJ')\n('as', 'ADP')\n('a', 'DET')\n('narrative', 'NOUN')\n('or', 'CONJ')\n('as', 'ADP')\n('being', 'VERB')\n('connected', 'VERB')\n('with', 'ADP')\n('myself', 'PRON')\n('.', '.')\n\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": true
   },
   "cell_type": "code",
   "source": "",
   "execution_count": null,
   "outputs": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3",
   "language": "python"
  },
  "language_info": {
   "mimetype": "text/x-python",
   "file_extension": ".py",
   "nbconvert_exporter": "python",
   "version": "3.4.2",
   "codemirror_mode": {
    "version": 3,
    "name": "ipython"
   },
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }