Created
October 6, 2014 15:25
-
-
Save fayeip/06eef20b48b5d3b33df6 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## 1. Syntactic Patterns for Technical Terms ##" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "import nltk\nfrom nltk.corpus import brown", | |
"prompt_number": 39, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "As seen in the Chuang et al. paper and in the Manning and Schuetze chapter,\nthere is a well-known part-of-speech based pattern defined by Justeson and Katz\nfor identifying simple noun phrases that often words well for pulling out keyphrases.\n\nChuang et al use this pattern: Technical Term T = (A | N)+ (N | C) | N\n\nBelow, please write a function to define a chunker using the RegexpParser as illustrated in the section *Chunking with Regular Expressions*. You'll need to revise the grammar rules shown there to match the pattern shown above. You can be liberal with your definition of what is meant by *N* here. Also, C refers to cardinal number, which is CD in the brown corpus.\n\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "grammar = r\"\"\"\n NP: {<ADJ|N>+<N|CD>*<N>*} # chunk adjectives/nouns, nouns/cardinals, or nouns\n {<NP>+} # chunk sequences of proper nouns\n\"\"\"", | |
"prompt_number": 40, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Below, please write a function to call the chunker, run it on some sentences, and then print out the results for those sentences.\n\nFor uniformity, please run it on sentences 100 through 104 from the tagged brown corpus news category.\n\nThen extract out the phrases themselves using the subtree extraction technique shown in the \n*Exploring Text Corpora* category. (Note: Section 7.4 shows how to get to the actual words in the phrase by using the tree.leaves() command.)" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "cp = nltk.RegexpParser(grammar)\n#sentence = [(\"Rapunzel\", \"NNP\"), (\"let\", \"VBD\"), (\"down\", \"RP\"), # [_code-chunker1-ex]\n# (\"her\", \"PP$\"), (\"long\", \"JJ\"), (\"golden\", \"JJ\"), (\"hair\", \"NN\")]\n#print cp.parse(sentence)\nbrown = nltk.corpus.brown\nfor i in range(100,104):\n print cp.parse(brown.tagged_sents(categories='news', simplify_tags=True)[i])\n# print brown.tagged_sents(categories='news', simplify_tags=True)[100:104]", | |
"prompt_number": 41, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "(S\n (NP Daniel/NP)\n personally/ADV\n led/VD\n the/DET\n (NP fight/N)\n for/P\n the/DET\n (NP measure/N)\n ,/,\n which/WH\n he/PRO\n had/V\n watered/VN\n down/ADV\n considerably/ADV\n since/P\n its/PRO\n (NP rejection/N)\n by/P\n two/NUM\n (NP previous/ADJ Legislatures/N)\n ,/,\n in/P\n a/DET\n (NP public/ADJ hearing/N)\n before/P\n the/DET\n (NP House/N Committee/N)\n on/P\n (NP Revenue/N)\n and/CNJ\n (NP Taxation/N)\n ./.)\n(S\n Under/P\n (NP committee/N rules/N)\n ,/,\n it/PRO\n went/VD\n automatically/ADV\n to/P\n a/DET\n (NP subcommittee/N)\n for/P\n one/NUM\n (NP week/N)\n ./.)\n(S\n But/CNJ\n (NP questions/N)\n with/P\n which/WH\n (NP committee/N members/N)\n taunted/VD\n (NP bankers/N)\n appearing/VG\n as/CNJ\n (NP witnesses/N)\n left/VD\n little/DET\n (NP doubt/N)\n that/CNJ\n they/PRO\n will/MOD\n recommend/V\n (NP passage/N)\n of/P\n it/PRO\n ./.)\n(S\n (NP Daniel/NP)\n termed/VD\n ``/``\n extremely/ADV\n (NP conservative/ADJ)\n ''/''\n his/PRO\n (NP estimate/N)\n that/CNJ\n it/PRO\n would/MOD\n produce/V\n 17/NUM\n million/NUM\n (NP dollars/N)\n to/TO\n help/V\n erase/V\n an/DET\n anticipated/VN\n (NP deficit/N)\n of/P\n 63/NUM\n million/NUM\n (NP dollars/N)\n at/P\n the/DET\n (NP end/N)\n of/P\n the/DET\n (NP current/ADJ fiscal/ADJ year/N)\n next/DET\n (NP Aug./NP)\n 31/NUM\n ./.)\n", | |
"stream": "stdout" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## 2. Identify Proper Nouns ##\nFor this next task, write a new version of the chunker, but this time change it in two ways:\n 1. Make it recognize proper nouns\n 2. Make it work on your personal text collection which means that you need to run a tagger over your personal text collection.\n\nNote that the second requirements means that you need to run a tagger over your personal text collection before you design the proper noun recognizer. You can use a pre-trained tagger or train your own on one of the existing tagged collections (brown, conll, or treebank)\n\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "**Tagger:** Your code for optionally training tagger, and for definitely running tagger on your personal collection goes here:" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "macbeth_raw = nltk.corpus.gutenberg.raw('shakespeare-macbeth.txt')\ntext = nltk.word_tokenize(macbeth_raw)\ntagged_text = nltk.pos_tag(text)", | |
"prompt_number": 42, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "[('[', 'NN'), ('The', 'DT'), ('Tragedie', 'NNP'), ('of', 'IN'), ('Macbeth', 'NNP'), ('by', 'IN'), ('William', 'NNP'), ('Shakespeare', 'NNP'), ('1603', 'CD'), (']', 'CD')]\n", | |
"stream": "stdout" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "**Chunker:** Code for the proper noun chunker goes here:" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "proper_nouns = r\"\"\"\n NP: {<ADJ|N>+<N|CD>*<N>*} # chunk adjectives/nouns, nouns/cardinals, or nouns\n {<NNP>+} # chunk sequences of proper nouns\n\"\"\"\ncp_macbeth = nltk.RegexpParser(proper_nouns)", | |
"prompt_number": 49, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "[('[', 'NN'), ('The', 'DT'), ('Tragedie', 'NNP'), ('of', 'IN'), ('Macbeth', 'NNP'), ('by', 'IN'), ('William', 'NNP'), ('Shakespeare', 'NNP'), ('1603', 'CD'), (']', 'CD'), ('Actus', 'NNP'), ('Primus.', 'NNP'), ('Scoena', 'NNP'), ('Prima.', 'NNP'), ('Thunder', 'NNP'), ('and', 'CC'), ('Lightning.', 'NNP'), ('Enter', 'NNP'), ('three', 'CD'), ('Witches.', 'NNP'), ('1.', 'CD'), ('When', 'WRB'), ('shall', 'MD'), ('we', 'PRP'), ('three', 'CD'), ('meet', 'VBP'), ('againe', 'NN'), ('?', '.'), ('In', 'NNP'), ('Thunder', 'NNP'), (',', ','), ('Lightning', 'NNP'), (',', ','), ('or', 'CC'), ('in', 'IN'), ('Raine', 'NNP'), ('?', '.'), ('2.', 'CD'), ('When', 'WRB'), ('the', 'DT'), ('Hurley-burley', 'JJ'), (\"'s\", 'POS'), ('done', 'NN'), (',', ','), ('When', 'WRB'), ('the', 'DT'), ('Battaile', 'NNP'), (\"'s\", 'POS'), ('lost', 'VBN'), (',', ','), ('and', 'CC'), ('wonne', 'NN'), ('3.', 'CD'), ('That', 'WDT'), ('will', 'MD'), ('be', 'VB'), ('ere', 'RB'), ('the', 'DT'), ('set', 'NN'), ('of', 'IN'), ('Sunne', 'NNP'), ('1.', 'CD'), ('Where', 'WRB'), ('the', 'DT'), ('place', 'NN'), ('?', '.'), ('2.', 'CD'), ('Vpon', 'NNP'), ('the', 'DT'), ('Heath', 'NNP'), ('3.', 'NNP'), ('There', 'NNP'), ('to', 'TO'), ('meet', 'VB'), ('with', 'IN'), ('Macbeth', 'NNP'), ('1.', 'CD'), ('I', 'PRP'), ('come', 'VBP'), (',', ','), ('Gray-Malkin', 'NNP'), ('All.', 'NNP'), ('Padock', 'NNP'), ('calls', 'VBZ'), ('anon', 'NN'), (':', ':'), ('faire', 'NN'), ('is', 'VBZ'), ('foule', 'NN'), (',', ','), ('and', 'CC'), ('foule', 'NN'), ('is', 'VBZ'), ('faire', 'NN'), (',', ','), ('Houer', 'NNP'), ('through', 'IN'), ('the', 'DT'), ('fogge', 'NN'), ('and', 'CC')]\n", | |
"stream": "stdout" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "**Test the Chunker:** Test your proper noun recognizer on a lot of sentences to see how well it is working. You might want to add prepositions in order to improve your results. \n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "cp_tree = cp_macbeth.parse(tagged_text)\nprint cp_tree[0:100]", | |
"prompt_number": 51, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "[('[', 'NN'), ('The', 'DT'), Tree('NP', [('Tragedie', 'NNP')]), ('of', 'IN'), Tree('NP', [('Macbeth', 'NNP')]), ('by', 'IN'), Tree('NP', [('William', 'NNP'), ('Shakespeare', 'NNP')]), ('1603', 'CD'), (']', 'CD'), Tree('NP', [('Actus', 'NNP'), ('Primus.', 'NNP'), ('Scoena', 'NNP'), ('Prima.', 'NNP'), ('Thunder', 'NNP')]), ('and', 'CC'), Tree('NP', [('Lightning.', 'NNP'), ('Enter', 'NNP')]), ('three', 'CD'), Tree('NP', [('Witches.', 'NNP')]), ('1.', 'CD'), ('When', 'WRB'), ('shall', 'MD'), ('we', 'PRP'), ('three', 'CD'), ('meet', 'VBP'), ('againe', 'NN'), ('?', '.'), Tree('NP', [('In', 'NNP'), ('Thunder', 'NNP')]), (',', ','), Tree('NP', [('Lightning', 'NNP')]), (',', ','), ('or', 'CC'), ('in', 'IN'), Tree('NP', [('Raine', 'NNP')]), ('?', '.'), ('2.', 'CD'), ('When', 'WRB'), ('the', 'DT'), ('Hurley-burley', 'JJ'), (\"'s\", 'POS'), ('done', 'NN'), (',', ','), ('When', 'WRB'), ('the', 'DT'), Tree('NP', [('Battaile', 'NNP')]), (\"'s\", 'POS'), ('lost', 'VBN'), (',', ','), ('and', 'CC'), ('wonne', 'NN'), ('3.', 'CD'), ('That', 'WDT'), ('will', 'MD'), ('be', 'VB'), ('ere', 'RB'), ('the', 'DT'), ('set', 'NN'), ('of', 'IN'), Tree('NP', [('Sunne', 'NNP')]), ('1.', 'CD'), ('Where', 'WRB'), ('the', 'DT'), ('place', 'NN'), ('?', '.'), ('2.', 'CD'), Tree('NP', [('Vpon', 'NNP')]), ('the', 'DT'), Tree('NP', [('Heath', 'NNP'), ('3.', 'NNP'), ('There', 'NNP')]), ('to', 'TO'), ('meet', 'VB'), ('with', 'IN'), Tree('NP', [('Macbeth', 'NNP')]), ('1.', 'CD'), ('I', 'PRP'), ('come', 'VBP'), (',', ','), Tree('NP', [('Gray-Malkin', 'NNP'), ('All.', 'NNP'), ('Padock', 'NNP')]), ('calls', 'VBZ'), ('anon', 'NN'), (':', ':'), ('faire', 'NN'), ('is', 'VBZ'), ('foule', 'NN'), (',', ','), ('and', 'CC'), ('foule', 'NN'), ('is', 'VBZ'), ('faire', 'NN'), (',', ','), Tree('NP', [('Houer', 'NNP')]), ('through', 'IN'), ('the', 'DT'), ('fogge', 'NN'), ('and', 'CC'), ('filthie', 'NN'), Tree('NP', [('ayre.', 'NNP'), ('Exeunt.', 'NNP'), ('Scena', 'NNP'), ('Secunda.', 'NNP'), ('Alarum', 'NNP'), ('within.', 'NNP'), ('Enter', 'NNP'), ('King', 'NNP'), ('Malcome', 'NNP')]), (',', ','), Tree('NP', [('Donalbaine', 'NNP')]), (',', ','), Tree('NP', [('Lenox', 'NNP')]), (',', ','), ('with', 'IN'), ('attendants', 'NNS'), (',', ','), ('meeting', 'NN')]\n", | |
"stream": "stdout" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "**FreqDist Results:** After you have your proper noun recognizer working to your satisfaction, below run it over your entire collection, feed the results into a FreqDist, and then print out the top 20 proper nouns by frequency. That code goes here, along with the output:\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "# I know this way of doing it is not the sentiment of the instruction, but I had a lot of problems feeding a \n# tree into the FreqDist so this is just a workaround. I will continue to work on figuring out the \"right\" way. \n\ntag_fd = nltk.FreqDist(word for (word, tag) in tagged_text if tag == 'NNP')\nprint tag_fd.items()[0:20]", | |
"prompt_number": 58, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "[('Macb.', 133), ('Enter', 73), ('Macd.', 56), ('I', 53), ('Macbeth', 53), ('The', 53), ('What', 46), ('Lady.', 41), ('Rosse.', 39), ('Ile', 35), ('King', 35), ('Banquo', 33), ('Lord', 33), ('My', 33), ('That', 33), ('Exeunt.', 29), ('As', 25), ('Mal.', 25), ('Sir', 25), ('Thane', 25)]\n", | |
"stream": "stdout" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### For Wednesday ###\nJust FYI, in Wednesday's October 8's assignment, you'll be asked to extend this code a bit more to discover interesting patterns using objects or subjects of verbs, and do a bit of Wordnet grouping. This will be posted soon. Note that these exercises are intended to provide you with functions to use directly in your larger assignment. " | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "", | |
"prompt_number": 42, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
} | |
], | |
"metadata": {} | |
} | |
], | |
"metadata": { | |
"name": "", | |
"signature": "sha256:37379aa61547128b5a728682b32a82e12e334d551de9d79ae7041f678ae8a2ac" | |
}, | |
"nbformat": 3 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment