Created
September 30, 2015 06:54
-
-
Save juanshishido/48571ac3822b2027b260 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## 1. Syntactic Patterns for Technical Terms ##" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "import re\nimport random\n\nimport nltk\nfrom nltk.corpus import brown", | |
"execution_count": 1, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "As seen in the Manning and Schuetze chapter, there is a well-known part-of-speech \nbased pattern defined by Justeson and Katz for identifying simple noun phrases \nthat often work well for pulling out keyphrases.\n\n Technical Term T = (A | N)+ (N | C) | N\n\nBelow, write a function to define a chunker using the RegexpParser as illustrated in the NLTK book Chapter 7 section 2.3 *Chunking with Regular Expressions*. You'll need to revise the grammar rules shown there to match the pattern shown above. You can be liberal with your definition of what is meant by *N* here. Also, C refers to cardinal number, which is CD in the brown corpus.\n\n" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "def np_chunker(pos_sentence):\n grammar = r\"\"\"NP: {<JJ.*|NN.*>+<NN.*|CD><NN.*>}\"\"\"\n cp = nltk.RegexpParser(grammar)\n print(cp.parse(pos_sentence), '\\n')", | |
"execution_count": 2, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Below, write a function to call the chunker, run it on some sentences, and then print out the results for those sentences.\n\nFor uniformity, please run it on sentences 100 through 104 from the full tagged brown corpus.\n\n " | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "brown_tagged_sents = brown.tagged_sents()", | |
"execution_count": 3, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "brown_sample = brown_tagged_sents[100:105]", | |
"execution_count": 4, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "[np_chunker(sent) for sent in brown_sample]", | |
"execution_count": 5, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "(S\n Daniel/NP\n personally/RB\n led/VBD\n the/AT\n fight/NN\n for/IN\n the/AT\n measure/NN\n ,/,\n which/WDT\n he/PPS\n had/HVD\n watered/VBN\n down/RP\n considerably/RB\n since/IN\n its/PP$\n rejection/NN\n by/IN\n two/CD\n previous/JJ\n Legislatures/NNS-TL\n ,/,\n in/IN\n a/AT\n public/JJ\n hearing/NN\n before/IN\n the/AT\n House/NN-TL\n Committee/NN-TL\n on/IN-TL\n Revenue/NN-TL\n and/CC-TL\n Taxation/NN-TL\n ./.) \n\n(S\n Under/IN\n committee/NN\n rules/NNS\n ,/,\n it/PPS\n went/VBD\n automatically/RB\n to/IN\n a/AT\n subcommittee/NN\n for/IN\n one/CD\n week/NN\n ./.) \n\n(S\n But/CC\n questions/NNS\n with/IN\n which/WDT\n committee/NN\n members/NNS\n taunted/VBD\n bankers/NNS\n appearing/VBG\n as/CS\n witnesses/NNS\n left/VBD\n little/AP\n doubt/NN\n that/CS\n they/PPSS\n will/MD\n recommend/VB\n passage/NN\n of/IN\n it/PPO\n ./.) \n\n(S\n Daniel/NP\n termed/VBD\n ``/``\n extremely/RB\n conservative/JJ\n ''/''\n his/PP$\n estimate/NN\n that/CS\n it/PPS\n would/MD\n produce/VB\n 17/CD\n million/CD\n dollars/NNS\n to/TO\n help/VB\n erase/VB\n an/AT\n anticipated/VBN\n deficit/NN\n of/IN\n 63/CD\n million/CD\n dollars/NNS\n at/IN\n the/AT\n end/NN\n of/IN\n the/AT\n current/JJ\n fiscal/JJ\n year/NN\n next/AP\n Aug./NP\n 31/CD\n ./.) \n\n(S\n He/PPS\n told/VBD\n the/AT\n committee/NN\n the/AT\n measure/NN\n would/MD\n merely/RB\n provide/VB\n means/NNS\n of/IN\n enforcing/VBG\n the/AT\n escheat/NN\n law/NN\n which/WDT\n has/HVZ\n been/BEN\n on/IN\n the/AT\n books/NNS\n ``/``\n since/IN\n Texas/NP\n was/BEDZ\n a/AT\n republic/NN\n ''/''\n ./.) \n\n", | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "[None, None, None, None, None]" | |
}, | |
"metadata": {}, | |
"execution_count": 5 | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "\nThen extract out the phrases themselves on sentences 100 through 160 using the subtree extraction technique shown in the \n*Exploring Text Corpora* category. " | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "grammar = r\"\"\"NP: {<JJ.*|NN.*>+<NN.*|CD><NN.*>}\"\"\"\ncp = nltk.RegexpParser(grammar)\n\nbrown_sample = brown_tagged_sents[100:161]\n\nfor sent in brown_sample:\n tree = cp.parse(sent)\n for subtree in tree.subtrees():\n if subtree.label() == 'NP':\n print(subtree)", | |
"execution_count": 6, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "(NP county-wide/JJ day/NN schools/NNS)\n(NP year's/NN$ capital/NN outlay/NN)\n(NP horse/NN race/NN parimutuels/NNS)\n(NP horse/NN race/NN betting/NN)\n(NP local/JJ option/NN proposal/NN)\n(NP State/NN-TL Health/NN-TL Department's/NN$-TL authority/NN)\n(NP county/NN-TL Hospital/NN-TL District/NN-TL)\n(NP Gulf/NN-TL Coast/NN-TL district/NN)\n(NP State/NN-TL Hospital/NN-TL board/NN)\n(NP tax/NN revision/NN bills/NNS)\n(NP miscellaneous/JJ excise/NN taxes/NNS)\n(NP real/JJ estate/NN brokers/NNS)\n(NP $12/NNS annual/JJ occupation/NN license/NN)\n(NP Natural/JJ gas/NN public/JJ utility/NN companies/NNS)\n(NP underground/JJ storage/NN reservoirs/NNS)\n(NP State/NN-TL Affairs/NNS-TL Committee/NN-TL)\n(NP water/NN development/NN bill/NN)\n(NP local/JJ water/NN project/NN)\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## 2. Identify Proper Nouns ##\nFor this next task, write a new version of the chunker, but this time change it in two ways:\n 1. Make it recognize proper nouns\n 2. Make it work on your personal text collection which means that you need to run a tagger over your personal text collection.\n\nNote that the second requirements means that you need to run a tagger over your personal text collection before you design the proper noun recognizer. You can use a pre-trained tagger or train your own on one of the existing tagged collections (brown, conll, or treebank)\n\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "**Tagger:** Your code for optionally training tagger, and for definitely running tagger on your personal collection goes here:" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "# load data\nwith open ('text-collection/jsm-collection.txt', 'r', encoding='utf-8') as jsm:\n t = jsm.read()\n\n# remove chapter and section headings\nt = re.sub('\\s+', ' ',\n re.sub(r'[A-Z]{2,}', '',\n re.sub('((?<=[A-Z])\\sI | I\\s(?=[A-Z]))', ' ', t)))\n\n# tokenize\ndef tokenize_text(corpus):\n sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')\n raw_sents = sent_tokenizer.tokenize(corpus)\n return [nltk.word_tokenize(word) for word in raw_sents]\njsm_sents = tokenize_text(t)\n\n# tag\njsm_tagged = [nltk.pos_tag(sent) for sent in jsm_sents]", | |
"execution_count": 7, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "**Chunker:** Code for the proper noun chunker goes here:" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "grammar = r\"\"\"PNOUN: {<N+P.*>+<DT|IN>*<N+P.*>+}\"\"\"\ncp = nltk.RegexpParser(grammar)", | |
"execution_count": 8, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "**Test the Chunker:** Test your proper noun recognizer on a lot of sentences to see how well it is working. You might want to add prepositions in order to improve your results. \n" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "start = random.randint(0, len(jsm_tagged) - 10)\n\nfor sent in jsm_tagged[start : start+10]:\n tree = cp.parse(sent)\n for subtree in tree.subtrees():\n if subtree.label() == 'PNOUN':\n print(subtree)", | |
"execution_count": 25, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "(PNOUN House/NNP of/IN Commons/NNPS)\n(PNOUN _Côté/NNP Gauche_/NNP of/IN the/DT Whig/NNP)\n(PNOUN Charles/NNP Buller/NNP)\n(PNOUN Sir/NNP William/NNP Molesworth/NNP)\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "**FreqDist Results:** After you have your proper noun recognizer working to your satisfaction, below run it over your entire collection, feed the results into a FreqDist, and then print out the top 20 proper nouns by frequency. That code goes here, along with the output:\n" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "pnouns = [' '.join([word for word, tag in subtree.leaves()]) for term in jsm_tagged\n for subtree in cp.parse(term).subtrees() if subtree.label() == 'PNOUN']\n\npnouns_fdist = nltk.FreqDist(pnouns)", | |
"execution_count": 10, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "pnouns_fdist.most_common(20)", | |
"execution_count": 11, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "[('United States', 396),\n ('M. Comte', 333),\n ('Dr. Whewell', 121),\n ('Political Economy', 89),\n ('Mr. Mill', 88),\n ('Method of Difference', 81),\n ('_a priori_', 77),\n ('Adam Smith', 72),\n ('Method of Agreement', 63),\n ('Mr. Spencer', 63),\n ('B C', 41),\n ('Archbishop Whately', 39),\n ('New York', 29),\n ('Mr. Ricardo', 29),\n ('Sir William Hamilton', 28),\n ('Great Britain', 28),\n ('Professor Bain', 26),\n ('Mr. Bain', 26),\n ('Chart No', 24),\n ('Deductive Method', 24)]" | |
}, | |
"metadata": {}, | |
"execution_count": 11 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": true | |
}, | |
"cell_type": "code", | |
"source": "", | |
"execution_count": null, | |
"outputs": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3", | |
"language": "python" | |
}, | |
"language_info": { | |
"pygments_lexer": "ipython3", | |
"mimetype": "text/x-python", | |
"version": "3.4.2", | |
"codemirror_mode": { | |
"version": 3, | |
"name": "ipython" | |
}, | |
"nbconvert_exporter": "python", | |
"file_extension": ".py", | |
"name": "python" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment