Skip to content

Instantly share code, notes, and snippets.

@cmgerber
Created October 7, 2014 01:54
Show Gist options
  • Save cmgerber/5b350d0abdeb83851393 to your computer and use it in GitHub Desktop.
Save cmgerber/5b350d0abdeb83851393 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"worksheets": [
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "## 1. Syntactic Patterns for Technical Terms ##"
},
{
"metadata": {},
"cell_type": "code",
"input": "import nltk\nfrom nltk.corpus import brown\nfrom nltk.tag import brill\nimport re\nimport unicodedata\nimport codecs",
"prompt_number": 1,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "##Questions:\n\nHow to get the backoff tagger to work when tagging new corpus"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "As seen in the Chuang et al. paper and in the Manning and Schuetze chapter,\nthere is a well-known part-of-speech based pattern defined by Justeson and Katz\nfor identifying simple noun phrases that often words well for pulling out keyphrases.\n\nChuang et al use this pattern: Technical Term T = (A | N)+ (N | C) | N\n\nBelow, please write a function to define a chunker using the RegexpParser as illustrated in the section *Chunking with Regular Expressions*. You'll need to revise the grammar rules shown there to match the pattern shown above. You can be liberal with your definition of what is meant by *N* here. Also, C refers to cardinal number, which is CD in the brown corpus.\n\n"
},
{
"metadata": {},
"cell_type": "code",
"input": "grammar = r\"\"\"\n CHUNK: {<JJ|NN>+<(NN|CD)|NN>} # chunk technical terms\n\"\"\"\ncp = nltk.RegexpParser(grammar)",
"prompt_number": 2,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Below, please write a function to call the chunker, run it on some sentences, and then print out the results for those sentences.\n\nFor uniformity, please run it on sentences 100 through 104 from the tagged brown corpus news category.\n\nThen extract out the phrases themselves using the subtree extraction technique shown in the \n*Exploring Text Corpora* category. (Note: Section 7.4 shows how to get to the actual words in the phrase by using the tree.leaves() command.)"
},
{
"metadata": {},
"cell_type": "code",
"input": "brown = nltk.corpus.brown.tagged_sents(categories='news')\nfor sents in brown[100:105]:\n tree = cp.parse(sents)\n for subtree in tree.subtrees():\n if subtree.label() == 'CHUNK': print subtree",
"prompt_number": 3,
"outputs": [
{
"output_type": "stream",
"text": "(CHUNK public/JJ hearing/NN)\n(CHUNK current/JJ fiscal/JJ year/NN)\n(CHUNK escheat/NN law/NN)\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## 2. Identify Proper Nouns ##\nFor this next task, write a new version of the chunker, but this time change it in two ways:\n 1. Make it recognize proper nouns\n 2. Make it work on your personal text collection which means that you need to run a tagger over your personal text collection.\n\nNote that the second requirements means that you need to run a tagger over your personal text collection before you design the proper noun recognizer. You can use a pre-trained tagger or train your own on one of the existing tagged collections (brown, conll, or treebank)\n\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Tagger:** Your code for optionally training tagger, and for definitely running tagger on your personal collection goes here:"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Having Problems with non ascii values so going to remove them"
},
{
"metadata": {},
"cell_type": "code",
"input": "def remove_non_ascii(text):\n new_text = []\n for line in text:\n try:\n for letter in line:\n if ord(letter) > 128:\n line = line.replace(letter, '')\n except:\n pass\n new_text.append(line)\n return new_text",
"prompt_number": 4,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "r = codecs.open('../Data/ct_criteria_colin.txt', encoding=\"utf-8\")\nr_lines = r.readlines()\n#r_lines = [unicodedata.normalize('NFKD', line).encode('ascii','ignore') for line in r_lines]\nr_lines = remove_non_ascii(r_lines)\nr_string = ' '.join(r_lines)\nlines_split = [re.split(' - ', line) for line in r_lines]",
"prompt_number": 5,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')",
"prompt_number": 6,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "sentence_groups = []\nfor sent_group in lines_split:\n group_holder = []\n for sent in sent_group:\n group_holder.append(sent_tokenizer.tokenize(sent))\n sentence_groups.append(group_holder)\n del group_holder",
"prompt_number": 7,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "pattern = r'''(?x) # set flag to allow verbose regexps\n ([A-Z]\\.)+ # abbreviations, e.g. U.S.A\n | \\w+([-‘]\\w+)* # words with optional internal hyphens\n | \\$?\\d+(\\.\\d+)?%? # currency and percentages, e.g. $12.40, 82%\n | \\.\\.\\. # ellipsis... \n | [][.,;\"'?():\\-_`]+ # these are separate tokens\n '''",
"prompt_number": 8,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "flattened_list = [item for sublist in sentence_groups for item in sublist]",
"prompt_number": 9,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "flattened_list_tokens = [nltk.regexp_tokenize(' '.join(sent), pattern) for sent\n in flattened_list]",
"prompt_number": 10,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "flattened_list_tokens[:3]",
"prompt_number": 11,
"outputs": [
{
"text": "[[u'Inclusion', u'Criteria', u':'],\n [u'Healthy', u'male'],\n [u'18-50', u'years', u'of', u'age']]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 11
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Train tagger"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Train set"
},
{
"metadata": {},
"cell_type": "code",
"input": "conll_sents = nltk.corpus.conll2007.tagged_sents()\nconll_train = list(conll_sents)",
"prompt_number": 12,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "brown_sents = nltk.corpus.brown.tagged_sents()\nbrown_train = list(brown_sents)",
"prompt_number": 13,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Patter for regex tagger"
},
{
"metadata": {},
"cell_type": "code",
"input": "word_patterns = [\n (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),\n (r'.*ould$', 'MD'),\n (r'.*ing$', 'VBG'),\n (r'.*ed$', 'VBD'),\n (r'.*ness$', 'NN'),\n (r'.*ment$', 'NN'),\n (r'.*ful$', 'JJ'),\n (r'.*ious$', 'JJ'),\n (r'.*ble$', 'JJ'),\n (r'.*ic$', 'JJ'),\n (r'.*ive$', 'JJ'),\n (r'.*ic$', 'JJ'),\n (r'.*est$', 'JJ'),\n (r'^a$', 'PREP'),\n (r'.*', 'NN')\n]",
"prompt_number": 14,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Backoff tagger function"
},
{
"metadata": {},
"cell_type": "code",
"input": "def backoff_tagger(tagged_sents, tagger_classes, backoff=None):\n if not backoff:\n backoff = tagger_classes[0](tagged_sents)\n del tagger_classes[0]\n\n for cls in tagger_classes:\n tagger = cls(tagged_sents, backoff=backoff)\n backoff = tagger\n\n return backoff",
"prompt_number": 15,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "raubt tagger"
},
{
"metadata": {},
"cell_type": "code",
"input": "raubt_tagger = backoff_tagger(brown_train, [nltk.tag.AffixTagger,\n nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],\n backoff=nltk.tag.RegexpTagger(word_patterns))",
"prompt_number": 16,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Brill tagger (in progress)"
},
{
"metadata": {},
"cell_type": "code",
"input": "# raubt_tagger = backoff_tagger(conll_train, [nltk.tag.AffixTagger,\n# nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],\n# backoff=nltk.tag.RegexpTagger(word_patterns))\n \n# templates = [\n# brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),\n# brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),\n# brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),\n# brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),\n# brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),\n# brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),\n# brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),\n# brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),\n# brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),\n# brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1))\n# ]\n \n# trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates)\n# braubt_tagger = trainer.train(conll_train, max_rules=100, min_score=3)",
"prompt_number": 17,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Tag my corpus"
},
{
"metadata": {},
"cell_type": "code",
"input": "tagged_corpus = [raubt_tagger.tag(sent) for sent in flattened_list_tokens]",
"prompt_number": 18,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Hacked fix for backup tagger not working"
},
{
"metadata": {},
"cell_type": "code",
"input": "reTag = nltk.tag.RegexpTagger(word_patterns)\n\nfor num, sent in enumerate(tagged_corpus):\n for n, word in enumerate(sent):\n if word[1] is None:\n print nltk.word_tokenize(word[0])\n tagged_corpus[num][n] = reTag.tag(nltk.word_tokenize(word[0]))\n break",
"prompt_number": 19,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Chunker:** Code for the proper noun chunker goes here:"
},
{
"metadata": {},
"cell_type": "code",
"input": "grammar = r\"\"\"\n CHUNK: {<NP.*>+} # chunk technical terms\n\"\"\"\ncp = nltk.RegexpParser(grammar)",
"prompt_number": 20,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "def chunker(tagged_corpus):\n \n grammar = r\"\"\"\n CHUNK: {<NP.*>+} # chunk technical terms\n \"\"\"\n cp = nltk.RegexpParser(grammar)\n \n results = []\n \n for sents in tagged_corpus:\n tree = cp.parse(sents)\n for subtree in tree.subtrees():\n if subtree.label() == 'CHUNK':\n results.append(subtree)\n return results",
"prompt_number": 21,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Test the Chunker:** Test your proper noun recognizer on a lot of sentences to see how well it is working. You might want to add prepositions in order to improve your results. \n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**There were no NNP (proper nouns) in my corpus so I used plural nouns instead**"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**FreqDist Results:** After you have your proper noun recognizer working to your satisfaction, below run it over your entire collection, feed the results into a FreqDist, and then print out the top 20 proper nouns by frequency. That code goes here, along with the output:\n"
},
{
"metadata": {},
"cell_type": "code",
"input": "NNS_from_corpus = chunker(tagged_corpus)\nNNS_from_corpus[:20]",
"prompt_number": 22,
"outputs": [
{
"text": "[Tree('CHUNK', [(u'Criteria', u'NP')]),\n Tree('CHUNK', [(u'Criteria', u'NP')]),\n Tree('CHUNK', [(u'apnea', u'NP')]),\n Tree('CHUNK', [(u'sarcoma', u'NP')]),\n Tree('CHUNK', [(u'histiocytoma', u'NP')]),\n Tree('CHUNK', [(u'Liposarcoma', u'NP')]),\n Tree('CHUNK', [(u'Leiomyosarcoma', u'NP')]),\n Tree('CHUNK', [(u'Fibrosarcoma', u'NP')]),\n Tree('CHUNK', [(u'Rhabdomyosarcoma', u'NP')]),\n Tree('CHUNK', [(u'sarcoma', u'NP')]),\n Tree('CHUNK', [(u'paraganglioma', u'NP')]),\n Tree('CHUNK', [(u'Neurofibrosarcoma', u'NP')]),\n Tree('CHUNK', [(u'schwannoma', u'NP')]),\n Tree('CHUNK', [(u'sarcoma', u'NP')]),\n Tree('CHUNK', [(u'osteosarcoma', u'NP')]),\n Tree('CHUNK', [(u'chondrosarcoma', u'NP')]),\n Tree('CHUNK', [(u'Angiosarcoma', u'NP')]),\n Tree('CHUNK', [(u'sarcoma', u'NP')]),\n Tree('CHUNK', [(u'sarcoma', u'NP')]),\n Tree('CHUNK', [(u'Karnofsky', u'NP')])]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 22
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "fd = nltk.FreqDist([word[0][0] for word in NNS_from_corpus])",
"prompt_number": 23,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "fd.most_common(20)",
"prompt_number": 24,
"outputs": [
{
"text": "[(u'Criteria', 3570),\n (u'angina', 252),\n (u'L', 197),\n (u'Karnofsky', 100),\n (u'English', 96),\n (u'adenocarcinoma', 96),\n (u'York', 94),\n (u'schizophrenia', 77),\n (u'heparin', 72),\n (u'warfarin', 70),\n (u'melanoma', 64),\n (u'non-melanoma', 62),\n (u'myeloma', 61),\n (u'Child', 50),\n (u'B', 46),\n (u'anesthesia', 46),\n (u'mitomycin', 42),\n (u'Gilbert', 40),\n (u'Crohn', 39),\n (u'B-cell', 39)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 24
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### For Wednesday ###\nJust FYI, in Wednesday's October 8's assignment, you'll be asked to extend this code a bit more to discover interesting patterns using objects or subjects of verbs, and do a bit of Wordnet grouping. This will be posted soon. Note that these exercises are intended to provide you with functions to use directly in your larger assignment. "
},
{
"metadata": {},
"cell_type": "code",
"input": "",
"prompt_number": 24,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
}
],
"metadata": {}
}
],
"metadata": {
"name": "",
"signature": "sha256:1b4941f5783a7297ada0e41af9c105b706947bccc58b8cd3d462a42ae33b8771"
},
"nbformat": 3
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment