Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Teino1978-Corp/70d26d5bc285bbaa95f6 to your computer and use it in GitHub Desktop.
Save Teino1978-Corp/70d26d5bc285bbaa95f6 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"worksheets": [
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "<h1> Keyphrase Identification Assignment </h1>\n<h3> Ankita Bhosle </h3>\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "This notebook contains three algorithms to find keyphrases from texts so that one can understand the gist and general contents of a text just by a glimpse at the keyphrases. The algorithms are composed of simple methods learnt in class so far, which combined together, give some pretty powerful insights. \n\nEach algorithms is run on three collections:\n\n1) My personal collection - The Palace of Illusions, an Indian mythological prose\n\n2) The Brown corpus 'News' collection\n\n3) Mystery text provided by Marti"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h1> Loading and Pre-processing Text"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "In this section the texts are normalized, through removal of punctuation and stop words and by lowercasing all words. Normalized texts are then tokenized. In this portion, I tried running the algorithms with and without normalization and found that normalizing the text gives better results. Losing stop words did not distort the meaningful keyphrases, as I had feared. "
},
{
"metadata": {},
"cell_type": "code",
"input": "# Code to load text from url. Please uncomment the code below and comment out lines 2 and 3 from the next block of code.\n\n# from urllib import urlopen\n# url = '<Insert URL Here>'\n# f = urlopen(url)\n# s = f.read()",
"prompt_number": 251,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h4> My Collection: The Palace of Illusions"
},
{
"metadata": {},
"cell_type": "code",
"input": "import nltk \nf = open('The_Palace_of_Illusions.txt', 'rU') # Load text. Comment this line and the next if loading from url using code above.\ns = f.read()\n\nimport string\nexclude = set(string.punctuation)\nnormalized = ''.join(ch for ch in s.lower() if ch not in exclude) # Remove punctuation, make lower case\ntokens = nltk.word_tokenize(normalized) # Create Word Tokens\n\nfrom nltk.corpus import stopwords\nmeaningful_tokens = [word for word in tokens if word not in (stopwords.words('english'))] # Remove Stop Words\ntagged_text = nltk.pos_tag(meaningful_tokens) # Tagged Tokens\n",
"prompt_number": 147,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')\nsents = sent_tokenizer.tokenize(s) # Create Sentence Tokens",
"prompt_number": 80,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h4> Brown Corpus News Collection"
},
{
"metadata": {},
"cell_type": "code",
"input": "from nltk.corpus import brown\ntok = [word.lower() for word in nltk.corpus.brown.words(categories='news')] # Lowercase Tokens\nnormalized_brown = ''.join(ch for ch in str(tok) if ch not in exclude) # Remove Punctuation\nbrown_tokens = nltk.word_tokenize(normalized_brown) # Create word tokens\nmeaningful_brown_tokens = [word for word in brown_tokens if word not in (stopwords.words('english'))] # Remove Stop Words\ntagged_brown_text = nltk.pos_tag(meaningful_brown_tokens) # Tagged Tokens",
"prompt_number": 210,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h4> Mystery Collection"
},
{
"metadata": {},
"cell_type": "code",
"input": "f1 = open('mystery.txt', 'rU') # Load Text\ns1 = f1.read() \n\nnormalized_myst = ''.join(ch for ch in s1.lower() if ch not in exclude) # Remove punctuation, make lower case\ntokens_myst = nltk.word_tokenize(normalized_myst) # Create Word Tokens\nmeaningful_myst_tokens = [word for word in tokens_myst if word not in (stopwords.words('english'))] # Remove Stop Words\ntagged_myst_text = nltk.pos_tag(meaningful_myst_tokens) # Tagged Tokens",
"prompt_number": 213,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h1> PART 1: Popular Words, Phrases and their Frequencies"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "This part uses n-grams (unigrams, bigrams and trigrams), ie word combinations, along with frequency distribution to find the most common combinations of words. The rationale behind using the frequency distribution is that if a certain combination of words occurs very frequently, it is less likely that it is by mere chance, and hence must be something meaningful. "
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h4> My Collection"
},
{
"metadata": {},
"cell_type": "code",
"input": "unigrams = nltk.FreqDist(meaningful_tokens) #Unigrams: Nothing but word tokens. \nunigrams.items()[:20] #Use FreqDist to rank order them by frequency of occurrence, list 20 most frequent words, with frequency",
"prompt_number": 89,
"outputs": [
{
"text": "[('would', 677),\n ('said', 380),\n ('one', 363),\n ('could', 351),\n ('i\\xe2\\x80\\x99d', 326),\n ('though', 273),\n ('us', 266),\n ('like', 247),\n ('time', 243),\n ('even', 237),\n ('didn\\xe2\\x80\\x99t', 234),\n ('krishna', 204),\n ('arjun', 201),\n ('knew', 201),\n ('karna', 197),\n ('yudhisthir', 194),\n ('face', 186),\n ('eyes', 174),\n ('way', 174),\n ('he\\xe2\\x80\\x99d', 168)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 89
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "bigrams = nltk.bigrams(meaningful_tokens) # Form Bigrams from normalized, tokenized text\nfd = nltk.FreqDist(bigrams)\nfd.items()[:20]",
"prompt_number": 255,
"outputs": [
{
"text": "[(('dhai', 'ma'), 95),\n (('could', 'see'), 25),\n (('didn\\xe2\\x80\\x99t', 'know'), 25),\n (('i\\xe2\\x80\\x99d', 'never'), 22),\n (('krishna', 'said'), 22),\n (('didn\\xe2\\x80\\x99t', 'want'), 21),\n (('blind', 'king'), 20),\n (('yudhisthir', 'said'), 20),\n (('would', 'never'), 19),\n (('don\\xe2\\x80\\x99t', 'know'), 18),\n (('would', 'make'), 18),\n (('king', 'drupad'), 17),\n (('shook', 'head'), 17),\n (('even', 'though'), 16),\n (('let', 'go'), 14),\n (('long', 'time'), 14),\n (('said', '\\xe2\\x80\\x9ci'), 14),\n (('though', 'knew'), 14),\n (('first', 'time'), 13),\n (('later', 'would'), 13)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 255
},
{
"output_type": "stream",
"text": "\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "trigrams = nltk.trigrams(meaningful_tokens) # Form Trigrams from normalized, tokenized text\nfd1 = nltk.FreqDist(trigrams)\nfd1.items()[:20]",
"prompt_number": 252,
"outputs": [
{
"text": "[(('dhai', 'ma', 'said'), 9),\n (('i\\xe2\\x80\\x99d', 'never', 'seen'), 6),\n (('long', 'time', 'ago'), 5),\n (('garland', 'around', 'neck'), 4),\n (('one', 'last', 'time'), 4),\n (('third', 'age', 'man'), 4),\n (('asked', 'dhai', 'ma'), 3),\n (('delighted', 'dhai', 'ma'), 3),\n (('dhai', 'ma', 'would'), 3),\n (('e', 'p', 'l'), 3),\n (('h', 'e', 'p'), 3),\n (('karna', 'would', 'never'), 3),\n (('l', 'c', 'e'), 3),\n (('later', 'would', 'wonder'), 3),\n (('m', 'm', 'm'), 3),\n (('mouth', 'went', 'dry'), 3),\n (('p', 'l', 'c'), 3),\n (('panchaali', 'panchaali', 'panchaali'), 3),\n (('part', 'i\\xe2\\x80\\x99d', 'played'), 3),\n (('put', 'arm', 'around'), 3)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 252
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h4> Brown Corpus News Collection"
},
{
"metadata": {},
"cell_type": "code",
"input": "unigrams_brown = nltk.FreqDist(meaningful_brown_tokens) \nunigrams_brown.items()[:20]",
"prompt_number": 218,
"outputs": [
{
"text": "[('said', 406),\n ('mrs', 254),\n ('would', 246),\n ('new', 241),\n ('one', 213),\n ('last', 177),\n ('two', 174),\n ('mr', 170),\n ('first', 158),\n ('state', 153),\n ('president', 142),\n ('year', 142),\n ('home', 132),\n ('also', 129),\n ('years', 118),\n ('made', 107),\n ('time', 103),\n ('three', 101),\n ('house', 97),\n ('week', 94)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 218
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "bigrams_brown = nltk.bigrams(meaningful_brown_tokens)\nfd_brown = nltk.FreqDist(bigrams_brown)\nfd_brown.items()[:20]",
"prompt_number": 220,
"outputs": [
{
"text": "[(('new', 'york'), 52),\n (('per', 'cent'), 50),\n (('mr', 'mrs'), 42),\n (('united', 'states'), 40),\n (('last', 'week'), 35),\n (('last', 'year'), 34),\n (('white', 'house'), 29),\n (('high', 'school'), 23),\n (('home', 'runs'), 23),\n (('president', 'kennedy'), 19),\n (('last', 'night'), 18),\n (('said', 'would'), 15),\n (('san', 'francisco'), 15),\n (('years', 'ago'), 15),\n (('antitrust', 'laws'), 14),\n (('los', 'angeles'), 14),\n (('mr', 'kennedy'), 14),\n (('kansas', 'city'), 13),\n (('premier', 'khrushchev'), 13),\n (('two', 'years'), 12)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 220
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "trigrams_brown = nltk.trigrams(meaningful_brown_tokens)\nfd1_brown = nltk.FreqDist(trigrams_brown)\nfd1_brown.items()[:20]",
"prompt_number": 222,
"outputs": [
{
"text": "[(('mr', 'hawksley', 'said'), 7),\n (('new', 'york', 'city'), 6),\n (('new', 'york', 'yankees'), 6),\n (('10', 'per', 'cent'), 5),\n (('four', 'home', 'runs'), 5),\n (('home', 'rule', 'charter'), 5),\n (('4', 'per', 'cent'), 4),\n (('aged', 'care', 'plan'), 4),\n (('american', 'catholic', 'higher'), 4),\n (('catholic', 'higher', 'education'), 4),\n (('la', 'dolce', 'vita'), 4),\n (('national', 'football', 'league'), 4),\n (('per', 'cent', 'interest'), 4),\n (('potato', 'chip', 'industry'), 4),\n (('two', 'years', 'ago'), 4),\n (('12', 'months', 'ended'), 3),\n (('60', 'home', 'runs'), 3),\n (('annapolis', 'jan', '7'), 3),\n (('anne', 'arundel', 'county'), 3),\n (('announce', 'birth', 'daughter'), 3)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 222
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h4> Mystery Collection"
},
{
"metadata": {},
"cell_type": "code",
"input": "unigrams_myst = nltk.FreqDist(meaningful_myst_tokens) \nunigrams_myst.items()[:20]",
"prompt_number": 219,
"outputs": [
{
"text": "[('said', 1434),\n ('mln', 727),\n ('pct', 550),\n ('tonnes', 498),\n ('us', 437),\n ('dlrs', 354),\n ('last', 281),\n ('trade', 279),\n ('dollar', 267),\n ('would', 255),\n ('oil', 246),\n ('wheat', 241),\n ('year', 220),\n ('yen', 218),\n ('new', 217),\n ('japan', 215),\n ('prices', 205),\n ('market', 201),\n ('coffee', 195),\n ('bank', 179)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 219
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "bigrams_myst = nltk.bigrams(meaningful_myst_tokens)\nfd_myst = nltk.FreqDist(bigrams_myst)\nfd_myst.items()[:20]",
"prompt_number": 221,
"outputs": [
{
"text": "[(('mln', 'tonnes'), 209),\n (('last', 'month'), 131),\n (('mln', 'dlrs'), 99),\n (('billion', 'dlrs'), 92),\n (('sources', 'said'), 78),\n (('mln', 'barrels'), 67),\n (('new', 'york'), 65),\n (('department', 'said'), 57),\n (('bank', 'japan'), 54),\n (('last', 'year'), 52),\n (('traders', 'said'), 51),\n (('tonnes', 'vs'), 48),\n (('us', 'agriculture'), 48),\n (('pct', 'sulphur'), 47),\n (('dealers', 'said'), 46),\n (('week', 'ended'), 45),\n (('crude', 'oil'), 43),\n (('heating', 'oil'), 42),\n (('agriculture', 'department'), 41),\n (('official', 'said'), 39)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 221
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "trigrams_myst = nltk.trigrams(meaningful_myst_tokens)\nfd1_myst = nltk.FreqDist(trigrams_myst)\nfd1_myst.items()[:20]",
"prompt_number": 223,
"outputs": [
{
"text": "[(('mln', 'tonnes', 'vs'), 48),\n (('us', 'agriculture', 'department'), 38),\n (('mln', 'tonnes', 'last'), 32),\n (('tonnes', 'last', 'month'), 32),\n (('trade', 'sources', 'said'), 27),\n (('last', 'month', 'exports'), 25),\n (('week', 'ended', 'march'), 22),\n (('agriculture', 'department', 'said'), 17),\n (('tonnes', 'free', 'market'), 15),\n (('ecus', 'per', 'tonne'), 14),\n (('mln', 'last', 'month'), 14),\n (('last', 'month', 'usda'), 13),\n (('month', 'exports', '198586'), 13),\n (('pct', 'rise', 'january'), 13),\n (('free', 'market', 'barley'), 12),\n (('last', 'month', 'stocks'), 12),\n (('pct', 'year', 'ago'), 12),\n (('bank', 'japan', 'intervenes'), 11),\n (('dlrs', '75', 'cts'), 11),\n (('dlrs', 'fob', 'gulf'), 11)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 223
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h1> PART 2: Collocations"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "This part uses collocations, that is \"expressions consisting of two or more words that correspond to some conventional way of saying things\", to find keyphrases in the texts. This includes both Statistics-based Collocations (PMI, Chi-squared, etc)\nand Syntactic-Pattern-Based Collocations (using peculiarities in morphology). \n\nIn this part, different statistics based collocations were used as they gave different results. After experimenting with some of them, I decided to keep all to show the difference in results. \n\nFor the second part (Syntactic-pattern based), I had to play with the RegEx a lot to get the pattern right, according to the paper. I also experimented with and without stop words, but with stop words gave better results in this case. "
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h4> My collection"
},
{
"metadata": {},
"cell_type": "code",
"input": "from nltk.collocations import *\nbigram_measures = nltk.collocations.BigramAssocMeasures()\ntrigram_measures = nltk.collocations.TrigramAssocMeasures()\nfinder = BigramCollocationFinder.from_words(meaningful_tokens)\nfinder.apply_freq_filter(3)\nfinder.nbest(bigram_measures.pmi, 10) # Find top 10 bigrams using Pointwise Mutual Information (PMI)",
"prompt_number": 135,
"outputs": [
{
"text": "[('nyaya', 'shastra'),\n ('peacock', 'feather'),\n ('chitra', 'banerjee'),\n ('l', 'c'),\n ('p', 'l'),\n ('grand', 'spectacle'),\n ('yadu', 'clan'),\n ('rice', 'pudding'),\n ('c', 'e'),\n ('e', 'p')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 135
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder.score_ngrams(bigram_measures.pmi)[:5] # Calculate PMI for the bigrams",
"prompt_number": 138,
"outputs": [
{
"text": "[(('nyaya', 'shastra'), 14.158951493658556),\n (('peacock', 'feather'), 13.421985899492352),\n (('chitra', 'banerjee'), 13.006948400213506),\n (('l', 'c'), 12.74391399437971),\n (('p', 'l'), 12.74391399437971)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 138
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder.nbest(bigram_measures.student_t, 5) # Find top 5 bigrams using the Student's t test",
"prompt_number": 133,
"outputs": [
{
"text": "[('dhai', 'ma'),\n ('didn\\xe2\\x80\\x99t', 'know'),\n ('could', 'see'),\n ('i\\xe2\\x80\\x99d', 'never'),\n ('didn\\xe2\\x80\\x99t', 'want')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 133
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder.nbest(bigram_measures.chi_sq, 5) # Find top 5 bigrams using the Pearson's Chi-squared test",
"prompt_number": 134,
"outputs": [
{
"text": "[('nyaya', 'shastra'),\n ('dhai', 'ma'),\n ('peacock', 'feather'),\n ('chitra', 'banerjee'),\n ('indra', 'prastha')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 134
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder.nbest(bigram_measures.likelihood_ratio, 5) # Find top 5 bigrams using the Likelihood Ratio",
"prompt_number": 136,
"outputs": [
{
"text": "[('dhai', 'ma'),\n ('blind', 'king'),\n ('shook', 'head'),\n ('king', 'drupad'),\n ('indra', 'prastha')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 136
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder1 = TrigramCollocationFinder.from_words(meaningful_tokens) \nsorted(finder1.nbest(trigram_measures.raw_freq, 10))",
"prompt_number": 126,
"outputs": [
{
"text": "[('asked', 'dhai', 'ma'),\n ('delighted', 'dhai', 'ma'),\n ('dhai', 'ma', 'said'),\n ('dhai', 'ma', 'would'),\n ('e', 'p', 'l'),\n ('garland', 'around', 'neck'),\n ('i\\xe2\\x80\\x99d', 'never', 'seen'),\n ('long', 'time', 'ago'),\n ('one', 'last', 'time'),\n ('third', 'age', 'man')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 126
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder1.score_ngrams(trigram_measures.pmi)[:10] # Find top 10 trigrams using PMI",
"prompt_number": 129,
"outputs": [
{
"text": "[(('1024', 'x', '768'), 31.487827988759424),\n (('111', 'mon', 'coating'), 31.487827988759424),\n (('2007033784', 'eisbn', '9780385525435'), 31.487827988759424),\n (('287', 'sile', 'speeding'), 31.487827988759424),\n (('302', 'loincloth', 'undid'), 31.487827988759424),\n (('324', 'bc\\xe2\\x80\\x94fiction', 'title'), 31.487827988759424),\n (('77', 'contradictory', 'man\\xe2\\x80\\x94he'), 31.487827988759424),\n (('81354\\xe2\\x80\\x94dc22', '2007033784', 'eisbn'), 31.487827988759424),\n (('ageconquering', 'unguents\\xe2\\x80\\x9d', '253'), 31.487827988759424),\n (('agent', 'sandra', 'dijkstra'), 31.487827988759424)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 129
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "tagged_words = nltk.pos_tag(tokens) # Tagging the text (without stop words removed)\ngrammar = \"Pattern: {(<JJ.*>|<NN.*>)(<JJ.*>|<NN.*>|<IN>)<NN.*>?}\" # Tag pattern used by Justeson and Katz to identify likely collocations\ncp = nltk.RegexpParser(grammar) \nresult = cp.parse(tagged_words)\n\ntemp = []\nfor subtree in result.subtrees():\n if subtree.node == 'Pattern':\n temp.append(subtree.leaves())\n[item for item in temp][:20]",
"prompt_number": 160,
"outputs": [
{
"text": "[[('viewing', 'NN'), ('at', 'IN')],\n [('pixels', 'NNS'), ('t', 'NN'), ('h', 'NN')],\n [('e', 'NN'), ('p', 'NN')],\n [('c', 'NN'), ('e', 'NN')],\n [('cognizant', 'JJ'), ('original', 'JJ'), ('v5', 'NNP')],\n [('release', 'NN'), ('october', 'NN')],\n [('chitra', 'NN'), ('banerjee', 'NN'), ('divakaruni', 'NN')],\n [('queen', 'NN'), ('of', 'IN'), ('dreams', 'NNS')],\n [('vine', 'NN'), ('of', 'IN'), ('desire', 'NN')],\n [('unknown', 'NN'), ('errors', 'NNS')],\n [('lives', 'NNS'), ('sister', 'JJR')],\n [('mistress', 'NN'), ('of', 'IN'), ('spices', 'NNS')],\n [('marriage', 'NN'), ('poetry', 'NN')],\n [('yuba', 'NN'), ('city', 'NN'), ('black', 'NN')],\n [('candle', 'NN'), ('for', 'IN')],\n [('young', 'JJ'), ('readers', 'NNS')],\n [('mirror', 'NN'), ('of', 'IN'), ('fire', 'NN')],\n [('conch', 'NN'), ('bearer', 'NN'), ('neela', 'NN')],\n [('victory', 'NN'), ('song', 'IN'), ('doubleday', 'NN')],\n [('new', 'JJ'), ('york', 'NN'), ('london', 'NN')]]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 160
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h4> Brown Corpus News Collection"
},
{
"metadata": {},
"cell_type": "code",
"input": "bigram_measures_brown = nltk.collocations.BigramAssocMeasures()\ntrigram_measures_brown = nltk.collocations.TrigramAssocMeasures()\nfinder_brown = BigramCollocationFinder.from_words(meaningful_brown_tokens)\nfinder_brown.apply_freq_filter(3)\nfinder_brown.nbest(bigram_measures_brown.pmi, 10) ",
"prompt_number": 224,
"outputs": [
{
"text": "[('sterling', 'township'),\n ('magnetic', 'tape'),\n ('duncan', 'phyfe'),\n ('dolce', 'vita'),\n ('notre', 'dame'),\n ('scottish', 'rite'),\n ('adlai', 'stevenson'),\n ('import', 'quotas'),\n ('moise', 'tshombe'),\n ('souvanna', 'phouma')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 224
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder_brown.score_ngrams(bigram_measures_brown.pmi)[:5]",
"prompt_number": 225,
"outputs": [
{
"text": "[(('sterling', 'township'), 14.074587496158713),\n (('magnetic', 'tape'), 13.65954999687987),\n (('duncan', 'phyfe'), 13.659549996879868),\n (('dolce', 'vita'), 13.33762190199251),\n (('notre', 'dame'), 13.33762190199251)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 225
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder_brown.nbest(bigram_measures_brown.student_t, 10)",
"prompt_number": 226,
"outputs": [
{
"text": "[('new', 'york'),\n ('per', 'cent'),\n ('mr', 'mrs'),\n ('united', 'states'),\n ('last', 'week'),\n ('last', 'year'),\n ('white', 'house'),\n ('home', 'runs'),\n ('high', 'school'),\n ('president', 'kennedy')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 226
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder_brown.nbest(bigram_measures_brown.chi_sq, 10)",
"prompt_number": 227,
"outputs": [
{
"text": "[('los', 'angeles'),\n ('viet', 'nam'),\n ('hong', 'kong'),\n ('dolce', 'vita'),\n ('notre', 'dame'),\n ('scottish', 'rite'),\n ('duncan', 'phyfe'),\n ('sterling', 'township'),\n ('per', 'cent'),\n ('magnetic', 'tape')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 227
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder_brown.nbest(bigram_measures_brown.likelihood_ratio, 10)",
"prompt_number": 228,
"outputs": [
{
"text": "[('per', 'cent'),\n ('new', 'york'),\n ('united', 'states'),\n ('white', 'house'),\n ('last', 'week'),\n ('mr', 'mrs'),\n ('los', 'angeles'),\n ('san', 'francisco'),\n ('home', 'runs'),\n ('last', 'year')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 228
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder1_brown = TrigramCollocationFinder.from_words(meaningful_brown_tokens)\nsorted(finder1_brown.nbest(trigram_measures_brown.raw_freq, 10))",
"prompt_number": 229,
"outputs": [
{
"text": "[('10', 'per', 'cent'),\n ('4', 'per', 'cent'),\n ('aged', 'care', 'plan'),\n ('american', 'catholic', 'higher'),\n ('catholic', 'higher', 'education'),\n ('four', 'home', 'runs'),\n ('home', 'rule', 'charter'),\n ('mr', 'hawksley', 'said'),\n ('new', 'york', 'city'),\n ('new', 'york', 'yankees')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 229
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "tagged_brown_words = nltk.pos_tag(brown_tokens)\ngrammar1 = \"Pattern1: {(<JJ.*>|<NN.*>)(<JJ.*>|<NN.*>|<IN>)<NN.*>?}\"\ncp1 = nltk.RegexpParser(grammar1)\nresult1 = cp1.parse(tagged_brown_words)\n\ntemp1 = []\nfor subtree in result1.subtrees():\n if subtree.node == 'Pattern1':\n temp1.append(subtree.leaves())\n[item for item in temp1][:20]",
"prompt_number": 230,
"outputs": [
{
"text": "[[('fulton', 'NN'), ('county', 'NN'), ('grand', 'NN')],\n [('investigation', 'NN'), ('of', 'IN'), ('atlantas', 'NNS')],\n [('recent', 'JJ'), ('primary', 'JJ'), ('election', 'NN')],\n [('evidence', 'NN'), ('that', 'IN')],\n [('termend', 'NN'), ('presentments', 'NNS')],\n [('city', 'NN'), ('executive', 'NN'), ('committee', 'NN')],\n [('overall', 'JJ'), ('charge', 'NN')],\n [('thanks', 'NNS'), ('of', 'IN')],\n [('city', 'NN'), ('of', 'IN'), ('atlanta', 'NN')],\n [('manner', 'NN'), ('in', 'IN')],\n [('septemberoctober', 'NN'), ('term', 'NN'), ('jury', 'NN')],\n [('fulton', 'NN'), ('superior', 'JJ'), ('court', 'NN')],\n [('judge', 'NN'), ('durwood', 'NN'), ('pye', 'NN')],\n [('reports', 'NNS'), ('of', 'IN')],\n [('possible', 'JJ'), ('irregularities', 'NNS')],\n [('hardfought', 'NN'), ('primary', 'NN')],\n [('mayornominate', 'NN'), ('ivan', 'NN'), ('allen', 'NN')],\n [('relative', 'JJ'), ('handful', 'NN')],\n [('such', 'JJ'), ('reports', 'NNS')],\n [('widespread', 'JJ'), ('interest', 'NN')]]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 230
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h4> Mystery Collection"
},
{
"metadata": {},
"cell_type": "code",
"input": "bigram_measures_myst = nltk.collocations.BigramAssocMeasures()\ntrigram_measures_myst = nltk.collocations.TrigramAssocMeasures()\nfinder_myst = BigramCollocationFinder.from_words(meaningful_myst_tokens)\nfinder_myst.apply_freq_filter(3)\nfinder_myst.nbest(bigram_measures_myst.pmi, 10)",
"prompt_number": 231,
"outputs": [
{
"text": "[('13204', 'toledoseaforth'),\n ('15001700', '13204'),\n ('20002000', 'apr'),\n ('25455', 'naantalisaudi'),\n ('40003000', '20304'),\n ('bahia', 'blanca'),\n ('burro', 'creek'),\n ('daps', '24274'),\n ('days8000', '13154'),\n ('enquiries', 'antwerplibya')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 231
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder_myst.score_ngrams(bigram_measures_myst.pmi)[:5]",
"prompt_number": 232,
"outputs": [
{
"text": "[(('13204', 'toledoseaforth'), 13.985753010751832),\n (('15001700', '13204'), 13.985753010751832),\n (('20002000', 'apr'), 13.985753010751832),\n (('25455', 'naantalisaudi'), 13.985753010751832),\n (('40003000', '20304'), 13.985753010751832)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 232
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder_myst.nbest(bigram_measures_myst.student_t, 10)",
"prompt_number": 233,
"outputs": [
{
"text": "[('mln', 'tonnes'),\n ('last', 'month'),\n ('billion', 'dlrs'),\n ('mln', 'dlrs'),\n ('sources', 'said'),\n ('mln', 'barrels'),\n ('new', 'york'),\n ('bank', 'japan'),\n ('department', 'said'),\n ('last', 'year')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 233
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder_myst.nbest(bigram_measures_myst.chi_sq, 10)",
"prompt_number": 234,
"outputs": [
{
"text": "[('buenos', 'aires'),\n ('van', 'horick'),\n ('santa', 'fe'),\n ('cape', 'spencer'),\n ('shearson', 'lehman'),\n ('dean', 'witter'),\n ('merrill', 'lynch'),\n ('excluded', 'countertrading'),\n ('hrs', 'edt'),\n ('nihon', 'keizai')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 234
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder_myst.nbest(bigram_measures_myst.likelihood_ratio, 10)",
"prompt_number": 235,
"outputs": [
{
"text": "[('last', 'month'),\n ('mln', 'tonnes'),\n ('new', 'york'),\n ('billion', 'dlrs'),\n ('mln', 'barrels'),\n ('west', 'germany'),\n ('united', 'states'),\n ('heating', 'oil'),\n ('mln', 'dlrs'),\n ('sources', 'said')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 235
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "finder1_myst = TrigramCollocationFinder.from_words(meaningful_myst_tokens)\nsorted(finder1_myst.nbest(trigram_measures_myst.raw_freq, 10))",
"prompt_number": 236,
"outputs": [
{
"text": "[('agriculture', 'department', 'said'),\n ('ecus', 'per', 'tonne'),\n ('last', 'month', 'exports'),\n ('mln', 'tonnes', 'last'),\n ('mln', 'tonnes', 'vs'),\n ('tonnes', 'free', 'market'),\n ('tonnes', 'last', 'month'),\n ('trade', 'sources', 'said'),\n ('us', 'agriculture', 'department'),\n ('week', 'ended', 'march')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 236
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "tagged_myst_words = nltk.pos_tag(tokens_myst)\ngrammar2 = \"Pattern2: {(<JJ.*>|<NN.*>)(<JJ.*>|<NN.*>|<IN>)<NN.*>?}\"\ncp2 = nltk.RegexpParser(grammar2)\nresult2 = cp2.parse(tagged_myst_words)\n\ntemp2 = []\nfor subtree in result2.subtrees():\n if subtree.node == 'Pattern2':\n temp2.append(subtree.leaves())\n[item for item in temp2][:10]",
"prompt_number": 238,
"outputs": [
{
"text": "[[('royal', 'NN'), ('dutch', 'NN'), ('rd', 'NN')],\n [('heavy', 'JJ'), ('fuel', 'NN'), ('prices', 'NNS')],\n [('petroleum', 'NN'), ('corp', 'NN')],\n [('subsidiary', 'NN'), ('of', 'IN')],\n [('royal', 'JJ'), ('dutchshell', 'NN'), ('group', 'NN')],\n [('contract', 'NN'), ('prices', 'NNS')],\n [('heavy', 'JJ'), ('fuel', 'NN')],\n [('barrel', 'NN'), ('effective', 'JJ'), ('tomorrow', 'NN')],\n [('price', 'NN'), ('for', 'IN')],\n [('pct', 'NN'), ('sulphur', 'NN'), ('fuel', 'NN')]]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 238
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h1> PART 3: Semantic-similarity, Higher-level Concepts"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "This part uses higher level concepts or hypernyms to generalize the specific words to get an idea of the theme of the text."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h4> My Collection"
},
{
"metadata": {},
"cell_type": "code",
"input": "fd_words = nltk.FreqDist(meaningful_tokens) #Using Frequency Distribution to get the most frequent words\nimp_terms = fd_words.keys()[:50] \nprint imp_terms",
"prompt_number": 241,
"outputs": [
{
"output_type": "stream",
"text": "['would', 'said', 'one', 'could', 'i\\xe2\\x80\\x99d', 'though', 'us', 'like', 'time', 'even', 'didn\\xe2\\x80\\x99t', 'krishna', 'arjun', 'knew', 'karna', 'yudhisthir', 'face', 'eyes', 'way', 'he\\xe2\\x80\\x99d', 'made', 'husbands', 'palace', 'duryodhan', 'know', 'man', 'perhaps', 'life', 'thought', 'much', 'never', 'see', 'couldn\\xe2\\x80\\x99t', 'day', 'must', 'back', 'dhri', 'king', 'many', 'kunti', 'father', 'war', 'away', 'around', 'heart', 'women', 'asked', 'bheeshma', 'wanted', 'dhai']\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "from nltk.corpus import wordnet as wn\ndef categories_from_hypernyms(termlist): # Function for getting frequent hypernyms\n hypterms = [] \n for term in termlist: # for each term\n s = wn.synsets(term.lower(), 'n') # get nominal synsets\n for synsts in s: \n lm = synsts.lemmas # extract lemma forms\n for l in lm:\n for word in l.name: # find hypernymns of lemmas\n s1 = wn.synsets(word, 'n')\n for syn1 in s1:\n for hyp in syn1.hypernyms():\n hypterms = hypterms + [hyp.name] \n\n hypfd = nltk.FreqDist(hypterms) # Rank order the hypernyms by frequency\n print \"Show most frequent hypernym results:\"\n return [(count, name, wn.synset(name).definition) for (name, count) in hypfd.items()[:10]] \n \ncategories_from_hypernyms(imp_terms) # Passing most frequent terms to the function",
"prompt_number": 242,
"outputs": [
{
"output_type": "stream",
"text": "Show most frequent hypernym results\n",
"stream": "stdout"
},
{
"text": "[(2240,\n 'letter.n.02',\n 'the conventional characters of the alphabet used to represent speech'),\n (878,\n 'chemical_element.n.01',\n 'any of the more than 100 known substances (of which 92 occur naturally) that cannot be separated into simpler substances and that singly or in combination constitute all matter'),\n (624, 'cardinal_compass_point.n.01', 'one of the four main compass points'),\n (587, 'fat-soluble_vitamin.n.01', 'any vitamin that is soluble in fats'),\n (586,\n 'nucleotide.n.01',\n 'a phosphoric ester of a nucleoside; the basic structural unit of nucleic acids (DNA or RNA)'),\n (523,\n 'metallic_element.n.01',\n 'any of several chemical elements that are usually shiny solids that conduct heat or electricity and can be formed into sheets etc.'),\n (445, 'large_integer.n.01', 'an integer equal to or greater than ten'),\n (437,\n 'gas.n.02',\n 'a fluid in the gaseous state having neither independent shape nor volume and being able to expand indefinitely'),\n (390, 'computer_memory_unit.n.01', 'a unit for measuring computer memory'),\n (389,\n 'antioxidant.n.01',\n 'substance that inhibits oxidation or inhibits reactions promoted by oxygen or peroxides')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 242
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h4> Brown Corpus News Collection"
},
{
"metadata": {},
"cell_type": "code",
"input": "fd_words_brown = nltk.FreqDist(meaningful_brown_tokens)\nimp_brown_terms = fd_words_brown.keys()[:50] \nprint imp_brown_terms",
"prompt_number": 243,
"outputs": [
{
"output_type": "stream",
"text": "['said', 'mrs', 'would', 'new', 'one', 'last', 'two', 'mr', 'first', 'state', 'president', 'year', 'home', 'also', 'years', 'made', 'time', 'three', 'house', 'week', 'city', 'may', 'could', 'school', 'four', 'day', 'committee', 'members', 'man', 'back', 'government', 'many', 'national', 'us', 'states', 'university', 'bill', 'get', 'high', 'american', 'since', 'work', 'kennedy', 'program', 'john', 'night', 'board', 'administration', 'meeting', 'county']\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "categories_from_hypernyms(imp_brown_terms)",
"prompt_number": 244,
"outputs": [
{
"output_type": "stream",
"text": "Show most frequent hypernym results\n",
"stream": "stdout"
},
{
"text": "[(4474,\n 'letter.n.02',\n 'the conventional characters of the alphabet used to represent speech'),\n (1799,\n 'chemical_element.n.01',\n 'any of the more than 100 known substances (of which 92 occur naturally) that cannot be separated into simpler substances and that singly or in combination constitute all matter'),\n (1166,\n 'nucleotide.n.01',\n 'a phosphoric ester of a nucleoside; the basic structural unit of nucleic acids (DNA or RNA)'),\n (1140, 'cardinal_compass_point.n.01', 'one of the four main compass points'),\n (1042, 'fat-soluble_vitamin.n.01', 'any vitamin that is soluble in fats'),\n (895, 'large_integer.n.01', 'an integer equal to or greater than ten'),\n (889,\n 'gas.n.02',\n 'a fluid in the gaseous state having neither independent shape nor volume and being able to expand indefinitely'),\n (876,\n 'metallic_element.n.01',\n 'any of several chemical elements that are usually shiny solids that conduct heat or electricity and can be formed into sheets etc.'),\n (810, 'computer_memory_unit.n.01', 'a unit for measuring computer memory'),\n (795,\n 'blood_group.n.01',\n 'human blood cells (usually just the red blood cells) that have the same antigens')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 244
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<h4> Mystery Collection"
},
{
"metadata": {},
"cell_type": "code",
"input": "fd_words_myst = nltk.FreqDist(meaningful_myst_tokens)\nimp_myst_terms = fd_words_myst.keys()[:50] \nprint imp_myst_terms",
"prompt_number": 245,
"outputs": [
{
"output_type": "stream",
"text": "['said', 'mln', 'pct', 'tonnes', 'us', 'dlrs', 'last', 'trade', 'dollar', 'would', 'oil', 'wheat', 'year', 'yen', 'new', 'japan', 'prices', 'market', 'coffee', 'bank', 'billion', 'month', 'week', 'export', 'one', 'exports', 'gold', 'price', 'rice', 'stocks', 'may', 'two', 'production', 'report', 'april', 'rise', 'also', 'grain', 'vs', 'gasoline', 'sources', '1987', 'department', 'per', '198687', 'exchange', 'could', 'imports', 'agriculture', 'government']\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "categories_from_hypernyms(imp_myst_terms)",
"prompt_number": 246,
"outputs": [
{
"output_type": "stream",
"text": "Show most frequent hypernym results\n",
"stream": "stdout"
},
{
"text": "[(3359,\n 'letter.n.02',\n 'the conventional characters of the alphabet used to represent speech'),\n (1405,\n 'chemical_element.n.01',\n 'any of the more than 100 known substances (of which 92 occur naturally) that cannot be separated into simpler substances and that singly or in combination constitute all matter'),\n (929,\n 'nucleotide.n.01',\n 'a phosphoric ester of a nucleoside; the basic structural unit of nucleic acids (DNA or RNA)'),\n (852, 'cardinal_compass_point.n.01', 'one of the four main compass points'),\n (756, 'fat-soluble_vitamin.n.01', 'any vitamin that is soluble in fats'),\n (728, 'large_integer.n.01', 'an integer equal to or greater than ten'),\n (662,\n 'metallic_element.n.01',\n 'any of several chemical elements that are usually shiny solids that conduct heat or electricity and can be formed into sheets etc.'),\n (612,\n 'gas.n.02',\n 'a fluid in the gaseous state having neither independent shape nor volume and being able to expand indefinitely'),\n (604,\n 'antioxidant.n.01',\n 'substance that inhibits oxidation or inhibits reactions promoted by oxygen or peroxides'),\n (587,\n 'blood_group.n.01',\n 'human blood cells (usually just the red blood cells) that have the same antigens')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 246
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
}
],
"metadata": {}
}
],
"metadata": {
"name": "",
"signature": "sha256:fd12473344f53fd375ca4d3300b135660d61eb42efed4d6c77c7998d5d0f6fca"
},
"nbformat": 3
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment