Created
November 30, 2016 11:15
-
-
Save AashishTiwari/f4734e52ed5f72cdd16e9190dc0ca317 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Analysis of UP Vidhan Sabha Headline dataset.\n", | |
"\n", | |
"### Notebook by [Aashish K Tiwari]\n", | |
"#### [Persistent Systems Ltd]\n", | |
"#### Data Source: Persistent Systems Limited." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Table of contents\n", | |
"\n", | |
"\n", | |
"1. [Step 1: Loading Dataset](#Step-1:-loading-dataset)\n", | |
"\n", | |
"2. [Step 2: Analyzing](#Step-2:-Analyzing)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 1: Loading Dataset\n", | |
"\n", | |
"[[ go back to the top ]](#Table-of-contents)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"A first look at the dataset.." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"import numpy\n", | |
"dataset = pd.read_csv('headline_dump.csv')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>headline_text</th>\n", | |
" <th>headline_keywords</th>\n", | |
" <th>headline_keypersons</th>\n", | |
" <th>book_year</th>\n", | |
" <th>book_session</th>\n", | |
" <th>book_volume</th>\n", | |
" <th>book_number</th>\n", | |
" <th>book_proceeding_date</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>पूर्व न्यायाधीश काटजू द्वारा महात्मा गांधी और...</td>\n", | |
" <td>आपत्तिजनक टिप्पणी; पूर्व न्यायाधीश काटजू द्वार...</td>\n", | |
" <td>प्रदीप माथुर, श्री;अध्यक्ष, श्री;मोहम्मद आजम ख...</td>\n", | |
" <td>2015</td>\n", | |
" <td>1</td>\n", | |
" <td>493</td>\n", | |
" <td>1</td>\n", | |
" <td>2015-03-13</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>प्रदेश में जिला पंचायत अध्यक्षों तथा क्षेत्र ...</td>\n", | |
" <td>जिला पंचायत चुनाव; जिला पंचायत अध्यक्ष एवं क्ष...</td>\n", | |
" <td>दलजीत सिंह, श्री;कैलाश यादव, श्री;अनुग्रह नारा...</td>\n", | |
" <td>2015</td>\n", | |
" <td>1</td>\n", | |
" <td>493</td>\n", | |
" <td>1</td>\n", | |
" <td>2015-03-13</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>श्रमिकों के स्वास्थ्य जीवन सुरक्षा एवं सामाजिक...</td>\n", | |
" <td>स्वास्थ्य जीवन सुरक्षा; सामाजिक सुरक्षा अधिनिय...</td>\n", | |
" <td>मनीष असीजा, श्री;सतीश महाना, श्री;शाहिद मंजूर,...</td>\n", | |
" <td>2015</td>\n", | |
" <td>1</td>\n", | |
" <td>493</td>\n", | |
" <td>1</td>\n", | |
" <td>2015-03-13</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>पूर्व न्यायाधीश काटजू द्वारा महात्मा गांधी और ...</td>\n", | |
" <td>आपत्तिजनक टिप्पणी; पूर्व न्यायाधीश काटजू द्वार...</td>\n", | |
" <td>मोहम्मद आजम खां, श्री;सुरेश कुमार खन्ना, श्री;...</td>\n", | |
" <td>2015</td>\n", | |
" <td>1</td>\n", | |
" <td>493</td>\n", | |
" <td>1</td>\n", | |
" <td>2015-03-13</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>जेलों में बन्दियों से काम कराये जाने को मजदूरी...</td>\n", | |
" <td>जेल बन्दी श्रम; मजदूरी; न्यूनतम</td>\n", | |
" <td>दलवीर सिंह, श्री;बलराम यादव, श्री;अध्यक्ष, श्र...</td>\n", | |
" <td>2015</td>\n", | |
" <td>1</td>\n", | |
" <td>493</td>\n", | |
" <td>1</td>\n", | |
" <td>2015-03-13</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" headline_text \\\n", | |
"0 पूर्व न्यायाधीश काटजू द्वारा महात्मा गांधी और... \n", | |
"1 प्रदेश में जिला पंचायत अध्यक्षों तथा क्षेत्र ... \n", | |
"2 श्रमिकों के स्वास्थ्य जीवन सुरक्षा एवं सामाजिक... \n", | |
"3 पूर्व न्यायाधीश काटजू द्वारा महात्मा गांधी और ... \n", | |
"4 जेलों में बन्दियों से काम कराये जाने को मजदूरी... \n", | |
"\n", | |
" headline_keywords \\\n", | |
"0 आपत्तिजनक टिप्पणी; पूर्व न्यायाधीश काटजू द्वार... \n", | |
"1 जिला पंचायत चुनाव; जिला पंचायत अध्यक्ष एवं क्ष... \n", | |
"2 स्वास्थ्य जीवन सुरक्षा; सामाजिक सुरक्षा अधिनिय... \n", | |
"3 आपत्तिजनक टिप्पणी; पूर्व न्यायाधीश काटजू द्वार... \n", | |
"4 जेल बन्दी श्रम; मजदूरी; न्यूनतम \n", | |
"\n", | |
" headline_keypersons book_year book_session \\\n", | |
"0 प्रदीप माथुर, श्री;अध्यक्ष, श्री;मोहम्मद आजम ख... 2015 1 \n", | |
"1 दलजीत सिंह, श्री;कैलाश यादव, श्री;अनुग्रह नारा... 2015 1 \n", | |
"2 मनीष असीजा, श्री;सतीश महाना, श्री;शाहिद मंजूर,... 2015 1 \n", | |
"3 मोहम्मद आजम खां, श्री;सुरेश कुमार खन्ना, श्री;... 2015 1 \n", | |
"4 दलवीर सिंह, श्री;बलराम यादव, श्री;अध्यक्ष, श्र... 2015 1 \n", | |
"\n", | |
" book_volume book_number book_proceeding_date \n", | |
"0 493 1 2015-03-13 \n", | |
"1 493 1 2015-03-13 \n", | |
"2 493 1 2015-03-13 \n", | |
"3 493 1 2015-03-13 \n", | |
"4 493 1 2015-03-13 " | |
] | |
}, | |
"execution_count": 2, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dataset.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 2: Analyzing\n", | |
"\n", | |
"[[ go back to the top ]](#Table-of-contents)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"full_text = dataset.headline_text.str.cat(sep=\" \")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# print full_text" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can see the stop words in the file" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"के\n" | |
] | |
} | |
], | |
"source": [ | |
"print(u'\\u0915\\u0947')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"है\n" | |
] | |
} | |
], | |
"source": [ | |
"print(u'\\u0939\\u0948')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"में\n" | |
] | |
} | |
], | |
"source": [ | |
"print(u'\\u092e\\u0947\\u0902')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"u'\\u092a\\u094d\\u0930\\u0925\\u092e'" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"'प्रथम'.decode('utf-8')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### word tokens" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"INDIC_NLP_LIB_HOME=\"./indic_nlp_library\"\n", | |
"INDIC_NLP_RESOURCES=\"./indic_nlp_resources\"\n", | |
"import sys\n", | |
"sys.path.append('{}/src'.format(INDIC_NLP_LIB_HOME))\n", | |
"from indicnlp import common\n", | |
"common.set_resources_path(INDIC_NLP_RESOURCES)\n", | |
"from indicnlp import loader\n", | |
"loader.load()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from indicnlp.tokenize import indic_tokenize \n", | |
"\n", | |
"indic_string=full_text.decode('utf-8')\n", | |
"\n", | |
"tokens = indic_tokenize.trivial_tokenize(indic_string)\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"न्यायाधीश\n" | |
] | |
} | |
], | |
"source": [ | |
"print tokens[1]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import nltk" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from nltk.util import ngrams" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"bigrams=ngrams(tokens,2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Bi-Grams:" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"A look at the first 20 bigrams:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"प्रदेश, में\n", | |
"में, जिला\n", | |
"जिला, पंचायत\n", | |
"पंचायत, अध्यक्षों\n", | |
"अध्यक्षों, तथा\n", | |
"तथा, क्षेत्र\n", | |
"क्षेत्र, पंचायत\n", | |
"पंचायत, अध्यक्षों\n", | |
"अध्यक्षों, का\n", | |
"का, चुनाव\n", | |
"चुनाव, जनता\n", | |
"जनता, द्वारा\n", | |
"द्वारा, सीधे\n", | |
"सीधे, कराये\n", | |
"कराये, जाने\n", | |
"जाने, का\n", | |
"का, सरकार\n", | |
"सरकार, का\n", | |
"का, विचार\n", | |
"विचार, श्रमिकों\n" | |
] | |
} | |
], | |
"source": [ | |
"import itertools\n", | |
"first20 = itertools.islice(bigrams, 20)\n", | |
"for k,v in first20:\n", | |
" print k + \", \" + v" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from nltk.collocations import *\n", | |
"finder = BigramCollocationFinder.from_words(tokens)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# finder.ngram_fd.viewitems()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"\r\n", | |
"498ए, भा\n", | |
"\r\n", | |
"अनियमितता, बरते\n", | |
"\r\n", | |
"कोटला, एचवं\n", | |
"\r\n", | |
"दानु, कुईया\n", | |
"\r\n", | |
"रहे, अवैघ\n", | |
"\r\n", | |
"वन्दे, मातरम्\n", | |
"\r\n", | |
"विकासकर्ता, अंशल\n", | |
"\r\n", | |
"हरपाल, बालियान\n", | |
"\r\n", | |
"होकर, रिसौलीतम\n", | |
"12ए, जयराम\n", | |
"1500, राइस\n", | |
"1994, बैच\n", | |
"20009, संस्थात\n", | |
"27ए, प्रजा\n", | |
"3000, कास्तकारों\n", | |
"600, मेगावाट\n", | |
"933, उपचारिकाओं\n", | |
"अंक, तालिका\n", | |
"अंग, प्रत्यारोपण\n", | |
"अखोप, माईनर\n" | |
] | |
} | |
], | |
"source": [ | |
"bigram_measures = nltk.collocations.BigramAssocMeasures()\n", | |
"best20 = finder.nbest(bigram_measures.pmi,20)\n", | |
"for k,v in best20:\n", | |
" print k + \", \" + v" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# fdist = nltk.FreqDist(bigrams)\n", | |
"# for k,v in fdist.items():\n", | |
"# print k,v" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Top-20 by PMI" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 56, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"\r\n", | |
"498ए, भा\n", | |
"\r\n", | |
"अनियमितता, बरते\n", | |
"\r\n", | |
"कोटला, एचवं\n", | |
"\r\n", | |
"दानु, कुईया\n", | |
"\r\n", | |
"रहे, अवैघ\n", | |
"\r\n", | |
"वन्दे, मातरम्\n", | |
"\r\n", | |
"विकासकर्ता, अंशल\n", | |
"\r\n", | |
"हरपाल, बालियान\n", | |
"\r\n", | |
"होकर, रिसौलीतम\n", | |
"12ए, जयराम\n", | |
"1500, राइस\n", | |
"1994, बैच\n", | |
"20009, संस्थात\n", | |
"27ए, प्रजा\n", | |
"3000, कास्तकारों\n", | |
"600, मेगावाट\n", | |
"933, उपचारिकाओं\n", | |
"अंक, तालिका\n", | |
"अंग, प्रत्यारोपण\n", | |
"अखोप, माईनर\n" | |
] | |
} | |
], | |
"source": [ | |
"top20=finder.nbest(bigram_measures.pmi,20)\n", | |
"for k,v in top20:\n", | |
" print k + \", \" + v" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### TRI_GRAMS" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"trigrams=ngrams(tokens,3)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"पूर्व, न्यायाधीश, काटजू\n", | |
"न्यायाधीश, काटजू, द्वारा\n", | |
"काटजू, द्वारा, महात्मा\n", | |
"द्वारा, महात्मा, गांधी\n", | |
"महात्मा, गांधी, और\n", | |
"गांधी, और, सुभाष\n", | |
"और, सुभाष, चन्द्र\n", | |
"सुभाष, चन्द्र, बोस\n", | |
"चन्द्र, बोस, के\n", | |
"बोस, के, बारे\n", | |
"के, बारे, में\n", | |
"बारे, में, की\n", | |
"में, की, गई\n", | |
"की, गई, \r\n", | |
"आपत्तिजनक\n", | |
"गई, \r\n", | |
"आपत्तिजनक, टिप्पणी\n", | |
"\r\n", | |
"आपत्तिजनक, टिप्पणी, का\n", | |
"टिप्पणी, का, प्रकरण\n", | |
"का, प्रकरण, \r\n", | |
"\n", | |
"प्रकरण, \r\n", | |
", प्रदेश\n", | |
"\r\n", | |
", प्रदेश, में\n" | |
] | |
} | |
], | |
"source": [ | |
"first20tri = itertools.islice(trigrams, 20)\n", | |
"for x,y,z in first20tri:\n", | |
" print x + \", \" + y + \", \" + z " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### TOP 20 TRIGRAMS BY PMI" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"\r\n", | |
"कोटला, एचवं, छारबाग\n", | |
"एबं, इटवां, दरवाजा\n", | |
"कहला, सरौटा, नयां\n", | |
"डेंडीकेडिट, अरबन, ट्रासपोर्ट\n", | |
"तकिया, हाजी, नुसरत\n", | |
"थमैड़ा, नारा, फौलादपुर\n", | |
"दोहरा, मापदण्ड, अपनाने\n", | |
"निर्माता, मेर्सस, अंश\n", | |
"परक, समाविष्ट, विषय\r\n", | |
"\n", | |
"फर्स्ट, रेफरल, यूनिटों\n", | |
"बांदरा, वासा, खैंडी\n", | |
"भिहौना, \r\n", | |
"होकर, रिसौलीतम\n", | |
"मेरिट, आर्डर, डिस्टि्ब्यूशन\n", | |
"युवती, आफरीन, परबीन\n", | |
"लगड़ी, सोनवल, कटया\n", | |
"वंशज, गाडिया, लोहारों\n", | |
"वन्य, पेड़, शीतगृहों\n", | |
"वाईजेड़ा, सिरौनी, के\r\n", | |
"बीच\n", | |
"वीडियो, कान्फ्रेंसिंग, क्रियाशील\n", | |
"सेहुड़ा, नगाहरी, तिघरा\n" | |
] | |
} | |
], | |
"source": [ | |
"from nltk.collocations import *\n", | |
"finder3 = TrigramCollocationFinder.from_words(tokens)\n", | |
"trigram_measures = nltk.collocations.TrigramAssocMeasures()\n", | |
"finder3.nbest(trigram_measures.pmi,20) \n", | |
"top20_tri=finder3.nbest(trigram_measures.pmi,20)\n", | |
"for k,v,z in top20_tri:\n", | |
" print k + \", \" + v + \", \" + z" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"anaconda-cloud": {}, | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.10" | |
}, | |
"widgets": { | |
"state": { | |
"1f12b998a3f7427c85eba50396be46e7": { | |
"views": [ | |
{ | |
"cell_index": 21 | |
} | |
] | |
}, | |
"a17a9c2a9031400a8d6b5d1dffcd1854": { | |
"views": [ | |
{ | |
"cell_index": 21 | |
} | |
] | |
}, | |
"ad1560ae0848463ba9089ba0d226fd0f": { | |
"views": [ | |
{ | |
"cell_index": 20 | |
} | |
] | |
}, | |
"ee6d6fca82da413f8839b4c4b54a840b": { | |
"views": [ | |
{ | |
"cell_index": 21 | |
} | |
] | |
} | |
}, | |
"version": "1.2.0" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment