Last active
August 31, 2016 14:04
-
-
Save AashishTiwari/a009ae683d5149b20e597760498cd068 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Working on 20 NewsGroup DataSet.\n", | |
"\n", | |
"### Notebook by [Aashish K Tiwari](https://gist.github.com/AashishTiwari)\n", | |
"#### You can see all my public gists @ https://gist.github.com/AashishTiwari\n", | |
"\n", | |
"#### [Persistent Systems Ltd]\n", | |
"#### Data Source: SciKit Learn Datasets" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Table of contents\n", | |
"\n", | |
"\n", | |
"1. [Step 1: Analyzing Data](#Step-1:-Analyzing-data)\n", | |
"\n", | |
"3. [Step 2: PreProcessing](#Step-2:-PreProcessing)\n", | |
"\n", | |
"4. [Step 3: Classification](#Step-3:-Classification)\n", | |
"\n", | |
"5. [Step 4: Conclusion](#Step-4:-Conclusion)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## libraries\n", | |
"\n", | |
"[[ go back to the top ]](#Table-of-contents)\n", | |
"\n", | |
"\n", | |
"* **NumPy**: >= V 1.11.1\n", | |
"* **pandas**: >= V 0.18.1\n", | |
"* **scikit-learn**: >= V 0.17.1\n", | |
"* **matplotlib**: >= V 1.5.1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 1: Analyzing Data\n", | |
"\n", | |
"[[ go back to the top ]](#Table-of-contents)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"20 newsgroups dataset is available in Sklean package, the data is divided already in training and testing set, so we don't have to do the normal split (holding 30% of data as test data)\n", | |
"Here we first import the training dataset" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 89, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.datasets import fetch_20newsgroups\n", | |
"newsgroups_train = fetch_20newsgroups(subset='train')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 90, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"sklearn.datasets.base.Bunch" | |
] | |
}, | |
"execution_count": 90, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"type(newsgroups_train)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Checking attributes of the data." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 91, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(11314,)" | |
] | |
}, | |
"execution_count": 91, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"newsgroups_train.target.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 92, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"numpy.ndarray" | |
] | |
}, | |
"execution_count": 92, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"type(newsgroups_train.target)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"What are the different target variables that we wan't to predict?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 93, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['alt.atheism',\n", | |
" 'comp.graphics',\n", | |
" 'comp.os.ms-windows.misc',\n", | |
" 'comp.sys.ibm.pc.hardware',\n", | |
" 'comp.sys.mac.hardware',\n", | |
" 'comp.windows.x',\n", | |
" 'misc.forsale',\n", | |
" 'rec.autos',\n", | |
" 'rec.motorcycles',\n", | |
" 'rec.sport.baseball',\n", | |
" 'rec.sport.hockey',\n", | |
" 'sci.crypt',\n", | |
" 'sci.electronics',\n", | |
" 'sci.med',\n", | |
" 'sci.space',\n", | |
" 'soc.religion.christian',\n", | |
" 'talk.politics.guns',\n", | |
" 'talk.politics.mideast',\n", | |
" 'talk.politics.misc',\n", | |
" 'talk.religion.misc']" | |
] | |
}, | |
"execution_count": 93, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"newsgroups_train.target_names" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"See actual files" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 94, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# newsgroups_train.viewvalues()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 2: PreProcessing\n", | |
"\n", | |
"[[ go back to the top ]](#Table-of-contents)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The actual data resides in the form of text inside the various files, some of the filenames (with their actual location) can be seen below:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 95, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([ '/home/user/scikit_learn_data/20news_home/20news-bydate-train/rec.autos/102994',\n", | |
" '/home/user/scikit_learn_data/20news_home/20news-bydate-train/comp.sys.mac.hardware/51861'], \n", | |
" dtype='|S91')" | |
] | |
}, | |
"execution_count": 95, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"newsgroups_train.filenames[0:2]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Using TF-IDF (Term Frequency, Inverse Document Frequency) algorithm to convert the text data into numerical vectors.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"TF-IDF is a way to score the importance of words in a document based on how frequently they appear across multiple documents." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Below is an example of bag-of-words approach and extending it using TF-IDF algorithm" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 96, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[1 1 0 1 1 1 1 0 1]\n", | |
" [0 1 1 0 1 0 1 1 1]]\n", | |
"{u'on': 2, u'to': 5, u'normal': 1, u'text': 4, u'well': 7, u'how': 0, u'see': 3, u'works': 8, u'vectorizer': 6}\n" | |
] | |
} | |
], | |
"source": [ | |
"from sklearn.feature_extraction.text import CountVectorizer\n", | |
"text_data = [\n", | |
" 'normal text to see how vectorizer works',\n", | |
" 'vectorizer works well on normal text'\n", | |
" ]\n", | |
"\n", | |
"\n", | |
"vectorizer = CountVectorizer()\n", | |
"print (vectorizer.fit_transform(text_data).todense())\n", | |
"print(vectorizer.vocabulary_)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 97, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[ 0.44610081 0.3174044 0. 0.44610081 0.3174044 0.44610081\n", | |
" 0.3174044 0. 0.3174044 ]\n", | |
" [ 0. 0.35464863 0.49844628 0. 0.35464863 0.\n", | |
" 0.35464863 0.49844628 0.35464863]]\n" | |
] | |
} | |
], | |
"source": [ | |
"from sklearn.feature_extraction.text import TfidfVectorizer\n", | |
"tfvectorizer = TfidfVectorizer()\n", | |
"counts=tfvectorizer.fit_transform(text_data).todense()\n", | |
"print(counts)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 98, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{u'how': 0,\n", | |
" u'normal': 1,\n", | |
" u'on': 2,\n", | |
" u'see': 3,\n", | |
" u'text': 4,\n", | |
" u'to': 5,\n", | |
" u'vectorizer': 6,\n", | |
" u'well': 7,\n", | |
" u'works': 8}" | |
] | |
}, | |
"execution_count": 98, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"tfvectorizer.vocabulary_" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Back to our current example of 20 newsgroup dataset we apply the algorithm to training data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 99, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(11314, 130107)" | |
] | |
}, | |
"execution_count": 99, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from sklearn.feature_extraction.text import TfidfVectorizer\n", | |
"vectorizer = TfidfVectorizer()\n", | |
"train_vectors = vectorizer.fit_transform(newsgroups_train.data)\n", | |
"train_vectors.shape" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 3: Classification\n", | |
"\n", | |
"[[ go back to the top ]](#Table-of-contents)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Using MultiNomial NB approach:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 100, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)" | |
] | |
}, | |
"execution_count": 100, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from sklearn.naive_bayes import MultinomialNB\n", | |
"from sklearn import metrics\n", | |
"model = MultinomialNB(alpha=.01)\n", | |
"model.fit(train_vectors, newsgroups_train.target)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now Fetch the Test samples and apply TF-IDF to convert." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 101, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"newsgroups_test = fetch_20newsgroups(subset='test')\n", | |
"vectors_test = vectorizer.transform(newsgroups_test.data)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 102, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.83523632501327671" | |
] | |
}, | |
"execution_count": 102, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
" model.score(vectors_test, newsgroups_test.target)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Evaluating model by looking at the precision recall & F1 Score.\n", | |
"Also the confusion matrix" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 103, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" precision recall f1-score support\n", | |
"\n", | |
" 0 0.82 0.78 0.80 319\n", | |
" 1 0.69 0.75 0.72 389\n", | |
" 2 0.74 0.63 0.68 394\n", | |
" 3 0.65 0.75 0.69 392\n", | |
" 4 0.83 0.84 0.83 385\n", | |
" 5 0.84 0.78 0.81 395\n", | |
" 6 0.82 0.78 0.80 390\n", | |
" 7 0.89 0.90 0.90 396\n", | |
" 8 0.93 0.96 0.95 398\n", | |
" 9 0.95 0.94 0.95 397\n", | |
" 10 0.95 0.97 0.96 399\n", | |
" 11 0.89 0.93 0.91 396\n", | |
" 12 0.79 0.77 0.78 393\n", | |
" 13 0.89 0.84 0.86 396\n", | |
" 14 0.87 0.91 0.89 394\n", | |
" 15 0.82 0.95 0.88 398\n", | |
" 16 0.76 0.91 0.83 364\n", | |
" 17 0.97 0.94 0.96 376\n", | |
" 18 0.80 0.64 0.71 310\n", | |
" 19 0.76 0.59 0.67 251\n", | |
"\n", | |
"avg / total 0.84 0.84 0.83 7532\n", | |
"\n", | |
"[[249 0 0 4 0 1 0 0 1 1 0 1 0 5 5 28 3 3\n", | |
" 1 17]\n", | |
" [ 0 290 15 14 10 23 6 0 0 3 0 4 12 0 7 2 0 2\n", | |
" 0 1]\n", | |
" [ 1 32 248 52 4 20 5 0 2 1 1 6 3 3 5 4 0 0\n", | |
" 4 3]\n", | |
" [ 0 11 26 293 22 1 11 1 0 1 0 1 21 0 4 0 0 0\n", | |
" 0 0]\n", | |
" [ 0 7 10 14 322 1 8 4 1 2 1 2 9 2 1 0 1 0\n", | |
" 0 0]\n", | |
" [ 0 40 14 11 6 307 3 1 2 0 0 3 2 1 4 0 1 0\n", | |
" 0 0]\n", | |
" [ 0 4 6 26 8 0 306 11 9 1 5 0 9 4 1 0 0 0\n", | |
" 0 0]\n", | |
" [ 0 1 1 5 1 0 10 358 6 1 0 0 6 3 1 0 2 0\n", | |
" 1 0]\n", | |
" [ 0 1 0 1 1 0 2 7 383 0 0 0 3 0 0 0 0 0\n", | |
" 0 0]\n", | |
" [ 0 0 0 0 1 0 3 4 0 373 11 1 0 0 2 0 0 2\n", | |
" 0 0]\n", | |
" [ 0 0 0 0 0 1 1 0 0 4 387 2 0 1 0 2 1 0\n", | |
" 0 0]\n", | |
" [ 1 3 1 2 2 1 3 3 0 0 0 370 1 3 2 0 4 0\n", | |
" 0 0]\n", | |
" [ 1 9 9 23 6 2 7 3 2 0 0 13 302 9 5 0 0 1\n", | |
" 1 0]\n", | |
" [ 2 10 1 3 1 3 3 4 1 2 0 4 8 332 2 7 1 2\n", | |
" 8 2]\n", | |
" [ 1 8 0 3 1 3 1 1 0 0 0 2 3 5 359 2 1 0\n", | |
" 4 0]\n", | |
" [ 3 1 1 1 0 0 0 0 1 1 1 0 0 2 1 378 0 0\n", | |
" 2 6]\n", | |
" [ 0 0 0 1 0 0 1 0 2 1 0 5 1 1 1 0 331 0\n", | |
" 14 6]\n", | |
" [ 5 1 0 0 0 1 0 0 0 1 1 0 0 0 0 2 2 355\n", | |
" 7 1]\n", | |
" [ 4 1 0 0 2 0 1 4 0 0 1 3 0 2 9 2 72 0\n", | |
" 199 10]\n", | |
" [ 35 1 2 0 0 0 0 0 0 0 0 1 0 2 5 33 15 1\n", | |
" 7 149]]\n" | |
] | |
} | |
], | |
"source": [ | |
"predictions = model.predict(vectors_test)\n", | |
"print(metrics.classification_report(newsgroups_test.target, predictions))\n", | |
"print(metrics.confusion_matrix(newsgroups_test.target, predictions))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Finding Best Params using Grid Search" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 104, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Best score: 0.916033233162\n", | |
"Best parameters: {'alpha': 0.01}\n" | |
] | |
} | |
], | |
"source": [ | |
"from sklearn.grid_search import GridSearchCV\n", | |
"from sklearn.naive_bayes import MultinomialNB\n", | |
"from sklearn.cross_validation import StratifiedKFold\n", | |
"\n", | |
"param_range = [0.1, 0.01, 1.0]\n", | |
"\n", | |
"naive_bayes_classifier = MultinomialNB()\n", | |
"parameter_grid = [{'alpha': param_range}]\n", | |
"\n", | |
"\n", | |
"cross_validation = StratifiedKFold(newsgroups_train.target, n_folds=10)\n", | |
"\n", | |
"grid_search = GridSearchCV(naive_bayes_classifier,\n", | |
" param_grid=parameter_grid,\n", | |
" cv=cross_validation)\n", | |
"\n", | |
"\n", | |
"grid_search.fit(train_vectors, newsgroups_train.target)\n", | |
"print('Best score: {}'.format(grid_search.best_score_))\n", | |
"print('Best parameters: {}'.format(grid_search.best_params_))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 4: Conclusion\n", | |
"\n", | |
"[[ go back to the top ]](#Table-of-contents)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Using MultiNomial NB :\n", | |
"Test Accuracy: 83.52%\n", | |
"Grid Search: Best Score - 91.6% @ alpha=0.01" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment