Skip to content

Instantly share code, notes, and snippets.

@AashishTiwari
Last active August 31, 2016 14:04
Show Gist options
  • Save AashishTiwari/a009ae683d5149b20e597760498cd068 to your computer and use it in GitHub Desktop.
Save AashishTiwari/a009ae683d5149b20e597760498cd068 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Working on 20 NewsGroup DataSet.\n",
"\n",
"### Notebook by [Aashish K Tiwari](https://gist.github.com/AashishTiwari)\n",
"#### You can see all my public gists @ https://gist.github.com/AashishTiwari\n",
"\n",
"#### [Persistent Systems Ltd]\n",
"#### Data Source: SciKit Learn Datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of contents\n",
"\n",
"\n",
"1. [Step 1: Analyzing Data](#Step-1:-Analyzing-data)\n",
"\n",
"3. [Step 2: PreProcessing](#Step-2:-PreProcessing)\n",
"\n",
"4. [Step 3: Classification](#Step-3:-Classification)\n",
"\n",
"5. [Step 4: Conclusion](#Step-4:-Conclusion)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## libraries\n",
"\n",
"[[ go back to the top ]](#Table-of-contents)\n",
"\n",
"\n",
"* **NumPy**: >= V 1.11.1\n",
"* **pandas**: >= V 0.18.1\n",
"* **scikit-learn**: >= V 0.17.1\n",
"* **matplotlib**: >= V 1.5.1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Analyzing Data\n",
"\n",
"[[ go back to the top ]](#Table-of-contents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"20 newsgroups dataset is available in Sklean package, the data is divided already in training and testing set, so we don't have to do the normal split (holding 30% of data as test data)\n",
"Here we first import the training dataset"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_20newsgroups\n",
"newsgroups_train = fetch_20newsgroups(subset='train')"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"sklearn.datasets.base.Bunch"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(newsgroups_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Checking attributes of the data."
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(11314,)"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"newsgroups_train.target.shape"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"numpy.ndarray"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(newsgroups_train.target)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are the different target variables that we wan't to predict?"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['alt.atheism',\n",
" 'comp.graphics',\n",
" 'comp.os.ms-windows.misc',\n",
" 'comp.sys.ibm.pc.hardware',\n",
" 'comp.sys.mac.hardware',\n",
" 'comp.windows.x',\n",
" 'misc.forsale',\n",
" 'rec.autos',\n",
" 'rec.motorcycles',\n",
" 'rec.sport.baseball',\n",
" 'rec.sport.hockey',\n",
" 'sci.crypt',\n",
" 'sci.electronics',\n",
" 'sci.med',\n",
" 'sci.space',\n",
" 'soc.religion.christian',\n",
" 'talk.politics.guns',\n",
" 'talk.politics.mideast',\n",
" 'talk.politics.misc',\n",
" 'talk.religion.misc']"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"newsgroups_train.target_names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See actual files"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# newsgroups_train.viewvalues()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: PreProcessing\n",
"\n",
"[[ go back to the top ]](#Table-of-contents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The actual data resides in the form of text inside the various files, some of the filenames (with their actual location) can be seen below:"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ '/home/user/scikit_learn_data/20news_home/20news-bydate-train/rec.autos/102994',\n",
" '/home/user/scikit_learn_data/20news_home/20news-bydate-train/comp.sys.mac.hardware/51861'], \n",
" dtype='|S91')"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"newsgroups_train.filenames[0:2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using TF-IDF (Term Frequency, Inverse Document Frequency) algorithm to convert the text data into numerical vectors.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TF-IDF is a way to score the importance of words in a document based on how frequently they appear across multiple documents."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below is an example of bag-of-words approach and extending it using TF-IDF algorithm"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1 1 0 1 1 1 1 0 1]\n",
" [0 1 1 0 1 0 1 1 1]]\n",
"{u'on': 2, u'to': 5, u'normal': 1, u'text': 4, u'well': 7, u'how': 0, u'see': 3, u'works': 8, u'vectorizer': 6}\n"
]
}
],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"text_data = [\n",
" 'normal text to see how vectorizer works',\n",
" 'vectorizer works well on normal text'\n",
" ]\n",
"\n",
"\n",
"vectorizer = CountVectorizer()\n",
"print (vectorizer.fit_transform(text_data).todense())\n",
"print(vectorizer.vocabulary_)"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 0.44610081 0.3174044 0. 0.44610081 0.3174044 0.44610081\n",
" 0.3174044 0. 0.3174044 ]\n",
" [ 0. 0.35464863 0.49844628 0. 0.35464863 0.\n",
" 0.35464863 0.49844628 0.35464863]]\n"
]
}
],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"tfvectorizer = TfidfVectorizer()\n",
"counts=tfvectorizer.fit_transform(text_data).todense()\n",
"print(counts)"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{u'how': 0,\n",
" u'normal': 1,\n",
" u'on': 2,\n",
" u'see': 3,\n",
" u'text': 4,\n",
" u'to': 5,\n",
" u'vectorizer': 6,\n",
" u'well': 7,\n",
" u'works': 8}"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tfvectorizer.vocabulary_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Back to our current example of 20 newsgroup dataset we apply the algorithm to training data"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(11314, 130107)"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"vectorizer = TfidfVectorizer()\n",
"train_vectors = vectorizer.fit_transform(newsgroups_train.data)\n",
"train_vectors.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Classification\n",
"\n",
"[[ go back to the top ]](#Table-of-contents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using MultiNomial NB approach:"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn import metrics\n",
"model = MultinomialNB(alpha=.01)\n",
"model.fit(train_vectors, newsgroups_train.target)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now Fetch the Test samples and apply TF-IDF to convert."
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"newsgroups_test = fetch_20newsgroups(subset='test')\n",
"vectors_test = vectorizer.transform(newsgroups_test.data)"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.83523632501327671"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
" model.score(vectors_test, newsgroups_test.target)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Evaluating model by looking at the precision recall & F1 Score.\n",
"Also the confusion matrix"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.82 0.78 0.80 319\n",
" 1 0.69 0.75 0.72 389\n",
" 2 0.74 0.63 0.68 394\n",
" 3 0.65 0.75 0.69 392\n",
" 4 0.83 0.84 0.83 385\n",
" 5 0.84 0.78 0.81 395\n",
" 6 0.82 0.78 0.80 390\n",
" 7 0.89 0.90 0.90 396\n",
" 8 0.93 0.96 0.95 398\n",
" 9 0.95 0.94 0.95 397\n",
" 10 0.95 0.97 0.96 399\n",
" 11 0.89 0.93 0.91 396\n",
" 12 0.79 0.77 0.78 393\n",
" 13 0.89 0.84 0.86 396\n",
" 14 0.87 0.91 0.89 394\n",
" 15 0.82 0.95 0.88 398\n",
" 16 0.76 0.91 0.83 364\n",
" 17 0.97 0.94 0.96 376\n",
" 18 0.80 0.64 0.71 310\n",
" 19 0.76 0.59 0.67 251\n",
"\n",
"avg / total 0.84 0.84 0.83 7532\n",
"\n",
"[[249 0 0 4 0 1 0 0 1 1 0 1 0 5 5 28 3 3\n",
" 1 17]\n",
" [ 0 290 15 14 10 23 6 0 0 3 0 4 12 0 7 2 0 2\n",
" 0 1]\n",
" [ 1 32 248 52 4 20 5 0 2 1 1 6 3 3 5 4 0 0\n",
" 4 3]\n",
" [ 0 11 26 293 22 1 11 1 0 1 0 1 21 0 4 0 0 0\n",
" 0 0]\n",
" [ 0 7 10 14 322 1 8 4 1 2 1 2 9 2 1 0 1 0\n",
" 0 0]\n",
" [ 0 40 14 11 6 307 3 1 2 0 0 3 2 1 4 0 1 0\n",
" 0 0]\n",
" [ 0 4 6 26 8 0 306 11 9 1 5 0 9 4 1 0 0 0\n",
" 0 0]\n",
" [ 0 1 1 5 1 0 10 358 6 1 0 0 6 3 1 0 2 0\n",
" 1 0]\n",
" [ 0 1 0 1 1 0 2 7 383 0 0 0 3 0 0 0 0 0\n",
" 0 0]\n",
" [ 0 0 0 0 1 0 3 4 0 373 11 1 0 0 2 0 0 2\n",
" 0 0]\n",
" [ 0 0 0 0 0 1 1 0 0 4 387 2 0 1 0 2 1 0\n",
" 0 0]\n",
" [ 1 3 1 2 2 1 3 3 0 0 0 370 1 3 2 0 4 0\n",
" 0 0]\n",
" [ 1 9 9 23 6 2 7 3 2 0 0 13 302 9 5 0 0 1\n",
" 1 0]\n",
" [ 2 10 1 3 1 3 3 4 1 2 0 4 8 332 2 7 1 2\n",
" 8 2]\n",
" [ 1 8 0 3 1 3 1 1 0 0 0 2 3 5 359 2 1 0\n",
" 4 0]\n",
" [ 3 1 1 1 0 0 0 0 1 1 1 0 0 2 1 378 0 0\n",
" 2 6]\n",
" [ 0 0 0 1 0 0 1 0 2 1 0 5 1 1 1 0 331 0\n",
" 14 6]\n",
" [ 5 1 0 0 0 1 0 0 0 1 1 0 0 0 0 2 2 355\n",
" 7 1]\n",
" [ 4 1 0 0 2 0 1 4 0 0 1 3 0 2 9 2 72 0\n",
" 199 10]\n",
" [ 35 1 2 0 0 0 0 0 0 0 0 1 0 2 5 33 15 1\n",
" 7 149]]\n"
]
}
],
"source": [
"predictions = model.predict(vectors_test)\n",
"print(metrics.classification_report(newsgroups_test.target, predictions))\n",
"print(metrics.confusion_matrix(newsgroups_test.target, predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding Best Params using Grid Search"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Best score: 0.916033233162\n",
"Best parameters: {'alpha': 0.01}\n"
]
}
],
"source": [
"from sklearn.grid_search import GridSearchCV\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn.cross_validation import StratifiedKFold\n",
"\n",
"param_range = [0.1, 0.01, 1.0]\n",
"\n",
"naive_bayes_classifier = MultinomialNB()\n",
"parameter_grid = [{'alpha': param_range}]\n",
"\n",
"\n",
"cross_validation = StratifiedKFold(newsgroups_train.target, n_folds=10)\n",
"\n",
"grid_search = GridSearchCV(naive_bayes_classifier,\n",
" param_grid=parameter_grid,\n",
" cv=cross_validation)\n",
"\n",
"\n",
"grid_search.fit(train_vectors, newsgroups_train.target)\n",
"print('Best score: {}'.format(grid_search.best_score_))\n",
"print('Best parameters: {}'.format(grid_search.best_params_))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4: Conclusion\n",
"\n",
"[[ go back to the top ]](#Table-of-contents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using MultiNomial NB :\n",
"Test Accuracy: 83.52%\n",
"Grid Search: Best Score - 91.6% @ alpha=0.01"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment