AashishTiwari · August 31, 2016 14:04
diff --git a/20 NewsGroup.ipynb b/20 NewsGroup.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working on 20 NewsGroup DataSet.\n",
    "\n",
    "### Notebook by [Aashish K Tiwari](https://gist.github.com/AashishTiwari)\n",
    "#### You can see all my public gists @ https://gist.github.com/AashishTiwari\n",
    "\n",
    "#### [Persistent Systems Ltd]\n",
    "#### Data Source: SciKit Learn Datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of contents\n",
    "\n",
    "\n",
    "1. [Step 1: Analyzing Data](#Step-1:-Analyzing-data)\n",
    "\n",
    "3. [Step 2: PreProcessing](#Step-2:-PreProcessing)\n",
    "\n",
    "4. [Step 3: Classification](#Step-3:-Classification)\n",
    "\n",
    "5. [Step 4: Conclusion](#Step-4:-Conclusion)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  libraries\n",
    "\n",
    "[[ go back to the top ]](#Table-of-contents)\n",
    "\n",
    "\n",
    "* **NumPy**: >= V 1.11.1\n",
    "* **pandas**: >= V 0.18.1\n",
    "* **scikit-learn**:  >= V 0.17.1\n",
    "* **matplotlib**: >= V 1.5.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Analyzing Data\n",
    "\n",
    "[[ go back to the top ]](#Table-of-contents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "20 newsgroups dataset is available in Sklean package, the data is divided already in training and testing set, so we don't have to do the normal split (holding 30% of data as test data)\n",
    "Here we first import the training dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups\n",
    "newsgroups_train = fetch_20newsgroups(subset='train')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sklearn.datasets.base.Bunch"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(newsgroups_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checking  attributes of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(11314,)"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newsgroups_train.target.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "numpy.ndarray"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(newsgroups_train.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are the different target variables that we wan't to predict?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['alt.atheism',\n",
       " 'comp.graphics',\n",
       " 'comp.os.ms-windows.misc',\n",
       " 'comp.sys.ibm.pc.hardware',\n",
       " 'comp.sys.mac.hardware',\n",
       " 'comp.windows.x',\n",
       " 'misc.forsale',\n",
       " 'rec.autos',\n",
       " 'rec.motorcycles',\n",
       " 'rec.sport.baseball',\n",
       " 'rec.sport.hockey',\n",
       " 'sci.crypt',\n",
       " 'sci.electronics',\n",
       " 'sci.med',\n",
       " 'sci.space',\n",
       " 'soc.religion.christian',\n",
       " 'talk.politics.guns',\n",
       " 'talk.politics.mideast',\n",
       " 'talk.politics.misc',\n",
       " 'talk.religion.misc']"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newsgroups_train.target_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See actual files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# newsgroups_train.viewvalues()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: PreProcessing\n",
    "\n",
    "[[ go back to the top ]](#Table-of-contents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The actual data resides in the form of text inside the various files, some of the filenames (with their actual location) can be seen below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ '/home/user/scikit_learn_data/20news_home/20news-bydate-train/rec.autos/102994',\n",
       "       '/home/user/scikit_learn_data/20news_home/20news-bydate-train/comp.sys.mac.hardware/51861'], \n",
       "      dtype='|S91')"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newsgroups_train.filenames[0:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using TF-IDF (Term Frequency, Inverse Document Frequency) algorithm to convert the text data into numerical vectors.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TF-IDF is a way to score the importance of words in a document based on how frequently they appear across multiple documents."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is an example of bag-of-words approach and extending it using TF-IDF algorithm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1 1 0 1 1 1 1 0 1]\n",
      " [0 1 1 0 1 0 1 1 1]]\n",
      "{u'on': 2, u'to': 5, u'normal': 1, u'text': 4, u'well': 7, u'how': 0, u'see': 3, u'works': 8, u'vectorizer': 6}\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "text_data = [\n",
    "                'normal text to see how vectorizer works',\n",
    "                'vectorizer works well on normal text'\n",
    "         ]\n",
    "\n",
    "\n",
    "vectorizer = CountVectorizer()\n",
    "print (vectorizer.fit_transform(text_data).todense())\n",
    "print(vectorizer.vocabulary_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0.44610081  0.3174044   0.          0.44610081  0.3174044   0.44610081\n",
      "   0.3174044   0.          0.3174044 ]\n",
      " [ 0.          0.35464863  0.49844628  0.          0.35464863  0.\n",
      "   0.35464863  0.49844628  0.35464863]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "tfvectorizer = TfidfVectorizer()\n",
    "counts=tfvectorizer.fit_transform(text_data).todense()\n",
    "print(counts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{u'how': 0,\n",
       " u'normal': 1,\n",
       " u'on': 2,\n",
       " u'see': 3,\n",
       " u'text': 4,\n",
       " u'to': 5,\n",
       " u'vectorizer': 6,\n",
       " u'well': 7,\n",
       " u'works': 8}"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tfvectorizer.vocabulary_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Back to our current example of 20 newsgroup dataset we apply the algorithm to training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(11314, 130107)"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "vectorizer = TfidfVectorizer()\n",
    "train_vectors = vectorizer.fit_transform(newsgroups_train.data)\n",
    "train_vectors.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Classification\n",
    "\n",
    "[[ go back to the top ]](#Table-of-contents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using MultiNomial NB approach:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn import metrics\n",
    "model = MultinomialNB(alpha=.01)\n",
    "model.fit(train_vectors, newsgroups_train.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now Fetch the Test samples and apply TF-IDF to convert."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "newsgroups_test = fetch_20newsgroups(subset='test')\n",
    "vectors_test = vectorizer.transform(newsgroups_test.data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.83523632501327671"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    " model.score(vectors_test, newsgroups_test.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Evaluating model by looking at the precision recall & F1 Score.\n",
    "Also the confusion matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             precision    recall  f1-score   support\n",
      "\n",
      "          0       0.82      0.78      0.80       319\n",
      "          1       0.69      0.75      0.72       389\n",
      "          2       0.74      0.63      0.68       394\n",
      "          3       0.65      0.75      0.69       392\n",
      "          4       0.83      0.84      0.83       385\n",
      "          5       0.84      0.78      0.81       395\n",
      "          6       0.82      0.78      0.80       390\n",
      "          7       0.89      0.90      0.90       396\n",
      "          8       0.93      0.96      0.95       398\n",
      "          9       0.95      0.94      0.95       397\n",
      "         10       0.95      0.97      0.96       399\n",
      "         11       0.89      0.93      0.91       396\n",
      "         12       0.79      0.77      0.78       393\n",
      "         13       0.89      0.84      0.86       396\n",
      "         14       0.87      0.91      0.89       394\n",
      "         15       0.82      0.95      0.88       398\n",
      "         16       0.76      0.91      0.83       364\n",
      "         17       0.97      0.94      0.96       376\n",
      "         18       0.80      0.64      0.71       310\n",
      "         19       0.76      0.59      0.67       251\n",
      "\n",
      "avg / total       0.84      0.84      0.83      7532\n",
      "\n",
      "[[249   0   0   4   0   1   0   0   1   1   0   1   0   5   5  28   3   3\n",
      "    1  17]\n",
      " [  0 290  15  14  10  23   6   0   0   3   0   4  12   0   7   2   0   2\n",
      "    0   1]\n",
      " [  1  32 248  52   4  20   5   0   2   1   1   6   3   3   5   4   0   0\n",
      "    4   3]\n",
      " [  0  11  26 293  22   1  11   1   0   1   0   1  21   0   4   0   0   0\n",
      "    0   0]\n",
      " [  0   7  10  14 322   1   8   4   1   2   1   2   9   2   1   0   1   0\n",
      "    0   0]\n",
      " [  0  40  14  11   6 307   3   1   2   0   0   3   2   1   4   0   1   0\n",
      "    0   0]\n",
      " [  0   4   6  26   8   0 306  11   9   1   5   0   9   4   1   0   0   0\n",
      "    0   0]\n",
      " [  0   1   1   5   1   0  10 358   6   1   0   0   6   3   1   0   2   0\n",
      "    1   0]\n",
      " [  0   1   0   1   1   0   2   7 383   0   0   0   3   0   0   0   0   0\n",
      "    0   0]\n",
      " [  0   0   0   0   1   0   3   4   0 373  11   1   0   0   2   0   0   2\n",
      "    0   0]\n",
      " [  0   0   0   0   0   1   1   0   0   4 387   2   0   1   0   2   1   0\n",
      "    0   0]\n",
      " [  1   3   1   2   2   1   3   3   0   0   0 370   1   3   2   0   4   0\n",
      "    0   0]\n",
      " [  1   9   9  23   6   2   7   3   2   0   0  13 302   9   5   0   0   1\n",
      "    1   0]\n",
      " [  2  10   1   3   1   3   3   4   1   2   0   4   8 332   2   7   1   2\n",
      "    8   2]\n",
      " [  1   8   0   3   1   3   1   1   0   0   0   2   3   5 359   2   1   0\n",
      "    4   0]\n",
      " [  3   1   1   1   0   0   0   0   1   1   1   0   0   2   1 378   0   0\n",
      "    2   6]\n",
      " [  0   0   0   1   0   0   1   0   2   1   0   5   1   1   1   0 331   0\n",
      "   14   6]\n",
      " [  5   1   0   0   0   1   0   0   0   1   1   0   0   0   0   2   2 355\n",
      "    7   1]\n",
      " [  4   1   0   0   2   0   1   4   0   0   1   3   0   2   9   2  72   0\n",
      "  199  10]\n",
      " [ 35   1   2   0   0   0   0   0   0   0   0   1   0   2   5  33  15   1\n",
      "    7 149]]\n"
     ]
    }
   ],
   "source": [
    "predictions = model.predict(vectors_test)\n",
    "print(metrics.classification_report(newsgroups_test.target, predictions))\n",
    "print(metrics.confusion_matrix(newsgroups_test.target, predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finding Best Params using Grid Search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best score: 0.916033233162\n",
      "Best parameters: {'alpha': 0.01}\n"
     ]
    }
   ],
   "source": [
    "from sklearn.grid_search import GridSearchCV\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.cross_validation import StratifiedKFold\n",
    "\n",
    "param_range = [0.1, 0.01, 1.0]\n",
    "\n",
    "naive_bayes_classifier = MultinomialNB()\n",
    "parameter_grid = [{'alpha': param_range}]\n",
    "\n",
    "\n",
    "cross_validation = StratifiedKFold(newsgroups_train.target, n_folds=10)\n",
    "\n",
    "grid_search = GridSearchCV(naive_bayes_classifier,\n",
    "                           param_grid=parameter_grid,\n",
    "                           cv=cross_validation)\n",
    "\n",
    "\n",
    "grid_search.fit(train_vectors, newsgroups_train.target)\n",
    "print('Best score: {}'.format(grid_search.best_score_))\n",
    "print('Best parameters: {}'.format(grid_search.best_params_))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Conclusion\n",
    "\n",
    "[[ go back to the top ]](#Table-of-contents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using MultiNomial NB :\n",
    "Test Accuracy: 83.52%\n",
    "Grid Search: Best Score - 91.6% @ alpha=0.01"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Working on 20 NewsGroup DataSet.\n",
	"\n",
	"### Notebook by [Aashish K Tiwari](https://gist.github.com/AashishTiwari)\n",
	"#### You can see all my public gists @ https://gist.github.com/AashishTiwari\n",
	"\n",
	"#### [Persistent Systems Ltd]\n",
	"#### Data Source: SciKit Learn Datasets"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Table of contents\n",
	"\n",
	"\n",
	"1. [Step 1: Analyzing Data](#Step-1:-Analyzing-data)\n",
	"\n",
	"3. [Step 2: PreProcessing](#Step-2:-PreProcessing)\n",
	"\n",
	"4. [Step 3: Classification](#Step-3:-Classification)\n",
	"\n",
	"5. [Step 4: Conclusion](#Step-4:-Conclusion)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## libraries\n",
	"\n",
	"[[ go back to the top ]](#Table-of-contents)\n",
	"\n",
	"\n",
	"* NumPy: >= V 1.11.1\n",
	"* pandas: >= V 0.18.1\n",
	"* scikit-learn: >= V 0.17.1\n",
	"* matplotlib: >= V 1.5.1"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 1: Analyzing Data\n",
	"\n",
	"[[ go back to the top ]](#Table-of-contents)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"20 newsgroups dataset is available in Sklean package, the data is divided already in training and testing set, so we don't have to do the normal split (holding 30% of data as test data)\n",
	"Here we first import the training dataset"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 89,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"from sklearn.datasets import fetch_20newsgroups\n",
	"newsgroups_train = fetch_20newsgroups(subset='train')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 90,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"sklearn.datasets.base.Bunch"
	]
	},
	"execution_count": 90,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"type(newsgroups_train)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Checking attributes of the data."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 91,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(11314,)"
	]
	},
	"execution_count": 91,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"newsgroups_train.target.shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 92,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"numpy.ndarray"
	]
	},
	"execution_count": 92,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"type(newsgroups_train.target)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"What are the different target variables that we wan't to predict?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 93,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['alt.atheism',\n",
	" 'comp.graphics',\n",
	" 'comp.os.ms-windows.misc',\n",
	" 'comp.sys.ibm.pc.hardware',\n",
	" 'comp.sys.mac.hardware',\n",
	" 'comp.windows.x',\n",
	" 'misc.forsale',\n",
	" 'rec.autos',\n",
	" 'rec.motorcycles',\n",
	" 'rec.sport.baseball',\n",
	" 'rec.sport.hockey',\n",
	" 'sci.crypt',\n",
	" 'sci.electronics',\n",
	" 'sci.med',\n",
	" 'sci.space',\n",
	" 'soc.religion.christian',\n",
	" 'talk.politics.guns',\n",
	" 'talk.politics.mideast',\n",
	" 'talk.politics.misc',\n",
	" 'talk.religion.misc']"
	]
	},
	"execution_count": 93,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"newsgroups_train.target_names"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"See actual files"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 94,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# newsgroups_train.viewvalues()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 2: PreProcessing\n",
	"\n",
	"[[ go back to the top ]](#Table-of-contents)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The actual data resides in the form of text inside the various files, some of the filenames (with their actual location) can be seen below:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 95,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([ '/home/user/scikit_learn_data/20news_home/20news-bydate-train/rec.autos/102994',\n",
	" '/home/user/scikit_learn_data/20news_home/20news-bydate-train/comp.sys.mac.hardware/51861'], \n",
	" dtype='\|S91')"
	]
	},
	"execution_count": 95,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"newsgroups_train.filenames[0:2]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Using TF-IDF (Term Frequency, Inverse Document Frequency) algorithm to convert the text data into numerical vectors.\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"TF-IDF is a way to score the importance of words in a document based on how frequently they appear across multiple documents."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Below is an example of bag-of-words approach and extending it using TF-IDF algorithm"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 96,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[[1 1 0 1 1 1 1 0 1]\n",
	" [0 1 1 0 1 0 1 1 1]]\n",
	"{u'on': 2, u'to': 5, u'normal': 1, u'text': 4, u'well': 7, u'how': 0, u'see': 3, u'works': 8, u'vectorizer': 6}\n"
	]
	}
	],
	"source": [
	"from sklearn.feature_extraction.text import CountVectorizer\n",
	"text_data = [\n",
	" 'normal text to see how vectorizer works',\n",
	" 'vectorizer works well on normal text'\n",
	" ]\n",
	"\n",
	"\n",
	"vectorizer = CountVectorizer()\n",
	"print (vectorizer.fit_transform(text_data).todense())\n",
	"print(vectorizer.vocabulary_)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 97,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[[ 0.44610081 0.3174044 0. 0.44610081 0.3174044 0.44610081\n",
	" 0.3174044 0. 0.3174044 ]\n",
	" [ 0. 0.35464863 0.49844628 0. 0.35464863 0.\n",
	" 0.35464863 0.49844628 0.35464863]]\n"
	]
	}
	],
	"source": [
	"from sklearn.feature_extraction.text import TfidfVectorizer\n",
	"tfvectorizer = TfidfVectorizer()\n",
	"counts=tfvectorizer.fit_transform(text_data).todense()\n",
	"print(counts)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 98,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"{u'how': 0,\n",
	" u'normal': 1,\n",
	" u'on': 2,\n",
	" u'see': 3,\n",
	" u'text': 4,\n",
	" u'to': 5,\n",
	" u'vectorizer': 6,\n",
	" u'well': 7,\n",
	" u'works': 8}"
	]
	},
	"execution_count": 98,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"tfvectorizer.vocabulary_"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Back to our current example of 20 newsgroup dataset we apply the algorithm to training data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 99,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(11314, 130107)"
	]
	},
	"execution_count": 99,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from sklearn.feature_extraction.text import TfidfVectorizer\n",
	"vectorizer = TfidfVectorizer()\n",
	"train_vectors = vectorizer.fit_transform(newsgroups_train.data)\n",
	"train_vectors.shape"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 3: Classification\n",
	"\n",
	"[[ go back to the top ]](#Table-of-contents)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Using MultiNomial NB approach:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 100,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)"
	]
	},
	"execution_count": 100,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from sklearn.naive_bayes import MultinomialNB\n",
	"from sklearn import metrics\n",
	"model = MultinomialNB(alpha=.01)\n",
	"model.fit(train_vectors, newsgroups_train.target)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now Fetch the Test samples and apply TF-IDF to convert."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 101,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"newsgroups_test = fetch_20newsgroups(subset='test')\n",
	"vectors_test = vectorizer.transform(newsgroups_test.data)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 102,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0.83523632501327671"
	]
	},
	"execution_count": 102,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	" model.score(vectors_test, newsgroups_test.target)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Evaluating model by looking at the precision recall & F1 Score.\n",
	"Also the confusion matrix"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 103,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" precision recall f1-score support\n",
	"\n",
	" 0 0.82 0.78 0.80 319\n",
	" 1 0.69 0.75 0.72 389\n",
	" 2 0.74 0.63 0.68 394\n",
	" 3 0.65 0.75 0.69 392\n",
	" 4 0.83 0.84 0.83 385\n",
	" 5 0.84 0.78 0.81 395\n",
	" 6 0.82 0.78 0.80 390\n",
	" 7 0.89 0.90 0.90 396\n",
	" 8 0.93 0.96 0.95 398\n",
	" 9 0.95 0.94 0.95 397\n",
	" 10 0.95 0.97 0.96 399\n",
	" 11 0.89 0.93 0.91 396\n",
	" 12 0.79 0.77 0.78 393\n",
	" 13 0.89 0.84 0.86 396\n",
	" 14 0.87 0.91 0.89 394\n",
	" 15 0.82 0.95 0.88 398\n",
	" 16 0.76 0.91 0.83 364\n",
	" 17 0.97 0.94 0.96 376\n",
	" 18 0.80 0.64 0.71 310\n",
	" 19 0.76 0.59 0.67 251\n",
	"\n",
	"avg / total 0.84 0.84 0.83 7532\n",
	"\n",
	"[[249 0 0 4 0 1 0 0 1 1 0 1 0 5 5 28 3 3\n",
	" 1 17]\n",
	" [ 0 290 15 14 10 23 6 0 0 3 0 4 12 0 7 2 0 2\n",
	" 0 1]\n",
	" [ 1 32 248 52 4 20 5 0 2 1 1 6 3 3 5 4 0 0\n",
	" 4 3]\n",
	" [ 0 11 26 293 22 1 11 1 0 1 0 1 21 0 4 0 0 0\n",
	" 0 0]\n",
	" [ 0 7 10 14 322 1 8 4 1 2 1 2 9 2 1 0 1 0\n",
	" 0 0]\n",
	" [ 0 40 14 11 6 307 3 1 2 0 0 3 2 1 4 0 1 0\n",
	" 0 0]\n",
	" [ 0 4 6 26 8 0 306 11 9 1 5 0 9 4 1 0 0 0\n",
	" 0 0]\n",
	" [ 0 1 1 5 1 0 10 358 6 1 0 0 6 3 1 0 2 0\n",
	" 1 0]\n",
	" [ 0 1 0 1 1 0 2 7 383 0 0 0 3 0 0 0 0 0\n",
	" 0 0]\n",
	" [ 0 0 0 0 1 0 3 4 0 373 11 1 0 0 2 0 0 2\n",
	" 0 0]\n",
	" [ 0 0 0 0 0 1 1 0 0 4 387 2 0 1 0 2 1 0\n",
	" 0 0]\n",
	" [ 1 3 1 2 2 1 3 3 0 0 0 370 1 3 2 0 4 0\n",
	" 0 0]\n",
	" [ 1 9 9 23 6 2 7 3 2 0 0 13 302 9 5 0 0 1\n",
	" 1 0]\n",
	" [ 2 10 1 3 1 3 3 4 1 2 0 4 8 332 2 7 1 2\n",
	" 8 2]\n",
	" [ 1 8 0 3 1 3 1 1 0 0 0 2 3 5 359 2 1 0\n",
	" 4 0]\n",
	" [ 3 1 1 1 0 0 0 0 1 1 1 0 0 2 1 378 0 0\n",
	" 2 6]\n",
	" [ 0 0 0 1 0 0 1 0 2 1 0 5 1 1 1 0 331 0\n",
	" 14 6]\n",
	" [ 5 1 0 0 0 1 0 0 0 1 1 0 0 0 0 2 2 355\n",
	" 7 1]\n",
	" [ 4 1 0 0 2 0 1 4 0 0 1 3 0 2 9 2 72 0\n",
	" 199 10]\n",
	" [ 35 1 2 0 0 0 0 0 0 0 0 1 0 2 5 33 15 1\n",
	" 7 149]]\n"
	]
	}
	],
	"source": [
	"predictions = model.predict(vectors_test)\n",
	"print(metrics.classification_report(newsgroups_test.target, predictions))\n",
	"print(metrics.confusion_matrix(newsgroups_test.target, predictions))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Finding Best Params using Grid Search"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 104,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Best score: 0.916033233162\n",
	"Best parameters: {'alpha': 0.01}\n"
	]
	}
	],
	"source": [
	"from sklearn.grid_search import GridSearchCV\n",
	"from sklearn.naive_bayes import MultinomialNB\n",
	"from sklearn.cross_validation import StratifiedKFold\n",
	"\n",
	"param_range = [0.1, 0.01, 1.0]\n",
	"\n",
	"naive_bayes_classifier = MultinomialNB()\n",
	"parameter_grid = [{'alpha': param_range}]\n",
	"\n",
	"\n",
	"cross_validation = StratifiedKFold(newsgroups_train.target, n_folds=10)\n",
	"\n",
	"grid_search = GridSearchCV(naive_bayes_classifier,\n",
	" param_grid=parameter_grid,\n",
	" cv=cross_validation)\n",
	"\n",
	"\n",
	"grid_search.fit(train_vectors, newsgroups_train.target)\n",
	"print('Best score: {}'.format(grid_search.best_score_))\n",
	"print('Best parameters: {}'.format(grid_search.best_params_))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 4: Conclusion\n",
	"\n",
	"[[ go back to the top ]](#Table-of-contents)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Using MultiNomial NB :\n",
	"Test Accuracy: 83.52%\n",
	"Grid Search: Best Score - 91.6% @ alpha=0.01"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.6"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}