Gijs-Koot · February 10, 2017 11:43
diff --git a/lstm_sentiment_benchmark.ipynb b/lstm_sentiment_benchmark.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deep learning vs. Naive Bayes vs. SVM for sentiment classification\n",
    "\n",
    "* Benchmark of three different algorithms for sentiment classification\n",
    "* Based on 50000 movie reviews in a 50/50 split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classifiers\n",
    "\n",
    "* LSTM\n",
    "    * Trained with Keras\n",
    "    * Features of first 500 words of the review\n",
    "    * Takes ~8 minutes to train\n",
    "    * Takes ~40 seconds to predict\n",
    "    * Architecture taken from CH7 in http://machinelearningmastery.com/\n",
    "    * Accuracy ~ 0.88\n",
    "* Naive Bayes (Bernoulli)\n",
    "    * Bag-of-words // tf-idf features\n",
    "    * Basic parameter optimization\n",
    "    * Takes ~20 seconds to train, ~1 second to predict\n",
    "    * Accuracy ~ 0.85\n",
    " * SVM \n",
    "     * Bag-of-words // tf-idf features\n",
    "     * Basic parameter optimization, linear kernel\n",
    "     * Takes ~15 minutes to train, ~2 minutes to predict\n",
    "     * Accuracy ~ 0.89"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "SVM performed slightly better but was also quite slow. I doubt parameter optimization is going to give much better performance, but toying with ngram settings might improve things a bit. Also allowing more words in the dataset could be an idea. Naive Bayes performance was still very solid, and superfast. LSTM had a good performance, slightly quicker to train than SVM, and architecture can probably be optimized.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from keras.datasets import imdb\n",
    "from keras.models import Sequential\n",
    "from keras.layers import Dense, LSTM, Embedding\n",
    "from keras.layers.convolutional import MaxPooling1D, Convolution1D\n",
    "from keras.preprocessing import sequence\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "np.random.seed(7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "top_words = 5000\n",
    "(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Keras with LSTM network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "max_review_length = 500\n",
    "X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)\n",
    "X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((25000, 500), (25000, 500))"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.shape, X_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "____________________________________________________________________________________________________\n",
      "Layer (type)                     Output Shape          Param #     Connected to                     \n",
      "====================================================================================================\n",
      "embedding_1 (Embedding)          (None, 500, 32)       160000      embedding_input_1[0][0]          \n",
      "____________________________________________________________________________________________________\n",
      "convolution1d_1 (Convolution1D)  (None, 500, 32)       3104        embedding_1[0][0]                \n",
      "____________________________________________________________________________________________________\n",
      "maxpooling1d_1 (MaxPooling1D)    (None, 250, 32)       0           convolution1d_1[0][0]            \n",
      "____________________________________________________________________________________________________\n",
      "lstm_1 (LSTM)                    (None, 100)           53200       maxpooling1d_1[0][0]             \n",
      "____________________________________________________________________________________________________\n",
      "dense_1 (Dense)                  (None, 1)             101         lstm_1[0][0]                     \n",
      "====================================================================================================\n",
      "Total params: 216405\n",
      "____________________________________________________________________________________________________\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "embedding_vector_length = 32\n",
    "\n",
    "model = Sequential()\n",
    "model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))\n",
    "model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))\n",
    "model.add(MaxPooling1D(pool_length=2))\n",
    "model.add(LSTM(100))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "\n",
    "model.compile(loss=\"binary_crossentropy\", optimizer=\"adam\", metrics=[\"accuracy\"])\n",
    "print(model.summary())\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/3\n",
      "25000/25000 [==============================] - 149s - loss: 0.4475 - acc: 0.7819   \n",
      "Epoch 2/3\n",
      "25000/25000 [==============================] - 153s - loss: 0.2623 - acc: 0.8956   \n",
      "Epoch 3/3\n",
      "25000/25000 [==============================] - 153s - loss: 0.2192 - acc: 0.9163   \n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7f04c0097f60>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.fit(X_train, y_train, nb_epoch=3, batch_size=64)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "predictions = model.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             precision    recall  f1-score   support\n",
      "\n",
      "          0       0.91      0.85      0.88     12500\n",
      "          1       0.86      0.91      0.89     12500\n",
      "\n",
      "avg / total       0.88      0.88      0.88     25000\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(y_test, predictions > .5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary\n",
    "\n",
    "* .88 accuracy\n",
    "* Training takes ~ 8 minutes\n",
    "* Prediction takes ~ 45 seconds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Naive bayes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "X_sents = np.array(map(lambda x: \" \".join(x), X_train))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def to_sents(X):\n",
    "    sents = list()\n",
    "    for x in X:\n",
    "        sents.append(\" \".join(str(s) for s in x))\n",
    "    return sents\n",
    "\n",
    "X_train_sents = to_sents(X_train)\n",
    "X_test_sents = to_sents(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 45 candidates, totalling 135 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   36.7s\n",
      "[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed:  2.3min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=3, error_score='raise',\n",
       "       estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,\n",
       " ...  vocabulary=None)), ('nb', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))]),\n",
       "       fit_params={}, iid=True, n_jobs=-1,\n",
       "       param_grid={'tfidf__min_df': [2, 5, 0.02], 'nb__alpha': [0.1, 0.3, 1, 3, 10], 'tfidf__max_df': [1.0, 0.8, 0.5]},\n",
       "       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.naive_bayes import BernoulliNB\n",
    "from sklearn.grid_search import GridSearchCV\n",
    "\n",
    "pl = Pipeline([\n",
    "    ('tfidf', TfidfVectorizer()),\n",
    "    ('nb', BernoulliNB())\n",
    "])\n",
    "\n",
    "params = {\n",
    "    'tfidf__min_df': [2, 5, .02],\n",
    "    'tfidf__max_df': [1.0, .8, .5],\n",
    "    'nb__alpha': [0.10, 0.3, 1, 3, 10]\n",
    "}\n",
    "\n",
    "est = GridSearchCV(pl, params, cv=3, n_jobs=-1, verbose=1)\n",
    "\n",
    "est.fit(X_train_sents, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[10698,  1802],\n",
       "       [ 2063, 10437]])"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.metrics import confusion_matrix \n",
    "from sklearn.metrics import classification_report\n",
    "\n",
    "predictions = est.predict(X_test_sents)\n",
    "\n",
    "confusion_matrix(y_test, predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             precision    recall  f1-score   support\n",
      "\n",
      "          0       0.84      0.86      0.85     12500\n",
      "          1       0.85      0.83      0.84     12500\n",
      "\n",
      "avg / total       0.85      0.85      0.85     25000\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(y_test, predictions))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 3 candidates, totalling 9 fits\n",
      "[CV] tfidf__min_df=3, svm__C=0.3 .....................................\n",
      "[CV] tfidf__min_df=3, svm__C=0.3 .....................................\n",
      "[CV] tfidf__min_df=3, svm__C=0.3 .....................................\n",
      "[CV] tfidf__min_df=3, svm__C=1 .......................................\n",
      "[CV] tfidf__min_df=3, svm__C=1 .......................................\n",
      "[CV] tfidf__min_df=3, svm__C=1 .......................................\n",
      "[CV] tfidf__min_df=3, svm__C=3 .......................................\n",
      "[CV] tfidf__min_df=3, svm__C=3 .......................................\n",
      "[CV] .............................. tfidf__min_df=3, svm__C=1 - 4.2min\n",
      "[CV] tfidf__min_df=3, svm__C=3 .......................................\n",
      "[CV] .............................. tfidf__min_df=3, svm__C=1 - 4.2min\n",
      "[CV] .............................. tfidf__min_df=3, svm__C=1 - 4.2min\n",
      "[CV] .............................. tfidf__min_df=3, svm__C=3 - 4.2min\n",
      "[CV] .............................. tfidf__min_df=3, svm__C=3 - 4.2min\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=-1)]: Done   4 out of   9 | elapsed:  4.2min remaining:  5.3min\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CV] ............................ tfidf__min_df=3, svm__C=0.3 - 4.8min\n",
      "[CV] ............................ tfidf__min_df=3, svm__C=0.3 - 4.8min\n",
      "[CV] ............................ tfidf__min_df=3, svm__C=0.3 - 4.9min\n",
      "[CV] .............................. tfidf__min_df=3, svm__C=3 - 3.1min\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:  7.3min remaining:    0.0s\n",
      "[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:  7.3min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=3, error_score='raise',\n",
       "       estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,\n",
       " ...,\n",
       "  max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
       "  tol=0.001, verbose=False))]),\n",
       "       fit_params={}, iid=True, n_jobs=-1,\n",
       "       param_grid={'tfidf__min_df': [3], 'svm__C': [0.3, 1, 3]},\n",
       "       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.svm import SVC\n",
    "\n",
    "pl = Pipeline([\n",
    "    ('tfidf', TfidfVectorizer()),\n",
    "    ('svm', SVC(kernel=\"linear\"))\n",
    "])\n",
    "\n",
    "params = {\n",
    "    'tfidf__min_df': [3],\n",
    "    'svm__C': [0.3, 1, 3]\n",
    "}\n",
    "\n",
    "est = GridSearchCV(pl, params, cv=3, n_jobs=-1, verbose=2)\n",
    "\n",
    "est.fit(X_train_sents, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=3, error_score='raise',\n",
       "       estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,\n",
       " ...,\n",
       "  max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
       "  tol=0.001, verbose=False))]),\n",
       "       fit_params={}, iid=True, n_jobs=-1,\n",
       "       param_grid={'tfidf__min_df': [3], 'svm__C': [0.3, 1, 3]},\n",
       "       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "est"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "predictions = est.predict(X_test_sents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             precision    recall  f1-score   support\n",
      "\n",
      "          0       0.89      0.88      0.89     12500\n",
      "          1       0.88      0.89      0.89     12500\n",
      "\n",
      "avg / total       0.89      0.89      0.89     25000\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(y_test, predictions))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Deep learning vs. Naive Bayes vs. SVM for sentiment classification\n",
	"\n",
	"* Benchmark of three different algorithms for sentiment classification\n",
	"* Based on 50000 movie reviews in a 50/50 split"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Classifiers\n",
	"\n",
	"* LSTM\n",
	" * Trained with Keras\n",
	" * Features of first 500 words of the review\n",
	" * Takes ~8 minutes to train\n",
	" * Takes ~40 seconds to predict\n",
	" * Architecture taken from CH7 in http://machinelearningmastery.com/\n",
	" * Accuracy ~ 0.88\n",
	"* Naive Bayes (Bernoulli)\n",
	" * Bag-of-words // tf-idf features\n",
	" * Basic parameter optimization\n",
	" * Takes ~20 seconds to train, ~1 second to predict\n",
	" * Accuracy ~ 0.85\n",
	" * SVM \n",
	" * Bag-of-words // tf-idf features\n",
	" * Basic parameter optimization, linear kernel\n",
	" * Takes ~15 minutes to train, ~2 minutes to predict\n",
	" * Accuracy ~ 0.89"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Summary\n",
	"\n",
	"SVM performed slightly better but was also quite slow. I doubt parameter optimization is going to give much better performance, but toying with ngram settings might improve things a bit. Also allowing more words in the dataset could be an idea. Naive Bayes performance was still very solid, and superfast. LSTM had a good performance, slightly quicker to train than SVM, and architecture can probably be optimized. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"from keras.datasets import imdb\n",
	"from keras.models import Sequential\n",
	"from keras.layers import Dense, LSTM, Embedding\n",
	"from keras.layers.convolutional import MaxPooling1D, Convolution1D\n",
	"from keras.preprocessing import sequence\n",
	"\n",
	"import numpy as np\n",
	"\n",
	"np.random.seed(7)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 60,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"top_words = 5000\n",
	"(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Keras with LSTM network"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 62,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"max_review_length = 500\n",
	"X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)\n",
	"X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"((25000, 500), (25000, 500))"
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"X_train.shape, X_test.shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"____________________________________________________________________________________________________\n",
	"Layer (type) Output Shape Param # Connected to \n",
	"====================================================================================================\n",
	"embedding_1 (Embedding) (None, 500, 32) 160000 embedding_input_1[0][0] \n",
	"____________________________________________________________________________________________________\n",
	"convolution1d_1 (Convolution1D) (None, 500, 32) 3104 embedding_1[0][0] \n",
	"____________________________________________________________________________________________________\n",
	"maxpooling1d_1 (MaxPooling1D) (None, 250, 32) 0 convolution1d_1[0][0] \n",
	"____________________________________________________________________________________________________\n",
	"lstm_1 (LSTM) (None, 100) 53200 maxpooling1d_1[0][0] \n",
	"____________________________________________________________________________________________________\n",
	"dense_1 (Dense) (None, 1) 101 lstm_1[0][0] \n",
	"====================================================================================================\n",
	"Total params: 216405\n",
	"____________________________________________________________________________________________________\n",
	"None\n"
	]
	}
	],
	"source": [
	"embedding_vector_length = 32\n",
	"\n",
	"model = Sequential()\n",
	"model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))\n",
	"model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))\n",
	"model.add(MaxPooling1D(pool_length=2))\n",
	"model.add(LSTM(100))\n",
	"model.add(Dense(1, activation='sigmoid'))\n",
	"\n",
	"model.compile(loss=\"binary_crossentropy\", optimizer=\"adam\", metrics=[\"accuracy\"])\n",
	"print(model.summary())\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Epoch 1/3\n",
	"25000/25000 [==============================] - 149s - loss: 0.4475 - acc: 0.7819 \n",
	"Epoch 2/3\n",
	"25000/25000 [==============================] - 153s - loss: 0.2623 - acc: 0.8956 \n",
	"Epoch 3/3\n",
	"25000/25000 [==============================] - 153s - loss: 0.2192 - acc: 0.9163 \n"
	]
	},
	{
	"data": {
	"text/plain": [
	"<keras.callbacks.History at 0x7f04c0097f60>"
	]
	},
	"execution_count": 12,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"model.fit(X_train, y_train, nb_epoch=3, batch_size=64)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 65,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"predictions = model.predict(X_test)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 68,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" precision recall f1-score support\n",
	"\n",
	" 0 0.91 0.85 0.88 12500\n",
	" 1 0.86 0.91 0.89 12500\n",
	"\n",
	"avg / total 0.88 0.88 0.88 25000\n",
	"\n"
	]
	}
	],
	"source": [
	"print(classification_report(y_test, predictions > .5))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Summary\n",
	"\n",
	"* .88 accuracy\n",
	"* Training takes ~ 8 minutes\n",
	"* Prediction takes ~ 45 seconds"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Naive bayes"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"\n",
	"X_sents = np.array(map(lambda x: \" \".join(x), X_train))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"def to_sents(X):\n",
	" sents = list()\n",
	" for x in X:\n",
	" sents.append(\" \".join(str(s) for s in x))\n",
	" return sents\n",
	"\n",
	"X_train_sents = to_sents(X_train)\n",
	"X_test_sents = to_sents(X_test)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 42,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Fitting 3 folds for each of 45 candidates, totalling 135 fits\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"[Parallel(n_jobs=-1)]: Done 34 tasks \| elapsed: 36.7s\n",
	"[Parallel(n_jobs=-1)]: Done 135 out of 135 \| elapsed: 2.3min finished\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"GridSearchCV(cv=3, error_score='raise',\n",
	" estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
	" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
	" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
	" ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,\n",
	" ... vocabulary=None)), ('nb', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))]),\n",
	" fit_params={}, iid=True, n_jobs=-1,\n",
	" param_grid={'tfidf__min_df': [2, 5, 0.02], 'nb__alpha': [0.1, 0.3, 1, 3, 10], 'tfidf__max_df': [1.0, 0.8, 0.5]},\n",
	" pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)"
	]
	},
	"execution_count": 42,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from sklearn.pipeline import Pipeline\n",
	"from sklearn.feature_extraction.text import TfidfVectorizer\n",
	"from sklearn.naive_bayes import BernoulliNB\n",
	"from sklearn.grid_search import GridSearchCV\n",
	"\n",
	"pl = Pipeline([\n",
	" ('tfidf', TfidfVectorizer()),\n",
	" ('nb', BernoulliNB())\n",
	"])\n",
	"\n",
	"params = {\n",
	" 'tfidf__min_df': [2, 5, .02],\n",
	" 'tfidf__max_df': [1.0, .8, .5],\n",
	" 'nb__alpha': [0.10, 0.3, 1, 3, 10]\n",
	"}\n",
	"\n",
	"est = GridSearchCV(pl, params, cv=3, n_jobs=-1, verbose=1)\n",
	"\n",
	"est.fit(X_train_sents, y_train)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 43,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[10698, 1802],\n",
	" [ 2063, 10437]])"
	]
	},
	"execution_count": 43,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from sklearn.metrics import confusion_matrix \n",
	"from sklearn.metrics import classification_report\n",
	"\n",
	"predictions = est.predict(X_test_sents)\n",
	"\n",
	"confusion_matrix(y_test, predictions)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 44,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" precision recall f1-score support\n",
	"\n",
	" 0 0.84 0.86 0.85 12500\n",
	" 1 0.85 0.83 0.84 12500\n",
	"\n",
	"avg / total 0.85 0.85 0.85 25000\n",
	"\n"
	]
	}
	],
	"source": [
	"print(classification_report(y_test, predictions))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 50,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Fitting 3 folds for each of 3 candidates, totalling 9 fits\n",
	"[CV] tfidf__min_df=3, svm__C=0.3 .....................................\n",
	"[CV] tfidf__min_df=3, svm__C=0.3 .....................................\n",
	"[CV] tfidf__min_df=3, svm__C=0.3 .....................................\n",
	"[CV] tfidf__min_df=3, svm__C=1 .......................................\n",
	"[CV] tfidf__min_df=3, svm__C=1 .......................................\n",
	"[CV] tfidf__min_df=3, svm__C=1 .......................................\n",
	"[CV] tfidf__min_df=3, svm__C=3 .......................................\n",
	"[CV] tfidf__min_df=3, svm__C=3 .......................................\n",
	"[CV] .............................. tfidf__min_df=3, svm__C=1 - 4.2min\n",
	"[CV] tfidf__min_df=3, svm__C=3 .......................................\n",
	"[CV] .............................. tfidf__min_df=3, svm__C=1 - 4.2min\n",
	"[CV] .............................. tfidf__min_df=3, svm__C=1 - 4.2min\n",
	"[CV] .............................. tfidf__min_df=3, svm__C=3 - 4.2min\n",
	"[CV] .............................. tfidf__min_df=3, svm__C=3 - 4.2min\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"[Parallel(n_jobs=-1)]: Done 4 out of 9 \| elapsed: 4.2min remaining: 5.3min\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[CV] ............................ tfidf__min_df=3, svm__C=0.3 - 4.8min\n",
	"[CV] ............................ tfidf__min_df=3, svm__C=0.3 - 4.8min\n",
	"[CV] ............................ tfidf__min_df=3, svm__C=0.3 - 4.9min\n",
	"[CV] .............................. tfidf__min_df=3, svm__C=3 - 3.1min\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"[Parallel(n_jobs=-1)]: Done 9 out of 9 \| elapsed: 7.3min remaining: 0.0s\n",
	"[Parallel(n_jobs=-1)]: Done 9 out of 9 \| elapsed: 7.3min finished\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"GridSearchCV(cv=3, error_score='raise',\n",
	" estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
	" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
	" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
	" ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,\n",
	" ...,\n",
	" max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
	" tol=0.001, verbose=False))]),\n",
	" fit_params={}, iid=True, n_jobs=-1,\n",
	" param_grid={'tfidf__min_df': [3], 'svm__C': [0.3, 1, 3]},\n",
	" pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)"
	]
	},
	"execution_count": 50,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from sklearn.svm import SVC\n",
	"\n",
	"pl = Pipeline([\n",
	" ('tfidf', TfidfVectorizer()),\n",
	" ('svm', SVC(kernel=\"linear\"))\n",
	"])\n",
	"\n",
	"params = {\n",
	" 'tfidf__min_df': [3],\n",
	" 'svm__C': [0.3, 1, 3]\n",
	"}\n",
	"\n",
	"est = GridSearchCV(pl, params, cv=3, n_jobs=-1, verbose=2)\n",
	"\n",
	"est.fit(X_train_sents, y_train)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 52,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"GridSearchCV(cv=3, error_score='raise',\n",
	" estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
	" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
	" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
	" ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,\n",
	" ...,\n",
	" max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
	" tol=0.001, verbose=False))]),\n",
	" fit_params={}, iid=True, n_jobs=-1,\n",
	" param_grid={'tfidf__min_df': [3], 'svm__C': [0.3, 1, 3]},\n",
	" pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)"
	]
	},
	"execution_count": 52,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"est"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 53,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"predictions = est.predict(X_test_sents)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 56,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" precision recall f1-score support\n",
	"\n",
	" 0 0.89 0.88 0.89 12500\n",
	" 1 0.88 0.89 0.89 12500\n",
	"\n",
	"avg / total 0.89 0.89 0.89 25000\n",
	"\n"
	]
	}
	],
	"source": [
	"print(classification_report(y_test, predictions))"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python [conda root]",
	"language": "python",
	"name": "conda-root-py"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}