Skip to content

Instantly share code, notes, and snippets.

@Gijs-Koot
Created February 10, 2017 11:43
Show Gist options
  • Save Gijs-Koot/1393ad34634757be83bc6ab3e8ff2c98 to your computer and use it in GitHub Desktop.
Save Gijs-Koot/1393ad34634757be83bc6ab3e8ff2c98 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deep learning vs. Naive Bayes vs. SVM for sentiment classification\n",
"\n",
"* Benchmark of three different algorithms for sentiment classification\n",
"* Based on 50000 movie reviews in a 50/50 split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classifiers\n",
"\n",
"* LSTM\n",
" * Trained with Keras\n",
" * Features of first 500 words of the review\n",
" * Takes ~8 minutes to train\n",
" * Takes ~40 seconds to predict\n",
" * Architecture taken from CH7 in http://machinelearningmastery.com/\n",
" * Accuracy ~ 0.88\n",
"* Naive Bayes (Bernoulli)\n",
" * Bag-of-words // tf-idf features\n",
" * Basic parameter optimization\n",
" * Takes ~20 seconds to train, ~1 second to predict\n",
" * Accuracy ~ 0.85\n",
" * SVM \n",
" * Bag-of-words // tf-idf features\n",
" * Basic parameter optimization, linear kernel\n",
" * Takes ~15 minutes to train, ~2 minutes to predict\n",
" * Accuracy ~ 0.89"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"SVM performed slightly better but was also quite slow. I doubt parameter optimization is going to give much better performance, but toying with ngram settings might improve things a bit. Also allowing more words in the dataset could be an idea. Naive Bayes performance was still very solid, and superfast. LSTM had a good performance, slightly quicker to train than SVM, and architecture can probably be optimized. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from keras.datasets import imdb\n",
"from keras.models import Sequential\n",
"from keras.layers import Dense, LSTM, Embedding\n",
"from keras.layers.convolutional import MaxPooling1D, Convolution1D\n",
"from keras.preprocessing import sequence\n",
"\n",
"import numpy as np\n",
"\n",
"np.random.seed(7)"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"top_words = 5000\n",
"(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Keras with LSTM network"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"max_review_length = 500\n",
"X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)\n",
"X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"((25000, 500), (25000, 500))"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.shape, X_test.shape"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"____________________________________________________________________________________________________\n",
"Layer (type) Output Shape Param # Connected to \n",
"====================================================================================================\n",
"embedding_1 (Embedding) (None, 500, 32) 160000 embedding_input_1[0][0] \n",
"____________________________________________________________________________________________________\n",
"convolution1d_1 (Convolution1D) (None, 500, 32) 3104 embedding_1[0][0] \n",
"____________________________________________________________________________________________________\n",
"maxpooling1d_1 (MaxPooling1D) (None, 250, 32) 0 convolution1d_1[0][0] \n",
"____________________________________________________________________________________________________\n",
"lstm_1 (LSTM) (None, 100) 53200 maxpooling1d_1[0][0] \n",
"____________________________________________________________________________________________________\n",
"dense_1 (Dense) (None, 1) 101 lstm_1[0][0] \n",
"====================================================================================================\n",
"Total params: 216405\n",
"____________________________________________________________________________________________________\n",
"None\n"
]
}
],
"source": [
"embedding_vector_length = 32\n",
"\n",
"model = Sequential()\n",
"model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))\n",
"model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))\n",
"model.add(MaxPooling1D(pool_length=2))\n",
"model.add(LSTM(100))\n",
"model.add(Dense(1, activation='sigmoid'))\n",
"\n",
"model.compile(loss=\"binary_crossentropy\", optimizer=\"adam\", metrics=[\"accuracy\"])\n",
"print(model.summary())\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/3\n",
"25000/25000 [==============================] - 149s - loss: 0.4475 - acc: 0.7819 \n",
"Epoch 2/3\n",
"25000/25000 [==============================] - 153s - loss: 0.2623 - acc: 0.8956 \n",
"Epoch 3/3\n",
"25000/25000 [==============================] - 153s - loss: 0.2192 - acc: 0.9163 \n"
]
},
{
"data": {
"text/plain": [
"<keras.callbacks.History at 0x7f04c0097f60>"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.fit(X_train, y_train, nb_epoch=3, batch_size=64)"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"predictions = model.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.91 0.85 0.88 12500\n",
" 1 0.86 0.91 0.89 12500\n",
"\n",
"avg / total 0.88 0.88 0.88 25000\n",
"\n"
]
}
],
"source": [
"print(classification_report(y_test, predictions > .5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary\n",
"\n",
"* .88 accuracy\n",
"* Training takes ~ 8 minutes\n",
"* Prediction takes ~ 45 seconds"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Naive bayes"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"X_sents = np.array(map(lambda x: \" \".join(x), X_train))"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def to_sents(X):\n",
" sents = list()\n",
" for x in X:\n",
" sents.append(\" \".join(str(s) for s in x))\n",
" return sents\n",
"\n",
"X_train_sents = to_sents(X_train)\n",
"X_test_sents = to_sents(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 3 folds for each of 45 candidates, totalling 135 fits\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 36.7s\n",
"[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed: 2.3min finished\n"
]
},
{
"data": {
"text/plain": [
"GridSearchCV(cv=3, error_score='raise',\n",
" estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
" ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,\n",
" ... vocabulary=None)), ('nb', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))]),\n",
" fit_params={}, iid=True, n_jobs=-1,\n",
" param_grid={'tfidf__min_df': [2, 5, 0.02], 'nb__alpha': [0.1, 0.3, 1, 3, 10], 'tfidf__max_df': [1.0, 0.8, 0.5]},\n",
" pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.pipeline import Pipeline\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.naive_bayes import BernoulliNB\n",
"from sklearn.grid_search import GridSearchCV\n",
"\n",
"pl = Pipeline([\n",
" ('tfidf', TfidfVectorizer()),\n",
" ('nb', BernoulliNB())\n",
"])\n",
"\n",
"params = {\n",
" 'tfidf__min_df': [2, 5, .02],\n",
" 'tfidf__max_df': [1.0, .8, .5],\n",
" 'nb__alpha': [0.10, 0.3, 1, 3, 10]\n",
"}\n",
"\n",
"est = GridSearchCV(pl, params, cv=3, n_jobs=-1, verbose=1)\n",
"\n",
"est.fit(X_train_sents, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[10698, 1802],\n",
" [ 2063, 10437]])"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics import confusion_matrix \n",
"from sklearn.metrics import classification_report\n",
"\n",
"predictions = est.predict(X_test_sents)\n",
"\n",
"confusion_matrix(y_test, predictions)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.84 0.86 0.85 12500\n",
" 1 0.85 0.83 0.84 12500\n",
"\n",
"avg / total 0.85 0.85 0.85 25000\n",
"\n"
]
}
],
"source": [
"print(classification_report(y_test, predictions))"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 3 folds for each of 3 candidates, totalling 9 fits\n",
"[CV] tfidf__min_df=3, svm__C=0.3 .....................................\n",
"[CV] tfidf__min_df=3, svm__C=0.3 .....................................\n",
"[CV] tfidf__min_df=3, svm__C=0.3 .....................................\n",
"[CV] tfidf__min_df=3, svm__C=1 .......................................\n",
"[CV] tfidf__min_df=3, svm__C=1 .......................................\n",
"[CV] tfidf__min_df=3, svm__C=1 .......................................\n",
"[CV] tfidf__min_df=3, svm__C=3 .......................................\n",
"[CV] tfidf__min_df=3, svm__C=3 .......................................\n",
"[CV] .............................. tfidf__min_df=3, svm__C=1 - 4.2min\n",
"[CV] tfidf__min_df=3, svm__C=3 .......................................\n",
"[CV] .............................. tfidf__min_df=3, svm__C=1 - 4.2min\n",
"[CV] .............................. tfidf__min_df=3, svm__C=1 - 4.2min\n",
"[CV] .............................. tfidf__min_df=3, svm__C=3 - 4.2min\n",
"[CV] .............................. tfidf__min_df=3, svm__C=3 - 4.2min\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=-1)]: Done 4 out of 9 | elapsed: 4.2min remaining: 5.3min\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] ............................ tfidf__min_df=3, svm__C=0.3 - 4.8min\n",
"[CV] ............................ tfidf__min_df=3, svm__C=0.3 - 4.8min\n",
"[CV] ............................ tfidf__min_df=3, svm__C=0.3 - 4.9min\n",
"[CV] .............................. tfidf__min_df=3, svm__C=3 - 3.1min\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 7.3min remaining: 0.0s\n",
"[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 7.3min finished\n"
]
},
{
"data": {
"text/plain": [
"GridSearchCV(cv=3, error_score='raise',\n",
" estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
" ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,\n",
" ...,\n",
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
" tol=0.001, verbose=False))]),\n",
" fit_params={}, iid=True, n_jobs=-1,\n",
" param_grid={'tfidf__min_df': [3], 'svm__C': [0.3, 1, 3]},\n",
" pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.svm import SVC\n",
"\n",
"pl = Pipeline([\n",
" ('tfidf', TfidfVectorizer()),\n",
" ('svm', SVC(kernel=\"linear\"))\n",
"])\n",
"\n",
"params = {\n",
" 'tfidf__min_df': [3],\n",
" 'svm__C': [0.3, 1, 3]\n",
"}\n",
"\n",
"est = GridSearchCV(pl, params, cv=3, n_jobs=-1, verbose=2)\n",
"\n",
"est.fit(X_train_sents, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"GridSearchCV(cv=3, error_score='raise',\n",
" estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
" ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,\n",
" ...,\n",
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
" tol=0.001, verbose=False))]),\n",
" fit_params={}, iid=True, n_jobs=-1,\n",
" param_grid={'tfidf__min_df': [3], 'svm__C': [0.3, 1, 3]},\n",
" pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"est"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"predictions = est.predict(X_test_sents)"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.89 0.88 0.89 12500\n",
" 1 0.88 0.89 0.89 12500\n",
"\n",
"avg / total 0.89 0.89 0.89 25000\n",
"\n"
]
}
],
"source": [
"print(classification_report(y_test, predictions))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda root]",
"language": "python",
"name": "conda-root-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment