Created
February 10, 2017 11:43
-
-
Save Gijs-Koot/1393ad34634757be83bc6ab3e8ff2c98 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Deep learning vs. Naive Bayes vs. SVM for sentiment classification\n", | |
"\n", | |
"* Benchmark of three different algorithms for sentiment classification\n", | |
"* Based on 50000 movie reviews in a 50/50 split" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Classifiers\n", | |
"\n", | |
"* LSTM\n", | |
" * Trained with Keras\n", | |
" * Features of first 500 words of the review\n", | |
" * Takes ~8 minutes to train\n", | |
" * Takes ~40 seconds to predict\n", | |
" * Architecture taken from CH7 in http://machinelearningmastery.com/\n", | |
" * Accuracy ~ 0.88\n", | |
"* Naive Bayes (Bernoulli)\n", | |
" * Bag-of-words // tf-idf features\n", | |
" * Basic parameter optimization\n", | |
" * Takes ~20 seconds to train, ~1 second to predict\n", | |
" * Accuracy ~ 0.85\n", | |
" * SVM \n", | |
" * Bag-of-words // tf-idf features\n", | |
" * Basic parameter optimization, linear kernel\n", | |
" * Takes ~15 minutes to train, ~2 minutes to predict\n", | |
" * Accuracy ~ 0.89" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Summary\n", | |
"\n", | |
"SVM performed slightly better but was also quite slow. I doubt parameter optimization is going to give much better performance, but toying with ngram settings might improve things a bit. Also allowing more words in the dataset could be an idea. Naive Bayes performance was still very solid, and superfast. LSTM had a good performance, slightly quicker to train than SVM, and architecture can probably be optimized. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from keras.datasets import imdb\n", | |
"from keras.models import Sequential\n", | |
"from keras.layers import Dense, LSTM, Embedding\n", | |
"from keras.layers.convolutional import MaxPooling1D, Convolution1D\n", | |
"from keras.preprocessing import sequence\n", | |
"\n", | |
"import numpy as np\n", | |
"\n", | |
"np.random.seed(7)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 60, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"top_words = 5000\n", | |
"(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Keras with LSTM network" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 62, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"max_review_length = 500\n", | |
"X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)\n", | |
"X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"((25000, 500), (25000, 500))" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"X_train.shape, X_test.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"____________________________________________________________________________________________________\n", | |
"Layer (type) Output Shape Param # Connected to \n", | |
"====================================================================================================\n", | |
"embedding_1 (Embedding) (None, 500, 32) 160000 embedding_input_1[0][0] \n", | |
"____________________________________________________________________________________________________\n", | |
"convolution1d_1 (Convolution1D) (None, 500, 32) 3104 embedding_1[0][0] \n", | |
"____________________________________________________________________________________________________\n", | |
"maxpooling1d_1 (MaxPooling1D) (None, 250, 32) 0 convolution1d_1[0][0] \n", | |
"____________________________________________________________________________________________________\n", | |
"lstm_1 (LSTM) (None, 100) 53200 maxpooling1d_1[0][0] \n", | |
"____________________________________________________________________________________________________\n", | |
"dense_1 (Dense) (None, 1) 101 lstm_1[0][0] \n", | |
"====================================================================================================\n", | |
"Total params: 216405\n", | |
"____________________________________________________________________________________________________\n", | |
"None\n" | |
] | |
} | |
], | |
"source": [ | |
"embedding_vector_length = 32\n", | |
"\n", | |
"model = Sequential()\n", | |
"model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))\n", | |
"model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))\n", | |
"model.add(MaxPooling1D(pool_length=2))\n", | |
"model.add(LSTM(100))\n", | |
"model.add(Dense(1, activation='sigmoid'))\n", | |
"\n", | |
"model.compile(loss=\"binary_crossentropy\", optimizer=\"adam\", metrics=[\"accuracy\"])\n", | |
"print(model.summary())\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Epoch 1/3\n", | |
"25000/25000 [==============================] - 149s - loss: 0.4475 - acc: 0.7819 \n", | |
"Epoch 2/3\n", | |
"25000/25000 [==============================] - 153s - loss: 0.2623 - acc: 0.8956 \n", | |
"Epoch 3/3\n", | |
"25000/25000 [==============================] - 153s - loss: 0.2192 - acc: 0.9163 \n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"<keras.callbacks.History at 0x7f04c0097f60>" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"model.fit(X_train, y_train, nb_epoch=3, batch_size=64)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 65, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"predictions = model.predict(X_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 68, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" precision recall f1-score support\n", | |
"\n", | |
" 0 0.91 0.85 0.88 12500\n", | |
" 1 0.86 0.91 0.89 12500\n", | |
"\n", | |
"avg / total 0.88 0.88 0.88 25000\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"print(classification_report(y_test, predictions > .5))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Summary\n", | |
"\n", | |
"* .88 accuracy\n", | |
"* Training takes ~ 8 minutes\n", | |
"* Prediction takes ~ 45 seconds" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Naive bayes" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"\n", | |
"X_sents = np.array(map(lambda x: \" \".join(x), X_train))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"def to_sents(X):\n", | |
" sents = list()\n", | |
" for x in X:\n", | |
" sents.append(\" \".join(str(s) for s in x))\n", | |
" return sents\n", | |
"\n", | |
"X_train_sents = to_sents(X_train)\n", | |
"X_test_sents = to_sents(X_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Fitting 3 folds for each of 45 candidates, totalling 135 fits\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 36.7s\n", | |
"[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed: 2.3min finished\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"GridSearchCV(cv=3, error_score='raise',\n", | |
" estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n", | |
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n", | |
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", | |
" ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,\n", | |
" ... vocabulary=None)), ('nb', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))]),\n", | |
" fit_params={}, iid=True, n_jobs=-1,\n", | |
" param_grid={'tfidf__min_df': [2, 5, 0.02], 'nb__alpha': [0.1, 0.3, 1, 3, 10], 'tfidf__max_df': [1.0, 0.8, 0.5]},\n", | |
" pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)" | |
] | |
}, | |
"execution_count": 42, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from sklearn.pipeline import Pipeline\n", | |
"from sklearn.feature_extraction.text import TfidfVectorizer\n", | |
"from sklearn.naive_bayes import BernoulliNB\n", | |
"from sklearn.grid_search import GridSearchCV\n", | |
"\n", | |
"pl = Pipeline([\n", | |
" ('tfidf', TfidfVectorizer()),\n", | |
" ('nb', BernoulliNB())\n", | |
"])\n", | |
"\n", | |
"params = {\n", | |
" 'tfidf__min_df': [2, 5, .02],\n", | |
" 'tfidf__max_df': [1.0, .8, .5],\n", | |
" 'nb__alpha': [0.10, 0.3, 1, 3, 10]\n", | |
"}\n", | |
"\n", | |
"est = GridSearchCV(pl, params, cv=3, n_jobs=-1, verbose=1)\n", | |
"\n", | |
"est.fit(X_train_sents, y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[10698, 1802],\n", | |
" [ 2063, 10437]])" | |
] | |
}, | |
"execution_count": 43, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from sklearn.metrics import confusion_matrix \n", | |
"from sklearn.metrics import classification_report\n", | |
"\n", | |
"predictions = est.predict(X_test_sents)\n", | |
"\n", | |
"confusion_matrix(y_test, predictions)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" precision recall f1-score support\n", | |
"\n", | |
" 0 0.84 0.86 0.85 12500\n", | |
" 1 0.85 0.83 0.84 12500\n", | |
"\n", | |
"avg / total 0.85 0.85 0.85 25000\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"print(classification_report(y_test, predictions))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 50, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Fitting 3 folds for each of 3 candidates, totalling 9 fits\n", | |
"[CV] tfidf__min_df=3, svm__C=0.3 .....................................\n", | |
"[CV] tfidf__min_df=3, svm__C=0.3 .....................................\n", | |
"[CV] tfidf__min_df=3, svm__C=0.3 .....................................\n", | |
"[CV] tfidf__min_df=3, svm__C=1 .......................................\n", | |
"[CV] tfidf__min_df=3, svm__C=1 .......................................\n", | |
"[CV] tfidf__min_df=3, svm__C=1 .......................................\n", | |
"[CV] tfidf__min_df=3, svm__C=3 .......................................\n", | |
"[CV] tfidf__min_df=3, svm__C=3 .......................................\n", | |
"[CV] .............................. tfidf__min_df=3, svm__C=1 - 4.2min\n", | |
"[CV] tfidf__min_df=3, svm__C=3 .......................................\n", | |
"[CV] .............................. tfidf__min_df=3, svm__C=1 - 4.2min\n", | |
"[CV] .............................. tfidf__min_df=3, svm__C=1 - 4.2min\n", | |
"[CV] .............................. tfidf__min_df=3, svm__C=3 - 4.2min\n", | |
"[CV] .............................. tfidf__min_df=3, svm__C=3 - 4.2min\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=-1)]: Done 4 out of 9 | elapsed: 4.2min remaining: 5.3min\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] ............................ tfidf__min_df=3, svm__C=0.3 - 4.8min\n", | |
"[CV] ............................ tfidf__min_df=3, svm__C=0.3 - 4.8min\n", | |
"[CV] ............................ tfidf__min_df=3, svm__C=0.3 - 4.9min\n", | |
"[CV] .............................. tfidf__min_df=3, svm__C=3 - 3.1min\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 7.3min remaining: 0.0s\n", | |
"[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 7.3min finished\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"GridSearchCV(cv=3, error_score='raise',\n", | |
" estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n", | |
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n", | |
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", | |
" ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,\n", | |
" ...,\n", | |
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n", | |
" tol=0.001, verbose=False))]),\n", | |
" fit_params={}, iid=True, n_jobs=-1,\n", | |
" param_grid={'tfidf__min_df': [3], 'svm__C': [0.3, 1, 3]},\n", | |
" pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)" | |
] | |
}, | |
"execution_count": 50, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from sklearn.svm import SVC\n", | |
"\n", | |
"pl = Pipeline([\n", | |
" ('tfidf', TfidfVectorizer()),\n", | |
" ('svm', SVC(kernel=\"linear\"))\n", | |
"])\n", | |
"\n", | |
"params = {\n", | |
" 'tfidf__min_df': [3],\n", | |
" 'svm__C': [0.3, 1, 3]\n", | |
"}\n", | |
"\n", | |
"est = GridSearchCV(pl, params, cv=3, n_jobs=-1, verbose=2)\n", | |
"\n", | |
"est.fit(X_train_sents, y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 52, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"GridSearchCV(cv=3, error_score='raise',\n", | |
" estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n", | |
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n", | |
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", | |
" ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,\n", | |
" ...,\n", | |
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n", | |
" tol=0.001, verbose=False))]),\n", | |
" fit_params={}, iid=True, n_jobs=-1,\n", | |
" param_grid={'tfidf__min_df': [3], 'svm__C': [0.3, 1, 3]},\n", | |
" pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)" | |
] | |
}, | |
"execution_count": 52, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"est" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 53, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"predictions = est.predict(X_test_sents)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 56, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" precision recall f1-score support\n", | |
"\n", | |
" 0 0.89 0.88 0.89 12500\n", | |
" 1 0.88 0.89 0.89 12500\n", | |
"\n", | |
"avg / total 0.89 0.89 0.89 25000\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"print(classification_report(y_test, predictions))" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python [conda root]", | |
"language": "python", | |
"name": "conda-root-py" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 1 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment