Last active
February 3, 2016 08:06
-
-
Save maheshakya/ebb1a8d2e7015b634ca4 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# [KDD cup 2014 - Predict excitement of projects](https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##### Target: Identify projects that are exceptionally exciting to the business, at the time of posting.\n", | |
"##### Category: Binary classification\n", | |
"##### Evaluation metric: Area under ROC curve" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Works with:\n", | |
"- scikit-learn - version 0.17\n", | |
"- numpy - version 1.10 \n", | |
"- pandas - version 0.17.1\n", | |
"- matplotlib - version 2.0\n", | |
"- Jupyter (Ipython notebook latest) obviously" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Following features will be covered in this session\n", | |
"- fit/predict/transform model\n", | |
"- train test split\n", | |
"- K-fold cross validation\n", | |
"- Hyper-parameter tuning with grid search\n", | |
"- Behavior or random forest, logistic regression classifiers\n", | |
"- Incremental learning with SGD classifier\n", | |
"- Label encoding\n", | |
"- One hot encoding\n", | |
"- Digitization of numerical attributes\n", | |
"- Area under ROC curve, scoring\n", | |
"- Simple plotting with matplotlib\n", | |
"- Term frequency - inverse document frequency vectorizer\n", | |
"- PCA\n", | |
"- Data manupulation with pandas and numpy" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##### Download projects.csv, essays.csv and outcomes.csv from [Get the data](https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose/data) page and save those files in /data folder (in working directory)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# To plot inline \n", | |
"%matplotlib inline" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Importing required features and libraries" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"import matplotlib.pyplot as plt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.cross_validation import train_test_split, KFold, cross_val_score\n", | |
"from sklearn.ensemble import RandomForestClassifier\n", | |
"from sklearn.linear_model import LogisticRegression, SGDClassifier\n", | |
"from sklearn.metrics import roc_auc_score, roc_curve, auc\n", | |
"from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n", | |
"from sklearn.grid_search import GridSearchCV\n", | |
"from sklearn.feature_extraction.text import TfidfVectorizer\n", | |
"from sklearn.decomposition import PCA" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Helper functions" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Get difference between 2 lists\n", | |
"def diff(a, b):\n", | |
" b = set(b)\n", | |
" return [aa for aa in a if aa not in b]\n", | |
"\n", | |
"# Plot ROC curve\n", | |
"def plot_roc(false_positive_rate, true_positive_rate, auc):\n", | |
" plt.title('Receiver Operating Characteristic')\n", | |
" plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% auc)\n", | |
" plt.legend(loc='lower right')\n", | |
" plt.plot([0,1],[0,1],'r--')\n", | |
" plt.xlim([-0.1,1.2])\n", | |
" plt.ylim([-0.1,1.2])\n", | |
" plt.ylabel('True Positive Rate')\n", | |
" plt.xlabel('False Positive Rate')\n", | |
" plt.show()\n", | |
" \n", | |
"# Evaluation metrics\n", | |
"def evaluate_model(y_true, y_preds, y_preds_proba):\n", | |
" # Calculate parameters for ROC curve\n", | |
" fpr, tpr, thresholds = roc_curve(y_true, y_preds_proba[:, 1])\n", | |
" auc_score = auc(fpr, tpr)\n", | |
"\n", | |
" # Plot ROC curve\n", | |
" plot_roc(fpr, tpr, auc_score)\n", | |
"\n", | |
" # Area under ROC curve score with actual proabilities\n", | |
" print \"ROC AUC score with probabilites: \", roc_auc_score(y_true, y_preds_proba[:, 1])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Analysis" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Loading CSV files\n", | |
"projects = pd.read_csv('data/projects.csv')\n", | |
"outcomes = pd.read_csv('data/outcomes.csv')\n", | |
"\n", | |
"# Sort by project ID\n", | |
"projects = projects.sort('projectid')\n", | |
"outcomes = outcomes.sort('projectid')" | |
] | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": [ | |
"We will analyze only training data from the data set. Training data will be divided into a train set and a test set. Evaluations will be carried out on the test set." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"#Filling missing values\n", | |
"projects = projects.fillna(method='pad') #'pad' filling is a naive way. We have better methods.\n", | |
"\n", | |
"# Extracting training data indices\n", | |
"dates = np.array(projects.date_posted)\n", | |
"train_idx = np.where((dates < '2014-01-01') & (dates > '2012-01-01'))[0]\n", | |
"\n", | |
"# Get training data\n", | |
"training_data = projects.iloc[train_idx].sort('projectid')\n", | |
"training_outcomes = outcomes[outcomes.projectid.isin(training_data.projectid)].sort('projectid')\n", | |
"\n", | |
"# Get labels\n", | |
"labels = np.array(training_outcomes.is_exciting) == 't'" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Train test split\n", | |
"X_train_ids, X_test_ids, y_train, y_test = train_test_split(training_data.projectid, labels,\n", | |
" test_size=0.33, random_state=42)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### A simple random forest model with only categorical attributes" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Checking attribute infomation of the training data\n", | |
"training_data.info()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Extract only categorical columns\n", | |
"projects_numeric_columns = ['school_latitude', 'school_longitude',\n", | |
" 'fulfillment_labor_materials',\n", | |
" 'total_price_excluding_optional_support',\n", | |
" 'total_price_including_optional_support']\n", | |
"\n", | |
"\n", | |
"projects_id_columns = ['projectid' ,'teacher_acctid', 'schoolid', 'school_ncesid']\n", | |
"projects_categorial_columns = diff(diff(diff(list(training_data.columns), projects_id_columns),\n", | |
" projects_numeric_columns), ['date_posted'])\n", | |
"\n", | |
"projects_categorial_values = np.array(training_data[projects_categorial_columns])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Encode labels\n", | |
"label_encoder = LabelEncoder()\n", | |
"categorical_data = label_encoder.fit_transform(projects_categorial_values[:, 0])\n", | |
"\n", | |
"for i in range(1, projects_categorial_values.shape[1]):\n", | |
" label_encoder = LabelEncoder()\n", | |
" categorical_data = np.column_stack((categorical_data, label_encoder.fit_transform(projects_categorial_values[:,i])))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Keep project ID to track train test split\n", | |
"project_ids = np.array(training_data.projectid)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"X_train = categorical_data[np.searchsorted(project_ids, X_train_ids)]\n", | |
"X_test = categorical_data[np.searchsorted(project_ids, X_test_ids)]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Train a random forest classifier (default parameters) with traning set\n", | |
"clf = RandomForestClassifier()\n", | |
"clf.fit(X_train, y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Predict values and probabilities\n", | |
"preds = clf.predict(X_test)\n", | |
"pred_probs = clf.predict_proba(X_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Get evaluations of the random forest model\n", | |
"evaluate_model(y_test, preds, pred_probs)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Initilialize K-fold CV and run\n", | |
"kfold_cv = KFold(X_train.shape[0], n_folds=5, shuffle=True, random_state=42)\n", | |
"print \"n-jobs = 1\"\n", | |
"cross_val_score(clf, X_train, y_train, scoring='roc_auc', cv=kfold_cv, n_jobs = 1, verbose=3)\n", | |
"print \"n-jobs = 4\"\n", | |
"cross_val_score(clf, X_train, y_train, scoring='roc_auc', cv=kfold_cv, n_jobs = 4, verbose=3)\n", | |
"print \"n-jobs = -1\"\n", | |
"cross_val_score(clf, X_train, y_train, scoring='roc_auc', cv=kfold_cv, n_jobs = -1, verbose=3)\n", | |
"print \"end...\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Model Selection - Find optimial hyper-parameters for Random Forest classifier with grid search" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Initialize parameters grid\n", | |
"param_grid = {'n_estimators': [5, 10, 25]}\n", | |
"\n", | |
"# Initilize grid search CV\n", | |
"grid_search = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, scoring='roc_auc', verbose=1, n_jobs=4)\n", | |
"\n", | |
"# Fit data to grid search\n", | |
"grid_search.fit(X_train, y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Best hyper parameters\n", | |
"print \"Best n_estimators: \", grid_search.best_estimator_.n_estimators" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Verify with test data\n", | |
"clf = RandomForestClassifier(n_estimators=25)\n", | |
"clf.fit(X_train, y_train)\n", | |
"preds = clf.predict(X_test)\n", | |
"pred_probs = clf.predict_proba(X_test)\n", | |
"evaluate_model(y_test, preds, pred_probs)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Logistic regression with the same features" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Train a logistic regression classifier (default parameters) with traning set\n", | |
"clf = LogisticRegression()\n", | |
"clf.fit(X_train, y_train)\n", | |
"\n", | |
"# Predict values and probabilities\n", | |
"preds = clf.predict(X_test)\n", | |
"pred_probs = clf.predict_proba(X_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Get evaluations of the logistic regression model\n", | |
"evaluate_model(y_test, preds, pred_probs)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Logistic regression with one hot encoded features" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# One hot encoding!\n", | |
"enc = OneHotEncoder()\n", | |
"enc.fit(categorical_data)\n", | |
"X_train_ohe = enc.transform(X_train)\n", | |
"X_test_ohe = enc.transform(X_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"print \"Number of features before one hot encoding: \", X_train.shape[1]\n", | |
"print \"Number of features after one hot encoding: \", X_train_ohe.shape[1]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Train a logistic regression classifier (default parameters) with one hot encoded traning set\n", | |
"clf = LogisticRegression()\n", | |
"clf.fit(X_train_ohe, y_train)\n", | |
"\n", | |
"# Predict values and probabilities\n", | |
"preds = clf.predict(X_test_ohe)\n", | |
"pred_probs = clf.predict_proba(X_test_ohe)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Get evaluations of the logistic regression model with one hot encoded data\n", | |
"evaluate_model(y_test, preds, pred_probs)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"### Handling numerical columns" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### PCA" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"print projects_numeric_columns" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"numerical_data = np.array(training_data[projects_numeric_columns])\n", | |
"\n", | |
"# initiate PCA and classifier\n", | |
"pca = PCA(n_components=3)\n", | |
"pca.fit(numerical_data)\n", | |
"\n", | |
"X_train = pca.transform(numerical_data[np.searchsorted(project_ids,\n", | |
" X_train_ids)])\n", | |
"X_test = pca.transform(numerical_data[np.searchsorted(project_ids,\n", | |
" X_test_ids)])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"print \"number of features after dimensionality reductions: \", X_train.shape[1]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Train a logistic regression classifier (default parameters) with traning set\n", | |
"clf = LogisticRegression()\n", | |
"clf.fit(X_train, y_train)\n", | |
"\n", | |
"# Predict values and probabilities\n", | |
"preds = clf.predict(X_test)\n", | |
"pred_probs = clf.predict_proba(X_test)\n", | |
"\n", | |
"# Get evaluations of the logistic regression model\n", | |
"evaluate_model(y_test, preds, pred_probs)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Binning" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Binning numerical data\n", | |
"numerical_dataframe = training_data[projects_numeric_columns]\n", | |
"numerical_data = np.empty(shape=numerical_dataframe.shape[0])\n", | |
"\n", | |
"# Number of bins = 2-\n", | |
"number_of_bins = 20\n", | |
"for col in projects_numeric_columns:\n", | |
" digitized_column = np.digitize(numerical_dataframe[col],\n", | |
" bins=np.linspace(np.min(numerical_dataframe[col]),\n", | |
" np.max(numerical_dataframe[col]), num=number_of_bins))\n", | |
" numerical_data = np.column_stack((numerical_data, digitized_column))\n", | |
"\n", | |
"numerical_data = numerical_data[:, 1:]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"X_train = numerical_data[np.searchsorted(project_ids, X_train_ids)]\n", | |
"X_test = numerical_data[np.searchsorted(project_ids, X_test_ids)]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# One hot encoding!\n", | |
"enc = OneHotEncoder()\n", | |
"enc.fit(numerical_data)\n", | |
"X_train_ohe = enc.transform(X_train)\n", | |
"X_test_ohe = enc.transform(X_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Train a logistic regression classifier (default parameters) with one hot encoded traning set\n", | |
"clf = LogisticRegression()\n", | |
"clf.fit(X_train_ohe, y_train)\n", | |
"\n", | |
"# Predict values and probabilities\n", | |
"preds = clf.predict(X_test_ohe)\n", | |
"pred_probs = clf.predict_proba(X_test_ohe)\n", | |
"\n", | |
"# Get evaluations of the logistic regression model with one hot encoded data\n", | |
"evaluate_model(y_test, preds, pred_probs)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# delete overall projects data (to save memory)\n", | |
"# reset_selective projects" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Using Essay data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Load essays data file\n", | |
"essays = pd.read_csv('data/essays.csv')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Extract training data from essays data file\n", | |
"training_essays = essays[essays.projectid.isin(training_data.projectid)].sort('projectid').fillna(method='pad')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# delete overall essay data (to save memory)\n", | |
"# reset_selective essays" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Initialize and fit TF-IDF vectorizer\n", | |
"tfidf_vectorizer = TfidfVectorizer()\n", | |
"tfidf_vectorizer.fit(training_essays.essay)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Transform training and test data of essays\n", | |
"X_train = tfidf_vectorizer.transform(np.array(training_essays.essay)[np.searchsorted(project_ids,\n", | |
" X_train_ids)])\n", | |
"X_test = tfidf_vectorizer.transform(np.array(training_essays.essay)[np.searchsorted(project_ids,\n", | |
" X_test_ids)])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"print \"number of features of vectorized essays: \", X_train.shape[1]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Train a logistic regression classifier (default parameters) with traning set\n", | |
"clf = LogisticRegression()\n", | |
"clf.fit(X_train, y_train)\n", | |
"\n", | |
"# Predict values and probabilities\n", | |
"preds = clf.predict(X_test)\n", | |
"pred_probs = clf.predict_proba(X_test)\n", | |
"\n", | |
"# Get evaluations of the logistic regression model\n", | |
"evaluate_model(y_test, preds, pred_probs)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Incremental learning" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Divide data into two sections\n", | |
"X_train_1 = X_train[:X_train.shape[0]/2, :]\n", | |
"X_train_2 = X_train[X_train.shape[0]/2:, :]\n", | |
"y_train_1 = y_train[:y_train.shape[0]/2]\n", | |
"y_train_2 = y_train[y_train.shape[0]/2:]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Train a logistic regression classifier (default parameters) with traning set\n", | |
"clf = SGDClassifier(loss='log')\n", | |
"clf.fit(X_train_1, y_train_1)\n", | |
"\n", | |
"# Predict values and probabilities\n", | |
"preds = clf.predict(X_test)\n", | |
"pred_probs = clf.predict_proba(X_test)\n", | |
"\n", | |
"# Get evaluations of the logistic regression model\n", | |
"evaluate_model(y_test, preds, pred_probs)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Partial fit to the trained classifier\n", | |
"clf.partial_fit(X_train_2, y_train_2)\n", | |
"\n", | |
"# Predict values and probabilities\n", | |
"preds = clf.predict(X_test)\n", | |
"pred_probs = clf.predict_proba(X_test)\n", | |
"\n", | |
"# Get evaluations of the logistic regression model\n", | |
"evaluate_model(y_test, preds, pred_probs)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment