maheshakya · February 3, 2016 08:06
diff --git a/KDDcup_2014_analysis.ipynb b/KDDcup_2014_analysis.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [KDD cup 2014 - Predict excitement of projects](https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Target: Identify projects that are exceptionally exciting to the business, at the time of posting.\n",
    "##### Category: Binary classification\n",
    "##### Evaluation metric: Area under ROC curve"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Works with:\n",
    "- scikit-learn - version 0.17\n",
    "- numpy - version 1.10 \n",
    "- pandas - version 0.17.1\n",
    "- matplotlib - version 2.0\n",
    "- Jupyter (Ipython notebook latest) obviously"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Following features will be covered in this session\n",
    "- fit/predict/transform model\n",
    "- train test split\n",
    "- K-fold cross validation\n",
    "- Hyper-parameter tuning with grid search\n",
    "- Behavior or random forest, logistic regression classifiers\n",
    "- Incremental learning with SGD classifier\n",
    "- Label encoding\n",
    "- One hot encoding\n",
    "- Digitization of numerical attributes\n",
    "- Area under ROC curve, scoring\n",
    "- Simple plotting with matplotlib\n",
    "- Term frequency - inverse document frequency vectorizer\n",
    "- PCA\n",
    "- Data manupulation with pandas and numpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Download projects.csv, essays.csv and outcomes.csv from [Get the data](https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose/data) page and save those files in /data folder (in working directory)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# To plot inline \n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Importing required features and libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split, KFold, cross_val_score\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.linear_model import LogisticRegression, SGDClassifier\n",
    "from sklearn.metrics import roc_auc_score, roc_curve, auc\n",
    "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n",
    "from sklearn.grid_search import GridSearchCV\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.decomposition import PCA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Helper functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Get difference between 2 lists\n",
    "def diff(a, b):\n",
    "    b = set(b)\n",
    "    return [aa for aa in a if aa not in b]\n",
    "\n",
    "# Plot ROC curve\n",
    "def plot_roc(false_positive_rate, true_positive_rate, auc):\n",
    "    plt.title('Receiver Operating Characteristic')\n",
    "    plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% auc)\n",
    "    plt.legend(loc='lower right')\n",
    "    plt.plot([0,1],[0,1],'r--')\n",
    "    plt.xlim([-0.1,1.2])\n",
    "    plt.ylim([-0.1,1.2])\n",
    "    plt.ylabel('True Positive Rate')\n",
    "    plt.xlabel('False Positive Rate')\n",
    "    plt.show()\n",
    "    \n",
    "# Evaluation metrics\n",
    "def evaluate_model(y_true, y_preds, y_preds_proba):\n",
    "    # Calculate parameters for ROC curve\n",
    "    fpr, tpr, thresholds = roc_curve(y_true, y_preds_proba[:, 1])\n",
    "    auc_score = auc(fpr, tpr)\n",
    "\n",
    "    # Plot ROC curve\n",
    "    plot_roc(fpr, tpr, auc_score)\n",
    "\n",
    "    # Area under ROC curve score with actual proabilities\n",
    "    print \"ROC AUC score with probabilites: \", roc_auc_score(y_true, y_preds_proba[:, 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Loading CSV files\n",
    "projects = pd.read_csv('data/projects.csv')\n",
    "outcomes = pd.read_csv('data/outcomes.csv')\n",
    "\n",
    "# Sort by project ID\n",
    "projects = projects.sort('projectid')\n",
    "outcomes = outcomes.sort('projectid')"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "We will analyze only training data from the data set. Training data will be divided into a train set and a test set. Evaluations will be carried out on the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Filling missing values\n",
    "projects = projects.fillna(method='pad') #'pad' filling is a naive way. We have better methods.\n",
    "\n",
    "# Extracting training data indices\n",
    "dates = np.array(projects.date_posted)\n",
    "train_idx = np.where((dates < '2014-01-01') & (dates > '2012-01-01'))[0]\n",
    "\n",
    "# Get training data\n",
    "training_data = projects.iloc[train_idx].sort('projectid')\n",
    "training_outcomes = outcomes[outcomes.projectid.isin(training_data.projectid)].sort('projectid')\n",
    "\n",
    "# Get labels\n",
    "labels = np.array(training_outcomes.is_exciting) == 't'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Train test split\n",
    "X_train_ids, X_test_ids, y_train, y_test = train_test_split(training_data.projectid, labels,\n",
    "                                                    test_size=0.33, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A simple random forest model with only categorical attributes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Checking attribute infomation of the training data\n",
    "training_data.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Extract only categorical columns\n",
    "projects_numeric_columns = ['school_latitude', 'school_longitude',\n",
    "                            'fulfillment_labor_materials',\n",
    "                            'total_price_excluding_optional_support',\n",
    "                            'total_price_including_optional_support']\n",
    "\n",
    "\n",
    "projects_id_columns = ['projectid' ,'teacher_acctid', 'schoolid', 'school_ncesid']\n",
    "projects_categorial_columns = diff(diff(diff(list(training_data.columns), projects_id_columns),\n",
    "                                        projects_numeric_columns), ['date_posted'])\n",
    "\n",
    "projects_categorial_values = np.array(training_data[projects_categorial_columns])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Encode labels\n",
    "label_encoder = LabelEncoder()\n",
    "categorical_data = label_encoder.fit_transform(projects_categorial_values[:, 0])\n",
    "\n",
    "for i in range(1, projects_categorial_values.shape[1]):\n",
    "    label_encoder = LabelEncoder()\n",
    "    categorical_data = np.column_stack((categorical_data, label_encoder.fit_transform(projects_categorial_values[:,i])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Keep project ID to track train test split\n",
    "project_ids = np.array(training_data.projectid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_train = categorical_data[np.searchsorted(project_ids, X_train_ids)]\n",
    "X_test = categorical_data[np.searchsorted(project_ids, X_test_ids)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Train a random forest classifier (default parameters) with traning set\n",
    "clf = RandomForestClassifier()\n",
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Predict values and probabilities\n",
    "preds = clf.predict(X_test)\n",
    "pred_probs = clf.predict_proba(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Get evaluations of the random forest model\n",
    "evaluate_model(y_test, preds, pred_probs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Initilialize K-fold CV and run\n",
    "kfold_cv = KFold(X_train.shape[0], n_folds=5, shuffle=True, random_state=42)\n",
    "print \"n-jobs = 1\"\n",
    "cross_val_score(clf, X_train, y_train, scoring='roc_auc', cv=kfold_cv, n_jobs = 1, verbose=3)\n",
    "print \"n-jobs = 4\"\n",
    "cross_val_score(clf, X_train, y_train, scoring='roc_auc', cv=kfold_cv, n_jobs = 4, verbose=3)\n",
    "print \"n-jobs = -1\"\n",
    "cross_val_score(clf, X_train, y_train, scoring='roc_auc', cv=kfold_cv, n_jobs = -1, verbose=3)\n",
    "print \"end...\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Model Selection - Find optimial hyper-parameters for Random Forest classifier with grid search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Initialize parameters grid\n",
    "param_grid = {'n_estimators': [5, 10, 25]}\n",
    "\n",
    "# Initilize grid search CV\n",
    "grid_search = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, scoring='roc_auc', verbose=1, n_jobs=4)\n",
    "\n",
    "# Fit data to grid search\n",
    "grid_search.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Best hyper parameters\n",
    "print \"Best n_estimators: \", grid_search.best_estimator_.n_estimators"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Verify with test data\n",
    "clf = RandomForestClassifier(n_estimators=25)\n",
    "clf.fit(X_train, y_train)\n",
    "preds = clf.predict(X_test)\n",
    "pred_probs = clf.predict_proba(X_test)\n",
    "evaluate_model(y_test, preds, pred_probs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logistic regression with the same features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Train a logistic regression classifier (default parameters) with traning set\n",
    "clf = LogisticRegression()\n",
    "clf.fit(X_train, y_train)\n",
    "\n",
    "# Predict values and probabilities\n",
    "preds = clf.predict(X_test)\n",
    "pred_probs = clf.predict_proba(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Get evaluations of the logistic regression model\n",
    "evaluate_model(y_test, preds, pred_probs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logistic regression with one hot encoded features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# One hot encoding!\n",
    "enc = OneHotEncoder()\n",
    "enc.fit(categorical_data)\n",
    "X_train_ohe = enc.transform(X_train)\n",
    "X_test_ohe = enc.transform(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print \"Number of features before one hot encoding: \", X_train.shape[1]\n",
    "print \"Number of features after one hot encoding: \", X_train_ohe.shape[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Train a logistic regression classifier (default parameters) with one hot encoded traning set\n",
    "clf = LogisticRegression()\n",
    "clf.fit(X_train_ohe, y_train)\n",
    "\n",
    "# Predict values and probabilities\n",
    "preds = clf.predict(X_test_ohe)\n",
    "pred_probs = clf.predict_proba(X_test_ohe)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Get evaluations of the logistic regression model with one hot encoded data\n",
    "evaluate_model(y_test, preds, pred_probs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Handling numerical columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### PCA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print projects_numeric_columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "numerical_data = np.array(training_data[projects_numeric_columns])\n",
    "\n",
    "# initiate PCA and classifier\n",
    "pca = PCA(n_components=3)\n",
    "pca.fit(numerical_data)\n",
    "\n",
    "X_train = pca.transform(numerical_data[np.searchsorted(project_ids,\n",
    "                                                       X_train_ids)])\n",
    "X_test = pca.transform(numerical_data[np.searchsorted(project_ids,\n",
    "                                                      X_test_ids)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print \"number of features after dimensionality reductions: \", X_train.shape[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Train a logistic regression classifier (default parameters) with traning set\n",
    "clf = LogisticRegression()\n",
    "clf.fit(X_train, y_train)\n",
    "\n",
    "# Predict values and probabilities\n",
    "preds = clf.predict(X_test)\n",
    "pred_probs = clf.predict_proba(X_test)\n",
    "\n",
    "# Get evaluations of the logistic regression model\n",
    "evaluate_model(y_test, preds, pred_probs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Binning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Binning numerical data\n",
    "numerical_dataframe = training_data[projects_numeric_columns]\n",
    "numerical_data = np.empty(shape=numerical_dataframe.shape[0])\n",
    "\n",
    "# Number of bins = 2-\n",
    "number_of_bins = 20\n",
    "for col in projects_numeric_columns:\n",
    "    digitized_column = np.digitize(numerical_dataframe[col],\n",
    "                                   bins=np.linspace(np.min(numerical_dataframe[col]),\n",
    "                                                    np.max(numerical_dataframe[col]), num=number_of_bins))\n",
    "    numerical_data = np.column_stack((numerical_data, digitized_column))\n",
    "\n",
    "numerical_data = numerical_data[:, 1:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_train = numerical_data[np.searchsorted(project_ids, X_train_ids)]\n",
    "X_test = numerical_data[np.searchsorted(project_ids, X_test_ids)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# One hot encoding!\n",
    "enc = OneHotEncoder()\n",
    "enc.fit(numerical_data)\n",
    "X_train_ohe = enc.transform(X_train)\n",
    "X_test_ohe = enc.transform(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Train a logistic regression classifier (default parameters) with one hot encoded traning set\n",
    "clf = LogisticRegression()\n",
    "clf.fit(X_train_ohe, y_train)\n",
    "\n",
    "# Predict values and probabilities\n",
    "preds = clf.predict(X_test_ohe)\n",
    "pred_probs = clf.predict_proba(X_test_ohe)\n",
    "\n",
    "# Get evaluations of the logistic regression model with one hot encoded data\n",
    "evaluate_model(y_test, preds, pred_probs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# delete overall projects data (to save memory)\n",
    "# reset_selective projects"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Essay data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Load essays data file\n",
    "essays = pd.read_csv('data/essays.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Extract training data from essays data file\n",
    "training_essays = essays[essays.projectid.isin(training_data.projectid)].sort('projectid').fillna(method='pad')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# delete overall essay data (to save memory)\n",
    "# reset_selective essays"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Initialize and fit TF-IDF vectorizer\n",
    "tfidf_vectorizer = TfidfVectorizer()\n",
    "tfidf_vectorizer.fit(training_essays.essay)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Transform training and test data of essays\n",
    "X_train = tfidf_vectorizer.transform(np.array(training_essays.essay)[np.searchsorted(project_ids,\n",
    "                                                                                     X_train_ids)])\n",
    "X_test = tfidf_vectorizer.transform(np.array(training_essays.essay)[np.searchsorted(project_ids,\n",
    "                                                                                     X_test_ids)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print \"number of features of vectorized essays: \", X_train.shape[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Train a logistic regression classifier (default parameters) with traning set\n",
    "clf = LogisticRegression()\n",
    "clf.fit(X_train, y_train)\n",
    "\n",
    "# Predict values and probabilities\n",
    "preds = clf.predict(X_test)\n",
    "pred_probs = clf.predict_proba(X_test)\n",
    "\n",
    "# Get evaluations of the logistic regression model\n",
    "evaluate_model(y_test, preds, pred_probs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Incremental learning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Divide data into two sections\n",
    "X_train_1 = X_train[:X_train.shape[0]/2, :]\n",
    "X_train_2 = X_train[X_train.shape[0]/2:, :]\n",
    "y_train_1 = y_train[:y_train.shape[0]/2]\n",
    "y_train_2 = y_train[y_train.shape[0]/2:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Train a logistic regression classifier (default parameters) with traning set\n",
    "clf = SGDClassifier(loss='log')\n",
    "clf.fit(X_train_1, y_train_1)\n",
    "\n",
    "# Predict values and probabilities\n",
    "preds = clf.predict(X_test)\n",
    "pred_probs = clf.predict_proba(X_test)\n",
    "\n",
    "# Get evaluations of the logistic regression model\n",
    "evaluate_model(y_test, preds, pred_probs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Partial fit to the trained classifier\n",
    "clf.partial_fit(X_train_2, y_train_2)\n",
    "\n",
    "# Predict values and probabilities\n",
    "preds = clf.predict(X_test)\n",
    "pred_probs = clf.predict_proba(X_test)\n",
    "\n",
    "# Get evaluations of the logistic regression model\n",
    "evaluate_model(y_test, preds, pred_probs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# [KDD cup 2014 - Predict excitement of projects](https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##### Target: Identify projects that are exceptionally exciting to the business, at the time of posting.\n",
	"##### Category: Binary classification\n",
	"##### Evaluation metric: Area under ROC curve"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Works with:\n",
	"- scikit-learn - version 0.17\n",
	"- numpy - version 1.10 \n",
	"- pandas - version 0.17.1\n",
	"- matplotlib - version 2.0\n",
	"- Jupyter (Ipython notebook latest) obviously"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Following features will be covered in this session\n",
	"- fit/predict/transform model\n",
	"- train test split\n",
	"- K-fold cross validation\n",
	"- Hyper-parameter tuning with grid search\n",
	"- Behavior or random forest, logistic regression classifiers\n",
	"- Incremental learning with SGD classifier\n",
	"- Label encoding\n",
	"- One hot encoding\n",
	"- Digitization of numerical attributes\n",
	"- Area under ROC curve, scoring\n",
	"- Simple plotting with matplotlib\n",
	"- Term frequency - inverse document frequency vectorizer\n",
	"- PCA\n",
	"- Data manupulation with pandas and numpy"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##### Download projects.csv, essays.csv and outcomes.csv from [Get the data](https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose/data) page and save those files in /data folder (in working directory)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# To plot inline \n",
	"%matplotlib inline"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Importing required features and libraries"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import numpy as np\n",
	"import pandas as pd\n",
	"import matplotlib.pyplot as plt"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"from sklearn.cross_validation import train_test_split, KFold, cross_val_score\n",
	"from sklearn.ensemble import RandomForestClassifier\n",
	"from sklearn.linear_model import LogisticRegression, SGDClassifier\n",
	"from sklearn.metrics import roc_auc_score, roc_curve, auc\n",
	"from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n",
	"from sklearn.grid_search import GridSearchCV\n",
	"from sklearn.feature_extraction.text import TfidfVectorizer\n",
	"from sklearn.decomposition import PCA"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Helper functions"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Get difference between 2 lists\n",
	"def diff(a, b):\n",
	" b = set(b)\n",
	" return [aa for aa in a if aa not in b]\n",
	"\n",
	"# Plot ROC curve\n",
	"def plot_roc(false_positive_rate, true_positive_rate, auc):\n",
	" plt.title('Receiver Operating Characteristic')\n",
	" plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% auc)\n",
	" plt.legend(loc='lower right')\n",
	" plt.plot([0,1],[0,1],'r--')\n",
	" plt.xlim([-0.1,1.2])\n",
	" plt.ylim([-0.1,1.2])\n",
	" plt.ylabel('True Positive Rate')\n",
	" plt.xlabel('False Positive Rate')\n",
	" plt.show()\n",
	" \n",
	"# Evaluation metrics\n",
	"def evaluate_model(y_true, y_preds, y_preds_proba):\n",
	" # Calculate parameters for ROC curve\n",
	" fpr, tpr, thresholds = roc_curve(y_true, y_preds_proba[:, 1])\n",
	" auc_score = auc(fpr, tpr)\n",
	"\n",
	" # Plot ROC curve\n",
	" plot_roc(fpr, tpr, auc_score)\n",
	"\n",
	" # Area under ROC curve score with actual proabilities\n",
	" print \"ROC AUC score with probabilites: \", roc_auc_score(y_true, y_preds_proba[:, 1])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Analysis"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Loading CSV files\n",
	"projects = pd.read_csv('data/projects.csv')\n",
	"outcomes = pd.read_csv('data/outcomes.csv')\n",
	"\n",
	"# Sort by project ID\n",
	"projects = projects.sort('projectid')\n",
	"outcomes = outcomes.sort('projectid')"
	]
	},
	{
	"cell_type": "raw",
	"metadata": {},
	"source": [
	"We will analyze only training data from the data set. Training data will be divided into a train set and a test set. Evaluations will be carried out on the test set."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"#Filling missing values\n",
	"projects = projects.fillna(method='pad') #'pad' filling is a naive way. We have better methods.\n",
	"\n",
	"# Extracting training data indices\n",
	"dates = np.array(projects.date_posted)\n",
	"train_idx = np.where((dates < '2014-01-01') & (dates > '2012-01-01'))[0]\n",
	"\n",
	"# Get training data\n",
	"training_data = projects.iloc[train_idx].sort('projectid')\n",
	"training_outcomes = outcomes[outcomes.projectid.isin(training_data.projectid)].sort('projectid')\n",
	"\n",
	"# Get labels\n",
	"labels = np.array(training_outcomes.is_exciting) == 't'"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Train test split\n",
	"X_train_ids, X_test_ids, y_train, y_test = train_test_split(training_data.projectid, labels,\n",
	" test_size=0.33, random_state=42)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### A simple random forest model with only categorical attributes"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Checking attribute infomation of the training data\n",
	"training_data.info()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Extract only categorical columns\n",
	"projects_numeric_columns = ['school_latitude', 'school_longitude',\n",
	" 'fulfillment_labor_materials',\n",
	" 'total_price_excluding_optional_support',\n",
	" 'total_price_including_optional_support']\n",
	"\n",
	"\n",
	"projects_id_columns = ['projectid' ,'teacher_acctid', 'schoolid', 'school_ncesid']\n",
	"projects_categorial_columns = diff(diff(diff(list(training_data.columns), projects_id_columns),\n",
	" projects_numeric_columns), ['date_posted'])\n",
	"\n",
	"projects_categorial_values = np.array(training_data[projects_categorial_columns])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Encode labels\n",
	"label_encoder = LabelEncoder()\n",
	"categorical_data = label_encoder.fit_transform(projects_categorial_values[:, 0])\n",
	"\n",
	"for i in range(1, projects_categorial_values.shape[1]):\n",
	" label_encoder = LabelEncoder()\n",
	" categorical_data = np.column_stack((categorical_data, label_encoder.fit_transform(projects_categorial_values[:,i])))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Keep project ID to track train test split\n",
	"project_ids = np.array(training_data.projectid)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"X_train = categorical_data[np.searchsorted(project_ids, X_train_ids)]\n",
	"X_test = categorical_data[np.searchsorted(project_ids, X_test_ids)]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Train a random forest classifier (default parameters) with traning set\n",
	"clf = RandomForestClassifier()\n",
	"clf.fit(X_train, y_train)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Predict values and probabilities\n",
	"preds = clf.predict(X_test)\n",
	"pred_probs = clf.predict_proba(X_test)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Get evaluations of the random forest model\n",
	"evaluate_model(y_test, preds, pred_probs)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Initilialize K-fold CV and run\n",
	"kfold_cv = KFold(X_train.shape[0], n_folds=5, shuffle=True, random_state=42)\n",
	"print \"n-jobs = 1\"\n",
	"cross_val_score(clf, X_train, y_train, scoring='roc_auc', cv=kfold_cv, n_jobs = 1, verbose=3)\n",
	"print \"n-jobs = 4\"\n",
	"cross_val_score(clf, X_train, y_train, scoring='roc_auc', cv=kfold_cv, n_jobs = 4, verbose=3)\n",
	"print \"n-jobs = -1\"\n",
	"cross_val_score(clf, X_train, y_train, scoring='roc_auc', cv=kfold_cv, n_jobs = -1, verbose=3)\n",
	"print \"end...\""
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Model Selection - Find optimial hyper-parameters for Random Forest classifier with grid search"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Initialize parameters grid\n",
	"param_grid = {'n_estimators': [5, 10, 25]}\n",
	"\n",
	"# Initilize grid search CV\n",
	"grid_search = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, scoring='roc_auc', verbose=1, n_jobs=4)\n",
	"\n",
	"# Fit data to grid search\n",
	"grid_search.fit(X_train, y_train)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Best hyper parameters\n",
	"print \"Best n_estimators: \", grid_search.best_estimator_.n_estimators"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Verify with test data\n",
	"clf = RandomForestClassifier(n_estimators=25)\n",
	"clf.fit(X_train, y_train)\n",
	"preds = clf.predict(X_test)\n",
	"pred_probs = clf.predict_proba(X_test)\n",
	"evaluate_model(y_test, preds, pred_probs)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Logistic regression with the same features"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Train a logistic regression classifier (default parameters) with traning set\n",
	"clf = LogisticRegression()\n",
	"clf.fit(X_train, y_train)\n",
	"\n",
	"# Predict values and probabilities\n",
	"preds = clf.predict(X_test)\n",
	"pred_probs = clf.predict_proba(X_test)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Get evaluations of the logistic regression model\n",
	"evaluate_model(y_test, preds, pred_probs)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Logistic regression with one hot encoded features"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# One hot encoding!\n",
	"enc = OneHotEncoder()\n",
	"enc.fit(categorical_data)\n",
	"X_train_ohe = enc.transform(X_train)\n",
	"X_test_ohe = enc.transform(X_test)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"print \"Number of features before one hot encoding: \", X_train.shape[1]\n",
	"print \"Number of features after one hot encoding: \", X_train_ohe.shape[1]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Train a logistic regression classifier (default parameters) with one hot encoded traning set\n",
	"clf = LogisticRegression()\n",
	"clf.fit(X_train_ohe, y_train)\n",
	"\n",
	"# Predict values and probabilities\n",
	"preds = clf.predict(X_test_ohe)\n",
	"pred_probs = clf.predict_proba(X_test_ohe)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Get evaluations of the logistic regression model with one hot encoded data\n",
	"evaluate_model(y_test, preds, pred_probs)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"### Handling numerical columns"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### PCA"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"print projects_numeric_columns"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"numerical_data = np.array(training_data[projects_numeric_columns])\n",
	"\n",
	"# initiate PCA and classifier\n",
	"pca = PCA(n_components=3)\n",
	"pca.fit(numerical_data)\n",
	"\n",
	"X_train = pca.transform(numerical_data[np.searchsorted(project_ids,\n",
	" X_train_ids)])\n",
	"X_test = pca.transform(numerical_data[np.searchsorted(project_ids,\n",
	" X_test_ids)])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"print \"number of features after dimensionality reductions: \", X_train.shape[1]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Train a logistic regression classifier (default parameters) with traning set\n",
	"clf = LogisticRegression()\n",
	"clf.fit(X_train, y_train)\n",
	"\n",
	"# Predict values and probabilities\n",
	"preds = clf.predict(X_test)\n",
	"pred_probs = clf.predict_proba(X_test)\n",
	"\n",
	"# Get evaluations of the logistic regression model\n",
	"evaluate_model(y_test, preds, pred_probs)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Binning"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Binning numerical data\n",
	"numerical_dataframe = training_data[projects_numeric_columns]\n",
	"numerical_data = np.empty(shape=numerical_dataframe.shape[0])\n",
	"\n",
	"# Number of bins = 2-\n",
	"number_of_bins = 20\n",
	"for col in projects_numeric_columns:\n",
	" digitized_column = np.digitize(numerical_dataframe[col],\n",
	" bins=np.linspace(np.min(numerical_dataframe[col]),\n",
	" np.max(numerical_dataframe[col]), num=number_of_bins))\n",
	" numerical_data = np.column_stack((numerical_data, digitized_column))\n",
	"\n",
	"numerical_data = numerical_data[:, 1:]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"X_train = numerical_data[np.searchsorted(project_ids, X_train_ids)]\n",
	"X_test = numerical_data[np.searchsorted(project_ids, X_test_ids)]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# One hot encoding!\n",
	"enc = OneHotEncoder()\n",
	"enc.fit(numerical_data)\n",
	"X_train_ohe = enc.transform(X_train)\n",
	"X_test_ohe = enc.transform(X_test)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Train a logistic regression classifier (default parameters) with one hot encoded traning set\n",
	"clf = LogisticRegression()\n",
	"clf.fit(X_train_ohe, y_train)\n",
	"\n",
	"# Predict values and probabilities\n",
	"preds = clf.predict(X_test_ohe)\n",
	"pred_probs = clf.predict_proba(X_test_ohe)\n",
	"\n",
	"# Get evaluations of the logistic regression model with one hot encoded data\n",
	"evaluate_model(y_test, preds, pred_probs)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# delete overall projects data (to save memory)\n",
	"# reset_selective projects"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Using Essay data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Load essays data file\n",
	"essays = pd.read_csv('data/essays.csv')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Extract training data from essays data file\n",
	"training_essays = essays[essays.projectid.isin(training_data.projectid)].sort('projectid').fillna(method='pad')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# delete overall essay data (to save memory)\n",
	"# reset_selective essays"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Initialize and fit TF-IDF vectorizer\n",
	"tfidf_vectorizer = TfidfVectorizer()\n",
	"tfidf_vectorizer.fit(training_essays.essay)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Transform training and test data of essays\n",
	"X_train = tfidf_vectorizer.transform(np.array(training_essays.essay)[np.searchsorted(project_ids,\n",
	" X_train_ids)])\n",
	"X_test = tfidf_vectorizer.transform(np.array(training_essays.essay)[np.searchsorted(project_ids,\n",
	" X_test_ids)])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"print \"number of features of vectorized essays: \", X_train.shape[1]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Train a logistic regression classifier (default parameters) with traning set\n",
	"clf = LogisticRegression()\n",
	"clf.fit(X_train, y_train)\n",
	"\n",
	"# Predict values and probabilities\n",
	"preds = clf.predict(X_test)\n",
	"pred_probs = clf.predict_proba(X_test)\n",
	"\n",
	"# Get evaluations of the logistic regression model\n",
	"evaluate_model(y_test, preds, pred_probs)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Incremental learning"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Divide data into two sections\n",
	"X_train_1 = X_train[:X_train.shape[0]/2, :]\n",
	"X_train_2 = X_train[X_train.shape[0]/2:, :]\n",
	"y_train_1 = y_train[:y_train.shape[0]/2]\n",
	"y_train_2 = y_train[y_train.shape[0]/2:]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Train a logistic regression classifier (default parameters) with traning set\n",
	"clf = SGDClassifier(loss='log')\n",
	"clf.fit(X_train_1, y_train_1)\n",
	"\n",
	"# Predict values and probabilities\n",
	"preds = clf.predict(X_test)\n",
	"pred_probs = clf.predict_proba(X_test)\n",
	"\n",
	"# Get evaluations of the logistic regression model\n",
	"evaluate_model(y_test, preds, pred_probs)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Partial fit to the trained classifier\n",
	"clf.partial_fit(X_train_2, y_train_2)\n",
	"\n",
	"# Predict values and probabilities\n",
	"preds = clf.predict(X_test)\n",
	"pred_probs = clf.predict_proba(X_test)\n",
	"\n",
	"# Get evaluations of the logistic regression model\n",
	"evaluate_model(y_test, preds, pred_probs)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.6"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}