mkarpp · November 22, 2019 09:22
diff --git a/hyperopt.ipynb b/hyperopt.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.4"
    },
    "colab": {
      "name": "hyperopt.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/mkarpp/c2c7de511bd1eaf809140095762d7100/hyperopt.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RPOiPRRp4ieN",
        "colab_type": "text"
      },
      "source": [
        "# Bayesian Hyperparameter Optimization\n",
        "\n",
        "#### by Matti Karppanen, Advian"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-2ZYeWHV63va",
        "colab_type": "text"
      },
      "source": [
        "Welcome! This notebook runs a lot faster with GPU acceleration. If you wish to do so, please use the Colab menu above to change the runtime to GPU via Runtime -> Change runtime type, and then restart the runtime with Runtime -> Restart runtime.\n",
        "\n",
        "For those new to these kinds of notebooks, the simplest way to move down and run the code in a shell is Shift+Enter. Or you can click a code shell and then click the Play button in top left corner of the shell."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Sbxeqf9P4N-c",
        "colab_type": "text"
      },
      "source": [
        "We will be optimizing XGBoost hyperparameters in the Scania Truck Air Pressure System (APS) Failure Prediction dataset. This was a competition in the Industrial Challenge 2016 at The 15th International Symposium on Intelligent Data Analysis (IDA). This dataset was chosen because is has decent size (60000x170), is wholly numeric, and requires only a bit of preparation to start running algorithms.\n",
        "\n",
        "This is a binary classification task where we are trying to classify the trucks into those whose APS failed and those whose didn't."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "T4pVlRzy4N-e",
        "colab_type": "text"
      },
      "source": [
        "### Options\n",
        "\n",
        "The more optimization trials you specify the better the optimization results. 30 trials take about 30 minutes to run through the notebook, training XGBoost on a Tesla K80 GPU. So the rate is approximately a minute per trial.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GE-uWCQp4N-f",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "search_trials = 30"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aEuHZw114N-h",
        "colab_type": "text"
      },
      "source": [
        "### Preparation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_XazUvwF4N-i",
        "colab_type": "text"
      },
      "source": [
        "Download the dataset from the UCI Machine Learning Database."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lJmqI_IJ4N-j",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "![ ! -f aps_failure_training_set.csv ] && curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_training_set.csv\n",
        "![ ! -f aps_failure_test_set.csv ] && curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_test_set.csv"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "X3AsW5yJ4N-l",
        "colab_type": "text"
      },
      "source": [
        "Import libraries. In addition to the usual Python ML stack and XGBoost, we use Ax.\n",
        "\n",
        "Ax is a recently open sourced platform for running experiments, by Facebook. For our purposes here, it has a nice API for Bayesian Optimization. Ax uses PyTorch under the hood."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7LKkmATz4YM6",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!pip install ax-platform"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0I4sXxg94N-m",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import torch\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import xgboost as xgb\n",
        "import matplotlib as plt\n",
        "from ax.service.managed_loop import optimize\n",
        "from sklearn.model_selection import train_test_split\n",
        "from sklearn.metrics import roc_auc_score, log_loss\n",
        "\n",
        "# Dataframe display precision from 6 to 4 digits to enhance readability.\n",
        "pd.options.display.precision = 4"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "p3UO9XWZ4N-o",
        "colab_type": "text"
      },
      "source": [
        "Read the data. The first 20 rows in the CSVs are dataset info, and missing values are marked \"na\" in this dataset. It's very helpful to read the csv with `na_values='na'`. Otherwise read_csv reads any missing values as strings intead of numbers."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QNBrHYec4N-p",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "train = pd.read_csv('aps_failure_training_set.csv', skiprows=20, na_values='na')\n",
        "test = pd.read_csv('aps_failure_test_set.csv', skiprows=20, na_values='na')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bVY13z8B4N-r",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "train.head(3)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DxHPJWL54N-u",
        "colab_type": "text"
      },
      "source": [
        "The target variable is a string column with possible values 'pos' and 'neg'. The dataset is inbalanced. Properly preparing an unbalanced dataset is beyond the scope of this notebook. Leave it as is for now."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XN7YipHP4N-v",
        "colab_type": "text"
      },
      "source": [
        "You may also note at this point that the size of the training data is 60000 x 170."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cSoyXiP94N-v",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "train['class'].value_counts()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "z2EjOiRg4N-y",
        "colab_type": "text"
      },
      "source": [
        "The dataset is numerical with the exception of the target column."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Ne3-jo0C4N-y",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "train.dtypes.value_counts()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "th07DlA24N-0",
        "colab_type": "text"
      },
      "source": [
        "There are no negative values in the dataset. This can be seen by looking at the minimun of all column minimums."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "oQlWTjnY4N-1",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "train.describe().loc['min'].min()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7D43ghp24N-3",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "test.describe().loc['min'].min()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zg13MfVX4N-5",
        "colab_type": "text"
      },
      "source": [
        "Therefore, we can replace NA values with -1. This is a reasonable way to deal with missing data given that we are using a decision tree based model. The model can split the data between 0 and -1 if missing values are important."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "URKM2Acf4N-6",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "train = train.fillna(-1)\n",
        "test = test.fillna(-1)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3kSLTfDi4N-8",
        "colab_type": "text"
      },
      "source": [
        "Prepare the train, validation, and test sets in the DMatrix format required by XGBoost. We use a separate validation set in the optimization to avoid overfitting hyperparameters to the test set. Stratify the split by the _y_ column. \n",
        "\n",
        "Cross validation is unfortunately many times (more than kfold times) slower and is skipped for speed reasons."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "oCdwCWTc4N-9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "train, val = train_test_split(train, test_size=.25, stratify=train['class'], \n",
        "                              random_state=210)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xVAAOVqi4N-_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "X_train = train.drop('class', 1)\n",
        "X_val = val.drop('class', 1)\n",
        "X_test = test.drop('class', 1)\n",
        "\n",
        "y_train = (train['class'] == 'pos').astype(int)\n",
        "y_val = (val['class'] == 'pos').astype(int)\n",
        "y_test = (test['class'] == 'pos').astype(int)\n",
        "\n",
        "dtrain = xgb.DMatrix(X_train, label=y_train)\n",
        "dval = xgb.DMatrix(X_val, label=y_val)\n",
        "dtest = xgb.DMatrix(X_test, label=y_test)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Iiq9Xvt34N_A",
        "colab_type": "text"
      },
      "source": [
        "### Optimization"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Wf3Vp7RT4N_B",
        "colab_type": "text"
      },
      "source": [
        "This is the custom metric used in the original competition. The metric penalizes false negatives 50 times more than false positives.\n",
        "\n",
        "    Cost-metric of miss-classification:\n",
        "    Predicted class \tTrue class \t \n",
        "      \tpos \tneg\n",
        "    pos \t- \tCost_1\n",
        "    neg \tCost_2 \t-\n",
        "\n",
        "    Cost_1 = 10 and cost_2 = 500\n",
        "\n",
        "    The total cost of a prediction model the sum of ‘Cost_1’ multiplied by the number of\n",
        "    Instances with type 1 failure and ‘Cost_2’ with the number of instances with\n",
        "    type 2 failure, resulting in a ‘Total_cost’.\n",
        "\n",
        "    In this case Cost_1 refers to the cost that an unnessecary check needs to be done by \n",
        "    an mechanic at an workshop, while Cost_2 refer to the cost of missing a faulty truck, \n",
        "    which may cause a breakdown.\n",
        "\n",
        "    Total_cost = Cost_1No_Instances + Cost_2No_Instances.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eK4NeRgm4N_C",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def custom_loss_function(y_pred, dmtrx):\n",
        "    # Count 500 * False Negatives + 10 * False Positives.\n",
        "    #\n",
        "    # Minimize this value by choosing the classification threshold that\n",
        "    # minimizes this value.\n",
        "    y_true = dmtrx.get_label()\n",
        "    y_true = y_true[np.argsort(y_pred)]\n",
        "    fp = np.arange(len(y_true)) + 1 - np.cumsum(np.flip(y_true))\n",
        "    fn = np.flip(np.cumsum(y_true))\n",
        "    \n",
        "    return np.min(500 * fn + 10 * fp)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dPgjG-NH4N_E",
        "colab_type": "text"
      },
      "source": [
        "This metric is somewhat volatile to single false negatives and therefore perhaps not the best metric to use in optimization. We will use AUC instead, since that is also about the ranking of predictions.\n",
        "\n",
        "Create the XGBTrainer class for convenience. The idea is to save any common things for all our training as intance variables."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4vXpCTBxVyFn",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class XGBTrainer():\n",
        "    def __init__(self, dtrain, dvalidation):\n",
        "        self.dtrain = dtrain\n",
        "        self.dvalidation = dvalidation\n",
        "        self.evallist = [(self.dtrain, 'train'), \n",
        "                         (self.dvalidation, 'validation')]\n",
        "        \n",
        "        self.common_params = {'objective': 'binary:logistic',\n",
        "                              'eval_metric': 'auc'}\n",
        "        \n",
        "        # Use GPU training is available\n",
        "        if torch.cuda.is_available():\n",
        "            gpu_params = {'tree_method': 'gpu_hist', 'gpu_id': 0}\n",
        "            self.common_params.update(gpu_params)\n",
        "\n",
        "    def train_model(self, hyperparams, nround, verbose=False):\n",
        "        # Train xgboost model for n rounds.\n",
        "        \n",
        "        # Combine hyperparams with the common parameters.\n",
        "        params = {**self.common_params, **hyperparams}\n",
        "\n",
        "        return xgb.train(params, self.dtrain, nround, verbose_eval=verbose)\n",
        "\n",
        "    def train_early_stop(self, hyperparams, verbose=False):\n",
        "        # Train xgboost model with early stopping.\n",
        "        \n",
        "        # Combine hyperparams with the common parameters.\n",
        "        params = {**self.common_params, **hyperparams}\n",
        "\n",
        "        # Vary early_stopping_rounds rounds depending on the learning rate.\n",
        "        # But clip it between 10 and 100.\n",
        "        stop_rounds = np.round(np.clip(3 / params['eta'], 10, 100))\n",
        "        \n",
        "        bst = xgb.train(params, self.dtrain, num_boost_round=1000, \n",
        "                        evals=self.evallist, verbose_eval=verbose,\n",
        "                        early_stopping_rounds=stop_rounds)\n",
        "        return bst\n",
        "\n",
        "    def train_evaluate(self, hyperparams, verbose=False):\n",
        "        bst = self.train_early_stop(hyperparams, verbose=verbose)\n",
        "        preds = bst.predict(self.dvalidation, ntree_limit=bst.best_ntree_limit)\n",
        "        return roc_auc_score(self.dvalidation.get_label(), preds)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wcevHmOofSUu",
        "colab_type": "text"
      },
      "source": [
        "#### XGBoost Hyperparameter Introduction\n",
        "\n",
        "The most important hyperparameters for the XGBoost algorithm are the following, in a rough order of importance.\n",
        "\n",
        "- `num_boost_round`, how many boosting iterations before stopping the training. We use early stopping to find a good value instead of searching.\n",
        "- `eta`, the learning rate. How much we adjust our predictions in each step.\n",
        "- `max_depth`, maximum depth of a single tree in the ensemble.\n",
        "- `gamma`, a regularization parameter that controls the minimum gain required to make a further split.\n",
        "- `min_child_weight`, the minimum number of instances in a leaf.\n",
        "- `subsample`, the portion of training sample to use for each tree.\n",
        "- `colsample_bytree`, is the same as subsample for columns.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oab8zYOU4N_P",
        "colab_type": "text"
      },
      "source": [
        "#### The Gaussian Process\n",
        "\n",
        "Run the optimization for the amount of iterations set by `search_trials` in the beginning of the notebook. We will enter both `eta`, the learning rate, and `gamma`, a regularization parameter, in logarithmic scale. The other hyperparameters to explore are the maximum depth of a single tree, and the number of rounds to run XGBoost. Remember to specify `minimize=True` if you use a minimizing metric.\n",
        "\n",
        "The commented out hyperparameters are less important to the result. You may try what happens if you uncomment them. Bear in mind though that it is far harder to find the optimum in a 6D space than in a 3D space.\n",
        "\n",
        "This cell takes c.15 minutes to run with a K80 GPU."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jP5gwa9v4N_R",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "eta_bounds = [1e-3, 1.]\n",
        "gamma_bounds = [1e-2, 100.]\n",
        "max_depth_bounds = [2, 12]\n",
        "\n",
        "hyperparameter_space = [\n",
        "  {\"name\": \"eta\", \"type\": \"range\", \"bounds\": eta_bounds, \"log_scale\": True},\n",
        "  {\"name\": \"gamma\", \"type\": \"range\", \"bounds\": gamma_bounds, \"log_scale\": True},\n",
        "  {\"name\": \"max_depth\", \"type\": \"range\", \"bounds\": max_depth_bounds},\n",
        "  #{\"name\": \"min_child_weight\", \"type\": \"range\", \"bounds\": [1, 10]},\n",
        "  #{\"name\": \"subsample\", \"type\": \"range\", \"bounds\": [.4, 1.]},\n",
        "  #{\"name\": \"colsample_bytree\", \"type\": \"range\", \"bounds\": [.4, 1.]},\n",
        "]\n",
        "\n",
        "xgb_trainer = XGBTrainer(dtrain, dval)\n",
        "\n",
        "parameters, values, experiment, model = optimize(\n",
        "    minimize=False, \n",
        "    total_trials=search_trials,\n",
        "    parameters=hyperparameter_space,\n",
        "    evaluation_function=xgb_trainer.train_evaluate,\n",
        "    objective_name='score',\n",
        ")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1BjjjWRK4N_U",
        "colab_type": "text"
      },
      "source": [
        "### Evaluation\n",
        "Look at the results."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UzYLm6Gg4N_X",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "data = experiment.fetch_data().df.rename(columns={\"mean\": \"score\"})\n",
        "params = pd.DataFrame([t.arm.parameters for t in experiment.trials.values()])\n",
        "pd.concat([params, data['score']], axis=1)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MSTU24uv4N_a",
        "colab_type": "text"
      },
      "source": [
        "Pick out the parameters of the best iteration and train the whole training dataset with those parameters. \n",
        "\n",
        "Here, `best_model` in the model trained on the \"train\" with the best hyperparameters from the optimization process, and `final_model` is trained on all of the data we have available for training, meaning combined \"train\" and \"val\" sets, and using the number of training rounds, from `best_model`'s early stopping."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Fiuf4wMj4N_b",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "X_complete_train = X_train.append(X_val)\n",
        "y_complete_train = y_train.append(y_val)\n",
        "d_all = xgb.DMatrix(X_complete_train, label=y_complete_train)\n",
        "\n",
        "# Change idxmax to idxmin if you use a minimizing metric.\n",
        "best_arm_name = data.arm_name[data.score.idxmax()]\n",
        "best_arm = experiment.arms_by_name[best_arm_name]\n",
        "\n",
        "best_model = xgb_trainer.train_early_stop(best_arm.parameters)\n",
        "\n",
        "final_trainer = XGBTrainer(d_all, dtest)\n",
        "final_model = final_trainer.train_model(best_arm.parameters, \n",
        "                                        best_model.best_ntree_limit)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EJHr1SztYhWp",
        "colab_type": "text"
      },
      "source": [
        "Calculate predictions for the test set using the best model. You may try if `best_model` actually yields better results of if it is actually better to train one final time using all the training data, like was done in `final_model`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "82lRjr8BXaZN",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "final_preds = best_model.predict(dtest)\n",
        "# final_preds = final_model.predict(dtest)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2iK8LMReYgim",
        "colab_type": "text"
      },
      "source": [
        "Check test set performance."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Nzu6sK2KXeVQ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "test_score = custom_loss_function(final_preds, dtest)\n",
        "test_logloss = log_loss(y_test, final_preds)\n",
        "test_auc = roc_auc_score(y_test, final_preds)\n",
        "\n",
        "print(f\"Logloss (test set): {test_logloss:.5f} \")\n",
        "print(f\"AUC (test set)): {test_auc:.4f} \")\n",
        "print(f\"Custom score (test set): {test_score} \")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7AB-f-vD4N_f",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "best_arm.parameters"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "XuGSxoaHu3Vu"
      },
      "source": [
        "Let's see how this compares to Random Search."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Xs0BHnur4N_h",
        "colab_type": "text"
      },
      "source": [
        "### Random Search"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AsIPS2LC4N_i",
        "colab_type": "text"
      },
      "source": [
        "Initiate sets of random hyperparameters using the same bounds an in the Bayesian case. Take into account the logarithmic scale for both eta and gamma."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "og4IN1yB4N_i",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "rdf = pd.DataFrame({\n",
        "    'eta': 10 ** np.random.uniform(np.log10(eta_bounds[0]), \n",
        "                                   np.log10(eta_bounds[1]), \n",
        "                                   search_trials),\n",
        "    'gamma': 10 ** np.random.uniform(np.log10(gamma_bounds[0]), \n",
        "                                     np.log10(gamma_bounds[1]), \n",
        "                                     search_trials),\n",
        "    'max_depth': np.random.randint(max_depth_bounds[0], \n",
        "                                   max_depth_bounds[1], \n",
        "                                   search_trials)\n",
        "})"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sSw5ZLx06SX-",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "In this cell we use a list comprehension to train as many XGB models with Random Search as we did with Bayesian Search above.  Takes c.15 minutes to run with a K80."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8OExq2X84N_k",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "rdf['score'] = [ xgb_trainer.train_evaluate(p) for p in rdf.to_dict('records') ]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Nc2IfyAx4N_o",
        "colab_type": "text"
      },
      "source": [
        "### Comparison"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "T4CM050V4N_p",
        "colab_type": "text"
      },
      "source": [
        "Plot the best result against the amount of trials for both datasets. Higher is better."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gmMJeUcf4N_p",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "data['rs_score'] = rdf['score']"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JViVfjbj4N_s",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "%matplotlib inline\n",
        "data[['score', 'rs_score']].cummax().plot(ylim=[.995, .9975 ])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aqfcwrr14N_v",
        "colab_type": "text"
      },
      "source": [
        "Also look at the result by trial. Notice how Bayesian Optimization (left) searches more promising parts of the hyperparameter space after it learns more about problem, while Random Search continues to be... random."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "PY6A8bRw4N_w",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "data[['score', 'rs_score']].plot(subplots=True, layout=(1, 2), \n",
        "                                 sharey=True, figsize=(12, 4))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "40nKp9_k0pTe",
        "colab_type": "text"
      },
      "source": [
        "### Further Opportunities\n",
        "A few ideas about how to improve the process:\n",
        "\n",
        "- The boundaries of the hyperparameter search space might be too close together, or shifted too much in some direction. \n",
        "- Try a different optimization metric instead of AUC (Logloss, AUPRC).\n",
        "- Evaluate more than one training process per hyperparameter combination to reduce noise in the optimization process.\n",
        "- More trials leads to better convergence.\n",
        "- Uncomment some more hyperparameters. My guess these have little effect but could be worth the try.\n",
        "- Use cross-validatation instead of a validation set.\n"
      ]
    }
  ]
 }