Created
July 22, 2021 14:17
-
-
Save johnleung8888/1a5a8eb872056033b03db2fe78d30dd9 to your computer and use it in GitHub Desktop.
Binary Classification.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "nbformat": 4, | |
| "nbformat_minor": 0, | |
| "metadata": { | |
| "colab": { | |
| "name": "Binary Classification.ipynb", | |
| "private_outputs": true, | |
| "provenance": [], | |
| "collapsed_sections": [], | |
| "include_colab_link": true | |
| }, | |
| "kernelspec": { | |
| "name": "python3", | |
| "display_name": "Python 3" | |
| } | |
| }, | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "view-in-github", | |
| "colab_type": "text" | |
| }, | |
| "source": [ | |
| "<a href=\"https://colab.research.google.com/gist/johnleung8888/1a5a8eb872056033b03db2fe78d30dd9/binary-classification.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "wDlWLbfkJtvu", | |
| "cellView": "form" | |
| }, | |
| "source": [ | |
| "#@title Copyright 2020 Google LLC. Double-click here for license information.\n", | |
| "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", | |
| "# you may not use this file except in compliance with the License.\n", | |
| "# You may obtain a copy of the License at\n", | |
| "#\n", | |
| "# https://www.apache.org/licenses/LICENSE-2.0\n", | |
| "#\n", | |
| "# Unless required by applicable law or agreed to in writing, software\n", | |
| "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", | |
| "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", | |
| "# See the License for the specific language governing permissions and\n", | |
| "# limitations under the License." | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "TL5y5fY9Jy_x" | |
| }, | |
| "source": [ | |
| "# Binary Classification\n", | |
| "\n", | |
| "So far, you've only created regression models. That is, you created models that produced floating-point predictions, such as, \"houses in this neighborhood costs N thousand dollars.\" In this Colab, you'll create and evaluate a binary [classification model](https://developers.google.com/machine-learning/glossary/#classification_model). That is, you'll create a model that answers a binary question. In this exercise, the binary question will be, \"Are houses in this neighborhood above a certain price?\"\n", | |
| "\n", | |
| "\n", | |
| "\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "yuw8rRl9lNuL" | |
| }, | |
| "source": [ | |
| "## Learning Objectives:\n", | |
| "\n", | |
| "After doing this Colab, you'll know how to:\n", | |
| "\n", | |
| " * Convert a regression question into a classification question.\n", | |
| " * Modify the classification threshold and determine how that modification influences the model.\n", | |
| " * Experiment with different classification metrics to determine your model's effectiveness." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "44OdC-OglN9D" | |
| }, | |
| "source": [ | |
| "## The Dataset\n", | |
| " \n", | |
| "Like several of the previous Colabs, this Colab uses the [California Housing Dataset](https://developers.google.com/machine-learning/crash-course/california-housing-data-description)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "xchnxAsaKKqO" | |
| }, | |
| "source": [ | |
| "## Use the right version of TensorFlow\n", | |
| "\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "BDWhxaj2OMSv" | |
| }, | |
| "source": [ | |
| "The following hidden code cell ensures that the Colab will run on TensorFlow 2.X." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "qBpGi_GD14-p" | |
| }, | |
| "source": [ | |
| "#@title Run on TensorFlow 2.x\n", | |
| "%tensorflow_version 2.x" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "5iuw6-JOGf7I" | |
| }, | |
| "source": [ | |
| "## Call the import statements\n", | |
| "\n", | |
| "The following code imports the necessary modules." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "9n9_cTveKmse" | |
| }, | |
| "source": [ | |
| "#@title Load the imports\n", | |
| "\n", | |
| "# from __future__ import absolute_import, division, print_function, unicode_literals\n", | |
| "\n", | |
| "import numpy as np\n", | |
| "import pandas as pd\n", | |
| "import tensorflow as tf\n", | |
| "from tensorflow.keras import layers\n", | |
| "from matplotlib import pyplot as plt\n", | |
| "\n", | |
| "# The following lines adjust the granularity of reporting.\n", | |
| "pd.options.display.max_rows = 10\n", | |
| "pd.options.display.float_format = \"{:.1f}\".format\n", | |
| "# tf.keras.backend.set_floatx('float32')\n", | |
| "\n", | |
| "print(\"Ran the import statements.\")" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "X_TaJhU4KcuY" | |
| }, | |
| "source": [ | |
| "## Load the datasets from the internet\n", | |
| "\n", | |
| "The following code cell loads the separate .csv files and creates the following two pandas DataFrames:\n", | |
| "\n", | |
| "* `train_df`, which contains the training set\n", | |
| "* `test_df`, which contains the test set" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "JZlvdpyYKx7V" | |
| }, | |
| "source": [ | |
| "train_df = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\")\n", | |
| "test_df = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv\")\n", | |
| "train_df = train_df.reindex(np.random.permutation(train_df.index)) # shuffle the training set" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "q_vuAQq0Cvrp" | |
| }, | |
| "source": [ | |
| "Unlike some of the previous Colabs, the preceding code cell did not scale the label (`median_house_value`). The following section (\"Normalize values\") provides an alternative approach." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "_G6y-XcEmk6r" | |
| }, | |
| "source": [ | |
| "## Normalize values\n", | |
| "\n", | |
| "When creating a model with multiple features, the values of each feature should cover roughly the same range. For example, if one feature's range spans 500 to 100,000 and another feature's range spans 2 to 12, then the model will be difficult or impossible to train. Therefore, you should \n", | |
| "[normalize](https://developers.google.com/machine-learning/glossary/#normalization) features in a multi-feature model. \n", | |
| "\n", | |
| "The following code cell normalizes datasets by converting each raw value (including the label) to its Z-score. A **Z-score** is the number of standard deviations from the mean for a particular raw value. For example, consider a feature having the following characteristics:\n", | |
| "\n", | |
| " * The mean is 60.\n", | |
| " * The standard deviation is 10.\n", | |
| "\n", | |
| "The raw value 75 would have a Z-score of +1.5:\n", | |
| "\n", | |
| "```\n", | |
| " Z-score = (75 - 60) / 10 = +1.5\n", | |
| "```\n", | |
| "\n", | |
| "The raw value 38 would have a Z-score of -2.2:\n", | |
| "\n", | |
| "```\n", | |
| " Z-score = (38 - 60) / 10 = -2.2\n", | |
| "```" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "n7nuAHoZIgVI" | |
| }, | |
| "source": [ | |
| "# Calculate the Z-scores of each column in the training set and\n", | |
| "# write those Z-scores into a new pandas DataFrame named train_df_norm.\n", | |
| "train_df_mean = train_df.mean()\n", | |
| "train_df_std = train_df.std()\n", | |
| "train_df_norm = (train_df - train_df_mean)/train_df_std\n", | |
| "\n", | |
| "# Examine some of the values of the normalized training set. Notice that most \n", | |
| "# Z-scores fall between -2 and +2.\n", | |
| "train_df_norm.head()" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "QoW-59jVFF2I" | |
| }, | |
| "source": [ | |
| "# Calculate the Z-scores of each column in the test set and\n", | |
| "# write those Z-scores into a new pandas DataFrame named test_df_norm.\n", | |
| "test_df_mean = test_df.mean()\n", | |
| "test_df_std = test_df.std()\n", | |
| "test_df_norm = (test_df - test_df_mean)/test_df_std" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "-swmXtWnZGis" | |
| }, | |
| "source": [ | |
| "## Task 1: Create a binary label\n", | |
| "\n", | |
| "In classification problems, the label for every example must be either 0 or 1. Unfortunately, the natural label in the California Housing Dataset, `median_house_value`, contains floating-point values like 80,100 or 85,700 rather than 0s and 1s, while the normalized version of `median_house_values` contains floating-point values primarily between -3 and +3.\n", | |
| "\n", | |
| "Your task is to create a new column named `median_house_value_is_high` in both the training set and the test set . If the `median_house_value` is higher than a certain arbitrary value (defined by `threshold`), then set `median_house_value_is_high` to 1. Otherwise, set `median_house_value_is_high` to 0. \n", | |
| "\n", | |
| "**Hint:** The cells in the `median_house_value_is_high` column must each hold `1` and `0`, not `True` and `False`. To convert `True` and `False` to `1` and `0`, call the pandas DataFrame function `astype(float)`. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "d4kWfWA8bhKW" | |
| }, | |
| "source": [ | |
| "threshold = 265000 # This is the 75th percentile for median house values.\n", | |
| "train_df_norm[\"median_house_value_is_high\"] = (train_df[\"median_house_value\"] > threshold).astype(float)\n", | |
| "test_df_norm[\"median_house_value_is_high\"] = (test_df[\"median_house_value\"] > threshold).astype(float)\n", | |
| "\n", | |
| "# Print out a few example cells from the beginning and \n", | |
| "# middle of the training set, just to make sure that\n", | |
| "# your code created only 0s and 1s in the newly created\n", | |
| "# median_house_value_is_high column\n", | |
| "train_df_norm[\"median_house_value_is_high\"].head(80)" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "b8b2fNmHO-iU" | |
| }, | |
| "source": [ | |
| "#@title Double-click for possible solutions.\n", | |
| "\n", | |
| "# We arbitrarily set the threshold to 265,000, which is \n", | |
| "# the 75th percentile for median house values. Every neighborhood\n", | |
| "# with a median house price above 265,000 will be labeled 1, \n", | |
| "# and all other neighborhoods will be labeled 0.\n", | |
| "threshold = 265000\n", | |
| "train_df_norm[\"median_house_value_is_high\"] = (train_df[\"median_house_value\"] > threshold).astype(float)\n", | |
| "test_df_norm[\"median_house_value_is_high\"] = (test_df[\"median_house_value\"] > threshold).astype(float) \n", | |
| "train_df_norm[\"median_house_value_is_high\"].head(8000)\n", | |
| "\n", | |
| "\n", | |
| "# Alternatively, instead of picking the threshold\n", | |
| "# based on raw house values, you can work with Z-scores.\n", | |
| "# For example, the following possible solution uses a Z-score\n", | |
| "# of +1.0 as the threshold, meaning that no more\n", | |
| "# than 16% of the values in median_house_value_is_high\n", | |
| "# will be labeled 1.\n", | |
| "\n", | |
| "# threshold_in_Z = 1.0 \n", | |
| "# train_df_norm[\"median_house_value_is_high\"] = (train_df_norm[\"median_house_value\"] > threshold_in_Z).astype(float)\n", | |
| "# test_df_norm[\"median_house_value_is_high\"] = (test_df_norm[\"median_house_value\"] > threshold_in_Z).astype(float) \n" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "8kir8UTUXSV8" | |
| }, | |
| "source": [ | |
| "## Represent features in feature columns\n", | |
| "\n", | |
| "This code cell specifies the features that you'll ultimately train the model on and how each of those features will be represented. The transformations (collected in `feature_layer`) don't actually get applied until you pass a DataFrame to it, which will happen when we train the model. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "3tmmZIDw4JEC" | |
| }, | |
| "source": [ | |
| "# Create an empty list that will eventually hold all created feature columns.\n", | |
| "feature_columns = []\n", | |
| "\n", | |
| "# Create a numerical feature column to represent median_income.\n", | |
| "median_income = tf.feature_column.numeric_column(\"median_income\")\n", | |
| "feature_columns.append(median_income)\n", | |
| "\n", | |
| "# Create a numerical feature column to represent total_rooms.\n", | |
| "tr = tf.feature_column.numeric_column(\"total_rooms\")\n", | |
| "feature_columns.append(tr)\n", | |
| "\n", | |
| "# Convert the list of feature columns into a layer that will later be fed into\n", | |
| "# the model. \n", | |
| "feature_layer = layers.DenseFeatures(feature_columns)\n", | |
| "\n", | |
| "# Print the first 3 and last 3 rows of the feature_layer's output when applied\n", | |
| "# to train_df_norm:\n", | |
| "feature_layer(dict(train_df_norm))" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "3014ezH3C7jT" | |
| }, | |
| "source": [ | |
| "## Define functions that build and train a model\n", | |
| "\n", | |
| "The following code cell defines two functions:\n", | |
| "\n", | |
| " * `create_model(my_learning_rate, feature_layer, my_metrics)`, which defines the model's\n", | |
| " topography.\n", | |
| " * `train_model(model, dataset, epochs, label_name, batch_size, shuffle)`, uses input features and labels to train the model.\n", | |
| "\n", | |
| "Prior exercises used [ReLU](https://developers.google.com/machine-learning/glossary#ReLU) as the [activation function](https://developers.google.com/machine-learning/glossary#activation_function). By contrast, this exercise uses [sigmoid](https://developers.google.com/machine-learning/glossary#sigmoid_function) as the activation function. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "pedD5GhlDC-y" | |
| }, | |
| "source": [ | |
| "#@title Define the functions that create and train a model.\n", | |
| "def create_model(my_learning_rate, feature_layer, my_metrics):\n", | |
| " \"\"\"Create and compile a simple classification model.\"\"\"\n", | |
| " # Most simple tf.keras models are sequential.\n", | |
| " model = tf.keras.models.Sequential()\n", | |
| "\n", | |
| " # Add the feature layer (the list of features and how they are represented)\n", | |
| " # to the model.\n", | |
| " model.add(feature_layer)\n", | |
| "\n", | |
| " # Funnel the regression value through a sigmoid function.\n", | |
| " model.add(tf.keras.layers.Dense(units=1, input_shape=(1,),\n", | |
| " activation=tf.sigmoid),)\n", | |
| "\n", | |
| " # Call the compile method to construct the layers into a model that\n", | |
| " # TensorFlow can execute. Notice that we're using a different loss\n", | |
| " # function for classification than for regression. \n", | |
| " model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=my_learning_rate), \n", | |
| " loss=tf.keras.losses.BinaryCrossentropy(),\n", | |
| " metrics=my_metrics)\n", | |
| "\n", | |
| " return model \n", | |
| "\n", | |
| "\n", | |
| "def train_model(model, dataset, epochs, label_name,\n", | |
| " batch_size=None, shuffle=True):\n", | |
| " \"\"\"Feed a dataset into the model in order to train it.\"\"\"\n", | |
| "\n", | |
| " # The x parameter of tf.keras.Model.fit can be a list of arrays, where\n", | |
| " # each array contains the data for one feature. Here, we're passing\n", | |
| " # every column in the dataset. Note that the feature_layer will filter\n", | |
| " # away most of those columns, leaving only the desired columns and their\n", | |
| " # representations as features.\n", | |
| " features = {name:np.array(value) for name, value in dataset.items()}\n", | |
| " label = np.array(features.pop(label_name)) \n", | |
| " history = model.fit(x=features, y=label, batch_size=batch_size,\n", | |
| " epochs=epochs, shuffle=shuffle)\n", | |
| " \n", | |
| " # The list of epochs is stored separately from the rest of history.\n", | |
| " epochs = history.epoch\n", | |
| "\n", | |
| " # Isolate the classification metric for each epoch.\n", | |
| " hist = pd.DataFrame(history.history)\n", | |
| "\n", | |
| " return epochs, hist \n", | |
| "\n", | |
| "print(\"Defined the create_model and train_model functions.\") " | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "Ak_TMAzGOIFq" | |
| }, | |
| "source": [ | |
| "## Define a plotting function\n", | |
| "\n", | |
| "The following [matplotlib](https://developers.google.com/machine-learning/glossary/#matplotlib) function plots one or more curves, showing how various classification metrics change with each epoch." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "QF0BFRXTOeR3" | |
| }, | |
| "source": [ | |
| "#@title Define the plotting function.\n", | |
| "def plot_curve(epochs, hist, list_of_metrics):\n", | |
| " \"\"\"Plot a curve of one or more classification metrics vs. epoch.\"\"\" \n", | |
| " # list_of_metrics should be one of the names shown in:\n", | |
| " # https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#define_the_model_and_metrics \n", | |
| "\n", | |
| " plt.figure()\n", | |
| " plt.xlabel(\"Epoch\")\n", | |
| " plt.ylabel(\"Value\")\n", | |
| "\n", | |
| " for m in list_of_metrics:\n", | |
| " x = hist[m]\n", | |
| " plt.plot(epochs[1:], x[1:], label=m)\n", | |
| "\n", | |
| " plt.legend()\n", | |
| "\n", | |
| "print(\"Defined the plot_curve function.\")" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "D-IXYVfvM4gD" | |
| }, | |
| "source": [ | |
| "## Invoke the creating, training, and plotting functions\n", | |
| "\n", | |
| "The following code cell calls specify the hyperparameters, and then invokes the \n", | |
| "functions to create and train the model, and then to plot the results." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "nj3v5EKQFY8s", | |
| "cellView": "both" | |
| }, | |
| "source": [ | |
| "# The following variables are the hyperparameters.\n", | |
| "learning_rate = 0.001\n", | |
| "epochs = 20\n", | |
| "batch_size = 100\n", | |
| "label_name = \"median_house_value_is_high\"\n", | |
| "classification_threshold = 0.35\n", | |
| "\n", | |
| "# Establish the metrics the model will measure.\n", | |
| "METRICS = [\n", | |
| " tf.keras.metrics.BinaryAccuracy(name='accuracy', \n", | |
| " threshold=classification_threshold),\n", | |
| " ]\n", | |
| "\n", | |
| "# Establish the model's topography.\n", | |
| "my_model = create_model(learning_rate, feature_layer, METRICS)\n", | |
| "\n", | |
| "# Train the model on the training set.\n", | |
| "epochs, hist = train_model(my_model, train_df_norm, epochs, \n", | |
| " label_name, batch_size)\n", | |
| "\n", | |
| "# Plot a graph of the metric(s) vs. epochs.\n", | |
| "list_of_metrics_to_plot = ['accuracy'] \n", | |
| "\n", | |
| "plot_curve(epochs, hist, list_of_metrics_to_plot)" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "FF64TpqkbOpn" | |
| }, | |
| "source": [ | |
| "Accuracy should gradually improve during training (until it can \n", | |
| "improve no more)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "1xNqWWos_zyk" | |
| }, | |
| "source": [ | |
| "## Evaluate the model against the test set\n", | |
| "\n", | |
| "At the end of model training, you ended up with a certain accuracy against the *training set*. Invoke the following code cell to determine your model's accuracy against the *test set*." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "eJorkMlDmtHf" | |
| }, | |
| "source": [ | |
| "features = {name:np.array(value) for name, value in test_df_norm.items()}\n", | |
| "label = np.array(features.pop(label_name))\n", | |
| "\n", | |
| "my_model.evaluate(x = features, y = label, batch_size=batch_size)" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "q7cHkFXalXV5" | |
| }, | |
| "source": [ | |
| "## Task 2: How accurate is your model really?\n", | |
| "\n", | |
| "Is your model valuable?" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "rUvCrQkulwjV" | |
| }, | |
| "source": [ | |
| "#@title Double-click for a possible answer to Task 2.\n", | |
| "\n", | |
| "# A perfect model would make 100% accurate predictions.\n", | |
| "# Our model makes 80% accurate predictions. 80% sounds\n", | |
| "# good, but note that a model that always guesses \n", | |
| "# \"median_house_value_is_high is False\" would be 75% \n", | |
| "# accurate. " | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "C8crSCCVf6gm" | |
| }, | |
| "source": [ | |
| "## Task 3: Add precision and recall as metrics\n", | |
| "\n", | |
| "Relying solely on accuracy, particularly for a class-imbalanced data set (like ours), can be a poor way to judge a classification model. Modify the code in the following code cell to enable the model to measure not only accuracy but also precision and recall. We have\n", | |
| "added accuracy and precision; your task is to add recall. See the [TensorFlow Reference](https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Recall) for details.\n", | |
| "\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "r-k1MD2XArmO" | |
| }, | |
| "source": [ | |
| "# The following variables are the hyperparameters.\n", | |
| "learning_rate = 0.001\n", | |
| "epochs = 20\n", | |
| "batch_size = 100\n", | |
| "classification_threshold = 0.35\n", | |
| "label_name = \"median_house_value_is_high\"\n", | |
| "\n", | |
| "# Modify the following definition of METRICS to generate\n", | |
| "# not only accuracy and precision, but also recall:\n", | |
| "METRICS = [\n", | |
| " tf.keras.metrics.BinaryAccuracy(name='accuracy', \n", | |
| " threshold=classification_threshold),\n", | |
| " tf.keras.metrics.Precision(thresholds=classification_threshold,\n", | |
| " name='precision' \n", | |
| " ),\n", | |
| " tf.keras.metrics.Recall(thresholds=classification_threshold, name='recall')\n", | |
| "]\n", | |
| "\n", | |
| "# Establish the model's topography.\n", | |
| "my_model = create_model(learning_rate, feature_layer, METRICS)\n", | |
| "\n", | |
| "# Train the model on the training set.\n", | |
| "epochs, hist = train_model(my_model, train_df_norm, epochs, \n", | |
| " label_name, batch_size)\n", | |
| "\n", | |
| "# Plot metrics vs. epochs\n", | |
| "list_of_metrics_to_plot = ['accuracy', 'precision', 'recall'] \n", | |
| "plot_curve(epochs, hist, list_of_metrics_to_plot)" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "Ax87gOyDBhAu" | |
| }, | |
| "source": [ | |
| "#@title Double-click to view the solution for Task 3.\n", | |
| "\n", | |
| "# The following variables are the hyperparameters.\n", | |
| "learning_rate = 0.001\n", | |
| "epochs = 20\n", | |
| "batch_size = 100\n", | |
| "classification_threshold = 0.35\n", | |
| "label_name = \"median_house_value_is_high\"\n", | |
| "\n", | |
| "# Here is the updated definition of METRICS:\n", | |
| "METRICS = [\n", | |
| " tf.keras.metrics.BinaryAccuracy(name='accuracy', \n", | |
| " threshold=classification_threshold),\n", | |
| " tf.keras.metrics.Precision(thresholds=classification_threshold,\n", | |
| " name='precision' \n", | |
| " ),\n", | |
| " tf.keras.metrics.Recall(thresholds=classification_threshold,\n", | |
| " name=\"recall\"),\n", | |
| "]\n", | |
| "\n", | |
| "# Establish the model's topography.\n", | |
| "my_model = create_model(learning_rate, feature_layer, METRICS)\n", | |
| "\n", | |
| "# Train the model on the training set.\n", | |
| "epochs, hist = train_model(my_model, train_df_norm, epochs, \n", | |
| " label_name, batch_size)\n", | |
| "\n", | |
| "# Plot metrics vs. epochs\n", | |
| "list_of_metrics_to_plot = ['accuracy', \"precision\", \"recall\"] \n", | |
| "plot_curve(epochs, hist, list_of_metrics_to_plot)\n", | |
| "\n", | |
| "\n", | |
| "# The new graphs suggest that precision and recall are \n", | |
| "# somewhat in conflict. That is, improvements to one of\n", | |
| "# those metrics may hurt the other metric." | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "cAsB85iKSXLe" | |
| }, | |
| "source": [ | |
| "## Task 4: Experiment with the classification threshold (if time permits)\n", | |
| "\n", | |
| "Experiment with different values for `classification_threshold` in the code cell within \"Invoke the creating, training, and plotting functions.\" What value of `classification_threshold` produces the highest accuracy?" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "FLPDYI7Sphnj" | |
| }, | |
| "source": [ | |
| "#@title Double-click to view the solution for Task 4.\n", | |
| "\n", | |
| "# The following variables are the hyperparameters.\n", | |
| "learning_rate = 0.001\n", | |
| "epochs = 20\n", | |
| "batch_size = 100\n", | |
| "classification_threshold = 0.52\n", | |
| "label_name = \"median_house_value_is_high\"\n", | |
| "\n", | |
| "# Here is the updated definition of METRICS:\n", | |
| "METRICS = [\n", | |
| " tf.keras.metrics.BinaryAccuracy(name='accuracy', \n", | |
| " threshold=classification_threshold),\n", | |
| " tf.keras.metrics.Precision(thresholds=classification_threshold,\n", | |
| " name='precision' \n", | |
| " ),\n", | |
| " tf.keras.metrics.Recall(thresholds=classification_threshold,\n", | |
| " name=\"recall\"),\n", | |
| "]\n", | |
| "\n", | |
| "# Establish the model's topography.\n", | |
| "my_model = create_model(learning_rate, feature_layer, METRICS)\n", | |
| "\n", | |
| "# Train the model on the training set.\n", | |
| "epochs, hist = train_model(my_model, train_df_norm, epochs, \n", | |
| " label_name, batch_size)\n", | |
| "\n", | |
| "# Plot metrics vs. epochs\n", | |
| "list_of_metrics_to_plot = ['accuracy', \"precision\", \"recall\"] \n", | |
| "plot_curve(epochs, hist, list_of_metrics_to_plot)\n", | |
| "\n", | |
| "# A `classification_threshold` of slightly over 0.5\n", | |
| "# appears to produce the highest accuracy (about 83%).\n", | |
| "# Raising the `classification_threshold` to 0.9 drops \n", | |
| "# accuracy by about 5%. Lowering the \n", | |
| "# `classification_threshold` to 0.3 drops accuracy by \n", | |
| "# about 3%. " | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "XBGRS0Ndduus" | |
| }, | |
| "source": [ | |
| "## Task 5: Summarize model performance (if time permits)\n", | |
| "\n", | |
| "If time permits, add one more metric that attempts to summarize the model's overall performance. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "vwNE6syoFvWe" | |
| }, | |
| "source": [ | |
| "#@title Double-click to view the solution for Task 5.\n", | |
| "\n", | |
| "# The following variables are the hyperparameters.\n", | |
| "learning_rate = 0.001\n", | |
| "epochs = 20\n", | |
| "batch_size = 100\n", | |
| "label_name = \"median_house_value_is_high\"\n", | |
| "\n", | |
| "# AUC is a reasonable \"summary\" metric for \n", | |
| "# classification models.\n", | |
| "# Here is the updated definition of METRICS to \n", | |
| "# measure AUC:\n", | |
| "METRICS = [\n", | |
| " tf.keras.metrics.AUC(num_thresholds=100, name='auc'),\n", | |
| "]\n", | |
| "\n", | |
| "# Establish the model's topography.\n", | |
| "my_model = create_model(learning_rate, feature_layer, METRICS)\n", | |
| "\n", | |
| "# Train the model on the training set.\n", | |
| "epochs, hist = train_model(my_model, train_df_norm, epochs, \n", | |
| " label_name, batch_size)\n", | |
| "\n", | |
| "# Plot metrics vs. epochs\n", | |
| "list_of_metrics_to_plot = ['auc'] \n", | |
| "plot_curve(epochs, hist, list_of_metrics_to_plot)" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| } | |
| ] | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment