Created
July 22, 2021 07:44
-
-
Save johnleung8888/c56dfe83c7727f576ac2ed3c7c90c2c3 to your computer and use it in GitHub Desktop.
Representation with a Feature Cross.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "nbformat": 4, | |
| "nbformat_minor": 0, | |
| "metadata": { | |
| "colab": { | |
| "name": "Representation with a Feature Cross.ipynb", | |
| "private_outputs": true, | |
| "provenance": [], | |
| "collapsed_sections": [], | |
| "include_colab_link": true | |
| }, | |
| "kernelspec": { | |
| "name": "python3", | |
| "display_name": "Python 3" | |
| } | |
| }, | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "view-in-github", | |
| "colab_type": "text" | |
| }, | |
| "source": [ | |
| "<a href=\"https://colab.research.google.com/gist/johnleung8888/c56dfe83c7727f576ac2ed3c7c90c2c3/representation-with-a-feature-cross.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "wDlWLbfkJtvu", | |
| "cellView": "form" | |
| }, | |
| "source": [ | |
| "#@title Copyright 2020 Google LLC. Double-click for license information.\n", | |
| "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", | |
| "# you may not use this file except in compliance with the License.\n", | |
| "# You may obtain a copy of the License at\n", | |
| "#\n", | |
| "# https://www.apache.org/licenses/LICENSE-2.0\n", | |
| "#\n", | |
| "# Unless required by applicable law or agreed to in writing, software\n", | |
| "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", | |
| "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", | |
| "# See the License for the specific language governing permissions and\n", | |
| "# limitations under the License." | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "TL5y5fY9Jy_x" | |
| }, | |
| "source": [ | |
| "# Representation with a Feature Cross\n", | |
| "\n", | |
| "In this exercise, you'll experiment with different ways to represent features." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "RXWoPIezkzgI" | |
| }, | |
| "source": [ | |
| "## Learning Objectives:\n", | |
| "\n", | |
| "After doing this Colab, you'll know how to:\n", | |
| "\n", | |
| " * Use [`tf.feature_column`](https://www.tensorflow.org/api_docs/python/tf/feature_column) methods to represent features in different ways.\n", | |
| " * Represent features as [bins](https://developers.google.com/machine-learning/glossary/#bucketing). \n", | |
| " * Cross bins to create a [feature cross](https://developers.google.com/machine-learning/glossary/#feature_cross). " | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "lH_g3Hsfkzzb" | |
| }, | |
| "source": [ | |
| "## The Dataset\n", | |
| " \n", | |
| "Like several of the previous Colabs, this exercise uses the [California Housing Dataset](https://developers.google.com/machine-learning/crash-course/california-housing-data-description)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "xchnxAsaKKqO" | |
| }, | |
| "source": [ | |
| "## Use the right version of TensorFlow\n", | |
| "\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "qBpGi_GD14-p" | |
| }, | |
| "source": [ | |
| "%tensorflow_version 2.x" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "5iuw6-JOGf7I" | |
| }, | |
| "source": [ | |
| "## Call the import statements\n", | |
| "\n", | |
| "The following code imports the necessary code to run the code in the rest of this Colaboratory." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "9n9_cTveKmse" | |
| }, | |
| "source": [ | |
| "#@title Load the imports\n", | |
| "\n", | |
| "# from __future__ import absolute_import, division, print_function, unicode_literals\n", | |
| "\n", | |
| "import numpy as np\n", | |
| "import pandas as pd\n", | |
| "import tensorflow as tf\n", | |
| "from tensorflow import feature_column\n", | |
| "from tensorflow.keras import layers\n", | |
| "\n", | |
| "from matplotlib import pyplot as plt\n", | |
| "\n", | |
| "# The following lines adjust the granularity of reporting.\n", | |
| "pd.options.display.max_rows = 10\n", | |
| "pd.options.display.float_format = \"{:.1f}\".format\n", | |
| "\n", | |
| "tf.keras.backend.set_floatx('float32')\n", | |
| "\n", | |
| "print(\"Imported the modules.\")" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "X_TaJhU4KcuY" | |
| }, | |
| "source": [ | |
| "## Load, scale, and shuffle the examples\n", | |
| "\n", | |
| "The following code cell loads the separate .csv files and creates the following two pandas DataFrames:\n", | |
| "\n", | |
| "* `train_df`, which contains the training set\n", | |
| "* `test_df`, which contains the test set\n", | |
| "\n", | |
| "The code cell then scales the `median_house_value` to a more human-friendly range and then shuffles the examples." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "JZlvdpyYKx7V" | |
| }, | |
| "source": [ | |
| "# Load the dataset\n", | |
| "train_df = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\")\n", | |
| "test_df = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv\")\n", | |
| "\n", | |
| "# Scale the labels\n", | |
| "scale_factor = 1000.0\n", | |
| "# Scale the training set's label.\n", | |
| "train_df[\"median_house_value\"] /= scale_factor \n", | |
| "\n", | |
| "# Scale the test set's label\n", | |
| "test_df[\"median_house_value\"] /= scale_factor\n", | |
| "\n", | |
| "# Shuffle the examples\n", | |
| "train_df = train_df.reindex(np.random.permutation(train_df.index))" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "8kir8UTUXSV8" | |
| }, | |
| "source": [ | |
| "## Represent latitude and longitude as floating-point values\n", | |
| "\n", | |
| "Previous Colabs trained on only a single feature or a single synthetic feature. By contrast, this exercise trains on two features. Furthermore, this Colab introduces **feature columns**, which provide a sophisticated way to represent features. \n", | |
| "\n", | |
| "You create feature columns as possible:\n", | |
| "\n", | |
| " * Call a [`tf.feature_column`](https://www.tensorflow.org/api_docs/python/tf/feature_column) method to represent a single feature, single feature cross, or single synthetic feature in the desired way. For example, to represent a certain feature as floating-point values, call [`tf.feature_column.numeric_column`](https://www.tensorflow.org/api_docs/python/tf/feature_column/numeric_column). To represent a certain feature as a series of buckets or bins, call [`tf.feature_column.bucketized_column`](https://www.tensorflow.org/api_docs/python/tf/feature_column/bucketized_column).\n", | |
| " * Assemble the created representations into a Python list. \n", | |
| "\n", | |
| "A neighborhood's location is typically the most important feature in determining a house's value. The California Housing dataset provides two features, `latitude` and `longitude` that identify each neighborhood's location. \n", | |
| "\n", | |
| "The following code cell calls [`tf.feature_column.numeric_column`](https://www.tensorflow.org/api_docs/python/tf/feature_column/numeric_column) twice, first to represent `latitude` as floating-point value and a second time to represent `longitude` as floating-point values. \n", | |
| "\n", | |
| "This code cell specifies the features that you'll ultimately train the model on and how each of those features will be represented. The transformations (collected in `fp_feature_layer`) don't actually get applied until you pass a DataFrame to it, which will happen when we train the model. \n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "3tmmZIDw4JEC" | |
| }, | |
| "source": [ | |
| "# Create an empty list that will eventually hold all feature columns.\n", | |
| "feature_columns = []\n", | |
| "\n", | |
| "# Create a numerical feature column to represent latitude.\n", | |
| "latitude = tf.feature_column.numeric_column(\"latitude\")\n", | |
| "feature_columns.append(latitude)\n", | |
| "\n", | |
| "# Create a numerical feature column to represent longitude.\n", | |
| "longitude = tf.feature_column.numeric_column(\"longitude\")\n", | |
| "feature_columns.append(longitude)\n", | |
| "\n", | |
| "# Convert the list of feature columns into a layer that will ultimately become\n", | |
| "# part of the model. Understanding layers is not important right now.\n", | |
| "fp_feature_layer = layers.DenseFeatures(feature_columns)" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "Q2x8sHKnAy3Q" | |
| }, | |
| "source": [ | |
| "When used, the layer processes the raw inputs, according to the transformations described by the feature columns, and packs the result into a numeric array. (The model will train on this numeric array.) " | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "3014ezH3C7jT" | |
| }, | |
| "source": [ | |
| "## Define functions that create and train a model, and a plotting function\n", | |
| "\n", | |
| "The following code defines three functions:\n", | |
| "\n", | |
| " * `create_model`, which tells TensorFlow to build a linear regression model and to use the `feature_layer_as_fp` as the representation of the model's features.\n", | |
| " * `train_model`, which will ultimately train the model from training set examples.\n", | |
| " * `plot_the_loss_curve`, which generates a loss curve." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "pedD5GhlDC-y" | |
| }, | |
| "source": [ | |
| "#@title Define functions to create and train a model, and a plotting function\n", | |
| "def create_model(my_learning_rate, feature_layer):\n", | |
| " \"\"\"Create and compile a simple linear regression model.\"\"\"\n", | |
| " # Most simple tf.keras models are sequential.\n", | |
| " model = tf.keras.models.Sequential()\n", | |
| "\n", | |
| " # Add the layer containing the feature columns to the model.\n", | |
| " model.add(feature_layer)\n", | |
| "\n", | |
| " # Add one linear layer to the model to yield a simple linear regressor.\n", | |
| " model.add(tf.keras.layers.Dense(units=1, input_shape=(1,)))\n", | |
| "\n", | |
| " # Construct the layers into a model that TensorFlow can execute.\n", | |
| " model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=my_learning_rate),\n", | |
| " loss=\"mean_squared_error\",\n", | |
| " metrics=[tf.keras.metrics.RootMeanSquaredError()])\n", | |
| "\n", | |
| " return model \n", | |
| "\n", | |
| "\n", | |
| "def train_model(model, dataset, epochs, batch_size, label_name):\n", | |
| " \"\"\"Feed a dataset into the model in order to train it.\"\"\"\n", | |
| "\n", | |
| " features = {name:np.array(value) for name, value in dataset.items()}\n", | |
| " label = np.array(features.pop(label_name))\n", | |
| " history = model.fit(x=features, y=label, batch_size=batch_size,\n", | |
| " epochs=epochs, shuffle=True)\n", | |
| "\n", | |
| " # The list of epochs is stored separately from the rest of history.\n", | |
| " epochs = history.epoch\n", | |
| " \n", | |
| " # Isolate the mean absolute error for each epoch.\n", | |
| " hist = pd.DataFrame(history.history)\n", | |
| " rmse = hist[\"root_mean_squared_error\"]\n", | |
| "\n", | |
| " return epochs, rmse \n", | |
| "\n", | |
| "\n", | |
| "def plot_the_loss_curve(epochs, rmse):\n", | |
| " \"\"\"Plot a curve of loss vs. epoch.\"\"\"\n", | |
| "\n", | |
| " plt.figure()\n", | |
| " plt.xlabel(\"Epoch\")\n", | |
| " plt.ylabel(\"Root Mean Squared Error\")\n", | |
| "\n", | |
| " plt.plot(epochs, rmse, label=\"Loss\")\n", | |
| " plt.legend()\n", | |
| " plt.ylim([rmse.min()*0.94, rmse.max()* 1.05])\n", | |
| " plt.show() \n", | |
| "\n", | |
| "print(\"Defined the create_model, train_model, and plot_the_loss_curve functions.\")" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "D-IXYVfvM4gD" | |
| }, | |
| "source": [ | |
| "## Train the model with floating-point representations\n", | |
| "\n", | |
| "The following code cell calls the functions you just created to train, plot, and evaluate a model." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "nj3v5EKQFY8s", | |
| "cellView": "both" | |
| }, | |
| "source": [ | |
| "# The following variables are the hyperparameters.\n", | |
| "learning_rate = 0.05\n", | |
| "epochs = 30\n", | |
| "batch_size = 100\n", | |
| "label_name = 'median_house_value'\n", | |
| "\n", | |
| "# Create and compile the model's topography.\n", | |
| "my_model = create_model(learning_rate, fp_feature_layer)\n", | |
| "\n", | |
| "# Train the model on the training set.\n", | |
| "epochs, rmse = train_model(my_model, train_df, epochs, batch_size, label_name)\n", | |
| "\n", | |
| "plot_the_loss_curve(epochs, rmse)\n", | |
| "\n", | |
| "print(\"\\n: Evaluate the new model against the test set:\")\n", | |
| "test_features = {name:np.array(value) for name, value in test_df.items()}\n", | |
| "test_label = np.array(test_features.pop(label_name))\n", | |
| "my_model.evaluate(x=test_features, y=test_label, batch_size=batch_size)" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "dbyWNS6T2fIT" | |
| }, | |
| "source": [ | |
| "## Task 1: Why aren't floating-point values a good way to represent latitude and longitude?\n", | |
| "\n", | |
| "Are floating-point values a good way to represent `latitude` and `longitude`? " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "VJLDCu5M2hXX" | |
| }, | |
| "source": [ | |
| "#@title Double-click to view an answer to Task 1.\n", | |
| "\n", | |
| "# No. Representing latitude and longitude as \n", | |
| "# floating-point values does not have much \n", | |
| "# predictive power. For example, neighborhoods at \n", | |
| "# latitude 35 are not 36/35 more valuable \n", | |
| "# (or 35/36 less valuable) than houses at \n", | |
| "# latitude 36.\n", | |
| "\n", | |
| "# Representing `latitude` and `longitude` as \n", | |
| "# floating-point values provides almost no \n", | |
| "# predictive power. We're only using the raw values \n", | |
| "# to establish a baseline for future experiments \n", | |
| "# with better representations." | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "Na8TPoPYx-0k" | |
| }, | |
| "source": [ | |
| "## Represent latitude and longitude in buckets\n", | |
| "\n", | |
| "The following code cell represents latitude and longitude in buckets (bins). Each bin represents all the neighborhoods within a single degree. For example,\n", | |
| "neighborhoods at latitude 35.4 and 35.8 are in the same bucket, but neighborhoods in latitude 35.4 and 36.2 are in different buckets. \n", | |
| "\n", | |
| "The model will learn a separate weight for each bucket. For example, the model will learn one weight for all the neighborhoods in the \"35\" bin\", a different weight for neighborhoods in the \"36\" bin, and so on. This representation will create approximately 20 buckets:\n", | |
| " \n", | |
| " * 10 buckets for `latitude`. \n", | |
| " * 10 buckets for `longitude`. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "WLTUFiaUyIpx" | |
| }, | |
| "source": [ | |
| "resolution_in_degrees = 1.0 \n", | |
| "\n", | |
| "# Create a new empty list that will eventually hold the generated feature column.\n", | |
| "feature_columns = []\n", | |
| "\n", | |
| "# Create a bucket feature column for latitude.\n", | |
| "latitude_as_a_numeric_column = tf.feature_column.numeric_column(\"latitude\")\n", | |
| "latitude_boundaries = list(np.arange(int(min(train_df['latitude'])), \n", | |
| " int(max(train_df['latitude'])), \n", | |
| " resolution_in_degrees))\n", | |
| "latitude = tf.feature_column.bucketized_column(latitude_as_a_numeric_column, \n", | |
| " latitude_boundaries)\n", | |
| "feature_columns.append(latitude)\n", | |
| "\n", | |
| "# Create a bucket feature column for longitude.\n", | |
| "longitude_as_a_numeric_column = tf.feature_column.numeric_column(\"longitude\")\n", | |
| "longitude_boundaries = list(np.arange(int(min(train_df['longitude'])), \n", | |
| " int(max(train_df['longitude'])), \n", | |
| " resolution_in_degrees))\n", | |
| "longitude = tf.feature_column.bucketized_column(longitude_as_a_numeric_column, \n", | |
| " longitude_boundaries)\n", | |
| "feature_columns.append(longitude)\n", | |
| "\n", | |
| "# Convert the list of feature columns into a layer that will ultimately become\n", | |
| "# part of the model. Understanding layers is not important right now.\n", | |
| "buckets_feature_layer = layers.DenseFeatures(feature_columns)" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "aZsFzoPQ4pFm" | |
| }, | |
| "source": [ | |
| "## Train the model with bucket representations\n", | |
| "\n", | |
| "Run the following code cell to train the model with bucket representations rather than floating-point representations:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "UnDrghxBzLvD" | |
| }, | |
| "source": [ | |
| "# The following variables are the hyperparameters.\n", | |
| "learning_rate = 0.04\n", | |
| "epochs = 35\n", | |
| "\n", | |
| "# Build the model, this time passing in the buckets_feature_layer.\n", | |
| "my_model = create_model(learning_rate, buckets_feature_layer)\n", | |
| "\n", | |
| "# Train the model on the training set.\n", | |
| "epochs, rmse = train_model(my_model, train_df, epochs, batch_size, label_name)\n", | |
| "\n", | |
| "plot_the_loss_curve(epochs, rmse)\n", | |
| "\n", | |
| "print(\"\\n: Evaluate the new model against the test set:\")\n", | |
| "my_model.evaluate(x=test_features, y=test_label, batch_size=batch_size)" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "Wb-bIKsN5M48" | |
| }, | |
| "source": [ | |
| "## Task 2: Did buckets outperform floating-point representations?\n", | |
| "\n", | |
| "Compare the model's `root_mean_squared_error` values for the two representations (floating-point vs. buckets)? Which model produced lower losses? " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "6sUlX1335UCb" | |
| }, | |
| "source": [ | |
| "#@title Double-click for an answer to Task 2.\n", | |
| "\n", | |
| "# Bucket representation outperformed \n", | |
| "# floating-point representations. \n", | |
| "# However, you can still do far better." | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "ab6-bhUvxbTL" | |
| }, | |
| "source": [ | |
| "## Task 3: What is a better way to represent location?\n", | |
| "\n", | |
| "Buckets are a big improvement over floating-point values. Can you identify an even better way to identify location with `latitude` and `longitude`?" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "5no5X0OFCwf7" | |
| }, | |
| "source": [ | |
| "#@title Double-click to view an answer to Task 3.\n", | |
| "\n", | |
| "# Representing location as a feature cross should \n", | |
| "# produce better results.\n", | |
| "\n", | |
| "# In Task 2, you represented latitude in \n", | |
| "# one-dimensional buckets and longitude in \n", | |
| "# another series of one-dimensional buckets. \n", | |
| "# Real-world locations, however, exist in \n", | |
| "# two dimension. Therefore, you should\n", | |
| "# represent location as a two-dimensional feature\n", | |
| "# cross. That is, you'll cross the 10 or so latitude \n", | |
| "# buckets with the 10 or so longitude buckets to \n", | |
| "# create a grid of 100 cells. \n", | |
| "\n", | |
| "# The model will learn separate weights for each \n", | |
| "# of the cells." | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "T1ulCDiyGB6g" | |
| }, | |
| "source": [ | |
| "## Represent location as a feature cross\n", | |
| "\n", | |
| "The following code cell represents location as a feature cross. That is, the following code cell first creates buckets and then calls `tf.feature_column.crossed_column` to cross the buckets.\n", | |
| "\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "HunsuEzqn21s", | |
| "cellView": "both" | |
| }, | |
| "source": [ | |
| "resolution_in_degrees = 1.0 \n", | |
| "\n", | |
| "# Create a new empty list that will eventually hold the generated feature column.\n", | |
| "feature_columns = []\n", | |
| "\n", | |
| "# Create a bucket feature column for latitude.\n", | |
| "latitude_as_a_numeric_column = tf.feature_column.numeric_column(\"latitude\")\n", | |
| "latitude_boundaries = list(np.arange(int(min(train_df['latitude'])), int(max(train_df['latitude'])), resolution_in_degrees))\n", | |
| "latitude = tf.feature_column.bucketized_column(latitude_as_a_numeric_column, latitude_boundaries)\n", | |
| "\n", | |
| "# Create a bucket feature column for longitude.\n", | |
| "longitude_as_a_numeric_column = tf.feature_column.numeric_column(\"longitude\")\n", | |
| "longitude_boundaries = list(np.arange(int(min(train_df['longitude'])), int(max(train_df['longitude'])), resolution_in_degrees))\n", | |
| "longitude = tf.feature_column.bucketized_column(longitude_as_a_numeric_column, longitude_boundaries)\n", | |
| "\n", | |
| "# Create a feature cross of latitude and longitude.\n", | |
| "latitude_x_longitude = tf.feature_column.crossed_column([latitude, longitude], hash_bucket_size=100)\n", | |
| "crossed_feature = tf.feature_column.indicator_column(latitude_x_longitude)\n", | |
| "feature_columns.append(crossed_feature)\n", | |
| "\n", | |
| "# Convert the list of feature columns into a layer that will later be fed into\n", | |
| "# the model. \n", | |
| "feature_cross_feature_layer = layers.DenseFeatures(feature_columns)" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "akRgNnnH3VXJ" | |
| }, | |
| "source": [ | |
| "Invoke the following code cell to test your solution for Task 3. Please ignore the warning messages." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "qn2PRDBEr5ni" | |
| }, | |
| "source": [ | |
| "# The following variables are the hyperparameters.\n", | |
| "learning_rate = 0.04\n", | |
| "epochs = 35\n", | |
| "\n", | |
| "# Build the model, this time passing in the feature_cross_feature_layer: \n", | |
| "my_model = create_model(learning_rate, feature_cross_feature_layer)\n", | |
| "\n", | |
| "# Train the model on the training set.\n", | |
| "epochs, rmse = train_model(my_model, train_df, epochs, batch_size, label_name)\n", | |
| "\n", | |
| "plot_the_loss_curve(epochs, rmse)\n", | |
| "\n", | |
| "print(\"\\n: Evaluate the new model against the test set:\")\n", | |
| "my_model.evaluate(x=test_features, y=test_label, batch_size=batch_size)" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "bCT-l1GaWNQE" | |
| }, | |
| "source": [ | |
| "## Task 4: Did the feature cross outperform buckets?\n", | |
| "\n", | |
| "Compare the model's `root_mean_squared_error` values for the two representations (buckets vs. feature cross)? Which model produced\n", | |
| "lower losses? " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "HUzdWDcs5rCi" | |
| }, | |
| "source": [ | |
| "#@title Double-click for an answer to this question.\n", | |
| "\n", | |
| "# Yes, representing these features as a feature \n", | |
| "# cross produced much lower loss values than \n", | |
| "# representing these features as buckets" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "J9Iw3ljfXqSQ" | |
| }, | |
| "source": [ | |
| "## Task 5: Adjust the resolution of the feature cross\n", | |
| "\n", | |
| "Return to the code cell in the \"Represent location as a feature cross\" section. Notice that `resolution_in_degrees` is set to 1.0. Therefore, each cell represents an area of 1.0 degree of latitude by 1.0 degree of longitude, which corresponds to a cell of 110 km by 90 km. This resolution defines a rather large neighborhood. \n", | |
| "\n", | |
| "Experiment with `resolution_in_degrees` to answer the following questions:\n", | |
| "\n", | |
| " 1. What value of `resolution_in_degrees` produces the best results (lowest loss value)?\n", | |
| " 2. Why does loss increase when the value of `resolution_in_degrees` drops below a certain value?\n", | |
| "\n", | |
| "Finally, answer the following question:\n", | |
| "\n", | |
| " 3. What feature (that does not exist in the California Housing Dataset) would\n", | |
| " be a better proxy for location than latitude X longitude." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": { | |
| "id": "71WWwlhx4h0X" | |
| }, | |
| "source": [ | |
| "#@title Double-click for possible answers to Task 5.\n", | |
| "\n", | |
| "#1. A resolution of ~0.4 degree provides the best \n", | |
| "# results.\n", | |
| "\n", | |
| "#2. Below ~0.4 degree, loss increases because the \n", | |
| "# dataset does not contain enough examples in \n", | |
| "# each cell to accurately predict prices for \n", | |
| "# those cells.\n", | |
| "\n", | |
| "#3. Postal code would be a far better feature \n", | |
| "# than latitude X longitude, assuming that \n", | |
| "# the dataset contained sufficient examples \n", | |
| "# in each postal code." | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| } | |
| ] | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment