Skip to content

Instantly share code, notes, and snippets.

@tdhopper
Created June 29, 2020 18:12
Show Gist options
  • Save tdhopper/9fff3236d8504bd0ab1358981a92e744 to your computer and use it in GitHub Desktop.
Save tdhopper/9fff3236d8504bd0ab1358981a92e744 to your computer and use it in GitHub Desktop.
data/
.env/
__pycache__
.ipynb_checkpoints
Training Report.pdf
*.tar.gz
handler_venv/
*.nbconvert.*
static/
env/
*.pkl
.env
.ipynb_checkpoints
**/__pycache__

Energy Efficiency Predictor

Problem

The goal of this challenge is to build a regression model and deploy it with docker. The dataset you will use for the challenge is available at https://archive.ics.uci.edu/ml/datasets/Energy+efficiency. You should be able to run the docker image and then curl the container by sending json containing the attributes of a new building and get a json response with the heating and cooling loads predicted by your trained model. The code should be written in python but you can use whichever libraries you like to train and deploy the model

Solution

I built a simple multivariate, Lasso model with Scikit-learn that is served with Flask.

You can see the notebook used to train the model at https://9whioydhmb.execute-api.us-east-1.amazonaws.com/carbonrelay.

Model can be tested by posting a dictionary of input values to an AWS endpoint, e.g.:

curl -d '{
    "relative_compactness": 0.98,
    "surface_area": 514.5,
    "wall_area": 294.0,
    "roof_area": 110.25,
    "overall_height": 7.0,
    "orientation": 2.0,
    "glazing_area": 0.0,
    "glazing_area_distribution": 0.0
}' -H 'Content-Type: application/json' https://9whioydhmb.execute-api.us-east-1.amazonaws.com/carbonrelay/predict

HTTP response is JSON containing an heating_load and cooling_load field.

Building and Testing

Prerequisites

  • Running Docker client

Steps

Run $ docker-compose up to train model and open webserver on port 5000. Once this completes, you should be able to run:

curl -d '{
    "relative_compactness": 0.98,
    "surface_area": 514.5,
    "wall_area": 294.0,
    "roof_area": 110.25,
    "overall_height": 7.0,
    "orientation": 2.0,
    "glazing_area": 0.0,
    "glazing_area_distribution": 0.0
}' -H 'Content-Type: application/json' http://127.0.0.1:5000/predict

You can also view the Jupyter notebook with training information and model performance information at http://127.0.0.1:5000/.

import joblib
import pandas as pd
from flask import Flask, request
app = Flask("energy_efficiency")
@app.route("/")
def report():
return app.send_static_file("Train.html")
@app.route("/predict", methods=["POST"])
def predict():
data = pd.DataFrame.from_records([request.get_json()])
pipe = joblib.load("model.pkl")
output = pipe.predict(data)
assert output.shape == (1, 2)
return {
"heating_load": output[0][0],
"cooling_load": output[0][1],
}
version: '3'
services:
web-server:
build: .
entrypoint: flask run -h 0.0.0.0
ports:
- "5000:5000"
volumes:
- .:/home
train:
build: .
entrypoint: /bin/sh
command: /home/train.sh
volumes:
- .:/home
FROM fnndsc/ubuntu-python3
RUN apt -qq install --yes build-essential
RUN pip install --quiet --upgrade pip
WORKDIR /home
COPY . ./
RUN pip install -r requirements.txt
ENV FLASK_APP=app
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
scikit-learn
flask
jupyter
pandas
xlrd
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Train dummy model and linear regression model on Energy Efficiency dataset."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import joblib\n",
"\n",
"from sklearn.linear_model import LinearRegression, Lasso\n",
"from sklearn.model_selection import cross_val_score\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.dummy import DummyRegressor\n",
"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
"from sklearn.compose import ColumnTransformer"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"feature_columns = [\n",
" \"relative_compactness\",\n",
" \"surface_area\",\n",
" \"wall_area\",\n",
" \"roof_area\",\n",
" \"overall_height\",\n",
" \"orientation\",\n",
" \"glazing_area\",\n",
" \"glazing_area_distribution\",\n",
"]\n",
"\n",
"output_columns = [\n",
" \"heating_load\",\n",
" \"cooling_load\",\n",
"]\n",
"\n",
"data = pd.read_excel(\"ENB2012_data.xlsx\", \n",
" names=feature_columns + output_columns,\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>relative_compactness</th>\n",
" <th>surface_area</th>\n",
" <th>wall_area</th>\n",
" <th>roof_area</th>\n",
" <th>overall_height</th>\n",
" <th>orientation</th>\n",
" <th>glazing_area</th>\n",
" <th>glazing_area_distribution</th>\n",
" <th>heating_load</th>\n",
" <th>cooling_load</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>768.000000</td>\n",
" <td>768.000000</td>\n",
" <td>768.000000</td>\n",
" <td>768.000000</td>\n",
" <td>768.00000</td>\n",
" <td>768.000000</td>\n",
" <td>768.000000</td>\n",
" <td>768.00000</td>\n",
" <td>768.000000</td>\n",
" <td>768.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>0.764167</td>\n",
" <td>671.708333</td>\n",
" <td>318.500000</td>\n",
" <td>176.604167</td>\n",
" <td>5.25000</td>\n",
" <td>3.500000</td>\n",
" <td>0.234375</td>\n",
" <td>2.81250</td>\n",
" <td>22.307195</td>\n",
" <td>24.587760</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.105777</td>\n",
" <td>88.086116</td>\n",
" <td>43.626481</td>\n",
" <td>45.165950</td>\n",
" <td>1.75114</td>\n",
" <td>1.118763</td>\n",
" <td>0.133221</td>\n",
" <td>1.55096</td>\n",
" <td>10.090204</td>\n",
" <td>9.513306</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.620000</td>\n",
" <td>514.500000</td>\n",
" <td>245.000000</td>\n",
" <td>110.250000</td>\n",
" <td>3.50000</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>6.010000</td>\n",
" <td>10.900000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>0.682500</td>\n",
" <td>606.375000</td>\n",
" <td>294.000000</td>\n",
" <td>140.875000</td>\n",
" <td>3.50000</td>\n",
" <td>2.750000</td>\n",
" <td>0.100000</td>\n",
" <td>1.75000</td>\n",
" <td>12.992500</td>\n",
" <td>15.620000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>0.750000</td>\n",
" <td>673.750000</td>\n",
" <td>318.500000</td>\n",
" <td>183.750000</td>\n",
" <td>5.25000</td>\n",
" <td>3.500000</td>\n",
" <td>0.250000</td>\n",
" <td>3.00000</td>\n",
" <td>18.950000</td>\n",
" <td>22.080000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>0.830000</td>\n",
" <td>741.125000</td>\n",
" <td>343.000000</td>\n",
" <td>220.500000</td>\n",
" <td>7.00000</td>\n",
" <td>4.250000</td>\n",
" <td>0.400000</td>\n",
" <td>4.00000</td>\n",
" <td>31.667500</td>\n",
" <td>33.132500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>0.980000</td>\n",
" <td>808.500000</td>\n",
" <td>416.500000</td>\n",
" <td>220.500000</td>\n",
" <td>7.00000</td>\n",
" <td>5.000000</td>\n",
" <td>0.400000</td>\n",
" <td>5.00000</td>\n",
" <td>43.100000</td>\n",
" <td>48.030000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" relative_compactness surface_area wall_area roof_area \\\n",
"count 768.000000 768.000000 768.000000 768.000000 \n",
"mean 0.764167 671.708333 318.500000 176.604167 \n",
"std 0.105777 88.086116 43.626481 45.165950 \n",
"min 0.620000 514.500000 245.000000 110.250000 \n",
"25% 0.682500 606.375000 294.000000 140.875000 \n",
"50% 0.750000 673.750000 318.500000 183.750000 \n",
"75% 0.830000 741.125000 343.000000 220.500000 \n",
"max 0.980000 808.500000 416.500000 220.500000 \n",
"\n",
" overall_height orientation glazing_area glazing_area_distribution \\\n",
"count 768.00000 768.000000 768.000000 768.00000 \n",
"mean 5.25000 3.500000 0.234375 2.81250 \n",
"std 1.75114 1.118763 0.133221 1.55096 \n",
"min 3.50000 2.000000 0.000000 0.00000 \n",
"25% 3.50000 2.750000 0.100000 1.75000 \n",
"50% 5.25000 3.500000 0.250000 3.00000 \n",
"75% 7.00000 4.250000 0.400000 4.00000 \n",
"max 7.00000 5.000000 0.400000 5.00000 \n",
"\n",
" heating_load cooling_load \n",
"count 768.000000 768.000000 \n",
"mean 22.307195 24.587760 \n",
"std 10.090204 9.513306 \n",
"min 6.010000 10.900000 \n",
"25% 12.992500 15.620000 \n",
"50% 18.950000 22.080000 \n",
"75% 31.667500 33.132500 \n",
"max 43.100000 48.030000 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a sanity check, test with dummy regressor (predict the mean output) as a baseline. Do 10-fold cross validation to estimate R^2 value. We would expect R^2 values around 0."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([-0.12217272, -0.15209695, -0.04929655, -0.00102648, -0.03166066,\n",
" -0.06108127, -0.00106225, -0.07762155, -0.21376269, -0.00233177])"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dummy = DummyRegressor()\n",
"\n",
"cross_val_score(dummy, \n",
" data[feature_columns], \n",
" data[output_columns], \n",
" cv=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It appears from above that `orientation` is a categorical variable while the others are continuous. \n",
"\n",
"I train on a simple model: one hot encode the orientation, standardized features (remove mean \n",
"and scale to unit variance), followed by Lasso regression"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"ct = ColumnTransformer(\n",
" [(\"orientation\", OneHotEncoder(), [\"orientation\"])], remainder=\"passthrough\"\n",
")\n",
"\n",
"pipe = Pipeline([(\"transform\", ct), (\"norm\", StandardScaler()), (\"regress\", Lasso())])"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.6419667 , 0.88427361, 0.87404772, 0.85625164, 0.90037526,\n",
" 0.85565515, 0.89379651, 0.84736809, 0.83821679, 0.901613 ])"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The model R^2 value increases significantly over the dummy model. \n",
"\n",
"cross_val_score(pipe, data[feature_columns], data[output_columns], cv=10)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['model.pkl']"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Train model on full training set.\n",
"\n",
"pipe.fit(data[feature_columns], data[output_columns])\n",
"joblib.dump(pipe, 'model.pkl')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
#!/bin/bash
jupyter nbconvert --to html --execute Train.ipynb --ExecutePreprocessor.timeout=600
mkdir -p static
mv Train.html static
{
"carbonrelay": {
"app_function": "app.app",
"aws_region": "us-east-1",
"project_name": "energy_efficiency",
"runtime": "python3.8",
"s3_bucket": "zappa-carbon-relay-takehome",
"slim_handler": true
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment