Notes for https://www.kaggle.com/learn/intro-to-machine-learning
misc notes:
- the melbourne dataset
made available for download with the kernel seems to be corrupted (451KB instead of 1.99MB;
can't call
read_csv
and opening it in any text editor just shows gibberish) - the melb dataset is a snapshot, which means you can get to the original dataset by Tony Pino.
- would be worth going through https://www.kaggle.com/learn/pandas to (re)learn how to clean datasets, because attempting to build a model with Pino's dataset fails.
Display columns in dataset:
>>> dataset.columns
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
'Longtitude', 'Regionname', 'Propertycount'],
dtype='object')
Access a column with dot notation:
>>> dataset.Suburb
0 Abbotsford
1 Abbotsford
2 Abbotsford
3 Abbotsford
4 Abbotsford
...
34852 Yarraville
34853 Yarraville
34854 Yarraville
34855 Yarraville
34856 Yarraville
Name: Suburb, Length: 34857, dtype: object
This returns a pandas Series.
- You can manipulate the output of a
.describe()
method call. .describe()
prints a table containingcount
,mean
,std
(standard deviation),min
, percentile values25%
,50%
,75%
, andmax
.- You can access these values like so:
.describe()[column_name]["max"]
- A fuller example below:
import math
import pandas as pd
home_data ...
# What is the average lot size (rounded to nearest integer)?
avg_lot_size = math.ceil(home_data.describe()["LotArea"]["mean"])
# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = 2019-home_data.describe()["YearBuilt"]["max"]
-
Select prediction target. Prediction target when working in pandas is usually a Series assigned to
y
. You can select a column using dot notation and assign it to y:y = dataset.Price >>> y 0 NaN 1 1480000.0 2 1035000.0 3 NaN 4 1465000.0 ... 34852 1480000.0 34853 888000.0 34854 705000.0 34855 1140000.0 34856 1020000.0 Name: Price, Length: 34857, dtype: float64
-
Select 'features', or the columns to input in the model. These are column names usually assigned as a list. E.g.
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
. Assign these toX
:X = melbourne_data[melbourne_features]
>>> X = m[melb_features] >>> X Rooms Bathroom Landsize Lattitude Longtitude 0 2 1.0 126.0 -37.80140 144.99580 1 2 1.0 202.0 -37.79960 144.99840 2 2 1.0 156.0 -37.80790 144.99340 3 3 2.0 0.0 -37.81140 145.01160 4 3 2.0 134.0 -37.80930 144.99440 ... ... ... ... ... ... 34852 4 1.0 593.0 -37.81053 144.88467 34853 2 2.0 98.0 -37.81551 144.88826 34854 2 1.0 220.0 -37.82286 144.87856 34855 3 NaN NaN NaN NaN 34856 2 1.0 250.0 -37.81810 144.89351
-
Import scikit-learn:
from sklearn.tree import DecisionTreeRegressor
-
Build a simple model (warning: make sure your data is clean!):
# Define model. Specify a number for random_state to ensure same results each run i.e. this is the seed >>> melbourne_model = DecisionTreeRegressor(random_state=1) # Fit model. **This mutates melbourne_model** >>> melbourne_model.fit(X, y) DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=1, splitter='best')
-
Make a prediction: >>> print("Making predictions for the following 5 houses:") >>> print(X.head()) >>> print("The predictions are") >>> print(melbourne_model.predict(X.head())) Making predictions for the following 5 houses: Rooms Bathroom Landsize Lattitude Longtitude 1 2 1.0 156.0 -37.8079 144.9934 2 3 2.0 134.0 -37.8093 144.9944 4 4 1.0 120.0 -37.8072 144.9941 6 3 2.0 245.0 -37.8024 144.9993 7 2 1.0 256.0 -37.8060 144.9954 The predictions are [1035000. 1465000. 1600000. 1876000. 1636000.]
Make sure that you validate your model with data that is not in your training dataset.
-
MAE: Mean absolute error.
-
Prediction error for each data point is
error = actual - predicted
-
The MAE then is the average of the absolute value of all errors.
-
You can do this using the
mean_absolute_error
method from the scikit-learn library:from sklearn.metrics import mean_absolute_error predictions = melbourne_model.predict(X) MAE = mean_absolute_error(y, predictions) # NB we shouldn't be using the 'y' target var; instead, # we should be using the `y` target var from another set of data.
Sidenote: Have always been extremely annoyed that method names are not consistent even in very established Python libraries e.g.
DecisionTreeRegressor
andmean_absolute_error
are both methods from the scikit-learn library (albiet one from thetree
sub-library, and the latter from themetrics
library), one usingCamelCase
and another usingsnake_case
.
Use the train_test_split
method from sklearn.model_selection
.
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
melb_model = DecisionTreeRegressor()
melb_model.fit(train_X, train_y)
predictions = melb_model.predict(val_X)
# compares actual values from the validation data set to predictions using the same features
mae = mean_absolute_error(val_y, predictions)
Question: How does this measure your model's effectiveness?
- Greater tree depth -> more likely to overfit.
- Shallower trees -> more likely to underfit.
You can control tree depth indirectly with the max_leaf_nodes
option when calling DecisionTreeRegressor
:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
You can also control it directly with max_depth
:
model = DecisionTreeRegressor(max_depth=<max_depth>, random_state=0)
Documentation for sklearn.tree.DecisionTreeRegressor
: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
To optimize your model's fit, you can loop through a list of
max_leaf_nodes
values and make predictions for each one,
and select the max_leaf_nodes
value that produces the
best prediction.
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
args = [train_X, val_X, train_y, val_y]
for el in candidate_max_leaf_nodes:
print("Max leaf nodes: %d\tMAE: %f" %(el, get_mae(el, *args)))
NOTE: Must call
model.fit(any_X,any_y)
before you can call anything on it e.g.model.get_depth()
Eyeball the optimal max_leaf_nodes
value for your model,
and plug it into your model:
final_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=0)
final_model.fit(X,y) # fit all data
- Optimizing a tree by hand is error-prone.
- Need to make manual decisions and eyeball tree depth and optimum model fit.
- Using a random forest allows you to delegate that work to your software.
- A random forest generates several decision trees using your data, and makes a prediction by taking an average of the predictions of each generated component tree.
Import the RandomForestRegressor
class:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))
- Tree depth: A measure of many splits a decision tree model model makes before coming to a prediction. E.g. depth == 1 -> 1 split -> 2 leaf nodes.
- Overfitting: high accuracy with training dataset, but performs poorly with validation/real-world data.
- Underfitting: Fails to capture important patterns and distinctions in data, so it performs poorly even with training data. Less tree depth, more likely to underfit.