Kaggle Intro to ML notes

Notes for https://www.kaggle.com/learn/intro-to-machine-learning

misc notes:

the melbourne dataset made available for download with the kernel seems to be corrupted (451KB instead of 1.99MB; can't call read_csv and opening it in any text editor just shows gibberish)
the melb dataset is a snapshot, which means you can get to the original dataset by Tony Pino.
would be worth going through https://www.kaggle.com/learn/pandas to (re)learn how to clean datasets, because attempting to build a model with Pino's dataset fails.

Basic data exploration

Display columns in dataset:

>>> dataset.columns
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
    'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
    'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
     'Longtitude', 'Regionname', 'Propertycount'],
    dtype='object')

Access a column with dot notation:

>>> dataset.Suburb
0        Abbotsford
1        Abbotsford
2        Abbotsford
3        Abbotsford
4        Abbotsford
            ...
34852    Yarraville
34853    Yarraville
34854    Yarraville
34855    Yarraville
34856    Yarraville
Name: Suburb, Length: 34857, dtype: object

This returns a pandas Series.

Manipulating a .describe()

You can manipulate the output of a .describe() method call.
.describe() prints a table containing count, mean, std (standard deviation), min, percentile values 25%, 50%, 75%, and max.
You can access these values like so: .describe()[column_name]["max"]
A fuller example below:

import math
import pandas as pd

home_data ...

# What is the average lot size (rounded to nearest integer)?
avg_lot_size = math.ceil(home_data.describe()["LotArea"]["mean"])

# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = 2019-home_data.describe()["YearBuilt"]["max"]

Decision Tree Models

First model

Select prediction target. Prediction target when working in pandas is usually a Series assigned to y. You can select a column using dot notation and assign it to y:

 y = dataset.Price
 >>> y
 0              NaN
 1        1480000.0
 2        1035000.0
 3              NaN
 4        1465000.0
         ...
 34852    1480000.0
 34853     888000.0
 34854     705000.0
 34855    1140000.0
 34856    1020000.0
 Name: Price, Length: 34857, dtype: float64

Select 'features', or the columns to input in the model. These are column names usually assigned as a list. E.g. melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']. Assign these to X: X = melbourne_data[melbourne_features]

 >>> X = m[melb_features]
 >>> X
     Rooms  Bathroom  Landsize  Lattitude  Longtitude
 0          2       1.0     126.0  -37.80140   144.99580
 1          2       1.0     202.0  -37.79960   144.99840
 2          2       1.0     156.0  -37.80790   144.99340
 3          3       2.0       0.0  -37.81140   145.01160
 4          3       2.0     134.0  -37.80930   144.99440
 ...      ...       ...       ...        ...         ...
 34852      4       1.0     593.0  -37.81053   144.88467
 34853      2       2.0      98.0  -37.81551   144.88826
 34854      2       1.0     220.0  -37.82286   144.87856
 34855      3       NaN       NaN        NaN         NaN
 34856      2       1.0     250.0  -37.81810   144.89351

Building the model

Import scikit-learn:

 from sklearn.tree import DecisionTreeRegressor

Build a simple model (warning: make sure your data is clean!):

 # Define model. Specify a number for random_state to ensure same results each run i.e. this is the seed
 >>> melbourne_model = DecisionTreeRegressor(random_state=1)

 # Fit model. **This mutates melbourne_model**
 >>> melbourne_model.fit(X, y)

 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
               max_leaf_nodes=None, min_impurity_decrease=0.0,
               min_impurity_split=None, min_samples_leaf=1,
               min_samples_split=2, min_weight_fraction_leaf=0.0,
               presort=False, random_state=1, splitter='best')

Make a prediction: >>> print("Making predictions for the following 5 houses:") >>> print(X.head()) >>> print("The predictions are") >>> print(melbourne_model.predict(X.head())) Making predictions for the following 5 houses: Rooms Bathroom Landsize Lattitude Longtitude 1 2 1.0 156.0 -37.8079 144.9934 2 3 2.0 134.0 -37.8093 144.9944 4 4 1.0 120.0 -37.8072 144.9941 6 3 2.0 245.0 -37.8024 144.9993 7 2 1.0 256.0 -37.8060 144.9954 The predictions are [1035000. 1465000. 1600000. 1876000. 1636000.]

Model validation

Make sure that you validate your model with data that is not in your training dataset.

MAE: Mean absolute error.
Prediction error for each data point is
```
  error = actual - predicted
```
The MAE then is the average of the absolute value of all errors.

You can do this using the mean_absolute_error method from the scikit-learn library:

  from sklearn.metrics import mean_absolute_error

  predictions = melbourne_model.predict(X)
  MAE = mean_absolute_error(y, predictions)
  # NB we shouldn't be using the 'y' target var; instead, 
  # we should be using the `y` target var from another set of data.

Sidenote: Have always been extremely annoyed that method names are not consistent even in very established Python libraries e.g. DecisionTreeRegressor and mean_absolute_error are both methods from the scikit-learn library (albiet one from the tree sub-library, and the latter from the metrics library), one using CamelCase and another using snake_case.

Splitting your data into `training` and `validation` datasets

Use the train_test_split method from sklearn.model_selection.

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

melb_model = DecisionTreeRegressor()
melb_model.fit(train_X, train_y)

predictions = melb_model.predict(val_X)

# compares actual values from the validation data set to predictions using the same features
mae = mean_absolute_error(val_y, predictions)

Question: How does this measure your model's effectiveness?

Controlling tree depth

Greater tree depth -> more likely to overfit.
Shallower trees -> more likely to underfit.

You can control tree depth indirectly with the max_leaf_nodes option when calling DecisionTreeRegressor:

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

You can also control it directly with max_depth:

model = DecisionTreeRegressor(max_depth=<max_depth>, random_state=0)

Documentation for sklearn.tree.DecisionTreeRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

Optimizing for fit

To optimize your model's fit, you can loop through a list of max_leaf_nodes values and make predictions for each one, and select the max_leaf_nodes value that produces the best prediction.

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
args = [train_X, val_X, train_y, val_y]

for el in candidate_max_leaf_nodes:
    print("Max leaf nodes: %d\tMAE: %f" %(el, get_mae(el, *args)))

NOTE: Must call model.fit(any_X,any_y) before you can call anything on it e.g. model.get_depth()

Eyeball the optimal max_leaf_nodes value for your model, and plug it into your model:

final_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=0)

final_model.fit(X,y) # fit all data

Random Forests

Optimizing a tree by hand is error-prone.
Need to make manual decisions and eyeball tree depth and optimum model fit.
Using a random forest allows you to delegate that work to your software.
A random forest generates several decision trees using your data, and makes a prediction by taking an average of the predictions of each generated component tree.

Import the RandomForestRegressor class:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

Glossary

Tree depth: A measure of many splits a decision tree model model makes before coming to a prediction. E.g. depth == 1 -> 1 split -> 2 leaf nodes.
Overfitting: high accuracy with training dataset, but performs poorly with validation/real-world data.
Underfitting: Fails to capture important patterns and distinctions in data, so it performs poorly even with training data. Less tree depth, more likely to underfit.

zeddee/kaggle-intro-ml-notes.md