Skip to content

Instantly share code, notes, and snippets.

@zeddee
Last active March 13, 2025 13:08
Show Gist options
  • Save zeddee/824cfa865bd2944057a49bf2fd6e4b65 to your computer and use it in GitHub Desktop.
Save zeddee/824cfa865bd2944057a49bf2fd6e4b65 to your computer and use it in GitHub Desktop.

Kaggle Intro to ML notes

Notes for https://www.kaggle.com/learn/intro-to-machine-learning

misc notes:

  • the melbourne dataset made available for download with the kernel seems to be corrupted (451KB instead of 1.99MB; can't call read_csv and opening it in any text editor just shows gibberish)
  • the melb dataset is a snapshot, which means you can get to the original dataset by Tony Pino.
  • would be worth going through https://www.kaggle.com/learn/pandas to (re)learn how to clean datasets, because attempting to build a model with Pino's dataset fails.

Basic data exploration

Display columns in dataset:

>>> dataset.columns
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
    'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
    'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
     'Longtitude', 'Regionname', 'Propertycount'],
    dtype='object')

Access a column with dot notation:

>>> dataset.Suburb
0        Abbotsford
1        Abbotsford
2        Abbotsford
3        Abbotsford
4        Abbotsford
            ...
34852    Yarraville
34853    Yarraville
34854    Yarraville
34855    Yarraville
34856    Yarraville
Name: Suburb, Length: 34857, dtype: object

This returns a pandas Series.

Manipulating a .describe()

  • You can manipulate the output of a .describe() method call.
  • .describe() prints a table containing count, mean, std (standard deviation), min, percentile values 25%, 50%, 75%, and max.
  • You can access these values like so: .describe()[column_name]["max"]
  • A fuller example below:
import math
import pandas as pd

home_data ...

# What is the average lot size (rounded to nearest integer)?
avg_lot_size = math.ceil(home_data.describe()["LotArea"]["mean"])

# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = 2019-home_data.describe()["YearBuilt"]["max"]

Decision Tree Models

First model

  1. Select prediction target. Prediction target when working in pandas is usually a Series assigned to y. You can select a column using dot notation and assign it to y:

     y = dataset.Price
     >>> y
     0              NaN
     1        1480000.0
     2        1035000.0
     3              NaN
     4        1465000.0
             ...
     34852    1480000.0
     34853     888000.0
     34854     705000.0
     34855    1140000.0
     34856    1020000.0
     Name: Price, Length: 34857, dtype: float64
    
  2. Select 'features', or the columns to input in the model. These are column names usually assigned as a list. E.g. melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']. Assign these to X: X = melbourne_data[melbourne_features]

     >>> X = m[melb_features]
     >>> X
         Rooms  Bathroom  Landsize  Lattitude  Longtitude
     0          2       1.0     126.0  -37.80140   144.99580
     1          2       1.0     202.0  -37.79960   144.99840
     2          2       1.0     156.0  -37.80790   144.99340
     3          3       2.0       0.0  -37.81140   145.01160
     4          3       2.0     134.0  -37.80930   144.99440
     ...      ...       ...       ...        ...         ...
     34852      4       1.0     593.0  -37.81053   144.88467
     34853      2       2.0      98.0  -37.81551   144.88826
     34854      2       1.0     220.0  -37.82286   144.87856
     34855      3       NaN       NaN        NaN         NaN
     34856      2       1.0     250.0  -37.81810   144.89351
    

Building the model

  1. Import scikit-learn:

     from sklearn.tree import DecisionTreeRegressor
    
  2. Build a simple model (warning: make sure your data is clean!):

     # Define model. Specify a number for random_state to ensure same results each run i.e. this is the seed
     >>> melbourne_model = DecisionTreeRegressor(random_state=1)
    
     # Fit model. **This mutates melbourne_model**
     >>> melbourne_model.fit(X, y)
    
     DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                   max_leaf_nodes=None, min_impurity_decrease=0.0,
                   min_impurity_split=None, min_samples_leaf=1,
                   min_samples_split=2, min_weight_fraction_leaf=0.0,
                   presort=False, random_state=1, splitter='best')
    
  3. Make a prediction: >>> print("Making predictions for the following 5 houses:") >>> print(X.head()) >>> print("The predictions are") >>> print(melbourne_model.predict(X.head())) Making predictions for the following 5 houses: Rooms Bathroom Landsize Lattitude Longtitude 1 2 1.0 156.0 -37.8079 144.9934 2 3 2.0 134.0 -37.8093 144.9944 4 4 1.0 120.0 -37.8072 144.9941 6 3 2.0 245.0 -37.8024 144.9993 7 2 1.0 256.0 -37.8060 144.9954 The predictions are [1035000. 1465000. 1600000. 1876000. 1636000.]

Model validation

Make sure that you validate your model with data that is not in your training dataset.

  • MAE: Mean absolute error.

  • Prediction error for each data point is

      error = actual - predicted
    
  • The MAE then is the average of the absolute value of all errors.

  • You can do this using the mean_absolute_error method from the scikit-learn library:

      from sklearn.metrics import mean_absolute_error
    
      predictions = melbourne_model.predict(X)
      MAE = mean_absolute_error(y, predictions)
      # NB we shouldn't be using the 'y' target var; instead, 
      # we should be using the `y` target var from another set of data.
    

Sidenote: Have always been extremely annoyed that method names are not consistent even in very established Python libraries e.g. DecisionTreeRegressor and mean_absolute_error are both methods from the scikit-learn library (albiet one from the tree sub-library, and the latter from the metrics library), one using CamelCase and another using snake_case.

Splitting your data into training and validation datasets

Use the train_test_split method from sklearn.model_selection.

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

melb_model = DecisionTreeRegressor()
melb_model.fit(train_X, train_y)

predictions = melb_model.predict(val_X)

# compares actual values from the validation data set to predictions using the same features
mae = mean_absolute_error(val_y, predictions)

Question: How does this measure your model's effectiveness?

Controlling tree depth

You can control tree depth indirectly with the max_leaf_nodes option when calling DecisionTreeRegressor:

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

You can also control it directly with max_depth:

model = DecisionTreeRegressor(max_depth=<max_depth>, random_state=0)

Documentation for sklearn.tree.DecisionTreeRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

Optimizing for fit

To optimize your model's fit, you can loop through a list of max_leaf_nodes values and make predictions for each one, and select the max_leaf_nodes value that produces the best prediction.

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
args = [train_X, val_X, train_y, val_y]

for el in candidate_max_leaf_nodes:
    print("Max leaf nodes: %d\tMAE: %f" %(el, get_mae(el, *args)))

NOTE: Must call model.fit(any_X,any_y) before you can call anything on it e.g. model.get_depth()

Eyeball the optimal max_leaf_nodes value for your model, and plug it into your model:

final_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=0)

final_model.fit(X,y) # fit all data

Random Forests

  • Optimizing a tree by hand is error-prone.
  • Need to make manual decisions and eyeball tree depth and optimum model fit.
  • Using a random forest allows you to delegate that work to your software.
  • A random forest generates several decision trees using your data, and makes a prediction by taking an average of the predictions of each generated component tree.

Import the RandomForestRegressor class:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

Glossary

  • Tree depth: A measure of many splits a decision tree model model makes before coming to a prediction. E.g. depth == 1 -> 1 split -> 2 leaf nodes.
  • Overfitting: high accuracy with training dataset, but performs poorly with validation/real-world data.
  • Underfitting: Fails to capture important patterns and distinctions in data, so it performs poorly even with training data. Less tree depth, more likely to underfit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment