Skip to content

Instantly share code, notes, and snippets.

@ylogx
Last active May 25, 2024 22:10
Show Gist options
  • Save ylogx/53fef94cc61d6a3e9b3eb900482f41e0 to your computer and use it in GitHub Desktop.
Save ylogx/53fef94cc61d6a3e9b3eb900482f41e0 to your computer and use it in GitHub Desktop.
XGBoost Incremental Learning
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/conda/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
" \"This module will be removed in 0.20.\", DeprecationWarning)\n"
]
}
],
"source": [
"# !conda install -yc conda-forge xgboost\n",
"import xgboost as xgb\n",
"import sklearn.datasets\n",
"import sklearn.metrics\n",
"import sklearn.feature_selection\n",
"import sklearn.feature_extraction\n",
"import sklearn.cross_validation\n",
"import sklearn.model_selection\n",
"import tqdm"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'0.6'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xgb.__version__"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['data', 'feature_names', 'DESCR', 'target']\n",
"['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'\n",
" 'B' 'LSTAT']\n"
]
}
],
"source": [
"df = sklearn.datasets.load_boston()\n",
"print(df.keys())\n",
"print(df['feature_names'])"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"X = df['data']\n",
"y = df['target']"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"x_tr, x_te, y_tr, y_te = sklearn.model_selection.train_test_split(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 33.8, 23.7, 20.5, 12.8, 50. , 17.4, 8.8, 17.8, 26.4, 18.2])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_tr[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## One shot learning\n",
"Train with all the training data. Only one iteration over the dataset."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9.7116581444822518"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"one_shot_model = xgb.train({\n",
" 'update':'refresh',\n",
" 'process_type': 'update',\n",
" 'refresh_leaf': True,\n",
" 'silent': False,\n",
"}, dtrain=xgb.DMatrix(x_tr, y_tr))\n",
"y_pr = one_shot_model.predict(xgb.DMatrix(x_te))\n",
"sklearn.metrics.mean_squared_error(y_te, y_pr)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 13.41692734, 10.1039257 , 26.06602859, 14.74896526,\n",
" 19.46399117, 22.82827187, 21.09622765, 18.83269501,\n",
" 27.70256996, 34.56838226], dtype=float32)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pr[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## One shot iterative training\n",
"Exploit the xgb_model parameter of xgb.train to iterate over the training data multiple time"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Iteration 0: 9.71165814448\n",
"Iteration 1: 7.90938546712\n",
"Iteration 2: 7.83283545287\n",
"Iteration 3: 7.90989805123\n",
"Iteration 4: 7.93978549112\n"
]
}
],
"source": [
"iteration = 5\n",
"one_shot_model_itr = None\n",
"for i in range(iteration):\n",
" one_shot_model_itr = xgb.train({\n",
" 'update':'refresh',\n",
" 'process_type': 'update',\n",
" 'refresh_leaf': True,\n",
" 'silent': False,\n",
" }, dtrain=xgb.DMatrix(x_tr, y_tr), xgb_model=one_shot_model_itr)\n",
" y_pr = one_shot_model_itr.predict(xgb.DMatrix(x_te))\n",
" print('Iteration {}: {}'.format(i, sklearn.metrics.mean_squared_error(y_te, y_pr)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So xgboost models are able to improve when you iterate over data multiple times."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Iterative Incremental Learning"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MSE itr@0: 239.680186067\n",
"MSE itr@1: 111.044669451\n",
"MSE itr@2: 57.7185741392\n",
"MSE itr@3: 35.7994472176\n",
"MSE itr@4: 26.2178656072\n",
"MSE itr@5: 20.3012679934\n",
"MSE itr@6: 17.0486683066\n",
"MSE itr@7: 14.9458533528\n",
"MSE itr@8: 13.5863551796\n",
"MSE itr@9: 12.5722084078\n",
"MSE itr@10: 12.0621747382\n",
"MSE itr@11: 11.8287598733\n",
"MSE itr@12: 11.6878301253\n",
"MSE itr@13: 11.4897400114\n",
"MSE itr@14: 11.4627225743\n",
"MSE itr@15: 11.5417849176\n",
"MSE itr@16: 11.4022054245\n",
"MSE itr@17: 11.2675483456\n",
"MSE itr@18: 11.3866442707\n",
"MSE itr@19: 11.3504530668\n",
"MSE itr@20: 11.3818182553\n",
"MSE itr@21: 11.5099846894\n",
"MSE itr@22: 11.5365974758\n",
"MSE itr@23: 11.7541341329\n",
"MSE itr@24: 11.9677214525\n",
"MSE at the end: 11.9677214525\n"
]
}
],
"source": [
"batch_size = 50\n",
"iterations = 25\n",
"model = None\n",
"for i in range(iterations):\n",
" for start in range(0, len(x_tr), batch_size):\n",
" model = xgb.train({\n",
" 'learning_rate': 0.007,\n",
" 'update':'refresh',\n",
" 'process_type': 'update',\n",
" 'refresh_leaf': True,\n",
" #'reg_lambda': 3, # L2\n",
" 'reg_alpha': 3, # L1\n",
" 'silent': False,\n",
" }, dtrain=xgb.DMatrix(x_tr[start:start+batch_size], y_tr[start:start+batch_size]), xgb_model=model)\n",
"\n",
" y_pr = model.predict(xgb.DMatrix(x_te))\n",
" #print(' MSE itr@{}: {}'.format(int(start/batch_size), sklearn.metrics.mean_squared_error(y_te, y_pr)))\n",
" print('MSE itr@{}: {}'.format(i, sklearn.metrics.mean_squared_error(y_te, y_pr)))\n",
"\n",
"y_pr = model.predict(xgb.DMatrix(x_te))\n",
"print('MSE at the end: {}'.format(sklearn.metrics.mean_squared_error(y_te, y_pr)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"MSE is decreasing with each iteration. Hence, the xgboost model is learning incrementally."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@StudyExchange
Copy link

Refer to How can I implement incremental training for xgboost?, I have a try, then I get loss.

%%time
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse

boston = load_boston()
features = boston.feature_names
X = boston.data
y = boston.target

X = pd.DataFrame(X, columns=features)
y = pd.Series(y, index=X.index)


# split data into training and testing sets
rs = ShuffleSplit(test_size=0.3, n_splits=1, random_state=0)
for train_idx,test_idx in rs.split(X):  # this looks silly
    pass

train_split = round(len(train_idx) / 2)
train1_idx = train_idx[:train_split]
train2_idx = train_idx[train_split:]
X_train = X.loc[train_idx]
X_train_1 = X.loc[train1_idx]
X_train_2 = X.loc[train2_idx]
X_test = X.loc[test_idx]
y_train = y.loc[train_idx]
y_train_1 = y.loc[train1_idx]
y_train_2 = y.loc[train2_idx]
y_test = y.loc[test_idx]

xg_train_0 = xgb.DMatrix(X_train, label=y_train)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)

num_round = 15
verbose_eval = 5
watch_list = [(xg_test, 'xg_test')]

params = {'objective': 'reg:linear', 'verbose': False}
print('full train\t'); 
model_0 = xgb.train(params, xg_train_0, num_round, watch_list, verbose_eval=verbose_eval)
print('model 1 \t'); 
model_1 = xgb.train(params, xg_train_1, num_round, watch_list, verbose_eval=verbose_eval)
model_1.save_model('model_1.model')
print('model 2 \t'); 
model_2_v1 = xgb.train(params, xg_train_2, num_round, watch_list, verbose_eval=verbose_eval)
print('model 1+2\t, this logs show continue train, but got a test score same to model 1?'); 
model_2_v2 = xgb.train(params, xg_train_2, num_round, watch_list, verbose_eval=verbose_eval, xgb_model=model_1)

params.update({
    'process_type': 'update',
    'updater': 'refresh',
    'refresh_leaf': True,
})
print('model 1+update2\t, this logs do not show continue train, but got a best test score?'); model_2_v2_update = xgb.train(params, xg_train_2, num_round, watch_list, verbose_eval=verbose_eval, xgb_model=model_1)

print('full train\t', mse(model_0.predict(xg_test), y_test)) # benchmark
print('model 1 \t', mse(model_1.predict(xg_test), y_test))  
print('model 2 \t', mse(model_2_v1.predict(xg_test), y_test))  # "before"
print('model 1+2\t', mse(model_2_v2.predict(xg_test), y_test))  # "after"
print('model 1+update2\t', mse(model_2_v2_update.predict(xg_test), y_test))  # "after"

Output:

full train	
[0]	xg_test-rmse:16.9311
[5]	xg_test-rmse:5.36819
[10]	xg_test-rmse:4.41758
[14]	xg_test-rmse:4.28357
model 1 	
[0]	xg_test-rmse:17.078
[5]	xg_test-rmse:6.0592
[10]	xg_test-rmse:5.04216
[14]	xg_test-rmse:4.94968
model 2 	
[0]	xg_test-rmse:16.9631
[5]	xg_test-rmse:6.08084
[10]	xg_test-rmse:5.18633
[14]	xg_test-rmse:5.12085
model 1+2	, this logs show continue train, but got a test score same to model 1?
[0]	xg_test-rmse:4.72028
[5]	xg_test-rmse:4.79626
[10]	xg_test-rmse:4.96232
[14]	xg_test-rmse:4.96374
model 1+update2	, this logs do not show continue train, but got a best test score?
[0]	xg_test-rmse:17.0353
[5]	xg_test-rmse:5.0438
[10]	xg_test-rmse:3.9661
[14]	xg_test-rmse:3.91295
full train	 18.348929164523298
model 1 	 24.499370663166886
model 2 	 26.223108502105553
model 1+2	 24.63867891225536
model 1+update2	 15.311159054248336
CPU times: user 17.1 s, sys: 4.19 s, total: 21.3 s
Wall time: 3.01 s

@Venkatesh-Sreedharan
Copy link

Hey, I am not able to replicate this code as it is. How do you manage to run incremental iterative learning for the first time with the model defined as 'None'. Because xgboost {process_type:'update'} parameter does not allow building of new trees and hence in the very first iteration it breaks as does not have any trees to build upon. Am I missing something here? Please do clarify. This comes up even for one shot iterative learning.

@DataSphereX
Copy link

Hi,

I am also unable to replicate the analogy of this model, for example of a Telecom Churn prediction, as and when a new customer gets added who will I re-use the old model to train the model instead of retraining with complete data.

@karelin
Copy link

karelin commented Oct 2, 2019

Just tried the notebook (xgboost version 0.90). Unfortunately, call to xgb.train (for one shot learning) raises error "XGBoostError: [13:44:55] src/gbm/gbtree.cc:278: Check failed: model_.trees.size() < model_.trees_to_update.size() (0 vs. 0) :"

@trivialfis
Copy link

I ran the gist on master branch and it works fine. Should be fixed with new model IO routines.

@luisvivasg
Copy link

I also got the same error as Karelin. And I this the same as Venkatesh

Check failed: model_.trees.size() < model_.trees_to_update.size() (0 vs. 0) :

I saw somewhere that it is needed to add the number of trees created in the first iteration, however, I cannot get that number. And it is never added in the code above.

@c3-varun
Copy link

Same issue on XGBoost 1.4.0. Has anyone figured this out yet?

@pjbhaumik
Copy link

Hi,
I have found the solution. Per xgboost documentation, the parameter 'update' should be 'updater'... this is a mistake in the notebook above. If you fix this, then you will see the right results.

model = xgb.train({
'learning_rate': 0.007,
'updater':'refresh',
'process_type': 'update',
'refresh_leaf': True,
#'reg_lambda': 3, # L2
'reg_alpha': 3, # L1
'silent': False,
}, dtrain=xgb.DMatrix(x_tr[start:start+batch_size], y_tr[start:start+batch_size]), xgb_model=model)

@marymlucas
Copy link

marymlucas commented Jul 14, 2023

Disregard, I figured it out. I was using handle_unknown='ignore' in OneHotEncoder, but one of the features has too few of a particular category, hence the mismatch.

Thank you for this gist. How can we implement this in a pipeline?

I am unable to test on the Boston dataset as it's been removed from sklearn, but on a different dataset I get a mismatch in number of columns. Even though I use the same pipeline the saved model seems to have one less feature than the new training data and I am unable to figure out why.

@Jason2Brownlee
Copy link

Great example!

Few people know that xgboost is able to perform incremental learning by adding boosting rounds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment