title | layout |
---|---|
Notes on xgboost: Extreme Gradient Boosting Machine |
default |
- Computation in C++
- tree-based models
- easy to use, install, R/python interface
- Automatic parallel computation on a single Machine
- Tunable parameters
- requires a spare matrix (dgcMatrix) which is more memory efficient
- need to provide input features, target, objective, number of iterations
- can score or predict model based on new data
- cross validation measures a model's predictive power
-
to train a model we need to optimize a loss function.
- Root Mean Square Error (RMSE) is typically used for regression
- Logloss is for binary/logistic classification
- mLogloss is for multi classification for multinomial logistic analysis
$$ F = -\frac{1}{N}\sum_{i}^{N}\sum_{j}^{M}y_{ij} \cdot Ln(p_{ij}))=\sum_{j}^{M} \left (-\frac{1}{N}\sum_{i}^{N}y_{ij} \cdot Ln(p_{ij})) \right ) = \sum_{j}^{M}F_i $$
-
Regularization is another important part of the model. A good regularization term controls the complexity of the model which prevents over-fitting.
-
The training objective is: objective = loss function + regularization
obj =
where the loss function controls the predictive power and regularization controls simplicity (or rather complexity) -
in xgboost: gradient decent is used to optimize the objective which is a first order optimization algorithm to take steps to find the local minimum point or peak. The performance can be improved by considering both the first and second order.
- how can we define a tree that improves the prediction along the gradient/
- decision tree (binary splits), each data point flows to one of the leaves following the direction on each node
- internal node: each internal node splits the flow of the data points by one the features (variables)
- the condition on the edge of specifies what data can flow through
- leaves: data points reach a leaf and assigned by weight, which is the prediction
- how to find a good structure?
- how to choose features to split?
- when to split?
- how to assign a prediction score?
- In each split we want to greedily find the best point that can optimize the objective by:
- sorting the number
- scan the best splitting point
- choose the best feature
- Calculate gain (gini index or entropy)
Excellent notes on the algorithm!
I note that even knowing the algorithm internals does not help to configure the algorithm in practice. We still end up using a grid search to configure xgboost in practice.