Notes from xgboost: GBM.md

title	layout
Notes on xgboost: Extreme Gradient Boosting Machine	default

to train a model we need to optimize a loss function.
- Root Mean Square Error (RMSE) is typically used for regression
$R M S D (\hat{θ}) = \sqrt{M S E (\hat{θ})} = \sqrt{E ((\hat{θ} - θ)^{2})}$
- Logloss is for binary/logistic classification
$l o g l o s s = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{M} y_{i j} \log (p_{i j})$

$l o g l o s s = - \frac{1}{N} \sum_{i = 1}^{N} (y_{i} \log (p_{i}) + (1 - y_{i}) \log (1 - p_{i}))$
- mLogloss is for multi classification for multinomial logistic analysis
$$ F = -\frac{1}{N}\sum_{i}^{N}\sum_{j}^{M}y_{ij} \cdot Ln(p_{ij}))=\sum_{j}^{M} \left (-\frac{1}{N}\sum_{i}^{N}y_{ij} \cdot Ln(p_{ij})) \right ) = \sum_{j}^{M}F_i $$
Regularization is another important part of the model. A good regularization term controls the complexity of the model which prevents over-fitting.
The training objective is: objective = loss function + regularization
obj = $L + Ω$
where the loss function controls the predictive power and regularization controls simplicity (or rather complexity)
in xgboost: gradient decent is used to optimize the objective which is a first order optimization algorithm to take steps to find the local minimum point or peak. The performance can be improved by considering both the first and second order.

how can we define a tree that improves the prediction along the gradient/
- decision tree (binary splits), each data point flows to one of the leaves following the direction on each node
- internal node: each internal node splits the flow of the data points by one the features (variables)
- the condition on the edge of specifies what data can flow through
- leaves: data points reach a leaf and assigned by weight, which is the prediction
how to find a good structure?
- how to choose features to split?
- when to split?
how to assign a prediction score?
In each split we want to greedily find the best point that can optimize the objective by:
1. sorting the number
2. scan the best splitting point
3. choose the best feature
Calculate gain (gini index or entropy)

jasdumas/Notes from xgboost: GBM.md