xg_boost

Tree boosting

Regularized objective

y^i=ϕ(x)=k=1Kfk(xi),fkF\hat y_i = \phi(x) = \sum_{k=1}^K f_k(x_i), f_k \in \mathcal{F} where F={f(x)=wq(x)}\mathcal{F} = \{f(x) = w_{q(x)}\}

qq maps each example to its corresponding leaf index. wiw_i represent the weight of leaf ii. Unlike decision trees, a regression tree contains a continuous score wiw_i on each of the leaf. We sum up the score in the corresponding leaves accross all the weak learners.

To learn them we minimize the regularized objective:

L(ϕ)=il(y^iyi)+k(γT+12λw2)\mathcal{L}(\phi) = \sum_i l(\hat y_i y_i) + \sum_k (\gamma T + \frac{1}{2}\lambda \lVert w \rVert^2)

where ll is a differentiable convex loss function and TT the number of leaves. The second term penalizes model complexity.

Gradient Tree Boosting

Since this ensemble model includes functions as parameters, it cannot be optimized in Euclidean space. The model is trained in an additive manner instead (greedily add weak learner that most improves the model):

L(t)=il(yi,y^i(t1)+ft(xi))+Ω(ft)\mathcal{L}^{(t)} = \sum_i l(y_i, \hat y_i^{(t-1)} + f_t(x_i)) + \Omega(f_t)

See notes on [[gradient_boosting]]

Taking second order approximation of the loss term with respect to the function, we can quickly optimize the (convex) objective by setting its gradient to zero and find the optimal weights given a tree structure qq.

INCLUDE FORMULAS

We use this score as a quality measure of tree structure qq.

Since we can't enumerate all tree structures we start from a single leaf and add branches in a greedy fashion.

Loss reduction after a split is given by:

INCLUDE FORMULAS

Shrinkage

Scales newly added weights by a factor η\eta after each step of tree boosting. Simliar to a learning rate, reduces influence of previous trees and leaves space to future trees to improve the model.

Column (feature) subsampling prevents overfitting even more than row subsampling.

Split finding

(Basic) Exact Greedy Algorithm

enumerates over all possible splits over all the features. Computationnally demanding

Approximate algorithm

Exacty greedy algorithm not possible when data does not fit in memory