## Tree boosting

**Regularized objective**

$\hat y_i = \phi(x) = \sum_{k=1}^K f_k(x_i), f_k \in \mathcal{F}$ where $\mathcal{F} = \{f(x) = w_{q(x)}\}$

$q$ maps each example to its corresponding leaf index. $w_i$ represent the weight of leaf $i$. Unlike decision trees, a regression tree contains a continuous score $w_i$ on each of the leaf. We sum up the score in the corresponding leaves accross all the weak learners.

To learn them we minimize the regularized objective:

$\mathcal{L}(\phi) = \sum_i l(\hat y_i y_i) + \sum_k (\gamma T + \frac{1}{2}\lambda \lVert w \rVert^2)$

where $l$ is a differentiable convex loss function and $T$ the number of leaves. The second term penalizes model complexity.

**Gradient Tree Boosting**

Since this ensemble model includes functions as parameters, it cannot be optimized in Euclidean space. The model is trained in an additive manner instead (greedily add weak learner that most improves the model):

$\mathcal{L}^{(t)} = \sum_i l(y_i, \hat y_i^{(t-1)} + f_t(x_i)) + \Omega(f_t)$

See notes on [[gradient_boosting]]

Taking second order approximation of the loss term with respect to the function, we can quickly optimize the (convex) objective by setting its gradient to zero and find the optimal weights given a tree structure $q$.

`INCLUDE FORMULAS`

We use this score as a quality measure of tree structure $q$.

Since we can't enumerate all tree structures we start from a single leaf and add branches in a greedy fashion.

Loss reduction after a split is given by:

`INCLUDE FORMULAS`

**Shrinkage**

Scales newly added weights by a factor $\eta$ after each step of tree boosting. Simliar to a learning rate, reduces influence of previous trees and leaves space to future trees to improve the model.

**Column (feature) subsampling** prevents overfitting even more than row subsampling.

## Split finding

**(Basic) Exact Greedy Algorithm**

enumerates over all possible splits over all the features. Computationnally demanding

**Approximate algorithm**

Exacty greedy algorithm not possible when data does not fit in memory