Gradient Boosting

Greedy Function Approximation - A Gradient Boosting Machine, Jerome Friedman (1999)

See original paper

One of my favorite papers.

1. Function estimation

Given training sample {yi,xi}1N\{y_i, x_i\}_1^N, the goal is to obtain approximation F^\hat{F} of the function F:xyF^*: x\rightarrow y that minimizes the expected value of some loss function L(y,F(x))L(y, F(x)) over the joint distribution of all (y,x)(y, x) values.

F=argminFEy,xL(y,F(x))=argminFEx[Ey(L(y,F(x)))x]F^* = \arg \min_F E_{y,x} L(y, F(x)) = \arg \min_F E_x [E_y(L(y, F(x)))\vert x]

Frequently employed loss functions:

  • squared-error (yF)2(y-F)^2 (regression)
  • absolute error yF\lvert y - F\rvert (regression)
  • negative binomial log-likelihood log(1+e2yF)\log (1 + e^{-2yF}) (classification, y{1,1}y\in\{-1, 1\})

Common procedure is to restrict FF to be a member of a parameterized class of functions F(x;P)F(x; P) where P={P1,P2,}P=\{P_1, P_2, \dots\} is a finite set of parameters.

In this paper, we focus on "additive" expansions of the form:

F(x;{βm,am}1M)=m=1Mβmh(x;am)F(x; \{\beta_m, a_m\}_1^M) = \sum_{m=1}^M \beta_m h(x; a_m)

Generic function h(x;a)h(x; a) is a simple function of the input variables, parameterized by a={a1,a2,}a = \{a_1, a_2, \dots\}

Such expansions are at the heart of many function approximation methods such as neural networks, radical basis functions, MARS, wavelets and support vector machines.

In this paper, each of the functions h(x;am)h(x; a_m) is a small regression tree. The parameters ama_m are the splitting variables, split locations, and the terminal node means of the individual trees.

1.1 Numerical optimization

Choosing a parameterized model changes the function optimization problem to one of parameter optimization: P=argminPEy,xL(y,F(x;P))P^* = \arg \min_P E_{y,x}L(y, F(x;P))

The solution for the parameters is expressed in the form: P=m=0MpmP^* = \sum_{m=0}^M p_m where p0p_0 is an initial guess and {pm}1M\{p_m\}_1^M are successive increments.

1.2 Steepest-descent

If we denote Φ(P)=Ey,xL(y,F(x;P))\Phi(P) = E_{y,x}L(y, F(x;P)) and gm=[Φ(P)Pj]P=Pm1g_m = \bigg[ \frac{\partial \Phi(P)}{\partial P_j} \bigg]_{P = P_{m-1}}

The step is taken to be pm=ρmgmp_m = -\rho_m g_m where we perform line search ρm=argminρΦ(Pm1ρgm)\rho_m = \arg \min_\rho \Phi(P_{m-1} - \rho g_m) along the steepest descent defined by negative gradient gm-g_m.

2. Numerical optimization in function space

Non-parametric approach: we consider F(x)F(x) evaluated at each point xx to be a parameter and seek to minimize:

Φ(F)=Ey,xL(y,F(x))=Ex[Ey(L(y,F(x)))x]\Phi(F)=E_{y,x}L(y, F(x)) = E_x[E_y(L(y, F(x)))\vert x]

In function space, there are an infinite number of such parameters, but in data sets, only a finite number {F(xi)}1N\{ F(x_i) \}_1^N. Following, the numerical optimization:

F(x)=m=0Mfm(x)F^*(x)=\sum_{m=0}^M f_m(x)

where f0f_0 is an initial guess and fmf_m are incremental functions.

For steepest descent: fm(x)=ρmgm(x)f_m(x) = -\rho_m g_m(x) with:

gm(x)=[Ey[L(y,F(x))x]F(x)]F(x)=Fm1(x)g_m(x) = \bigg[ \frac{\partial E_y [L(y,F(x))\vert x]}{\partial F(x)} \bigg]_{F(x)=F_{m-1}(x)}

Among sufficient regularity that one can interchange differentiation and integration (see Leibniz Integral rule):

gm(x)=Ey[L(y,F(x))(F(x))]F(x)=Fm1(x)g_m(x) = E_y \bigg[ \frac{\partial L(y, F(x))}{\partial(F(x))} \bigg]_{F(x) = F_{m-1}(x)}

The multiplier ρm\rho_m is given by the line search: ρm=argminρEy,xL(y,Fm1(x)ρgm(x))\rho_m = \arg \min_\rho E_{y,x}L(y,F_{m-1}(x)-\rho g_m(x))

3. Finite data

For a finite data sample, the expectation cannot be estimated accurately. Moreover, this approach would not ensure that the functions fmf_m are adapted for unseen data points. We impose smoothness on the solution by assuming a parameterized form:

F(x;{βm,am}1M)=m=1Mβmh(x;am)F(x; \{\beta_m, a_m\}_1^M) = \sum_{m=1}^M \beta_m h(x; a_m)

and minimize the data based estimate of expected loss:

{βm,am}=argmin{βm,am}1Mi=1NL(yi,m=1Mβmh(xi;am))\{\beta_m, a_m\} = \arg \min_{\{\beta_m', a_m'\}_1^M} \sum_{i=1}^N L(y_i, \sum_{m=1}^M \beta_m' h(x_i; a_m'))

If infeasible, one can try a greedy-stagewise approach. For m=1,2,,Mm=1,2, \dots, M:

(βm,am)=argminβ,ai=1NL(yi,Fm1(xi)+βh(xi;a))(\beta_m, a_m) = \arg \min_{\beta, a} \sum_{i=1}^N L(y_i, F_{m-1}(x_i) + \beta h(x_i; a))

and then: Fm(x)=Fm1(x)+βmh(x;am)F_m(x)=F_{m-1}(x) + \beta_m h(x; a_m)

Here, we don't readjust previously entered terms as in the stepwise approach.

  • in signal processing, this stagewise strategy is called matching pursuit (squared-error loss). Functions hh are called basis functions and are taken from an over-complete wavelet-like dictionary
  • in machine learning, it is called boosting and the loss is either the exponential loss criterion eyFe^{-yF} or the negative binomial log-likelihood. Function hh is called a weak learner or base learner and is a classification tree.

Given approximator Fm1F_{m-1}, the function βmh(x;am)\beta_m h(x; a_m) (i.e. after optimizing βm,am\beta_m, a_m) can be viewed the best greedy step towards the estimate of FF^*, under the constraint that the step direction h(x;am)h(x; a_m) be a member of the class of functions h(x;a)h(x;a). It can thus be regarded as a steepest-descent step.

The unconstrained negative gradient:

gm(xi)=[L(yi,F(xi))F(xi)]F(x)=Fm1(x)-g_m(x_i) = - \bigg[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \bigg]_{F(x) = F_{m-1}(x)}

gives the best steepest-descent step direction gm={gm(xi)}1N-g_m = \{-g_m(x_i)\}_1^N in the NN-dimensional data space at Fm1(x)F_{m-1}(x).

However, this gradient is only defined at data-points xix_i and cannot be generalized to other xx. We therefore choose that member of parameterized class h(x;am)h(x; a_m) that produces hm={xi,am}1Nh_m = \{x_i, a_m\}_1^N most parallel to gm-g_m (i.e. most highly correlated with it). It can be obtained by solving the least squares problem:

am=argmina,βi=1N[gm(xi)βh(xi;a)]2a_m = \arg \min_{a, \beta} \sum_{i=1}^N [-g_m(x_i) - \beta h(x_i; a)]^2

We then use h(x;am)h(x; a_m) as the negative gradient in line-search:

ρm=argminρi=1NL(yi,Fm1(xi)+ρh(xi,am))\rho_m = \arg \min_\rho \sum_{i=1}^N L(y_i, F_{m-1}(x_i) + \rho h(x_i, a_m))

and: Fm(x)=Fm1(x)+ρmh(x;am)F_m(x) = F_{m-1}(x) + \rho_m h(x;a_m)

Algorithm: Gradient Boost

F0(x)=argminρi=1NL(yi,ρ)F_0(x) = \arg\min_\rho \sum_{i=1}^N L(y_i, \rho)\

For m=1 to Mm=1\text{ to }M do:

y~i=[L(yi,F(xi))F(xi)]F(x)=Fm1(x),i=1,N\tilde{y}_i = -\bigg[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \bigg]_{F(x)=F_{m-1}(x)}, i=1, N

am=argmina,βi=1N[y~iβh(xi;a)]2a_m = \arg\min_{a,\beta} \sum_{i=1}^N [\tilde{y}_i - \beta h(x_i; a)]^2

ρm=argminρL(yi,Fm1(xi)+ρh(xi;am))\rho_m = \arg \min_\rho \sum L(y_i, F_{m-1}(x_i) + \rho h(x_i; a_m))

Fm(x)=Fm1(x)+ρmh(x;am)F_m(x) = F_{m-1}(x) + \rho_m h(x; a_m)

4. Applications: additive modeling

4.1. Least-squares regression

L(y,F)=(yF)2/2L(y,F)=(y-F)^2/2 and the derivative wrt F is the residuals y~i=yiFm1(xi)\tilde{y}_i = y_i - F_{m-1}(x_i). Therefore, am,βma_m, \beta_m are fitting the residuals and line search produces ρm=βm\rho_m = \beta_m (we can thus skip the ρm\rho_m optimization step)

least squares

(using any base learner hh)

Question: How do we learn ama_m and bmb_m at the same time? In least squares, the feature vector is fixed beforehand.

Answer: this is still theoretical. Here hh is any base learner; see the 4.3 section about using regression trees specifically.

4.2 Least-absolute-deviation (LAD) regression.

L(y,F)=yFL(y, F) = \lvert y - F \rvert and y~i=sign(yiFm1(xi))\tilde{y}_i = sign(y_i - F_{m-1}(x_i)).

Therefore hh is fit (by least-squares) to the sign of the current residuals and the line search becomes:

ρm=argminρi=1NyiFm1(xi)ρh(xi;am)=argminρh(xi;am)yiFm1(xi)h(xi;am)ρ=medianWweighted median{yiFm1(xi)h(xi;am)},wi=h(xi;am)\begin{aligned}\rho_m & = \arg \min_{\rho} \sum_{i=1}^N \lvert y_i - F_{m-1}(x_i) - \rho h(x_i; a_m)\rvert \\ & = \arg\min_\rho \sum \lvert h(x_i; a_m)\rvert \cdot \bigg\lvert \frac{y_i - F_{m-1}(x_i)}{h(x_i; a_m)} - \rho \bigg\rvert \\ & = \underbrace{\text{median}_W}_{\text{weighted median}} \bigg\{ \frac{y_i - F_{m-1}(x_i)}{h(x_i; a_m)} \bigg\}, w_i = \lvert h(x_i; a_m) \rvert \\ \end{aligned}

(using any base learner hh)

4.3. Regression trees

Base learner is a JJ-terminal node regression tree. Each regression tree model has the additive form:

h(x;{bj,Rj}1J)=j=1Jbj1(xRj)h(x; \{b_j, R_j\}_1^J) = \sum_{j=1}^J b_j 1(x\in R_j)

i.e. if xRjx\in R_j then h(x)=bjh(x) = b_j. Where the regions {Rj}1J\{R_j\}_1^J is a partition of xx.

The update becomes:

Fm(x)=Fm1(x)+ρmj=1Jbjm1(xRjm)γjm1(xRjm),γjm=ρmbjmF_m(x) = F_{m-1}(x) + \underbrace{\rho_m \sum_{j=1}^J b_{jm}1(x\in R_{jm})}_{\sum \gamma_{jm}1(x\in R_{jm}), \gamma_{jm} = \rho_m b_{jm}}

Where {Rjm}1J\{ R_{jm} \}_1^J are regions defined by the terminal nodes at the mmth iteration. They are constructed to predict the pseudo-responses {y~i}1N\{\tilde{y}_i\}_1^N by least-squares.

We can view this as adding JJ separate basis functions at each step. One can thus improve the quality of the fit by using the optimal coefficients for each of these separate basis functions:

{yjm}j=1J=argmin{γj}1Ji=1NL(yi,Fm1(xi)+j=1Jγj1(xiRjm))\{y_{jm}\}_{j=1}^J = \arg \min_{\{\gamma_j\}_1^J} \sum_{i=1}^N L\bigg( y_i, F_{m-1}(x_i) + \sum_{j=1}^J \gamma_j 1(x_i\in R_{jm}) \bigg)

Since the regions are disjoints, this reduces to:

γjm=argminγxiRjmL(yi,Fm1(xi)+γ)\gamma_{jm} = \arg \min_\gamma \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma) (replace sum over data points to sum over regions)

Note:\

In case of LAD regression, why not directly minimize: treem(x)=argminJ-node treei=1NyiFm1(xi)tree(xi)tree_m(x) = \arg \min_{\text{J-node }tree} \sum_{i=1}^N \lvert y_i - F_{m-1}(x_i) - tree(x_i)\rvert?

Using squared-error loss to search for splits is much faster than mean-absolute-deviation. Why? (see CS289 lecture notes)

4.4 M-Regression

Attempts resistance to long-tailed error distributions and outliers while maintaining high efficiency for normally distributed errors. Consider the Huber loss function.

L(y,F)=12(yF)2ifyFδ and δ(yFδ/2) otherwiseL(y,F) = \frac{1}{2}(y-F)^2 if \lvert y - F \rvert \leq \delta\text{ and }\delta(\lvert y-F\rvert - \delta/2)\text{ otherwise}

The transition point δ\delta defines those residual values that are considered to be outliers, subject to absolute rather than squared error-loss. Common practice is to choose δ\delta to be the α\alpha-quantile of the distribution of yF(x)\lvert y-F^*(x)\rvert, where (1α)(1-\alpha) controls the break-down point (fraction of observations that can be arbitrarily modified without seriously degrading the quality of the result). One can use Fm1F_{m-1} as an approximation of FF^*:

δm=quantileα{yiFmi(xi)}1N\delta_m = \text{quantile}_\alpha \{\lvert y_i - F_{m-i}(x_i)\rvert\}_1^N

With regression trees as base learners, we can do a separate update in each terminal node RjmR_{jm} as seen in 4.3.

The solution to each update can be approximated by a single step of the standard iterative procedure (see Huber 1964), starting at the median of residuals:

r~jm=medianxiRjm{yiFm1(xi)}\tilde{r}_{jm} = \text{median}_{x_i\in R_{jm}} \{y_i - F_{m-1}(x_i)\}

The approximation is:

γjm=r~jm+1RjmxiRjmsign(rm1(xi)r~jm)min(δm,abs(rm1(xi)r~jm))\gamma_{jm} = \tilde{r}_{jm} + \frac{1}{\lvert R_{jm}\rvert} \sum_{x_i \in R_{jm}} sign(r_{m-1}(x_i) - \tilde{r}_{jm})\cdot \min(\delta_m, abs(r_{m-1}(x_i) - \tilde{r}_{jm}))

According to motivations underlying robust regression, this algorithm should have properties:

  • similar to that of least-squares boosting for normally distributed errors
  • similar to that of least-absolute-deviation regression with very long tailed distributions
  • superior to both for error distributions with only moderately long tails

4.5 Two-class logistic regression and classification

Loss function is negative binomial log-likelihood: L(y,F)=log(1+exp(2F)),y{1,1}L(y,F) = \log(1+\exp(-2F)), y\in\{-1, 1\} where F(x)=12log[P(y=1x)P(y=1x)]F(x)=\frac{1}{2}\log \bigg[ \frac{P(y=1\vert x)}{P(y=-1\vert x)} \bigg] (1/2 log odds).

With regression trees as base learners, we again use separate updates in each terminal node RjmR_{jm}. There is no closed form solution to the minimization, thus we approximate it by a single Newton-Raphson step (see Friedman, Hastie, Tibshirani Additive logistic regression: a stistical view of boosting (2000)):

γjm=xiRjmy~i/xiRjmy~i(2y~i)\gamma_{jm} = \sum_{x_i \in R_{jm}} \tilde{y}_i / \sum_{x_i \in R_{jm}} \lvert \tilde{y}_i\rvert(2-\lvert \tilde{y}_i\rvert)

The log-odds can be inverted to get the probability estimates, which can be used for classification:

y^i=2I[c(1,1)P^(y=1x)>c(1,1)P^(y=0x)]1\hat{y}_i = 2\cdot \mathbb{I}[c(-1, 1)\hat{P}(y=1\vert x) > c(1,-1)\hat{P}(y=0\vert x)] - 1

where c(y^,y)c(\hat{y}, y) is the cost of predicting y^\hat{y} when the truth is yy.

4.5.1 Influence trimming

Using the binomial log-likelihood, the empirical loss function:

ϕm(ρ,a)=i=1NL(yi,Fm1(xi)+ρh(xi;a))=i=1Nlog[1+exp(2yiFm1(xi))exp(2yiρh(xi;a))]\begin{aligned}\phi_m(\rho, a) & = \sum_{i=1}^N L(y_i, F_{m-1}(x_i) + \rho h(x_i; a))\\& =\sum_{i=1}^N \log[1 + \exp(-2y_iF_{m-1}(x_i))\cdot\exp(-2y_i\rho h(x_i; a))]\\\end{aligned}

If yiFm1(xi)y_iF_{m-1}(x_i) is very large, then ρh(xi;a)\rho h(x_i; a) will have almost no influence on the loss. This suggests that all observations (yi,xi)(y_i, x_i) for which yiFm1(xi)y_iF_{m-1}(x_i) is very large can be deleted from all computations of the mm-th iteration.

wi=exp(2yiFm1(xi))w_i = \exp(-2y_i F_{m-1}(x_i))

can be viewed as a measure of the influence of the ii-th iteration on the estimate ρmh(xi;am)\rho_m h(x_i; a_m)

We can also use the second derivative of the loss with respect to FF as wiw_i.

Influence trimming deletes all observations with wiw_i-values less than wl(α)w_{l(\alpha)} where l(α)l(\alpha) is the number of weights arranged in ascending order, whose sum equals α\alpha times the total sum of weights (typical values are α[0.05,0.2]\alpha \in [0.05, 0.2]).

90% to 95% of the observations are often deleted without sacrificing accuracy. Results in a corresponding reduction in computation by a factor 10 to 20.

4.6 Multi-class logistic regression and classification

L({yk,Fk(x)}1K)=k=1Kyklogkpk(x)L(\{y_k, F_k(x)\}_1^K) = -\sum_{k=1}^K y_k \log_k p_k (x)

where yk=I[class=k]0,1y_k = \mathbb{I}[\text{class}=k]\in{0,1} and pk(x)=P(yk=1x)p_k(x) = \mathbb{P}(y_k=1\vert x).

pk(x)=softmax(F)k=exp(Fk(x))/l=1Kexp(Fl(x))p_k(x) = \text{softmax}(F)_k = \exp(F_k(x))\bigg/ \sum_{l=1}^K \exp (F_l (x))

Thus, KK-trees with JJ-terminal nodes each are induced at each iteration mm to predict the corresponding current residuals for each class. Since the regions corresponding to different classes overlap, one cannot separate calculation within each region of each tree. Approximate solution with a single Newton-Raphson step using a diagonal approximation to the Hessian decomposes the problem into a separate calculation for each terminal node of each tree:

γjkm=KK1xiRjkmy~ikxiRjkmy~ik(1y~ik)\gamma_{jkm} = \frac{K}{K-1} \frac{\sum_{x_i\in R_{jkm}} \tilde{y}_{ik}}{\sum_{x_i\in R_{jkm}} \lvert \tilde{y}_{ik}\rvert (1- \lvert \tilde{y}_{ik}\rvert)}

The final probability estimates at step MM can be used for classification:

k~(x)=argmin1kKk=1Kc(k,k)pkM(x)\tilde{k}(x) = \arg \min_{1\leq k\leq K} \sum_{k'=1}^K c(k,k') p_{k'M}(x)

For K=2K=2, this algorithm is equivalent to binary logistic regression.

Influence trimming for multi-class is implemented the same way as for two-class.

5. Regularization

For additive expansions, a natural regularization is the number of components MM. This places an implicit prior belief that "sparse" approximations involving fewer terms are likely to provide better predictions.

However, regularization through shrinkage provides superior results than restricting number of components (Copas 1983).

Simple shrinkage strategy is introducing learning rate ν(0,1]\nu \in (0, 1]:

Fm(x)=Fm1(x)+νρmh(x;am)F_m(x) = F_{m-1}(x) + \nu \rho_m h(x; a_m)

There is a trade off between ν\nu and MM (decreasing ν\nu increases best value for MM, increasing MM increases computation time proportionately).

Misclassification error is not sensitive to over-fitting: misclassification error-rate decreases well after the logistic likelihood has reached its optimum. Error-rate depends only on the sign of FM(x)F_M(x), whereas the likelihood is affected by both magnitude and sign. Over-fitting degrades the quality of the estimate without affecting (and sometimes improving) the sign.

7. Tree boosting

  • Meta-parameters of the GradientBoost procedure: MM and ν\nu
  • Meta-parameter of the base learner: number of terminal nodes JJ.

Best choice for JJ depends on the highest order of the dominant interactions among the variables.

Consider "ANOVA" expansion of a function:

F(x)=jfj(xj)+j,kfjk(xj,xk)+j,k,lfjkl(xjkl)+F(x) = \sum_j f_j(x_j) + \sum_{j,k} f_{jk}(x_j,x_k) + \sum_{j,k,l}f_{jkl}(x_{jkl}) + \dots

  • first sum is "main effects" component of F(x)F(x). Reffered to as "additive" model because contributions of each xjx_j, fj(xj)f_j(x_j) are added. Provides closest approximation to F(x)F(x). More restrictive definition of "additive" model than the one provided at the beginning.
  • second sum: two-variable "interaction effects"
  • and so on...

Highest interaction order possible is limited by the number of input variables nn.

Purely additive approximations are also produced by the "naive"-Bayes method.

In boosting regression trees, the interaction order can be controlled by limiting the size of the individual trees induced at each iteration:

  • a tree with JJ terminal nodes produces a function with interaction order at most min(J1,n)\min(J-1, n).
  • Since boosting is additive, order of the entire approximation can be no larger than the largest among its individual components.
  • Best tree size JJ is governed by the effective interaction order of the target FF^*.
  • As, this is usually unknown, JJ becomes a meta-parameter to be estimated using a model selection criterion. It is unlikely that large trees would ever be necessary or desirable.

8. Interpretation

8.1 Relative importance of input variables

Ij=(Ex[F^(x)xj]2varx[xj])1/2I_j = \bigg(E_x\bigg[ \frac{\partial \hat{F}(x)}{\partial x_j} \bigg]^2 \cdot var_x[x_j]\bigg)^{1/2}

Does not strictly exist for piecewise constant approximations produced by decision trees. Breiman et al (1983) propose surrogate measure:

I^j2(T)=t=1J1i^t2I(vt=j)\hat{I}^2_j (T) = \sum_{t=1}^{J-1} \hat{i}_t^2 \mathbb{I}(v_t = j)

  • summation over non-terminal nodes tt of the JJ-terminal node tree TT
  • vtv_t is the splitting variable associated with node tt
  • i^2\hat{i}^2 is the empirical improvement in squared error as a result of the split
  • For a collection of decision trees {Tm}1M\{T_m\}_1^M: I^j2=1Mm=1MI^j2(Tm)\hat{I}^2_j = \frac{1}{M} \sum_{m=1}^M \hat{I}_j^2 (T_m) (generalization justified by heuristic arguments in the paper)

8.2 Partial dependence plots

Useful for visualizing functions of high dimensional arguments.

Let zlz_l be a chosen "target" subset of size ll, of the input variables xx:

zl={z1,,zl}{x1,,xn}z_l = \{z_1, \dots, z_l\}\subset \{x_1, \dots, x_n\} and zlz_{\setminus l} the complement subset.

One can condition F^(x)\hat{F}(x) on particular values of zlz_{\setminus l} and consider it as a function only of zlz_l: F^zl=F^(zlzl)\hat{F}_{z_{\setminus l}} = \hat{F}(z_l \vert z_{\setminus l})

In general F^zl\hat{F}_{z_{\setminus l}}, depends on the particular values chosen for zlz_{\setminus l}.

If this dependence is not too strong, the average function:

F^l(zl)=Ezl[F^(x)]=F^(zl,zl)pl(zl)dzl\hat{F}_l(z_l) = E_{z_{\setminus l}}\big[\hat{F}(x)\big]=\int \hat{F}(z_l, z_{\setminus l})p_{\setminus l}(z_{\setminus l})dz_{\setminus l}

can represent a useful summary of the "partial dependence" of F^(x)\hat{F}(x) on zlz_l.

pl(zl)p_{\setminus l}(z_{\setminus l}) is the marginal probability density of zlz_{\setminus l}:

pl(zl)=p(x)joint density of all inputs xdzlp_{\setminus l}(z_{\setminus l}) = \int \underbrace{p(x)}_{\text{joint density of all inputs }x}dz_l

This complement marginal density can be estimated from the training data and finally:

Fˉl(zl)=1Ni=1NF^(zl,zi,l)\bar{F}_l(z_l)=\frac{1}{N}\sum_{i=1}^N \hat{F}(z_l, z_{i,\setminus l})

For regression trees based on single-variable splits, the partial dependence of F^(x)\hat{F}(x) doesn't need reference to the data to be evaluated. For a specific set of values for zlz_l, a weighted traversal is performed. At the root, a weight value of one is assigned. For each non-terminal node visited, if its split variable is in the target subset zlz_l, the weight is not modified. If its in the complement subset zlz_{\setminus l}, the weight is multiplied by the fraction of observations that went left or right respectively. When the traversal is complete, Fˉl(zl)\bar{F}_l(z_l) is the weighted average over the F^(x)\hat{F}(x) values of the terminal nodes.

When the dependence of F^(x)\hat{F}(x) on zlz_l is additive or multiplicative (i.e. F^(x)=F^l(zl)+ or ×F^l(zl)\hat{F}(x)=\hat{F}_l(z_l) + \text{ or }\times \hat{F}_{\setminus l}(z_{\setminus l})), then Fˉl(zl)\bar{F}_l(z_l) provides a complete description of the nature of the variation of F^(x)\hat{F}(x) on the variable subset zlz_l.

Therefore, subsets that group together influential inputs that hae complex (nonfactorable) interactions between them will provide the most reveling partial dependence plots.

As a diagnostic, we can compute F^l(zl)\hat{F}_l(z_l), F^l(zl)\hat{F}_{\setminus l}(z_{\setminus l}) and compute the correlation of their sum/product with F^(x)\hat{F}(x).

Interpretability of larger trees is questionable (too unstable).

10. Data Mining

Gradient Boosting is robust: it is invariant under all (strictly) monotone transformations of the individual input variables. E.g. using xj,logxj,expxj,xjax_j, \log x_j, \exp x_j, x_j^a as the jj-th input variable yields the same result. No need for variable transformation. As a consequence, sensitivity to long tailed distribution and outliers is also eliminated (why? because you can make the spread between data points arbitrarily small by applying say the log function)

LAD- and M-regression are especially robust to outliers (absolute loss and Huber loss respectively).

Feature selection: trees are robust against the addition of irrelevant input variables. All tree based models handle missing values in a unified and elegant manner (Classification And Regression Trees, Breiman et al 1983) (WHY?, because if None, the variable is simply not considered for splitting?). No need to consider external imputation schemes.

Principal disadvantage of single tree models is inaccuracy due to piecewise-constant approximaions. This is solved by boosting (piecewise-constant approximations but granularity is much finer).