Given training sample {yi,xi}1N, the goal is to obtain approximation F^ of the function F∗:x→y that minimizes the expected value of some loss function L(y,F(x)) over the joint distribution of all (y,x) values.
Common procedure is to restrict F to be a member of a parameterized class of functions F(x;P) where P={P1,P2,…} is a finite set of parameters.
In this paper, we focus on "additive" expansions of the form:
F(x;{βm,am}1M)=∑m=1Mβmh(x;am)
Generic function h(x;a) is a simple function of the input variables, parameterized by a={a1,a2,…}
Such expansions are at the heart of many function approximation methods such as neural networks, radical basis functions, MARS, wavelets and support vector machines.
In this paper, each of the functions h(x;am) is a small regression tree. The parameters am are the splitting variables, split locations, and the terminal node means of the individual trees.
1.1 Numerical optimization
Choosing a parameterized model changes the function optimization problem to one of parameter optimization: P∗=argminPEy,xL(y,F(x;P))
The solution for the parameters is expressed in the form: P∗=∑m=0Mpm where p0 is an initial guess and {pm}1M are successive increments.
1.2 Steepest-descent
If we denote Φ(P)=Ey,xL(y,F(x;P)) and gm=[∂Pj∂Φ(P)]P=Pm−1
The step is taken to be pm=−ρmgm where we perform line searchρm=argminρΦ(Pm−1−ρgm) along the steepest descent defined by negative gradient −gm.
2. Numerical optimization in function space
Non-parametric approach: we consider F(x) evaluated at each point x to be a parameter and seek to minimize:
Φ(F)=Ey,xL(y,F(x))=Ex[Ey(L(y,F(x)))∣x]
In function space, there are an infinite number of such parameters, but in data sets, only a finite number {F(xi)}1N. Following, the numerical optimization:
F∗(x)=∑m=0Mfm(x)
where f0 is an initial guess and fm are incremental functions.
For steepest descent: fm(x)=−ρmgm(x) with:
gm(x)=[∂F(x)∂Ey[L(y,F(x))∣x]]F(x)=Fm−1(x)
Among sufficient regularity that one can interchange differentiation and integration (see Leibniz Integral rule):
gm(x)=Ey[∂(F(x))∂L(y,F(x))]F(x)=Fm−1(x)
The multiplier ρm is given by the line search: ρm=argminρEy,xL(y,Fm−1(x)−ρgm(x))
3. Finite data
For a finite data sample, the expectation cannot be estimated accurately. Moreover, this approach would not ensure that the functions fm are adapted for unseen data points. We impose smoothness on the solution by assuming a parameterized form:
F(x;{βm,am}1M)=∑m=1Mβmh(x;am)
and minimize the data based estimate of expected loss:
Here, we don't readjust previously entered terms as in the stepwise approach.
in signal processing, this stagewise strategy is called matching pursuit (squared-error loss). Functions h are called basis functions and are taken from an over-complete wavelet-like dictionary
in machine learning, it is called boosting and the loss is either the exponential loss criterione−yF or the negative binomial log-likelihood. Function h is called a weak learner or base learner and is a classification tree.
Given approximator Fm−1, the function βmh(x;am) (i.e. after optimizing βm,am) can be viewed the best greedy step towards the estimate of F∗, under the constraint that the step direction h(x;am) be a member of the class of functions h(x;a). It can thus be regarded as a steepest-descent step.
The unconstrained negative gradient:
−gm(xi)=−[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x)
gives the best steepest-descent step direction −gm={−gm(xi)}1N in the N-dimensional data space at Fm−1(x).
However, this gradient is only defined at data-points xi and cannot be generalized to other x. We therefore choose that member of parameterized class h(x;am) that produces hm={xi,am}1N most parallel to −gm (i.e. most highly correlated with it). It can be obtained by solving the least squares problem:
am=argmina,β∑i=1N[−gm(xi)−βh(xi;a)]2
We then use h(x;am) as the negative gradient in line-search:
L(y,F)=(y−F)2/2 and the derivative wrt F is the residuals y~i=yi−Fm−1(xi). Therefore, am,βm are fitting the residuals and line search produces ρm=βm (we can thus skip the ρm optimization step)
(using any base learner h)
Question: How do we learn am and bm at the same time? In least squares, the feature vector is fixed beforehand.
Answer: this is still theoretical. Here h is any base learner; see the 4.3 section about using regression trees specifically.
4.2 Least-absolute-deviation (LAD) regression.
L(y,F)=∣y−F∣ and y~i=sign(yi−Fm−1(xi)).
Therefore h is fit (by least-squares) to the sign of the current residuals and the line search becomes:
Where {Rjm}1J are regions defined by the terminal nodes at the mth iteration. They are constructed to predict the pseudo-responses {y~i}1N by least-squares.
We can view this as adding J separate basis functions at each step. One can thus improve the quality of the fit by using the optimal coefficients for each of these separate basis functions:
γjm=argminγ∑xi∈RjmL(yi,Fm−1(xi)+γ) (replace sum over data points to sum over regions)
Note:\
In case of LAD regression, why not directly minimize: treem(x)=argminJ-node tree∑i=1N∣yi−Fm−1(xi)−tree(xi)∣?
Using squared-error loss to search for splits is much faster than mean-absolute-deviation. Why? (see CS289 lecture notes)
4.4 M-Regression
Attempts resistance to long-tailed error distributions and outliers while maintaining high efficiency for normally distributed errors. Consider the Huber loss function.
L(y,F)=21(y−F)2if∣y−F∣≤δ and δ(∣y−F∣−δ/2) otherwise
The transition point δ defines those residual values that are considered to be outliers, subject to absolute rather than squared error-loss. Common practice is to choose δ to be the α-quantile of the distribution of ∣y−F∗(x)∣, where (1−α) controls the break-down point (fraction of observations that can be arbitrarily modified without seriously degrading the quality of the result). One can use Fm−1 as an approximation of F∗:
δm=quantileα{∣yi−Fm−i(xi)∣}1N
With regression trees as base learners, we can do a separate update in each terminal node Rjm as seen in 4.3.
The solution to each update can be approximated by a single step of the standard iterative procedure (see Huber 1964), starting at the median of residuals:
According to motivations underlying robust regression, this algorithm should have properties:
similar to that of least-squares boosting for normally distributed errors
similar to that of least-absolute-deviation regression with very long tailed distributions
superior to both for error distributions with only moderately long tails
4.5 Two-class logistic regression and classification
Loss function is negative binomial log-likelihood: L(y,F)=log(1+exp(−2F)),y∈{−1,1} where F(x)=21log[P(y=−1∣x)P(y=1∣x)] (1/2 log odds).
With regression trees as base learners, we again use separate updates in each terminal node Rjm. There is no closed form solution to the minimization, thus we approximate it by a single Newton-Raphson step (see Friedman, Hastie, Tibshirani Additive logistic regression: a stistical view of boosting (2000)):
γjm=∑xi∈Rjmy~i/∑xi∈Rjm∣y~i∣(2−∣y~i∣)
The log-odds can be inverted to get the probability estimates, which can be used for classification:
y^i=2⋅I[c(−1,1)P^(y=1∣x)>c(1,−1)P^(y=0∣x)]−1
where c(y^,y) is the cost of predicting y^ when the truth is y.
4.5.1 Influence trimming
Using the binomial log-likelihood, the empirical loss function:
If yiFm−1(xi) is very large, then ρh(xi;a) will have almost no influence on the loss. This suggests that all observations (yi,xi) for which yiFm−1(xi) is very large can be deleted from all computations of the m-th iteration.
wi=exp(−2yiFm−1(xi))
can be viewed as a measure of the influence of the i-th iteration on the estimate ρmh(xi;am)
We can also use the second derivative of the loss with respect to F as wi.
Influence trimming deletes all observations with wi-values less than wl(α) where l(α) is the number of weights arranged in ascending order, whose sum equals α times the total sum of weights (typical values are α∈[0.05,0.2]).
90% to 95% of the observations are often deleted without sacrificing accuracy. Results in a corresponding reduction in computation by a factor 10 to 20.
4.6 Multi-class logistic regression and classification
L({yk,Fk(x)}1K)=−∑k=1Kyklogkpk(x)
where yk=I[class=k]∈0,1 and pk(x)=P(yk=1∣x).
pk(x)=softmax(F)k=exp(Fk(x))/∑l=1Kexp(Fl(x))
Thus, K-trees with J-terminal nodes each are induced at each iteration m to predict the corresponding current residuals for each class. Since the regions corresponding to different classes overlap, one cannot separate calculation within each region of each tree. Approximate solution with a single Newton-Raphson step using a diagonal approximation to the Hessian decomposes the problem into a separate calculation for each terminal node of each tree:
The final probability estimates at step M can be used for classification:
k~(x)=argmin1≤k≤K∑k′=1Kc(k,k′)pk′M(x)
For K=2, this algorithm is equivalent to binary logistic regression.
Influence trimming for multi-class is implemented the same way as for two-class.
5. Regularization
For additive expansions, a natural regularization is the number of components M. This places an implicit prior belief that "sparse" approximations involving fewer terms are likely to provide better predictions.
However, regularization through shrinkage provides superior results than restricting number of components (Copas 1983).
Simple shrinkage strategy is introducing learning rateν∈(0,1]:
Fm(x)=Fm−1(x)+νρmh(x;am)
There is a trade off between ν and M (decreasing ν increases best value for M, increasing M increases computation time proportionately).
Misclassification error is not sensitive to over-fitting: misclassification error-rate decreases well after the logistic likelihood has reached its optimum. Error-rate depends only on the sign of FM(x), whereas the likelihood is affected by both magnitude and sign. Over-fitting degrades the quality of the estimate without affecting (and sometimes improving) the sign.
7. Tree boosting
Meta-parameters of the GradientBoost procedure: M and ν
Meta-parameter of the base learner: number of terminal nodes J.
Best choice for J depends on the highest order of the dominant interactions among the variables.
first sum is "main effects" component of F(x). Reffered to as "additive" model because contributions of each xj, fj(xj) are added. Provides closest approximation to F(x). More restrictive definition of "additive" model than the one provided at the beginning.
second sum: two-variable "interaction effects"
and so on...
Highest interaction order possible is limited by the number of input variables n.
Purely additive approximations are also produced by the "naive"-Bayes method.
In boosting regression trees, the interaction order can be controlled by limiting the size of the individual trees induced at each iteration:
a tree with J terminal nodes produces a function with interaction order at most min(J−1,n).
Since boosting is additive, order of the entire approximation can be no larger than the largest among its individual components.
Best tree size J is governed by the effective interaction order of the target F∗.
As, this is usually unknown, J becomes a meta-parameter to be estimated using a model selection criterion. It is unlikely that large trees would ever be necessary or desirable.
8. Interpretation
8.1 Relative importance of input variables
Ij=(Ex[∂xj∂F^(x)]2⋅varx[xj])1/2
Does not strictly exist for piecewise constant approximations produced by decision trees. Breiman et al (1983) propose surrogate measure:
I^j2(T)=∑t=1J−1i^t2I(vt=j)
summation over non-terminal nodes t of the J-terminal node tree T
vt is the splitting variable associated with node t
i^2 is the empirical improvement in squared error as a result of the split
For a collection of decision trees {Tm}1M: I^j2=M1∑m=1MI^j2(Tm) (generalization justified by heuristic arguments in the paper)
8.2 Partial dependence plots
Useful for visualizing functions of high dimensional arguments.
Let zl be a chosen "target" subset of size l, of the input variables x:
zl={z1,…,zl}⊂{x1,…,xn} and z∖l the complement subset.
One can condition F^(x) on particular values of z∖l and consider it as a function only of zl: F^z∖l=F^(zl∣z∖l)
In general F^z∖l, depends on the particular values chosen for z∖l.
If this dependence is not too strong, the average function:
can represent a useful summary of the "partial dependence" of F^(x) on zl.
p∖l(z∖l) is the marginal probability density of z∖l:
p∖l(z∖l)=∫joint density of all inputs xp(x)dzl
This complement marginal density can be estimated from the training data and finally:
Fˉl(zl)=N1∑i=1NF^(zl,zi,∖l)
For regression trees based on single-variable splits, the partial dependence of F^(x) doesn't need reference to the data to be evaluated. For a specific set of values for zl, a weighted traversal is performed. At the root, a weight value of one is assigned. For each non-terminal node visited, if its split variable is in the target subset zl, the weight is not modified. If its in the complement subset z∖l, the weight is multiplied by the fraction of observations that went left or right respectively. When the traversal is complete, Fˉl(zl) is the weighted average over the F^(x) values of the terminal nodes.
When the dependence of F^(x) on zl is additive or multiplicative (i.e. F^(x)=F^l(zl)+ or ×F^∖l(z∖l)), then Fˉl(zl) provides a complete description of the nature of the variation of F^(x) on the variable subset zl.
Therefore, subsets that group together influential inputs that hae complex (nonfactorable) interactions between them will provide the most reveling partial dependence plots.
As a diagnostic, we can compute F^l(zl), F^∖l(z∖l) and compute the correlation of their sum/product with F^(x).
Interpretability of larger trees is questionable (too unstable).
10. Data Mining
Gradient Boosting is robust: it is invariant under all (strictly) monotone transformations of the individual input variables. E.g. using xj,logxj,expxj,xja as the j-th input variable yields the same result. No need for variable transformation. As a consequence, sensitivity to long tailed distribution and outliers is also eliminated (why? because you can make the spread between data points arbitrarily small by applying say the log function)
LAD- and M-regression are especially robust to outliers (absolute loss and Huber loss respectively).
Feature selection: trees are robust against the addition of irrelevant input variables. All tree based models handle missing values in a unified and elegant manner (Classification And Regression Trees, Breiman et al 1983) (WHY?, because if None, the variable is simply not considered for splitting?). No need to consider external imputation schemes.
Principal disadvantage of single tree models is inaccuracy due to piecewise-constant approximaions. This is solved by boosting (piecewise-constant approximations but granularity is much finer).