Filter methods
filter methods = only consider statistical characteristics of the data (without training a model)
Mutual information
https://thuijskens.github.io/2017/10/07/feature-selection/
I(X;Y)=∫X∫Yp(x,y)log(p(x)p(y)p(x,y)dxdy)
If X and Y are completely unrelated (independent) then p(x,y)=p(x)p(y) and the integral equals to zero.
Problem formulation: find the optimal subset of features S~ that maximizes mutual info of XS with the target y given a maximum cardinality k:
S~=argmaxSI(XS;y) s.t. ∣S∣=k
Set of possible combinations of features grows exponentially. NP-hard optimisation problem.
Not adapted to low sample size or high dimensional space (check stability of estimate on bootstrapped samples).
Greedy solution
At step t, select next feature ft+1 to add to the set such that:
ft+1=argmaxi∈StI(XSt∪i;y)
Computing joint mutual information involves integrating over a t-dimensional space (XSt), which is intractable.
Assumption 1
Selected features XS are independent and class-conditionally independent given unselected feature Xk under consideration. We can decompose:
ft=argmaxI(xi;y)−[I(xI;xSt)−I(xi;xSt∣y)]
First term = relevancy
Second term = redundancy
lower-dimensional approximation
I(xi;xSt)≈α∑k=1tI(xfK;xi)
I(xi;xSt∣y)≈β∑k=1tI(xfK;xi∣y)
Assumption 2
All features are pairwise independent, i.e.: p(xixj)=p(xi)p(xj)⇒∑I(Xj;Xk)=0
Assumption 3
All fatures are pairwise class-conditionally independant: p(xixj∣y)=p(xi∣y)p(xj∣y)⇒∑I(Xj;Xk∣y)=0
- larger α indicates stronger belief in assumption 2
- larger β indicates stronger belief in assumption 3
Examples for specific values:
- Joint mutual information: α=t1,β=t1
- Maximum relevancy minimum redundancy: α=t1,β=0
- Mutual info maximisation: α=β=0
Reduction in entropy
mutual_info_classif in sklearn
Chi-squared test: categorical feature, categorical target
Assume categorical target y=(y0,…,yk) and categorical feature X=(X1,…,Xd) both following a multinomial distribution (they contain the counts of each category) over n data points. If target is not multinomial, bin it into d intervals.
Define variable Zij representing the contingency table between X and y: P(Zij)=P(Xi,yj)
Under independence assumption (H0): P(Zij)=P(Xi)P(yj)⇒E0[Zij]=nP(Xi)P(yj)
When n→∞, Xk/n should approach the normal distribution and the statistic:
T=∑i,jE0[Zij](Zij−E0[Zij])2→χ(k−1).(d−1)2
There are (k−1).(d−1) degrees of freedom.
The p-value is then: P(χ(k−1).(d−1)2>t) where t is the statistic.
Chi-squared test: goodfness of fit
Source: All of statistics, chap. 10.8
We want to check whether the data comes from an assumed parametric model F={f(x;θ):θ∈Θ}.
Divide data into k disjoint intervals and compute the probability that an observation falls into interval j under the assumed model:
pj(θ)=∫Ijf(x;θ)dx
Let n be the number of data points and Nj the number of observations that fall into Ij. The multinomial likelihood is:
Q(θ)=∏j=1kpi(θ)Nj
Maximizing Q(θ) yields estimates θ~.
Now define the test statistic: Q=∑j=1knpj(θ~)(Nj−npj(θ~))2
Let H0 be the null hypothesis that data are IID draws from model F. Under H0, the statistic Q converges in distribution to a χk−1−∣θ∣2 random variable. Approximate p-value is given by P(χk−1−∣θ∣2>q) where q is the observed value of Q.
Limitations:
- if we fail to reject the null, we don't know if it's because the model is good or the test didn't have enough power
- can't have a ranking of the features however
- better to use nonparametric methods whenever possible
Fischer's score
pass
ANOVA: categorical feature, continuous target
PCA, CCA https://www.slideshare.net/VishalPatel321/feature-reduction-techniques
Wrapper methods
- wrapper methods = use learning algo and select relevant features based on performance
- Better predictive accuracy but computationally more expensive
Forward feature selection
Greedily add features from the empty set until best performance is achieved.
Backward feature elimination
Greedily remove features from entire set.
Exhaustive feature selection (brute force)
Evaluate each subset of features.
Recursive feature elimination
- Start with entire set
- compute feature importance
- prune least important features.
- Re-iterate until desired number of features is achieved
Embedded methods
I.e. feature selection is part of the model
L1 regularization
(includes LASSO, LARS)
Tree pruning based on feature importance
Random feature
- Train decision tree with random features
- look at feature importance and only keep features above the noisy features