Feature Selection

Filter methods

filter methods = only consider statistical characteristics of the data (without training a model)

Mutual information


I(X;Y)=XYp(x,y)log(p(x,y)p(x)p(y)dxdy)I(X;Y) = \int_X\int_Yp(x,y)\log(\frac{p(x,y)}{p(x)p(y)}dxdy)

If XX and YY are completely unrelated (independent) then p(x,y)=p(x)p(y)p(x,y) = p(x)p(y) and the integral equals to zero.

Problem formulation: find the optimal subset of features S~\tilde S that maximizes mutual info of XSX_S with the target yy given a maximum cardinality kk:

S~=argmaxSI(XS;y) s.t. S=k\tilde S = \arg \max_S I(X_S;y)\text{ s.t. }\lvert S\rvert = k

Set of possible combinations of features grows exponentially. NP-hard optimisation problem.

Not adapted to low sample size or high dimensional space (check stability of estimate on bootstrapped samples).

Greedy solution

At step tt, select next feature ft+1f_{t+1} to add to the set such that:

ft+1=argmaxi∉StI(XSti;y)f_{t+1}=\arg\max_{i\not\in S^t} I(X_{S^t \cup i; y})

Computing joint mutual information involves integrating over a tt-dimensional space (XStX_{S^t}), which is intractable.

Assumption 1

Selected features XSX_S are independent and class-conditionally independent given unselected feature XkX_k under consideration. We can decompose:

ft=argmaxI(xi;y)[I(xI;xSt)I(xi;xSty)]f_t=\arg \max I(x_i;y) - [I(x_I;x_{S^t}) - I(x_i;x_{S^t}\vert y)]

First term = relevancy

Second term = redundancy

lower-dimensional approximation

I(xi;xSt)αk=1tI(xfK;xi)I(x_i; x_{S^t}) \approx \alpha \sum_{k=1}^t I(x_{f_K};x_i)

I(xi;xSty)βk=1tI(xfK;xiy)I(x_i; x_{S^t}\vert y) \approx \beta \sum_{k=1}^t I(x_{f_K};x_i\vert y)

Assumption 2

All features are pairwise independent, i.e.: p(xixj)=p(xi)p(xj)I(Xj;Xk)=0p(x_i x_j) = p(x_i) p(x_j)\Rightarrow \sum I(X_j; X_k) = 0

Assumption 3

All fatures are pairwise class-conditionally independant: p(xixjy)=p(xiy)p(xjy)I(Xj;Xky)=0p(x_i x_j \vert y) = p(x_i\vert y)p(x_j \vert y) \Rightarrow \sum I(X_j;X_k\vert y) = 0

  • larger α\alpha indicates stronger belief in assumption 2
  • larger β\beta indicates stronger belief in assumption 3

Examples for specific values:

  • Joint mutual information: α=1t,β=1t\alpha=\frac{1}{t}, \beta=\frac{1}{t}
  • Maximum relevancy minimum redundancy: α=1t,β=0\alpha=\frac{1}{t}, \beta=0
  • Mutual info maximisation: α=β=0\alpha=\beta=0

Reduction in entropy

mutual_info_classif in sklearn

Chi-squared test: categorical feature, categorical target

Assume categorical target y=(y0,,yk)y = (y_0, \dots, y_k) and categorical feature X=(X1,,Xd)X = (X_1, \dots, X_d) both following a multinomial distribution (they contain the counts of each category) over nn data points. If target is not multinomial, bin it into dd intervals.

Define variable ZijZ_{ij} representing the contingency table between XX and yy: P(Zij)=P(Xi,yj)P(Z_{ij}) = P(X_i, y_j)

Under independence assumption (H0H_0): P(Zij)=P(Xi)P(yj)E0[Zij]=nP(Xi)P(yj)P(Z_{ij}) = P(X_i)P(y_j) \Rightarrow \mathbb{E}_0[Z_{ij}] = nP(X_i)P(y_j)

When nn \to \infty, Xk/nX_k/n should approach the normal distribution and the statistic:

T=i,j(ZijE0[Zij])2E0[Zij]χ(k1).(d1)2T = \sum_{i,j} \frac{(Z_{ij} - \mathbb{E}_0[Z_{ij}])^2}{\mathbb{E}_0[Z_{ij}]} \to \chi^2_{(k-1).(d-1)}

There are (k1).(d1)(k-1).(d-1) degrees of freedom.

The pp-value is then: P(χ(k1).(d1)2>t)\mathbb{P}\big(\chi^2_{(k-1).(d-1)} > t\big) where tt is the statistic.

Chi-squared test: goodfness of fit

Source: All of statistics, chap. 10.8

We want to check whether the data comes from an assumed parametric model F={f(x;θ):θΘ}\mathcal{F}=\{f(x;\theta):\theta\in\Theta\}.

Divide data into kk disjoint intervals and compute the probability that an observation falls into interval jj under the assumed model:

pj(θ)=Ijf(x;θ)dxp_j(\theta)=\int_{I_j} f(x;\theta)dx

Let nn be the number of data points and NjN_j the number of observations that fall into IjI_j. The multinomial likelihood is:

Q(θ)=j=1kpi(θ)NjQ(\theta)=\prod_{j=1}^k p_i(\theta)^{N_j}

Maximizing Q(θ)Q(\theta) yields estimates θ~\tilde\theta.

Now define the test statistic: Q=j=1k(Njnpj(θ~))2npj(θ~)Q=\sum_{j=1}^k \frac{(N_j-np_j(\tilde\theta))^2}{np_j(\tilde\theta)}

Let H0H_0 be the null hypothesis that data are IID draws from model F\mathcal{F}. Under H0H_0, the statistic QQ converges in distribution to a χk1θ2\chi^2_{k-1-\lvert \theta\rvert} random variable. Approximate p-value is given by P(χk1θ2>q)\mathbb{P}(\chi^2_{k-1-\lvert \theta\rvert} > q) where qq is the observed value of QQ.


  • if we fail to reject the null, we don't know if it's because the model is good or the test didn't have enough power
  • can't have a ranking of the features however
  • better to use nonparametric methods whenever possible

Fischer's score


ANOVA: categorical feature, continuous target

PCA, CCA https://www.slideshare.net/VishalPatel321/feature-reduction-techniques

Wrapper methods

  • wrapper methods = use learning algo and select relevant features based on performance
  • Better predictive accuracy but computationally more expensive

Forward feature selection

Greedily add features from the empty set until best performance is achieved.

Backward feature elimination

Greedily remove features from entire set.

Exhaustive feature selection (brute force)

Evaluate each subset of features.

Recursive feature elimination

  • Start with entire set
  • compute feature importance
  • prune least important features.
  • Re-iterate until desired number of features is achieved

Embedded methods

I.e. feature selection is part of the model

L1 regularization

(includes LASSO, LARS)

Tree pruning based on feature importance

Random feature

  • Train decision tree with random features
  • look at feature importance and only keep features above the noisy features