feature-selection

Timothy Delille / notes / feature-selection

feature-selection

Filter methods

filter methods = only consider statistical characteristics of the data (without training a model)

Mutual information

https://thuijskens.github.io/2017/10/07/feature-selection/

I(X;Y) = \int_X\int_Yp(x,y)\log(\frac{p(x,y)}{p(x)p(y)}dxdy)

If $X$ and $Y$ are completely unrelated (independent) then $p(x,y) = p(x)p(y)$ and the integral equals to zero.

Problem formulation: find the optimal subset of features $\tilde S$ that maximizes mutual info of $X_S$ with the target $y$ given a maximum cardinality $k$ :

\tilde S = \arg \max_S I(X_S;y)\text{ s.t. }\lvert S\rvert = k

Set of possible combinations of features grows exponentially. NP-hard optimisation problem.

Not adapted to low sample size or high dimensional space (check stability of estimate on bootstrapped samples).

Greedy solution

At step $t$ , select next feature $f_{t+1}$ to add to the set such that:

f_{t+1}=\arg\max_{i\not\in S^t} I(X_{S^t \cup i; y})

Computing joint mutual information involves integrating over a $t$ -dimensional space ( $X_{S^t}$ ), which is intractable.

Assumption 1

Selected features $X_S$ are independent and class-conditionally independent given unselected feature $X_k$ under consideration. We can decompose:

f_t=\arg \max I(x_i;y) - [I(x_I;x_{S^t}) - I(x_i;x_{S^t}\vert y)]

First term = relevancy

Second term = redundancy

lower-dimensional approximation

I(x_i; x_{S^t}) \approx \alpha \sum_{k=1}^t I(x_{f_K};x_i)

I(x_i; x_{S^t}\vert y) \approx \beta \sum_{k=1}^t I(x_{f_K};x_i\vert y)

Assumption 2

All features are pairwise independent, i.e.: $p(x_i x_j) = p(x_i) p(x_j)\Rightarrow \sum I(X_j; X_k) = 0$

Assumption 3

All fatures are pairwise class-conditionally independant: $p(x_i x_j \vert y) = p(x_i\vert y)p(x_j \vert y) \Rightarrow \sum I(X_j;X_k\vert y) = 0$

larger $\alpha$ indicates stronger belief in assumption 2
larger $\beta$ indicates stronger belief in assumption 3

Examples for specific values:

Joint mutual information: $\alpha=\frac{1}{t}, \beta=\frac{1}{t}$
Maximum relevancy minimum redundancy: $\alpha=\frac{1}{t}, \beta=0$
Mutual info maximisation: $\alpha=\beta=0$

Reduction in entropy

mutual_info_classif in sklearn

Chi-squared test: categorical feature, categorical target

Assume categorical target $y = (y_0, \dots, y_k)$ and categorical feature $X = (X_1, \dots, X_d)$ both following a multinomial distribution (they contain the counts of each category) over $n$ data points. If target is not multinomial, bin it into $d$ intervals.

Define variable $Z_{ij}$ representing the contingency table between $X$ and $y$ : $P(Z_{ij}) = P(X_i, y_j)$

Under independence assumption ( $H_0$ ): $P(Z_{ij}) = P(X_i)P(y_j) \Rightarrow \mathbb{E}_0[Z_{ij}] = nP(X_i)P(y_j)$

When $n \to \infty$ , $X_k/n$ should approach the normal distribution and the statistic:

T = \sum_{i,j} \frac{(Z_{ij} - \mathbb{E}_0[Z_{ij}])^2}{\mathbb{E}_0[Z_{ij}]} \to \chi^2_{(k-1).(d-1)}

There are $(k-1).(d-1)$ degrees of freedom.

The $p$ -value is then: $\mathbb{P}\big(\chi^2_{(k-1).(d-1)} > t\big)$ where $t$ is the statistic.

Chi-squared test: goodfness of fit

Source: All of statistics, chap. 10.8

We want to check whether the data comes from an assumed parametric model $\mathcal{F}=\{f(x;\theta):\theta\in\Theta\}$ .

Divide data into $k$ disjoint intervals and compute the probability that an observation falls into interval $j$ under the assumed model:

p_j(\theta)=\int_{I_j} f(x;\theta)dx

Let $n$ be the number of data points and $N_j$ the number of observations that fall into $I_j$ . The multinomial likelihood is:

Q(\theta)=\prod_{j=1}^k p_i(\theta)^{N_j}

Maximizing $Q(\theta)$ yields estimates $\tilde\theta$ .

Now define the test statistic: $Q=\sum_{j=1}^k \frac{(N_j-np_j(\tilde\theta))^2}{np_j(\tilde\theta)}$

Let $H_0$ be the null hypothesis that data are IID draws from model $\mathcal{F}$ . Under $H_0$ , the statistic $Q$ converges in distribution to a $\chi^2_{k-1-\lvert \theta\rvert}$ random variable. Approximate p-value is given by $\mathbb{P}(\chi^2_{k-1-\lvert \theta\rvert} > q)$ where $q$ is the observed value of $Q$ .

Limitations:

if we fail to reject the null, we don't know if it's because the model is good or the test didn't have enough power
can't have a ranking of the features however
better to use nonparametric methods whenever possible

Fischer's score

pass

ANOVA: categorical feature, continuous target

PCA, CCA https://www.slideshare.net/VishalPatel321/feature-reduction-techniques

Wrapper methods

wrapper methods = use learning algo and select relevant features based on performance
Better predictive accuracy but computationally more expensive

Forward feature selection

Greedily add features from the empty set until best performance is achieved.

Backward feature elimination

Greedily remove features from entire set.

Exhaustive feature selection (brute force)

Evaluate each subset of features.

Recursive feature elimination

Start with entire set
compute feature importance
prune least important features.
Re-iterate until desired number of features is achieved

Embedded methods

I.e. feature selection is part of the model

L1 regularization

(includes LASSO, LARS)

Tree pruning based on feature importance

Random feature

Train decision tree with random features
look at feature importance and only keep features above the noisy features

links

x.com/timothydelille

linkedin.com/in/timothydelille

github.com/timothydelille