feature-selection

filter methods = only consider statistical characteristics of the data (without training a model)

https://thuijskens.github.io/2017/10/07/feature-selection/

$I(X;Y) = \int_X\int_Yp(x,y)\log(\frac{p(x,y)}{p(x)p(y)}dxdy)$

If $X$ and $Y$ are completely unrelated (independent) then $p(x,y) = p(x)p(y)$ and the integral equals to zero.

Problem formulation: find the optimal subset of features $\tilde S$ that maximizes mutual info of $X_S$ with the target $y$ given a maximum cardinality $k$:

$\tilde S = \arg \max_S I(X_S;y)\text{ s.t. }\lvert S\rvert = k$

Set of possible combinations of features grows exponentially. NP-hard optimisation problem.

Not adapted to low sample size or high dimensional space (check stability of estimate on bootstrapped samples).

At step $t$, select next feature $f_{t+1}$ to add to the set such that:

$f_{t+1}=\arg\max_{i\not\in S^t} I(X_{S^t \cup i; y})$

Computing joint mutual information involves integrating over a $t$-dimensional space ($X_{S^t}$), which is intractable.

**Assumption 1**

Selected features $X_S$ are independent and class-conditionally independent given unselected feature $X_k$ under consideration. We can decompose:

$f_t=\arg \max I(x_i;y) - [I(x_I;x_{S^t}) - I(x_i;x_{S^t}\vert y)]$

First term = relevancy

Second term = redundancy

**lower-dimensional approximation**

$I(x_i; x_{S^t}) \approx \alpha \sum_{k=1}^t I(x_{f_K};x_i)$

$I(x_i; x_{S^t}\vert y) \approx \beta \sum_{k=1}^t I(x_{f_K};x_i\vert y)$

**Assumption 2**

All features are pairwise independent, i.e.: $p(x_i x_j) = p(x_i) p(x_j)\Rightarrow \sum I(X_j; X_k) = 0$

**Assumption 3**

All fatures are pairwise class-conditionally independant: $p(x_i x_j \vert y) = p(x_i\vert y)p(x_j \vert y) \Rightarrow \sum I(X_j;X_k\vert y) = 0$

- larger $\alpha$ indicates stronger belief in assumption 2
- larger $\beta$ indicates stronger belief in assumption 3

Examples for specific values:

- Joint mutual information: $\alpha=\frac{1}{t}, \beta=\frac{1}{t}$
- Maximum relevancy minimum redundancy: $\alpha=\frac{1}{t}, \beta=0$
- Mutual info maximisation: $\alpha=\beta=0$

Reduction in entropy

mutual_info_classif in sklearn

Assume **categorical target** $y = (y_0, \dots, y_k)$ and **categorical feature** $X = (X_1, \dots, X_d)$ both following a multinomial distribution (they contain the counts of each category) over $n$ data points. If target is not multinomial, bin it into $d$ intervals.

Define variable $Z_{ij}$ representing the contingency table between $X$ and $y$: $P(Z_{ij}) = P(X_i, y_j)$

Under independence assumption ($H_0$): $P(Z_{ij}) = P(X_i)P(y_j) \Rightarrow \mathbb{E}_0[Z_{ij}] = nP(X_i)P(y_j)$

When $n \to \infty$, $X_k/n$ should approach the normal distribution and the statistic:

$T = \sum_{i,j} \frac{(Z_{ij} - \mathbb{E}_0[Z_{ij}])^2}{\mathbb{E}_0[Z_{ij}]} \to \chi^2_{(k-1).(d-1)}$

There are $(k-1).(d-1)$ degrees of freedom.

The $p$-value is then: $\mathbb{P}\big(\chi^2_{(k-1).(d-1)} > t\big)$ where $t$ is the statistic.

Source: *All of statistics, chap. 10.8*

We want to check whether the data comes from an assumed parametric model $\mathcal{F}=\{f(x;\theta):\theta\in\Theta\}$.

Divide data into $k$ disjoint intervals and compute the probability that an observation falls into interval $j$ under the assumed model:

$p_j(\theta)=\int_{I_j} f(x;\theta)dx$

Let $n$ be the number of data points and $N_j$ the number of observations that fall into $I_j$. The multinomial likelihood is:

$Q(\theta)=\prod_{j=1}^k p_i(\theta)^{N_j}$

Maximizing $Q(\theta)$ yields estimates $\tilde\theta$.

Now define the test statistic: $Q=\sum_{j=1}^k \frac{(N_j-np_j(\tilde\theta))^2}{np_j(\tilde\theta)}$

Let $H_0$ be the null hypothesis that data are IID draws from model $\mathcal{F}$. Under $H_0$, the statistic $Q$ converges in distribution to a $\chi^2_{k-1-\lvert \theta\rvert}$ random variable. Approximate p-value is given by $\mathbb{P}(\chi^2_{k-1-\lvert \theta\rvert} > q)$ where $q$ is the observed value of $Q$.

Limitations:

- if we fail to reject the null, we don't know if it's because the model is good or the test didn't have enough power
- can't have a ranking of the features however
- better to use nonparametric methods whenever possible

pass

PCA, CCA https://www.slideshare.net/VishalPatel321/feature-reduction-techniques

- wrapper methods = use learning algo and select relevant features based on performance
- Better predictive accuracy but computationally more expensive

Greedily add features from the empty set until best performance is achieved.

Greedily remove features from entire set.

Evaluate each subset of features.

- Start with entire set
- compute feature importance
- prune least important features.
- Re-iterate until desired number of features is achieved

I.e. feature selection is part of the model

(includes LASSO, LARS)

- Train decision tree with random features
- look at feature importance and only keep features above the noisy features