detecting-data-drit

Timothy Delille / notes / detecting-data-drit

detecting-data-drit

Types

Four types of drift:

concept drift: change in $P(y\vert x)$ , shift in relationship between input and target output. If we have real-time ground truth labels, change in performance is the strongest indicator
prediction drift: change in $P(\hat{y}\vert x)$
label drift: change in $P(y)$ , shift in ground truth distribution
feature drift: change in $P(X)$

"Virtual concept drift" has no influence on performance:

Causes

externalities cause real change in data distribution
data integrity issues due to data engineering or data collection

How to detect data drift?

No ground truth during inference, therefore cannot use performance metrics like AUC, precision, recall, etc...

Population Stability Index (PSI)

Mostly used in the financial industry

One training population and one scoring population. Divide each in say $n$ equal buckets, denoted $\{B_1^\mathcal{D}, \dots, B_n^\mathcal{D}\}$ where $\mathcal{D}$ is either the training or scoring dataset. For bucket $i$ , let's denote $P_i^t$ the percentage of training data in bucket $B_i^\text{training}$ and $P_i^s$ the percentage of training data in bucket $B_i^\text{scoring}$ .

\text{PSI} = \sum_{1 \leq i \leq n} (P_i^s - P_i^t)\cdot \ln \frac{P_i^s}{P_i^t}

PSI < 0.1: no significant population change
0.1 < PSI < 0.2: slight population change
PSI >= 0.2: significant population change

Kullback-Leibler divergence

Average number of bits wasted by encodings events from $P$ based on $Q$ .

D_{KL}(P\Vert Q) = \sum_{x\in \chi} P(x) \log \frac{P(x)}{Q(x)}

Standard values?

Jensen-Shannon divergence

Based on KL divergence, except it is symmetric and always has a finite value

JSD(P \Vert Q) = \frac{1}{2}D_{KL}(P\Vert M) + \frac{1}{2}D_{KL}(Q\Vert M), M=\frac{1}{2}(P+Q)

Kolmogorov-Smirnov test

Nonparametric test (does not assume a specific distribution)

Largest absolute difference between two distributions functions:

D = \sup_x \lvert F_\text{training}(x) - F_\text{scoring}(x)\rvert

where the empirical distribution function for $n$ i.i.d. samples is given by: $F = \sum_{x} \mathbb{I}_{X_i \in (\infty, x]}$

If the scoring distribution comes from the reference distribution: $D \to 0$ almost surely as $n\to \infty$

Recall: sequence $X_n$ converges almost surely towards $X$ if $P(\lim_{n\to \infty} X_n = X) = 1$

Frameworks

scikit-multiflow for streaming data.

Papers to read

https://arxiv.org/pdf/1010.4784.pdf
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2279&rep=rep1&type=pdf
https://link.springer.com/chapter/10.1007/978-3-540-28645-5_29

links

x.com/timothydelille

linkedin.com/in/timothydelille

github.com/timothydelille