calibration

Source

Calibration refers to improving our model such that predicted probability distribution is similar to the probability observed in training data.

Calibration plot (=q-q plot)

Assume we are predicting probabilities for 2 classes {1,1}\{-1, 1\}.

To produce a calibration plot

  • sort by predicted probability p^i=P^(y=1Xi)\hat p_i = \hat P(y=1 \vert X_i)
  • Define bins BiB_i between 00 and 11 and compute p^(Bi)=k,p^kBiIyk=1Bi\hat p(B_i) = \frac{\sum_{k, \hat p_k \in B_i} \mathbb{I}_{y_k =1}}{\lvert B_i \rvert}
  • plot p^(Bi)\hat p(B_i) against expected true probability for the bin p(Bi)p(B_i).

How do we derive p(Bi)p(B_i)? Think of it as first sampling a probability pU[0,1]p\sim\mathcal{U}[0,1], then, if p[l,r]p\in[l, r] sample a random variable: XBernoulli(p)X\sim Bernoulli(p) taking value in {0,1}\{0, 1\}.

What is the fraction of positives (i.e. the expectation of XX) as pp varies between ll and rr?

EpU[0,1][Ep[X]lpr]=p=lrpP(p[l,r])dp=[12p2]lr/(rl)=12r2l2rl=r+l2\mathbb{E}_{p\in\mathcal{U}[0,1]}[\mathbb{E}_p[X]\vert l \leq p \leq r] = \int_{p=l}^{r} \frac{p}{P(p\in[l,r])} dp= \big[\frac{1}{2}p^2\big]_l^r / (r-l) = \frac{1}{2}\frac{r^2-l^2}{r-l}=\frac{r+l}{2}

Perfect calibration plot should be identity:

perfect_calibration.jpeg

Pseudo-code:


import math

import seaborn as sns

def plot_calibration_curve_binary(label, y_proba):

	# expected true proba within a bin.

	def expected_proba(l, r):

		return 2/3 * (r**3 - l**3) / (r**2 - l**2)

	

	df = pd.DataFrame({

		"label": label,

		"y_proba": y_proba,

	})

	

	# rule of thumb for number of bars N given number of data points n: N = ceil(log2(n) + 1)

	N = math.ceil(math.log2(len(df) + 1))

	bins = pd.cut(df["y_proba"], bins=np.linspace(0, 1, N))

	

	calibration_df = df.groupby(bins, observed=False).agg(mean=("label", "mean"), count=("label", "count"))

	calibration_df["expected_proba"] = list(map(lambda x: expected_proba(x.left, x.right), calibration_df.index))

	

	sns.scatterplot(x=calibration_df["expected_proba"], y=calibration_df["mean"], label="y_proba")

	sns.lineplot(x=[0, 1], y=[0, 1], color="red") # perfect calibration line

	sns.despine() # remove top and right spines

Sigmoid / Platt calibration

Logistic regression on our model output:

P^new(yX)=11+exp[αP^(yX)+β]\hat P_\text{new}(y \vert X) = \frac{1}{1 + \exp-[\alpha \hat P(y\vert X) + \beta]}

Optimized over α\alpha and β\beta.

Isotonic regression

Let (p^1,p1),,(p^n,pn)(\hat p_1, p_1), \dots, (\hat p_n, p_n). Isotonic regressions seeks weighted least-squares fit pi^newpi\hat{p_i}^\text{new} \approx p_i s.t. pi^newpj^new\hat{p_i}^\text{new} \leq \hat{p_j}^\text{new} whenever pi^p^j\hat{p_i} \leq \hat p_j.

Objective is: mini=1nwi(p^inewpi)2 s.t. p^1newp^jnew\min \sum_{i=1}^n w_i (\hat p_i^\text{new} - p_i)^2 \text{ s.t. } \hat p_1^\text{new} \leq \dots \leq \hat p_j^\text{new} assuming the pip_i's are ordered.

This yields a piecewise constant non-decreasing function. To solve this we use the pool adjacent violators algorithm. See these notes.