max-halford-online-learning-evaluation

https://maxhalford.github.io/blog/online-learning-evaluation/

Online models

Online models (models trained using SGD) are usually weaker than batch models when trained on the same amount of data.
However, this discrepancy tends to get smaller as the size of the training data increases.
Online and batch models aren't meant to solve the same problem: batch models are meant to be used when you can afford to retrain your model from scratch every so often. Online models, on the contrary, are meant to be used when you want your model to learn from a stream of data, and therefore never have to restart from scratch.
Learning from a stream of data is something a batch model can’t do, and is very much different to the usual train/test split paradigm that machine learning practitioners are used to. In fact, there are other ways to evaluate the performance of an online model that make more sense than, say, cross-validation.
cross-validation assumes that the model is static. For online models, progressive validation provides a better measure of the models performance in its lifetime.

Given a new instance $(x_i, y_i)$ , produce prediction $\hat y_i$ and update running metric.
Common metrics such as accuracy, MSE, AUC etc... can be updated online.
model is trained/validated on all data on a single pass
same processing for both learning and validation signal
models which are based on the gradient of a loss function require computing a prediction anyway, in which case progressive validation can essentially be performed for free.
can be overly optimistic if the data contains seasonal patterns. In a real setting, there might be a delay between $x_i$ and $y_i$ such as in click-through rate prediction (meaning $y_i$ 's distribution can change during this delay). Progressive validation has access to both $x_i$ and $y_i$ at the same time.
gold standard for handling delays is keeping a log file with event times
when log file is not available use delayed progressive evaluation: add delay variable $d$ (constant, dynamic, random variable, ...) and feed the ground truth to the model only when it would be available in prod settings. Progressive validation reproduces the real state of events.
Using an exponential moving average of the performance can uncover cyclicity in the model performance. See next plot: