1. Generative pre-training

Language modeling objective on unlabeled data using auto-regressive model:

L1({u1,,un})=ilogP(uiuik,,ui1;Θ)L_1(\{u_1, \dots, u_n\})=\sum_i \log P(u_i\vert u_{i-k}, \dots, u_{i-1}; \Theta)

where kk is the size of the context window and uiu_is the tokens in the corpus.

BooksCorpus dataset is used for training (7,000 unique unpublished books from a variety of genres). It contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information.

The 1B Word Benchmark used by ELMo is approximately the same size but is shuffled at a sentence level - destroying long-range structure.

2. Discriminative fine-tuning

Task-specific input adaptation and corresponding supervised objective

L2(D)=(x,y)logP(yx1,,xm)L_2(\mathcal{D}) = \sum_{(x,y)} \log P(y\vert x^1, \dots, x^m)

where P(yx1,,xm)=softmax(hlmWy)P(y\vert x^1,\dots, x^m) = \text{softmax}(h_l^m W_y). hlmh_l^m is the final transformer block's activation and WyW_y is a task-specific parameter learned during fine-tuning.

Including language modeling as auxiliary objective to the fine-tuning improves generalization: L3=L2+λL1L_3 = L_2 + \lambda L_1

Pre-trained model is trained on contiguous sequences of text, thus inputs for fine-tuning tasks need to be adapted to a traversal-style approach:

input transformations

Embeddings for delimiter tokens are parameters that arise during fine-tuning.


  • multi-layer Transformer decoder
  • provides structured memory for handling long-term dependencies than attention-augmented RNNs.


(improvements are absolute)

  • 86.5 / +8.9% on commonsense reasoning (Stories Cloze Test)
  • 59 / +5.7% on question answering (RACE)
  • 81.4 / +1.5% on textual entailment (MultiNLI) (judge relationship as entailment, contradiction or neutral)
  • 72.8 / +5.5% on GLUE multi-task benchmark

Larger fine-tuning datasets benefit from the language model auxiliary objective but smaller datasets do not.

Transformers beats LSTM-based architectures on almost all datasets.


  • Zero-shot behavior: perform task without supervised fine-tuning
  • earliest approaches used unlabeled data to compute word-level or phrase-level statistics, then used as a feature in a supervised model before adopting to word embeddings
  • used ftfy library to fix unicode that's broken and spaCy tokenizer
  • Mathews correlation coefficient: measure of the quality of binary classification. Computed using confusion matrix, regarded as balanced measure which can be used even in the case of class imbalance (better than F1 score). See wikipedia.