Summary:
Two variants: transformer encoder and deep averaging network
(Iyyer et al., 2015)
single encoding model used to feed multiple downstream tasks
Note: switching to Sentence Piece vocabulary instead of words significantly reduces vocabulary size, which is a major contributor of model sizes (good for on-device or browser-based implementations)