Bag of tricks for efficient text classification, 2016 (FAIR). pdf

Baselines for text classification:

- neural nets (though slow training and inference)
- linear classifiers such as logistic regression or SVM trained on bag of words (BoW) offers strong baseline and have potential to scale to very large corpus.

Goal: scale these baselines to very large corpus and output space

Result: linear models with a rank constraint and a fast loss approximation can train on a billion words within ten minutes, while achieving performance on par with the state-of-the-art.

They can train on 1.5B tokens in 10 minutes.

## Model architecture

A simple and efficient baseline for sentence classification is to represent sentences as bag of words (BoW) and train a linear classifier (logistic regression or an SVM). However, linear classifiers do not share parameters among features and classes. `Q: why?`

This possibly limits their generalization in the context of large output space where some classes have very few examples. Common solutions to this problem are to factorize the linear classifier into low rank matrices or to use multilayer neural networks.

Minimize the negative log likelihood:

$\frac{-1}{N}\sum_{n=1}^N y_n \log(\text{softmax}(BAx_n))$

where:

- the first weight matrix $A$ is a look-up table over the words (embeddings table).
- the word representations are then averaged into a text representation, which is in turn fed to a linear classifier.
- softmax computes a probability distribution over the classes.
- trained on multiple CPUs with SGD and a linearly decaying learning rate.

## Hierarchical softmax

Complexity of softmax is $O(\text{n classes} \times \text{embed dim})$

Hierarchical softmax is based on Huffman coding tree, computational complexity drops to $O(\text{embed dim} \times \log_2(\text{n classes}))$. `Q: how does hierarchical softmax work again?`

## N-gram features

Bag of words is invariant to word order but taking explicitly this order into account is often computationally very expensive. Instead, we use a bag of n-grams as additional features to capture some partial information about the local word order.

We maintain a fast and memory efficient mapping of the n-grams by using the *hashing trick*. `Q: ref?`

In practice, they use 10 hidden units, bigrams for an added 1-4% performance.