Bag of tricks for efficient text classification, 2016 (FAIR). pdf
Baselines for text classification:
Goal: scale these baselines to very large corpus and output space
Result: linear models with a rank constraint and a fast loss approximation can train on a billion words within ten minutes, while achieving performance on par with the state-of-the-art.
They can train on 1.5B tokens in 10 minutes.
A simple and efficient baseline for sentence classification is to represent sentences as bag of words (BoW) and train a linear classifier (logistic regression or an SVM). However, linear classifiers do not share parameters among features and classes. Q: why?
This possibly limits their generalization in the context of large output space where some classes have very few examples. Common solutions to this problem are to factorize the linear classifier into low rank matrices or to use multilayer neural networks.
Minimize the negative log likelihood:
where:
Complexity of softmax is
Hierarchical softmax is based on Huffman coding tree, computational complexity drops to . Q: how does hierarchical softmax work again?
Bag of words is invariant to word order but taking explicitly this order into account is often computationally very expensive. Instead, we use a bag of n-grams as additional features to capture some partial information about the local word order.
We maintain a fast and memory efficient mapping of the n-grams by using the hashing trick. Q: ref?
In practice, they use 10 hidden units, bigrams for an added 1-4% performance.