Timothy Delille / notes / factorization_machines

factorization_machines

$\hat{y}(\mathbf{x}) := w_0 + \sum_{i=1}^n w_i x_i + \sum_{i=1}^n \sum_{j=i+1}^n <\mathbf{v}_i,\mathbf{v}_j> x_i x_j$ TLDR:

factorization machines (FMs) learn linear model with interaction features, where each interaction weight is given by the dot product of latent vectors (there is one latent vector per feature). This makes the interaction weights dependent (contrary to SVMs).
comparison to matrix factorization (MF):
- FMs are NOT learned with MF although MF can be mimicked by a certain specification of factorization machines.
- FMs bridged the gap between MF and models that could incorporate additional features (e.g., context, user demographics).
- Unlike MF, which assumes a fixed user-item matrix, FMs can model any interaction matrix derived from feature engineering.
- FMs underperformed compared to more specialized models in RecSys, particularly when advanced MF techniques (e.g., Alternating Least Squares or Stochastic Gradient Descent-based MF) were used.
The mathematical expression of FMs is elegant in that it encompasses other model classes such as MF. However, FMs can struggle to scale efficiently to very large datasets due to the increased complexity of handling all pairwise interactions. They were eventually overtaken by neural models like NCF, which are more expressive and scalable.

Formulation

This is different than matrix factorization. Namely we are not learning over a feedback matrix, but rather over a classical dataset of $(\mathbf{x}, \mathbf{y})$ pairs:

Given a feature vector $\mathbf{x}$ , an interaction model of degree 2 is given by:

\hat{y}(\mathbf{x}) := w_0 + \sum_{i=1}^n w_i x_i + \sum_{i=1}^n \sum_{j=i+1}^n <\mathbf{v}_i,\mathbf{v}_j> x_i x_j

It captures all single and pairwise interactions between the features:

$w_0$ is the global bias
$w_i$ models the strength of the i-th variable
$w_{i,j} := <v_i, v_j>$ represents the interaction between the i-th and j-th features and is modeled as the dot product between 2 latent vectors $v_i$ and $v_j \in \mathbb{R}^k$ ( $k$ is the latent dimension) (contrary to a linear / polynomial SVM which learns independent parameters, though sharing the same formulation)

By sharing the latent vectors and not learning $n^2$ independent interaction weights, we can learn high order interactions under sparsity. Factorization machines break the independence of the interaction parameters by factorizing them. The feature vector for one interaction helps estimate the feature vectors for related interactions.

In recommender systems, one reason for high sparsity is the presence of large categorical variable domains (e.g. item ids, user ids)

In the paper, they reformulate the model to compute it in linear time.

We can use different losses, depending on the prediction task:

Regression (real target): mean squared error
Binary classification: hinge loss ( $\max(0, (1-t)\cdot\hat{y})$ , where target $t\in[-1, 1]$ ) or binary cross-entropy
Ranking: pairwise classification loss.

Generalization to d-way interactions

For $d=3$ , the interaction term becomes:

\sum_{i_1=1}^n\sum_{i_2=1}^n\sum_{i_3=1}^n <v_1\times v_2 \times v_3, 1>(x_1x_2x_3)

$<v_1\times v_2 \times v_3, 1>$ just means we're summing over the element-wise product between the 3 latent vectors.

The generalized formula is ugly and given in the paper. The specificity is that they actually learn one latent matrix $V^{(l)} \in \mathbb{R}^{n\times k_l}$ for each degree $l$ .

Equivalence: anecdotally, the paper shows that factorization machines can be equivalent to matrix factorization and other models when designing the proper input features. For matrix factorization, the input vector is the concatenation of a one-hot encoding of the user set and a one-hot encoding of the item set, i.e. $x$ is a vector with ones at user index $i$ and item index $j$ ( $j + |U|$ since we concatenated the 2)

links

x.com/timothydelille

linkedin.com/in/timothydelille

github.com/timothydelille