Collaborative filtering:
memory-based: e.g. item-based or user-based using nearest neighbor (limitations in dealing with sparse and large-scale data since it computes similarity scores)
model-based: e.g. latent factor models such as matrix factorization or neural networks
Implicit and explicit feedback
recommendation tasks:
rating prediction
top-$n$ recommendation (item ranking)
sequence-aware recommendation
click-through rate prediction: $CTR=\frac{\#\text{Clicks}}{\#\text{Impressions}}$. Usually a binary classification problem (click or no click).
cold-start recommendation
Factorize the user-item interaction matrix (e.g. rating matrix) into a product of two lower-rank matrices, capturing the low-rank structure of the user-item interactions.
See this paragraph for details.
Evaluation metric: root mean squared error between observed and predicted interactions.
Reconstruction error: $\arg\min_{\underbrace{W,V}_{\text{weights}},\underbrace{\mu,b}_{\text{biases}}} \sum_{i \in\text{columns of } R} \lVert R_i - h(R_i)\rVert_\text{observed entries}^2 + \lambda(\lVert W\rVert_F^2 + \lVert V\rVert_F^2)$ where $R$ is the ratings matrix
Three types of ranking losses:
See original paper Bpr: bayesian personalized ranking from implicit feedback
Training data: tuples $(u, i, j)$ which represents that the user $u$ prefers item $i$ over item $j$ (both positive and negative pairs i.e. missing values, assumes user prefers positive item over all other non-observed items).
BPR aims to maximize the posterior probability of model parameters given desired personalized ranking for user $u$: $p(\Theta \vert \text{ranking}_u)$
$\begin{aligned} \ln p(\Theta\vert \text{ranking}_u) & \propto \ln p(\text{ranking}_u\vert \Theta)p(\Theta) \\& = \ln \prod_{(u,i,k)}\underbrace{\sigma}_{\text{sigmoid function}}(\hat y_{ui} - \hat y_{uj})p(\Theta) \\& = \sum_{(u,i,j)} \ln \sigma(\hat y_{ui} - \hat y_{uj}) + \ln \underbrace{p(\Theta)}_{\sim \mathcal{N}(0, \lambda_\Theta I)} \\& = \sum_{(u,i,j)} \sigma(\hat y_{ui} - \hat y_{uj}) - \lambda_\Theta \lVert \Theta\rVert^2\end{aligned}$
Where $i$ is taken from the positive items and $j$ is taken from the negative items for user $u$. $\hat y_{ui}$ is the predicted score of user $u$ to item $i$.
Often used in Support Vector Machine classifiers.
$\sum_{(u,i,j)} \max(m-\hat y_{ui} + \hat y_{uj}, 0)$
where $m$ is the safety margin. It aims to push negative items away from positive items.
$\begin{aligned}x & = p_u \cdot q_i \\\hat y_{u,i} & = \sigma(h^T x)\end{aligned}$
$\hat y_{u,i}$ is the prediction score of user $u$ for item $i$:
$\hat y_{u,i} = \sigma(h^T[\text{GMF}(p_u, q_i), \text{MLP}([u_i, v_i])$
Split dataset by time into training and test sets.
$\text{Hit @ r} = \frac{1}{\text{user set}}\sum_{u\in \text{user set}} \mathbb{I}(\text{rank}_{u, g_u} \leq r)$
Where $\text{rank}_{u, g_u}$ denotes the ranking of ground truth item $g_u$ of the user $u$ (the one that the user clicked on) in the predicted recommendation list (ideal ranking is 1).
Proportion of times where the predicted rank of the ground truth item was better than that of irrelevant items.
$\text{AUC}=\frac{1}{\text{user set}}\sum_{u\in\text{user set}} \frac{1}{\lvert \mathcal{D}\rvert}\sum_{j\in \mathcal{D}} \mathbb{I}(\text{rank}_{u, g_u} < \text{rank}_{u, j})$
Where $\mathcal{D} = \text{item set} - \text{candidate items of user u}$
Convolutional Sequence Embedding Recommendation
Let sequence $(\underbrace{S_1^u}_{\text{first item}}, S_2^u \dots, )=S^{u}$ denote the ordered sequence of items for user $u$.
Suppose we take previous $L$ items into consideration. Embedding matrix that represents former interactions for time step $t$ can be contstructed: $E^{(u, t)} = [q_{S_{t-L}^{u}}, \dots, q_{S_{t-1}^{u}}]^T \in \mathbb{R}^{L\times k}$ where $q_i$ is the $i$-th row of item embeddings $Q\in\mathbb{R}^{n\times k}$.
$E^{(u, t)}$ can be used to infer the transient interest of user $u$ at time-step $t$ and is used as input to the subsequent components.
$E^{(u, t)}$ is passed through an horizontal convolutional layer and a vertical convolutional layer in parallel. Their outputs are concatenated and passed into a fully-connected layer to get the final representation.
$\begin{aligned}o & = \text{HConv}(E^{(u, t)}, \text{horizontal filters})\\o' & = \text{VConv}(E^{(u, t)}, \text{vertical filters})\\z & = \sigma(W[o,o']^T + b)\\\hat y_{u,i,t} & = v_i [z, p_u]^T + b_i \\\end{aligned}$
Where $V$ is another item embedding matrix, $P$ is the user embedding matrix for user's general tastes.
Model can be learned with BPR or Hinge loss.
Data generation process:
See also Session-based recommendations with RNNs
In addition to interaction data (sparse and noisy), integrate features of items, profiles of users and in which context the interaction occured.
Let $x\in\mathbb{R}^d$ denote feature vectors of one sample, $y$ corresponding label (either real-valued label or class label). 2-way factorization machine defined as:
$\hat{y}(x) = w_0 + \sum_{i=1}^d w_i x_i + \sum_{i=1}^{d}\sum_{j=i+1}^{d}\underbrace{<v_i, v_j>}_{\text{interaction between i-th and j-th feature}}x_i x_j$
where $w_0\in\mathbb{R}$ is the global bias; $w\in\mathbb{R}^d$ denotes weights of $i$-th variable; $V\in\mathbb{R}^{d\times k}$ represents feature embeddings ($k$ being dimensionality of latent factors)
First two terms correspond to linear regression and last term is an extension of matrix factorization.
Naive evaluation leads to $\mathcal{O}(kd^2)$
First line: total sum is twice the triangular sum - sum of diagonal.
Time complexity becomes linear: $\mathcal{O}(kd)$. Moreover, only non-zero elements need to be computed (efficient for sparse features).
Loss functions: MSE for regression, cross-entropy for classification, BPR loss for ranking.
Factorization machines still combines interactions in a linear model. Higher order feature interactions (> 2) usually lead to numerical instability and high computational complexity.
Deep neural networks can learn sophisticated (nonlinear + higher order) feature interactions with rich feature representations.
DeepFM (Guo et al., 2017) has two components: factorization machine and multi-layer perceptron.
FM component's output is: $\hat{y}^{(FM)}$ (see previous section)
MLP component:
Let $e_i\in\mathbb{R}^k$ be the latent feature vector of feature $i$.
$\begin{aligned}z^{(0)} & = [e_1,\dots,e_f] \\z^{(l)} & = \text{activation}(W^{(l)}z^{(l-1)} + b^{(l)})\end{aligned}$
Output is $\hat{y}^{(MLP)}$
Both outputs are combined in sigmoid function: $\hat{y} = \sigma(\hat{y}^{(FM)} + \hat{y}^{(MLP)})$.
See also Neural factorization machines for sparse predictive analytics