memory-based: e.g. item-based or user-based using nearest neighbor (limitations in dealing with sparse and large-scale data since it computes similarity scores)
model-based: e.g. latent factor models such as matrix factorization or neural networks
Implicit and explicit feedback
recommendation tasks:
rating prediction
top-n recommendation (item ranking)
sequence-aware recommendation
click-through rate prediction: CTR=#Impressions#Clicks. Usually a binary classification problem (click or no click).
cold-start recommendation
Matrix factorization
Factorize the user-item interaction matrix (e.g. rating matrix) into a product of two lower-rank matrices, capturing the low-rank structure of the user-item interactions.
Evaluation metric: root mean squared error between observed and predicted interactions.
AutoRec: rating prediction with autoencoders
Matrix factorization is a linear model and thus does not capture nonlinear relationships
AutoRec differs from a traditional autoencoder in that, rather than learning the latent representations, it focuses on reconstructing the output layer.
it uses a partially observed interaction matrix as input, aiming to reconstruct a completed rating matrix.
Two variants: user-based and item-based
Item-based AutoRec
Reconstruction error: argminweightsW,V,biasesμ,b∑i∈columns of R∥Ri−h(Ri)∥observed entries2+λ(∥W∥F2+∥V∥F2) where R is the ratings matrix
Personalized Ranking for Recommender Systems
former methods only use explicit feedback (expensive to collect and most signals lie in implicit feedback)
ratings might not be missing at random but because of users' preferences
Three types of ranking losses:
pointwise approach: consider a single interaction at a time (e.g. matrix factorization and AutoRec are optimized with pointwise objectives)
pairwise approach: consider a pair of items for each user and aim to approximate the optimal ordering for that pair. More suitable for ranking task. (see next sections: Bayesian Personalized Ranking loss and Hinge loss)
Listwise approach: approximate the ordering of the entire list of items (e.g. direct optimizing the ranking measures such as normalized discounted cumulative gain). More complex and compute-intensive
Training data: tuples (u,i,j) which represents that the user u prefers item i over item j (both positive and negative pairs i.e. missing values, assumes user prefers positive item over all other non-observed items).
BPR aims to maximize the posterior probability of model parameters given desired personalized ranking for user u: p(Θ∣rankingu)
Where i is taken from the positive items and j is taken from the negative items for user u. y^ui is the predicted score of user u to item i.
Hinge Loss
Often used in Support Vector Machine classifiers.
∑(u,i,j)max(m−y^ui+y^uj,0)
where m is the safety margin. It aims to push negative items away from positive items.
Neural Collaborative Filtering for Personalized Ranking
implicit feedback (clicks, buys, watches)
NeuMF
NeuMF (for neural matrix factorization, He et al, 2017) leverages non-linearity of neural nets to replace dot products in matrix factorization
unlike the rating prediction task in [AutoRec](d2l_recommender_systems_course#AutoRec rating prediction with autoencoders), the model generates a ranked recommendation list to each user based on the implicit feedback
fuses two subnetworks: generalized matrix factorization (GMF) and multi-layer perceptron (MLP)
trained using pairwise ranking loss. Thus important to include negative sampling.
Generalized matrix factorization (GMF)
xy^u,i=pu⋅qi=σ(hTx)
pu is the u-th row of P∈Rm×k (the user embeddings)
qi is the i-th row of Q∈Rn×k (the item embeddings)
m is the number of users
n is the number of items
k is the number of latent dimensions
Multi-layer perceptron
uses concatenation of user and item embeddings as input [uu,vi] (not the same ones as the GMF layer) and passes it through a multi-layer perceptron
Final layer
y^u,i is the prediction score of user u for item i:
y^u,i=σ(hT[GMF(pu,qi),MLP([ui,vi])
Evaluation
Split dataset by time into training and test sets.
Hit rate at given cut-off (Hit @ r)
Hit @ r=user set1∑u∈user setI(ranku,gu≤r)
Where ranku,gu denotes the ranking of ground truth item gu of the user u (the one that the user clicked on) in the predicted recommendation list (ideal ranking is 1).
Area under curve (AUC)
Proportion of times where the predicted rank of the ground truth item was better than that of irrelevant items.
in previous sections, we abstract the recommendation task as a matrix completion problem without considering users' short-term behaviors
take sequentially-ordered user interaction logs into account
modeling users' temporal behavioral patterns and discovering their interest drift
point-level pattern: indicates impact of single item in historical sequence on target item
union-level pattern: implies influences of several previous actions on subsequent target
CASER
Convolutional Sequence Embedding Recommendation
Let sequence (first itemS1u,S2u…,)=Su denote the ordered sequence of items for user u.
Suppose we take previous L items into consideration. Embedding matrix that represents former interactions for time step t can be contstructed: E(u,t)=[qSt−Lu,…,qSt−1u]T∈RL×k where qi is the i-th row of item embeddings Q∈Rn×k.
E(u,t) can be used to infer the transient interest of user u at time-step t and is used as input to the subsequent components.
E(u,t) is passed through an horizontal convolutional layer and a vertical convolutional layer in parallel. Their outputs are concatenated and passed into a fully-connected layer to get the final representation.
In addition to interaction data (sparse and noisy), integrate features of items, profiles of users and in which context the interaction occured.
Factorization Machines
supervised algo used for classification, regression and ranking tasks
generalization of linear regression and matrix factorization
reminiscent of support vector machines with polynomial kernel
can model χ-way variable interactions (χ is polynoimal order, usually 2), fast optimization algo reduces polynomial computation time to linear complexity
2-way factorization machines
Let x∈Rd denote feature vectors of one sample, y corresponding label (either real-valued label or class label). 2-way factorization machine defined as:
y^(x)=w0+∑i=1dwixi+∑i=1d∑j=i+1dinteraction between i-th and j-th feature<vi,vj>xixj
where w0∈R is the global bias; w∈Rd denotes weights of i-th variable; V∈Rd×k represents feature embeddings (k being dimensionality of latent factors)
First two terms correspond to linear regression and last term is an extension of matrix factorization.
Efficient optimization criterion
Naive evaluation leads to O(kd2)
First line: total sum is twice the triangular sum - sum of diagonal.
Time complexity becomes linear: O(kd). Moreover, only non-zero elements need to be computed (efficient for sparse features).
Loss functions: MSE for regression, cross-entropy for classification, BPR loss for ranking.
Deep Factorization Machines
Factorization machines still combines interactions in a linear model. Higher order feature interactions (> 2) usually lead to numerical instability and high computational complexity.
Deep neural networks can learn sophisticated (nonlinear + higher order) feature interactions with rich feature representations.