d2l_recommender_systems_course

source

Overview

  • Collaborative filtering:

    • memory-based: e.g. item-based or user-based using nearest neighbor (limitations in dealing with sparse and large-scale data since it computes similarity scores)

    • model-based: e.g. latent factor models such as matrix factorization or neural networks

  • Implicit and explicit feedback

  • recommendation tasks:

    • rating prediction

    • top-nn recommendation (item ranking)

    • sequence-aware recommendation

    • click-through rate prediction: CTR=#Clicks#ImpressionsCTR=\frac{\#\text{Clicks}}{\#\text{Impressions}}. Usually a binary classification problem (click or no click).

    • cold-start recommendation

Matrix factorization

Factorize the user-item interaction matrix (e.g. rating matrix) into a product of two lower-rank matrices, capturing the low-rank structure of the user-item interactions.

See this paragraph for details.

Evaluation metric: root mean squared error between observed and predicted interactions.

AutoRec: rating prediction with autoencoders

  • Matrix factorization is a linear model and thus does not capture nonlinear relationships
  • AutoRec differs from a traditional autoencoder in that, rather than learning the latent representations, it focuses on reconstructing the output layer.
  • it uses a partially observed interaction matrix as input, aiming to reconstruct a completed rating matrix.
  • Two variants: user-based and item-based

Item-based AutoRec

autorec_architecture.png

Reconstruction error: argminW,Vweights,μ,bbiasesicolumns of RRih(Ri)observed entries2+λ(WF2+VF2)\arg\min_{\underbrace{W,V}_{\text{weights}},\underbrace{\mu,b}_{\text{biases}}} \sum_{i \in\text{columns of } R} \lVert R_i - h(R_i)\rVert_\text{observed entries}^2 + \lambda(\lVert W\rVert_F^2 + \lVert V\rVert_F^2) where RR is the ratings matrix

Personalized Ranking for Recommender Systems

  • former methods only use explicit feedback (expensive to collect and most signals lie in implicit feedback)
  • ratings might not be missing at random but because of users' preferences

Three types of ranking losses:

  • pointwise approach: consider a single interaction at a time (e.g. matrix factorization and AutoRec are optimized with pointwise objectives)
  • pairwise approach: consider a pair of items for each user and aim to approximate the optimal ordering for that pair. More suitable for ranking task. (see next sections: Bayesian Personalized Ranking loss and Hinge loss)
  • Listwise approach: approximate the ordering of the entire list of items (e.g. direct optimizing the ranking measures such as normalized discounted cumulative gain). More complex and compute-intensive

Bayesian Personalized Ranking Loss

See original paper Bpr: bayesian personalized ranking from implicit feedback

Training data: tuples (u,i,j)(u, i, j) which represents that the user uu prefers item ii over item jj (both positive and negative pairs i.e. missing values, assumes user prefers positive item over all other non-observed items).

BPR aims to maximize the posterior probability of model parameters given desired personalized ranking for user uu: p(Θrankingu)p(\Theta \vert \text{ranking}_u)

lnp(Θrankingu)lnp(rankinguΘ)p(Θ)=ln(u,i,k)σsigmoid function(y^uiy^uj)p(Θ)=(u,i,j)lnσ(y^uiy^uj)+lnp(Θ)N(0,λΘI)=(u,i,j)σ(y^uiy^uj)λΘΘ2\begin{aligned} \ln p(\Theta\vert \text{ranking}_u) & \propto \ln p(\text{ranking}_u\vert \Theta)p(\Theta) \\& = \ln \prod_{(u,i,k)}\underbrace{\sigma}_{\text{sigmoid function}}(\hat y_{ui} - \hat y_{uj})p(\Theta) \\& = \sum_{(u,i,j)} \ln \sigma(\hat y_{ui} - \hat y_{uj}) + \ln \underbrace{p(\Theta)}_{\sim \mathcal{N}(0, \lambda_\Theta I)} \\& = \sum_{(u,i,j)} \sigma(\hat y_{ui} - \hat y_{uj}) - \lambda_\Theta \lVert \Theta\rVert^2\end{aligned}

Where ii is taken from the positive items and jj is taken from the negative items for user uu. y^ui\hat y_{ui} is the predicted score of user uu to item ii.

Hinge Loss

Often used in Support Vector Machine classifiers.

(u,i,j)max(my^ui+y^uj,0)\sum_{(u,i,j)} \max(m-\hat y_{ui} + \hat y_{uj}, 0)

where mm is the safety margin. It aims to push negative items away from positive items.

Neural Collaborative Filtering for Personalized Ranking

  • implicit feedback (clicks, buys, watches)

NeuMF

  • NeuMF (for neural matrix factorization, He et al, 2017) leverages non-linearity of neural nets to replace dot products in matrix factorization
  • unlike the rating prediction task in [AutoRec](d2l_recommender_systems_course#AutoRec rating prediction with autoencoders), the model generates a ranked recommendation list to each user based on the implicit feedback
  • fuses two subnetworks: generalized matrix factorization (GMF) and multi-layer perceptron (MLP)
  • trained using pairwise ranking loss. Thus important to include negative sampling.

Generalized matrix factorization (GMF)

x=puqiy^u,i=σ(hTx)\begin{aligned}x & = p_u \cdot q_i \\\hat y_{u,i} & = \sigma(h^T x)\end{aligned}

  • pup_u is the uu-th row of PRm×kP \in \mathbb{R}^{m\times k} (the user embeddings)
  • qiq_i is the ii-th row of QRn×kQ \in \mathbb{R}^{n\times k} (the item embeddings)
  • mm is the number of users
  • nn is the number of items
  • kk is the number of latent dimensions

Multi-layer perceptron

  • uses concatenation of user and item embeddings as input [uu,vi][u_u, v_i] (not the same ones as the GMF layer) and passes it through a multi-layer perceptron

Final layer

y^u,i\hat y_{u,i} is the prediction score of user uu for item ii:

y^u,i=σ(hT[GMF(pu,qi),MLP([ui,vi])\hat y_{u,i} = \sigma(h^T[\text{GMF}(p_u, q_i), \text{MLP}([u_i, v_i])

neumf_architecture.svg

Evaluation

Split dataset by time into training and test sets.

Hit rate at given cut-off (Hit @ r)

Hit @ r=1user setuuser setI(ranku,gur)\text{Hit @ r} = \frac{1}{\text{user set}}\sum_{u\in \text{user set}} \mathbb{I}(\text{rank}_{u, g_u} \leq r)

Where ranku,gu\text{rank}_{u, g_u} denotes the ranking of ground truth item gug_u of the user uu (the one that the user clicked on) in the predicted recommendation list (ideal ranking is 1).

Area under curve (AUC)

Proportion of times where the predicted rank of the ground truth item was better than that of irrelevant items.

AUC=1user setuuser set1DjDI(ranku,gu<ranku,j)\text{AUC}=\frac{1}{\text{user set}}\sum_{u\in\text{user set}} \frac{1}{\lvert \mathcal{D}\rvert}\sum_{j\in \mathcal{D}} \mathbb{I}(\text{rank}_{u, g_u} < \text{rank}_{u, j})

Where D=item setcandidate items of user u\mathcal{D} = \text{item set} - \text{candidate items of user u}

Sequence-Aware Recommender Systems

  • in previous sections, we abstract the recommendation task as a matrix completion problem without considering users' short-term behaviors
  • take sequentially-ordered user interaction logs into account
  • modeling users' temporal behavioral patterns and discovering their interest drift
  • point-level pattern: indicates impact of single item in historical sequence on target item
  • union-level pattern: implies influences of several previous actions on subsequent target

CASER

Convolutional Sequence Embedding Recommendation

Let sequence (S1ufirst item,S2u,)=Su(\underbrace{S_1^u}_{\text{first item}}, S_2^u \dots, )=S^{u} denote the ordered sequence of items for user uu.

Suppose we take previous LL items into consideration. Embedding matrix that represents former interactions for time step tt can be contstructed: E(u,t)=[qStLu,,qSt1u]TRL×kE^{(u, t)} = [q_{S_{t-L}^{u}}, \dots, q_{S_{t-1}^{u}}]^T \in \mathbb{R}^{L\times k} where qiq_i is the ii-th row of item embeddings QRn×kQ\in\mathbb{R}^{n\times k}.

E(u,t)E^{(u, t)} can be used to infer the transient interest of user uu at time-step tt and is used as input to the subsequent components.

E(u,t)E^{(u, t)} is passed through an horizontal convolutional layer and a vertical convolutional layer in parallel. Their outputs are concatenated and passed into a fully-connected layer to get the final representation.

o=HConv(E(u,t),horizontal filters)o=VConv(E(u,t),vertical filters)z=σ(W[o,o]T+b)y^u,i,t=vi[z,pu]T+bi\begin{aligned}o & = \text{HConv}(E^{(u, t)}, \text{horizontal filters})\\o' & = \text{VConv}(E^{(u, t)}, \text{vertical filters})\\z & = \sigma(W[o,o']^T + b)\\\hat y_{u,i,t} & = v_i [z, p_u]^T + b_i \\\end{aligned}

Where VV is another item embedding matrix, PP is the user embedding matrix for user's general tastes.

Model can be learned with BPR or Hinge loss.

Data generation process:

caser-seq-data.svg

See also Session-based recommendations with RNNs

Feature-rich recommender systems

In addition to interaction data (sparse and noisy), integrate features of items, profiles of users and in which context the interaction occured.

Factorization Machines

  • supervised algo used for classification, regression and ranking tasks
  • generalization of linear regression and matrix factorization
  • reminiscent of support vector machines with polynomial kernel
  • can model χ\chi-way variable interactions (χ\chi is polynoimal order, usually 2), fast optimization algo reduces polynomial computation time to linear complexity

2-way factorization machines

Let xRdx\in\mathbb{R}^d denote feature vectors of one sample, yy corresponding label (either real-valued label or class label). 2-way factorization machine defined as:

y^(x)=w0+i=1dwixi+i=1dj=i+1d<vi,vj>interaction between i-th and j-th featurexixj\hat{y}(x) = w_0 + \sum_{i=1}^d w_i x_i + \sum_{i=1}^{d}\sum_{j=i+1}^{d}\underbrace{<v_i, v_j>}_{\text{interaction between i-th and j-th feature}}x_i x_j

where w0Rw_0\in\mathbb{R} is the global bias; wRdw\in\mathbb{R}^d denotes weights of ii-th variable; VRd×kV\in\mathbb{R}^{d\times k} represents feature embeddings (kk being dimensionality of latent factors)

First two terms correspond to linear regression and last term is an extension of matrix factorization.

Efficient optimization criterion

Naive evaluation leads to O(kd2)\mathcal{O}(kd^2)

factorization_machines_reformulation.png

First line: total sum is twice the triangular sum - sum of diagonal.

Time complexity becomes linear: O(kd)\mathcal{O}(kd). Moreover, only non-zero elements need to be computed (efficient for sparse features).

Loss functions: MSE for regression, cross-entropy for classification, BPR loss for ranking.

Deep Factorization Machines

Factorization machines still combines interactions in a linear model. Higher order feature interactions (> 2) usually lead to numerical instability and high computational complexity.

Deep neural networks can learn sophisticated (nonlinear + higher order) feature interactions with rich feature representations.

DeepFM (Guo et al., 2017) has two components: factorization machine and multi-layer perceptron.

FM component's output is: y^(FM)\hat{y}^{(FM)} (see previous section)

MLP component:

Let eiRke_i\in\mathbb{R}^k be the latent feature vector of feature ii.

z(0)=[e1,,ef]z(l)=activation(W(l)z(l1)+b(l))\begin{aligned}z^{(0)} & = [e_1,\dots,e_f] \\z^{(l)} & = \text{activation}(W^{(l)}z^{(l-1)} + b^{(l)})\end{aligned}

Output is y^(MLP)\hat{y}^{(MLP)}

Both outputs are combined in sigmoid function: y^=σ(y^(FM)+y^(MLP))\hat{y} = \sigma(\hat{y}^{(FM)} + \hat{y}^{(MLP)}).

deepfm_architecture.svg

See also Neural factorization machines for sparse predictive analytics