home page recommendations: personalized to a user based on their interests
related item recommendations
Common architecture:
candidate generation (reduce corpus: billions to a few 100k)
scoring (reduce corpus: few 100k to 10, system can use more complex model)
re-ranking: additional constraints (e.g. items that the user explicitly disliked / boost the score of fresher content)
Candidate generation
Content-based filtering: use similarity between items
Collaborative filtering: similarities between queries (users) and items
Similarity metrics:
dot product
cosine similarity = dot product of normalized vectors
euclidean distance = ∥q−x∥=∑i(qi−xi)2. When q and x are normalized, q, x and q−x are the sides of an equilateral triangle with lenght 1. Therefore:
Items that appear frequently in the dataset have embeddings with larger norms. If capturing popularity is desirable, you can use dot product. In practice, you can define a variant that puts less emphasis on the norm: ∥q∥α∥x∥αcos(q,x),α∈(0,1)
Initialization: if embeddings are initialized with a large norm, rare items (which are less frequently updated) will have a larger norms.
Content-based filtering
recommendations specific to a single user
q (user) and x (item) share the same feature space
evaluate the similarity between q and x (e.g. dot product)
model can only make recommendations based on existing interests of the user. In other words, the model has limited ability to expand on the users' existing interests.
features are hand-engineered. I.e. they're not embeddings.
Collaborative filtering
recommendations based both on similar users and similar items to its history (recommendations are thus more serendipitous, users can discover new interests)
embeddings can be learned automatically, without relying on hand-engineering of features
feedback matrix where rows = users and columns = items
explicit feedback (likes, ratings) vs implicit feedback (watch time)
Matrix factorization
Simple embedding model.
Given feedback matrix A∈Rm×n where m is the number of users and n the number of items:
learn user embedding matrix U∈Rm×d where d is embedding dimension
learn item embedding matrix V∈Rn×d
Such that: A≈UVT
A is very sparse
Objective functions
Naive least squares over observed entries: ∑(i,j)∈obs(Aij−<Ui,Vj>). If we only sum over observed entries, a matrix of all 1/(d) for U and V produces minimal loss. The model does not learn how to place the embeddings of irrelevant movies. This is known as folding.
Treat unseen entries as zero and minimize Frobenius norm: ∥A−UVT∥F2. Solvable using singular value decomposition. However, since A is sparse, the solution UVT will likely be close to zero.
gravity, a global prior that pushes the prediction of any pair towards zero: nm1∑(i,j)∈/obs<Ui,Vj>2
balance the two regularization terms with hyper-parameters
Folding
Some categories of queries and items (such as language of a user / video on YouTube) may only interact among each other.
model learns how to place the query/item embeddings within a category but embeddings from different colors may end up in the same region of the embedding space by chance.
This phenomenon is known as folding and can lead to spurious recommendations (model predicts a high score for an item from a different group)
Illustration of the folding phenomenon: squares = entities, circles = items, colors = categories
Optimizer
Stochastic Gradient Descent (generic)
Weighted Alternating Least Squares: specialized for this objective
Objective is convex in U and V but not jointly convex
WALS alternates between fixing U and solving for V and vice-versa until convergence
WALS converges faster than SGD
Cold start problem
Given new item i0 that has a few interactions with users, we can generate its embedding by keeping U fixed and performing one iteration of WALS over vi0
heuristics-based: average embeddings of items from the same category
Including side features for query/item
Define block matrix:
Aˉ=(AFuFi0)
Where Fu is the user features matrix and Fi is the item features matrix.
Since A is sparse, we can use tf.SparseTensor(indices, values, dense_shape) for efficient representation (only store the non-zero entries).
Initialization. Let the ratings vector X=σZ where each entry of Z follows a standard normal distribution. We have that X∼N(0,σ2).
∥X∥2=∑σ2Zi2∼σ2χ2(n). Expected norm: E∥X∥2=σ2E[χ2(n)]=nσ2. Therefore we should init each embedding vector with E∥X∥=σn.
Deep neural networks
Multiclass prediction problem using softmax model:
input is user query:
dense features: watch time, time since last watch
sparse features: watch history, country
output is probability vector over corpus of items, representing the probability of interacting with the item (e.g. click or watch probability)
p^=softmax(∈Rdf(x)∈Rd,∣items∣WT) where W is the learned item embedding matrix and f(x) the output of the neural net, the embedding of the query.
The softmax outputs a probability distribution over the dot product (similarity) of the item embeddings and the query embedding.
Loss: cross-entropy loss between predicted probability distribution and ground truth (items the user has interacted with, as a normalized multi-hot distribution)
computing the gradient on the entire corpus is prohibitively expensive. If we only train on positive pairs however, the model suffers from folding. We use negative sampling to include items that are irrelevant to the query in the loss.
if we sample uniformly, some negatives might not contribute enough to the gradient (they are too obvious). Hard negatives: we can assign a higher probability to items j with a higher score <f(x),Vj> (they are harder to differentiate from the true positives).
How to include item features
Use two-tower neural net:
* one NN maps query featuers to query embedding
* one NN maps item features to item embedding
* output is the dot product of the two
Matrix factorization vs deep neural networks
matrix factorization is better for large corpora: easier to scale, cheaper to query (query embedding does not need to be computed at query time, it's only a look up in the user embeddings matrix), less prone to folding (by adjusting unobserved weight in WALS)
DNN models can better capture personalized preferences. They are preferable for scoring because they can use more feature and are more expressive. It is usually acceptable for DNN models to fold since you mostly care about ranking a prefiltered set of candidates assumed to be relevant.
Retrieval
This is a nearest neighbors problem. Return top k items according to similarity score.