Two kinds of recommendations:
Common architecture:
candidate generation (reduce corpus: billions to a few 100k)
scoring (reduce corpus: few 100k to 10, system can use more complex model)
re-ranking: additional constraints (e.g. items that the user explicitly disliked / boost the score of fresher content)
Content-based filtering: use similarity between items
Collaborative filtering: similarities between queries (users) and items
Similarity metrics:
dot product
cosine similarity = dot product of normalized vectors
euclidean distance = $\lVert q-x\rVert = \sqrt{\sum_i (q_i - x_i)^2}$. When $q$ and $x$ are normalized, $q$, $x$ and $q-x$ are the sides of an equilateral triangle with lenght 1. Therefore:
$1/2\lVert q-x\rVert^2 = 1/2(q^Tq - 2q^Tx + x^Tx) = 1/2(1 - 2cos(q,x)+1)=1-cos(q,x)$
Items that appear frequently in the dataset have embeddings with larger norms. If capturing popularity is desirable, you can use dot product. In practice, you can define a variant that puts less emphasis on the norm: $\lVert q\rVert^\alpha \lVert x\rVert^\alpha \cos(q,x), \alpha \in (0,1)$
Initialization: if embeddings are initialized with a large norm, rare items (which are less frequently updated) will have a larger norms.
Simple embedding model.
Given feedback matrix $A\in \mathbb{R}^{m\times n}$ where $m$ is the number of users and $n$ the number of items:
Such that: $A\approx UV^T$
Illustration of the folding phenomenon: squares = entities, circles = items, colors = categories
Define block matrix:
$\bar{A} = \begin{pmatrix}A & F_i\\F_u & \vec{0}\end{pmatrix}$
Where $F_u$ is the user features matrix and $F_i$ is the item features matrix.
Since $A$ is sparse, we can use tf.SparseTensor(indices, values, dense_shape)
for efficient representation (only store the non-zero entries).
Initialization. Let the ratings vector $X = \sigma Z$ where each entry of $Z$ follows a standard normal distribution. We have that $X \sim \mathcal{N}(0, \sigma^2)$.
$\lVert X\rVert^2 = \sum \sigma^2 Z_i^2 \sim \sigma^2\chi^2(n)$. Expected norm: $\mathbb{E} \lVert X\rVert^2 = \sigma^2 \mathbb{E}[\chi^2(n)] = n \sigma^2$. Therefore we should init each embedding vector with $\mathbb{E} \lVert X\rVert = \sigma \sqrt{n}$.
Multiclass prediction problem using softmax model:
input is user query:
dense features: watch time, time since last watch
sparse features: watch history, country
output is probability vector over corpus of items, representing the probability of interacting with the item (e.g. click or watch probability)
Use two-tower neural net:
* one NN maps query featuers to query embedding
* one NN maps item features to item embedding
* output is the dot product of the two
This is a nearest neighbors problem. Return top $k$ items according to similarity score.
For large-scale retrieval, you can:
precompute the scores offline
Recommendation system may have multiple candidate generation models
model combines candidates, scores them and ranks them accordingly
why not let the candidate generator score?
scores of different candidate generation models might not be comparable
with a smaller pool of candidates, scoring model can afford to use more features and parameters and thus better capture context
objective functions:
maximize click rate: system recommends click-bait. Poor user experience and user's interest may quickly fade.
maximize watch time: system recommends very long videos. Poor user experience. Multiple short watches can be just as good as one long watch.
increase diversity and maximize session watch time: recommend shorter videos, but ones that are more likely to increase engagement.
positional bias: items that appear lower on the screen are less likely to be clicked.
freshness:
re-run training (warm start) as often as possible
new users: average a cluster of embeddings based on user features or use a neural net that takes user features as input
add document age as a feature
diversity (lack of diversity causes boring user experience):
train multiple candidate generators using different sources
train multiple rankers using different objective functions
re-rank items based on genre or metadata
fairness:
track metrics for each demographics to watch for biases
make separate models for underserved groups