A Contextual-Bandit Approach to Personalized News Article Recommendation (2012)
Collaborative filtering works by "recognizing similarities across users based on their consumption history" (user-based neighborhood methods, latent factors). It is well adapted when:
Being unable to recommend new items, collaborative filtering is often combined with content-based filtering in so-called "hybrid" approaches.
Content-based filtering "helps to identify new items which well match an existing user's consumption profile, but the recommended items are always similar to the items previously taken by the user".
Multi-armed bandit
A gambler can play $K$ levers associated with their own rewards distributions. The gambler iteratively plays one lever per round and aims to maximize the sum of rewards. The goal is to identify the best arm.
Formally this is equivalent to a one-state Markov decision process.
Contextual bandit
Without context, the gambler must identify the single arm that has the highest expected reward.
In a contextual bandit setting, the gambler still has to choose between arms but also sees a $d$-dimensional feature vector. The objective is to choose the arm that has highest expected reward given the context vector.
In the context of the paper, the levers are news articles and the rewards are clicks. The learning algorithm "sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks in the long run". The set of arms thus dynamically changes as well.
Context includes user demographics, historical activities, article categories, time (to model temporal change)
Formally, the algorithm observes the current user $u_t$ and a set $A_t$ of arms or actions together with their feature vectors $x_{t,a}$ for $a\in A_t$ and time step $t$. The feature vector summarizes information of both the user and action, and is referred to as the context.
Based on observed payoffs in previous trials, choose an action $a$ and receive payoff $r_{t,a_t}$.
Improve the arm-selection strategy with the new observation $(x_{t,a_t}, a_t, r_{t,a_t})$.
The total payoff over $T$ time steps or trials is defined as $\sum_t^T r_{t,a_t}$. The objective is to minimize regret with respect to the optimal arm selection strategy. Regret is defined as:
$R(T):= \mathbb{E}\bigg[\sum_t^T r_{t,a_t}^*\bigg] - \mathbb{E}\bigg[\sum_t^T r_{t,a_t}\bigg]$
In the context of news recommendations, actions = articles and reward = click, therefore, the expected payoff is exactly the click through rate (CTR) of the article.
The hope is that the bandit algorithm "generalizes CTR information from one article / user to another" by leveraging contextual information relating to users, articles and their interactions.
The fundamental challenge in bandit problems is the balance between exploration and exploitation. Some classical methods are:
By assuming that the expected rewards are linear in the context parameters, they compute a confidence interval in closed form and call this algorithm LinUCB.
$\mathbb{E}[r_{t,a_t}\vert x_{t,a}]=x_{t,a}^T \theta_a^*$
Where $\theta_a^*$ is an unknown weight vector for a single arm. The linear models are disjoint in that the parameters are not shared among different arms.
Applied to the current problem, this means that the probability of a click on an article is a linear combination of the context vector. However, this is not the same as modeling the entire problem with a linear regression because each article has its set of weights.
At time step $t$, they estimate $\theta_{t,a}$ using ridge regression:
$\hat\theta_{t,a} = (X_{\leq t, a}^TX_{\leq t,a} + I_d)^{-1}X_{\leq t, a}^Ty_{t,a}$
The feature matrix $X_{\leq t, a}$ contains all previously observed contexts for article $a$. For ease of notation, I will drop the time part.
The UCB arm-selection strategy derived in the paper is: at each trial $t$, choose:
$a_t:=\arg\max_{a\in A_t}\big(x_{t,a}^T\hat\theta_{t,a} + \alpha\sqrt{x_{t,a}^T(X_{a}^TX_{a} + I_d)^{-1}x_{t,a}}\big)$
Where $\alpha$ is the only input parameter. It determines the tightness of the confidence bound: $\alpha = 1+\sqrt{\ln(2/\delta)/2}$ where the absolute error between estimated payoff $x_{t,a}^T\hat\theta_{t,a}$ and true payoff $\mathbb{E}[r_{t,a}\vert x_{t,a}]$ is less than the confidence interval with probability $1-\delta$.
Finally, they claim that this approach is valid even with non-stationary input features.
Hybrid linear models
They propose adding a weight vector that is shared among all arms. They use this weight to learn how to handle user / article interaction features (e.g. users that prefer only articles about politics).
Formally, the model becomes:
$\mathbb{E}[r_{t,a}\vert x_{t,a}]=z_{t,a}^T \beta^* + x_{t,a}^T \theta_a^*$
where $\beta^*$ is shared and $\theta_a^*$ is arm-specific.
In their experiments, $z_{t,a}$ contains user/article interaction features while $x_{t,a}$ contains user features only.
They give the pseudo-code for the new upper confidence bound formula.
In their experiments, they have $d$ features for both users and articles. They construct the features as follows:
fit a bilinear model via logistic regression for click probability given raw user and article features: $\hat p=\phi_u^TW\phi_a$
project user features onto the induced space $\psi_u = \phi_u^T W$. Then, apply K-means to group users in the induced space $\psi_u$ into $5$ clusters.
the final user feature $x_{t,a}$ contains five entries corresponding to the membership of that user in the $5$ clusters (computed with a gaussian kernel and summed to unity) + a constant feature.
Question: it's unclear whether $\phi_a$ represents all articles or article categories only? They speak multiple times about projecting user features onto article categories.
To compute $z_{t,a}$, they perform the same dimensionality reduction as above to obtain a 6-dimensional article feature. $z_{t,a}$ is simply its outer product with $x_{t,a}$, i.e. a $6\times 6=36$ dimensional vector.
$z_{t,a}$ thus contains user-article interaction information while $x_{t,a}$ contains user information only.
They simplify the ranking problem by only choosing which article to put as the main story.
Because training logs are already created by an existing policy, they assign a small percentage of traffic to a random bucket. The bucket selects an article to promote uniformly at random.
For evaluation, they sift through the logs and only keep articles that match what the policy under evaluation would choose.
They show that this is equivalent to evaluating the policy on the real world.
Question: how does the system deal with new users? new articles?
The covariance matrix $A_a := (X_a^T X_a + I_d)$ is initialized to the identify matrix and the response vector $b_a := X_{a}^T y_{t,a}$ is initialized to zero. The confidence bound will be large, encouraging exploration of the article.
We can adjust exploration by adjusting $\alpha$.
New users should have a feature vector $x_{t,a}$, they are handled just like any other user.
Question: how does this relate to other approaches such as Q-learning?
The multi-armed bandit problem is formally equivalent to a one-state Markov decision process. However, because we don't incorporate future rewards as in Q-learning, we cannot model long-term interactions. In practice, the model might prioritize click-bait content that guarantees a short-term reward over content that promotes long-term user activity.