linUCB-yahoo-contextual-bandit

A Contextual-Bandit Approach to Personalized News Article Recommendation (2012)

link to the paper

Overview of recommender systems

Collaborative filtering works by "recognizing similarities across users based on their consumption history" (user-based neighborhood methods, latent factors). It is well adapted when:

  • there is a high overlap in historical consumption across users
  • the content universe is almost static

Being unable to recommend new items, collaborative filtering is often combined with content-based filtering in so-called "hybrid" approaches.

Content-based filtering "helps to identify new items which well match an existing user's consumption profile, but the recommended items are always similar to the items previously taken by the user".

Problem formulation

Multi-armed bandit

A gambler can play KK levers associated with their own rewards distributions. The gambler iteratively plays one lever per round and aims to maximize the sum of rewards. The goal is to identify the best arm.

Formally this is equivalent to a one-state Markov decision process.

Contextual bandit

Without context, the gambler must identify the single arm that has the highest expected reward.

In a contextual bandit setting, the gambler still has to choose between arms but also sees a dd-dimensional feature vector. The objective is to choose the arm that has highest expected reward given the context vector.

In the context of the paper, the levers are news articles and the rewards are clicks. The learning algorithm "sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks in the long run". The set of arms thus dynamically changes as well.

Context includes user demographics, historical activities, article categories, time (to model temporal change)

Formally, the algorithm observes the current user utu_t and a set AtA_t of arms or actions together with their feature vectors xt,ax_{t,a} for aAta\in A_t and time step tt. The feature vector summarizes information of both the user and action, and is referred to as the context.

Based on observed payoffs in previous trials, choose an action aa and receive payoff rt,atr_{t,a_t}.

Improve the arm-selection strategy with the new observation (xt,at,at,rt,at)(x_{t,a_t}, a_t, r_{t,a_t}).

The total payoff over TT time steps or trials is defined as tTrt,at\sum_t^T r_{t,a_t}. The objective is to minimize regret with respect to the optimal arm selection strategy. Regret is defined as:

R(T):=E[tTrt,at]E[tTrt,at]R(T):= \mathbb{E}\bigg[\sum_t^T r_{t,a_t}^*\bigg] - \mathbb{E}\bigg[\sum_t^T r_{t,a_t}\bigg]

In the context of news recommendations, actions = articles and reward = click, therefore, the expected payoff is exactly the click through rate (CTR) of the article.

The hope is that the bandit algorithm "generalizes CTR information from one article / user to another" by leveraging contextual information relating to users, articles and their interactions.

The fundamental challenge in bandit problems is the balance between exploration and exploitation. Some classical methods are:

  • ϵ\epsilon-greedy (by decaying ϵ\epsilon appropriately, per-step regret R(T)/TR(T)/T converges to 00 with probability 11)
  • upper-confidence bound algorithms: estimate both mean payoff μ^t,a\hat\mu_{t,a} of each arm and a corresponding confidence interval ct,ac_{t,a} (often inversely proportional to the number of actions on arm aa). Then, select the arm that achieves the highest upper confidence bound argmaxa(μ^t,a+ct,a)\arg\max_a(\hat\mu_{t,a} + c_{t,a}).

LinUCB Algorithm

By assuming that the expected rewards are linear in the context parameters, they compute a confidence interval in closed form and call this algorithm LinUCB.

E[rt,atxt,a]=xt,aTθa\mathbb{E}[r_{t,a_t}\vert x_{t,a}]=x_{t,a}^T \theta_a^*

Where θa\theta_a^* is an unknown weight vector for a single arm. The linear models are disjoint in that the parameters are not shared among different arms.

Applied to the current problem, this means that the probability of a click on an article is a linear combination of the context vector. However, this is not the same as modeling the entire problem with a linear regression because each article has its set of weights.

At time step tt, they estimate θt,a\theta_{t,a} using ridge regression:

θ^t,a=(Xt,aTXt,a+Id)1Xt,aTyt,a\hat\theta_{t,a} = (X_{\leq t, a}^TX_{\leq t,a} + I_d)^{-1}X_{\leq t, a}^Ty_{t,a}

The feature matrix Xt,aX_{\leq t, a} contains all previously observed contexts for article aa. For ease of notation, I will drop the time part.

The UCB arm-selection strategy derived in the paper is: at each trial tt, choose:

at:=argmaxaAt(xt,aTθ^t,a+αxt,aT(XaTXa+Id)1xt,a)a_t:=\arg\max_{a\in A_t}\big(x_{t,a}^T\hat\theta_{t,a} + \alpha\sqrt{x_{t,a}^T(X_{a}^TX_{a} + I_d)^{-1}x_{t,a}}\big)

Where α\alpha is the only input parameter. It determines the tightness of the confidence bound: α=1+ln(2/δ)/2\alpha = 1+\sqrt{\ln(2/\delta)/2} where the absolute error between estimated payoff xt,aTθ^t,ax_{t,a}^T\hat\theta_{t,a} and true payoff E[rt,axt,a]\mathbb{E}[r_{t,a}\vert x_{t,a}] is less than the confidence interval with probability 1δ1-\delta.

Finally, they claim that this approach is valid even with non-stationary input features.

Hybrid linear models

They propose adding a weight vector that is shared among all arms. They use this weight to learn how to handle user / article interaction features (e.g. users that prefer only articles about politics).

Formally, the model becomes:

E[rt,axt,a]=zt,aTβ+xt,aTθa\mathbb{E}[r_{t,a}\vert x_{t,a}]=z_{t,a}^T \beta^* + x_{t,a}^T \theta_a^*

where β\beta^* is shared and θa\theta_a^* is arm-specific.

In their experiments, zt,az_{t,a} contains user/article interaction features while xt,ax_{t,a} contains user features only.

They give the pseudo-code for the new upper confidence bound formula.

In their experiments, they have dd features for both users and articles. They construct the features as follows:

  1. fit a bilinear model via logistic regression for click probability given raw user and article features: p^=ϕuTWϕa\hat p=\phi_u^TW\phi_a

  2. project user features onto the induced space ψu=ϕuTW\psi_u = \phi_u^T W. Then, apply K-means to group users in the induced space ψu\psi_u into 55 clusters.

  3. the final user feature xt,ax_{t,a} contains five entries corresponding to the membership of that user in the 55 clusters (computed with a gaussian kernel and summed to unity) + a constant feature.

Question: it's unclear whether ϕa\phi_a represents all articles or article categories only? They speak multiple times about projecting user features onto article categories.

To compute zt,az_{t,a}, they perform the same dimensionality reduction as above to obtain a 6-dimensional article feature. zt,az_{t,a} is simply its outer product with xt,ax_{t,a}, i.e. a 6×6=366\times 6=36 dimensional vector.

zt,az_{t,a} thus contains user-article interaction information while xt,ax_{t,a} contains user information only.

Evaluation

They simplify the ranking problem by only choosing which article to put as the main story.

Because training logs are already created by an existing policy, they assign a small percentage of traffic to a random bucket. The bucket selects an article to promote uniformly at random.

For evaluation, they sift through the logs and only keep articles that match what the policy under evaluation would choose.

They show that this is equivalent to evaluating the policy on the real world.

Question: how does the system deal with new users? new articles?

The covariance matrix Aa:=(XaTXa+Id)A_a := (X_a^T X_a + I_d) is initialized to the identify matrix and the response vector ba:=XaTyt,ab_a := X_{a}^T y_{t,a} is initialized to zero. The confidence bound will be large, encouraging exploration of the article.

We can adjust exploration by adjusting α\alpha.

New users should have a feature vector xt,ax_{t,a}, they are handled just like any other user.

Question: how does this relate to other approaches such as Q-learning?

The multi-armed bandit problem is formally equivalent to a one-state Markov decision process. However, because we don't incorporate future rewards as in Q-learning, we cannot model long-term interactions. In practice, the model might prioritize click-bait content that guarantees a short-term reward over content that promotes long-term user activity.