microsoft-deep-RL-for-news

or DRN: A Deep Reinforcement Learning Framework for News Recommendation (2018)

  • Dynamic changes in news recommendations require some sort of online update, and make RL an attractive solution.
  • Current methods only try to model current reward (e.g. Click Through Rate). Ignoring long term rewards or scheduling strategies
  • They don't consider user feedback other than click / no click labels (e.g. frequency of user return)
  • They are exploitative and keep serving the same items to users

In a deep Q network, future rewards are modeled with a Markov Decision Process. Multi-armed bandit methods can only capture the reward of the current iteration. The network outputs a single score for a specific user-news combination.

Features:

  • News features: 417 dimension features including click counts in the last 1h, 6h, 24h, 1 week and 1 year.
  • User features: aggregated features of the news that the user clicked on in the last 1h, 6h, 24h, 1 week and 1 year. Also contains total click count for the same windows. 413×5=2065413\times 5=2065 dimensions.
  • user news features: 25-dimensional features describing the interaction between user and a specific piece of news. Frequency of impression, topic, topic category in the user's history.
  • Context features: 32-dimensional features describing the context of the request, including time, weekday.

The reward is modeled using a double deep Q network:

ys,a,t=ra,t+1+γQ(sa,t+1,argmaxaQ(sa,t+1,a;Wt);Wt)y_{s,a,t} = r_{a,t+1} + \gamma Q(s_{a,t+1}, \arg\max_{a'} Q(s_{a,t+1}, a'; W_t); W_t')

where ra,t+1r_{a,t+1} is the immediate reward for taking action aa at time tt (the reward is delayed by 1). More critically, WtW_t and WtW_t' are two different sets of parameters of the DQN, that are switched every few iterations. The goal is to mitigate maximization bias (it corresponds to the "double" part of the double deep Q network). The future reward discount γ\gamma is set to 0.40.4.

Question: what is the training objective? MSE loss?

Moreover, the paper also models the Q function as the sum of the value function and advantage function:

Q(s,a)=V(s)+A(s,a)Q(s,a) = V(s) + A(s,a)

The value function only depends on user features and context features (e.g. whether the user is active, or has read enough news today does not depend on the news article).

The advantage function only depends on news features and user-news interactions.

Exploration

State-of-the-art RL methods apply ϵ\epsilon-greedy strategies or Upper Confidence Bound. However, both strategies can harm recommandation performance in the short term, because the explored items can be too random.

Instead, they apply a "dueling bandit gradient descent":

  1. create disturbed set of weights: W~=(1+αU(1,1))W\tilde{W} = (1 + \alpha \cdot \mathcal{U}(-1,1))\cdot W, where α:=0.1\alpha := 0.1.

  2. generate recommendation lists LL and L~\tilde{L}

  3. "probabilistic interleave": randomly select between list LL and L~\tilde{L}. Suppose LL is selected, then an item ii from LL is put into L~\tilde{L} with a probability determined by its ranking in LL. Then L~\tilde{L} is recommended to the user. If the items recommended by the explore network Q~\tilde{Q} receive a better feedback, the agent updates the network QQ towards Q~\tilde{Q} using the equation: W:=W+ηW~W := W + \eta \tilde{W} where η:=0.05\eta := 0.05. If not, the network remains unchanged.

Question: in practice, how do you measure which network received the highest feedback? Do you take the mean reward over the period and update Q~\tilde{Q} if it's strictly higher? (no margin?)

A DQN works by experience replay. They call this a "major update" in the paper, and perform it every hour.

They call the dueling bandit gradient descent a "minor update" and perform it every 30 minutes.

Evaluation

In terms of evaluation metrics, they use CTR, precision@5 and nDCG. They also evaluate recommendation diversity by computing the intra-list similarity (ILS), which is the average pairwise similarity of the items in a list. They use cosine similarity as a similarity measure:

ILS(L)=(bi,bj)L,bibjcos(bi,bj)(bi,bj)L,bibj1\text{ILS}(L)=\frac{\sum_{(b_i,b_j)\in L, b_i\ne b_j}\cos(b_i, b_j)}{\sum_{(b_i,b_j)\in L, b_i\ne b_j} 1}

Question: cosine similarity of what? bib_i and bjb_j are vectors? Do they use the raw features? Do they encode it? Unclear.

Sadly, they don't use their double Q network in conjunction with an upper-confidence bound exploration approach, but they do test ϵ\epsilon-greedy.

From the results, it looks like a simple UCB approach could be a performant baseline.

Appendix: rewarding user activeness

The assumption is that better recommendations influence whether users want to use the application again. They use equations from survival analysis to model user return.

In survival analysis, the hazard function represents the instantaneous rate of occurence of the event at time tt, given that has not occurred before tt:

λ(t)=limdt0P(tT<t+dtTt)dt\lambda(t)=\lim_{dt\to 0} \frac{P(t\leq T<t+dt\vert T\geq t)}{dt}

The cumulative probability that the event has occurred by time tt is denoted F(t)=P(Tt)F(t)=P(T\leq t). Let us consider a small change in dFdF:

dF(t)=P(tTt+dt)dF(t) = P(t\leq T\leq t+dt)

Thus:

P(tTt+dt)P(Tt)=P(tTt+dtTt)=λ(t)dt\frac{P(t\leq T\leq t+dt)}{P(T\geq t)} = P(t\leq T\leq t+dt\vert T\geq t)=\lambda(t)dt

Finally: dF(t)=λ(t)dt(1F(t))dF(t) = \lambda(t)dt (1-F(t))

We denote S(t)S(t) the survival function, which is the probability that the event has not occurred by time tt, i.e., the probability that the system survives up to time tt. We'll relate it to the hazard function.

Then the survival function is S(t)=1F(t)=1P(Tt)S(t)=1−F(t)=1-P(T\leq t) and dF(t)=S(t)λ(t)dtdF(t)=S(t)\lambda(t)dt.

Differentiating both sides with respect to tt:

dF(t)dt=λ(t)S(t)\frac{dF(t)}{dt}=\lambda(t)S(t)

Since F(t)=1S(t)F(t)=1-S(t), we also have: dF(t)dt=dS(t)dt\frac{dF(t)}{dt}=-\frac{dS(t)}{dt}. Putting both together yields the first order differential equation:

dS(t)dt=λ(t)S(t)\frac{dS(t)}{dt}=-\lambda(t)S(t)

Integrating both sides from 00 to tt and assuming S0=1S_0 = 1 (system has survived at t=0t=0):

ln(S(t))=0tλ(x)dx\ln(S(t)) = -\int_0^t \lambda(x) dx

The expected lifespan T0T_0 is: T0=0S(t)dtT_0 = \int_0^\infty S(t) dt.

In the paper, they assume λ\lambda to be a constant λ0\lambda_0. The dynamics are given by the following logic:

  • When a user returns, their score S(t)=S(t)+SaS(t) = S(t) + S_a, then starts decaying based on the formula S(t)=exp0tλ(x)dxS(t) = \exp{-\int_0^t \lambda(x) dx}
  • S0:=0.5S_0:=0.5 to represent the random initial state of a user
  • T0:=24hT_0:=24h
  • λ0:=1.2×105second1\lambda_0:=1.2\times 10^{-5} \text{second}^-1
  • Sa:=0.32S_a:=0.32 such that after one daily basis request, the user returns to the initial state: S0expλ0T0+Sa=S0S_0\exp{-\lambda_0 T_0} + S_a = S_0.

The click / no click and user activeness labels are combined linearly: r=rclick+βractiver = r_{\text{click}} + \beta r_{\text{active}}. In the experiments, β=0.05\beta = 0.05 which is quite low and makes us wonder why they went through all the struggle.

It's also worth noting that they downsample the click / no-click ratio to approximately 1:11 for "better model fitting purposes".