microsoft-deep-RL-for-news

Timothy Delille / notes / microsoft-deep-RL-for-news

or DRN: A Deep Reinforcement Learning Framework for News Recommendation (2018)

Dynamic changes in news recommendations require some sort of online update, and make RL an attractive solution.
Current methods only try to model current reward (e.g. Click Through Rate). Ignoring long term rewards or scheduling strategies
They don't consider user feedback other than click / no click labels (e.g. frequency of user return)
They are exploitative and keep serving the same items to users

In a deep Q network, future rewards are modeled with a Markov Decision Process. Multi-armed bandit methods can only capture the reward of the current iteration. The network outputs a single score for a specific user-news combination.

Features:

News features: 417 dimension features including click counts in the last 1h, 6h, 24h, 1 week and 1 year.
User features: aggregated features of the news that the user clicked on in the last 1h, 6h, 24h, 1 week and 1 year. Also contains total click count for the same windows. $413\times 5=2065$ dimensions.
user news features: 25-dimensional features describing the interaction between user and a specific piece of news. Frequency of impression, topic, topic category in the user's history.
Context features: 32-dimensional features describing the context of the request, including time, weekday.

The reward is modeled using a double deep Q network:

y_{s,a,t} = r_{a,t+1} + \gamma Q(s_{a,t+1}, \arg\max_{a'} Q(s_{a,t+1}, a'; W_t); W_t')

where $r_{a,t+1}$ is the immediate reward for taking action $a$ at time $t$ (the reward is delayed by 1). More critically, $W_t$ and $W_t'$ are two different sets of parameters of the DQN, that are switched every few iterations. The goal is to mitigate maximization bias (it corresponds to the "double" part of the double deep Q network). The future reward discount $\gamma$ is set to $0.4$ .

Question: what is the training objective? MSE loss?

Moreover, the paper also models the Q function as the sum of the value function and advantage function:

Q(s,a) = V(s) + A(s,a)

The value function only depends on user features and context features (e.g. whether the user is active, or has read enough news today does not depend on the news article).

The advantage function only depends on news features and user-news interactions.

Exploration

State-of-the-art RL methods apply $\epsilon$ -greedy strategies or Upper Confidence Bound. However, both strategies can harm recommandation performance in the short term, because the explored items can be too random.

Instead, they apply a "dueling bandit gradient descent":

create disturbed set of weights: $\tilde{W} = (1 + \alpha \cdot \mathcal{U}(-1,1))\cdot W$ , where $\alpha := 0.1$ .
generate recommendation lists $L$ and $\tilde{L}$
"probabilistic interleave": randomly select between list $L$ and $\tilde{L}$ . Suppose $L$ is selected, then an item $i$ from $L$ is put into $\tilde{L}$ with a probability determined by its ranking in $L$ . Then $\tilde{L}$ is recommended to the user. If the items recommended by the explore network $\tilde{Q}$ receive a better feedback, the agent updates the network $Q$ towards $\tilde{Q}$ using the equation: $W := W + \eta \tilde{W}$ where $\eta := 0.05$ . If not, the network remains unchanged.

Question: in practice, how do you measure which network received the highest feedback? Do you take the mean reward over the period and update $\tilde{Q}$ if it's strictly higher? (no margin?)

A DQN works by experience replay. They call this a "major update" in the paper, and perform it every hour.

They call the dueling bandit gradient descent a "minor update" and perform it every 30 minutes.

Evaluation

In terms of evaluation metrics, they use CTR, precision@5 and nDCG. They also evaluate recommendation diversity by computing the intra-list similarity (ILS), which is the average pairwise similarity of the items in a list. They use cosine similarity as a similarity measure:

\text{ILS}(L)=\frac{\sum_{(b_i,b_j)\in L, b_i\ne b_j}\cos(b_i, b_j)}{\sum_{(b_i,b_j)\in L, b_i\ne b_j} 1}

Question: cosine similarity of what? $b_i$ and $b_j$ are vectors? Do they use the raw features? Do they encode it? Unclear.

Sadly, they don't use their double Q network in conjunction with an upper-confidence bound exploration approach, but they do test $\epsilon$ -greedy.

From the results, it looks like a simple UCB approach could be a performant baseline.

Appendix: rewarding user activeness

The assumption is that better recommendations influence whether users want to use the application again. They use equations from survival analysis to model user return.

In survival analysis, the hazard function represents the instantaneous rate of occurence of the event at time $t$ , given that has not occurred before $t$ :

\lambda(t)=\lim_{dt\to 0} \frac{P(t\leq T<t+dt\vert T\geq t)}{dt}

The cumulative probability that the event has occurred by time $t$ is denoted $F(t)=P(T\leq t)$ . Let us consider a small change in $dF$ :

dF(t) = P(t\leq T\leq t+dt)

Thus:

\frac{P(t\leq T\leq t+dt)}{P(T\geq t)} = P(t\leq T\leq t+dt\vert T\geq t)=\lambda(t)dt

Finally: $dF(t) = \lambda(t)dt (1-F(t))$

We denote $S(t)$ the survival function, which is the probability that the event has not occurred by time $t$ , i.e., the probability that the system survives up to time $t$ . We'll relate it to the hazard function.

Then the survival function is $S(t)=1−F(t)=1-P(T\leq t)$ and $dF(t)=S(t)\lambda(t)dt$ .

Differentiating both sides with respect to $t$ :

\frac{dF(t)}{dt}=\lambda(t)S(t)

Since $F(t)=1-S(t)$ , we also have: $\frac{dF(t)}{dt}=-\frac{dS(t)}{dt}$ . Putting both together yields the first order differential equation:

\frac{dS(t)}{dt}=-\lambda(t)S(t)

Integrating both sides from $0$ to $t$ and assuming $S_0 = 1$ (system has survived at $t=0$ ):

\ln(S(t)) = -\int_0^t \lambda(x) dx

The expected lifespan $T_0$ is: $T_0 = \int_0^\infty S(t) dt$ .

In the paper, they assume $\lambda$ to be a constant $\lambda_0$ . The dynamics are given by the following logic:

When a user returns, their score $S(t) = S(t) + S_a$ , then starts decaying based on the formula $S(t) = \exp{-\int_0^t \lambda(x) dx}$
$S_0:=0.5$ to represent the random initial state of a user
$T_0:=24h$
$\lambda_0:=1.2\times 10^{-5} \text{second}^-1$
$S_a:=0.32$ such that after one daily basis request, the user returns to the initial state: $S_0\exp{-\lambda_0 T_0} + S_a = S_0$ .

The click / no click and user activeness labels are combined linearly: $r = r_{\text{click}} + \beta r_{\text{active}}$ . In the experiments, $\beta = 0.05$ which is quite low and makes us wonder why they went through all the struggle.

It's also worth noting that they downsample the click / no-click ratio to approximately 1:11 for "better model fitting purposes".

links

x.com/timothydelille

linkedin.com/in/timothydelille

github.com/timothydelille