or DRN: A Deep Reinforcement Learning Framework for News Recommendation (2018)
In a deep Q network, future rewards are modeled with a Markov Decision Process. Multi-armed bandit methods can only capture the reward of the current iteration. The network outputs a single score for a specific user-news combination.
Features:
The reward is modeled using a double deep Q network:
where is the immediate reward for taking action at time (the reward is delayed by 1). More critically, and are two different sets of parameters of the DQN, that are switched every few iterations. The goal is to mitigate maximization bias (it corresponds to the "double" part of the double deep Q network). The future reward discount is set to .
Question: what is the training objective? MSE loss?
Moreover, the paper also models the Q function as the sum of the value function and advantage function:
The value function only depends on user features and context features (e.g. whether the user is active, or has read enough news today does not depend on the news article).
The advantage function only depends on news features and user-news interactions.
State-of-the-art RL methods apply -greedy strategies or Upper Confidence Bound. However, both strategies can harm recommandation performance in the short term, because the explored items can be too random.
Instead, they apply a "dueling bandit gradient descent":
create disturbed set of weights: , where .
generate recommendation lists and
"probabilistic interleave": randomly select between list and . Suppose is selected, then an item from is put into with a probability determined by its ranking in . Then is recommended to the user. If the items recommended by the explore network receive a better feedback, the agent updates the network towards using the equation: where . If not, the network remains unchanged.
Question: in practice, how do you measure which network received the highest feedback? Do you take the mean reward over the period and update if it's strictly higher? (no margin?)
A DQN works by experience replay. They call this a "major update" in the paper, and perform it every hour.
They call the dueling bandit gradient descent a "minor update" and perform it every 30 minutes.
In terms of evaluation metrics, they use CTR, precision@5 and nDCG. They also evaluate recommendation diversity by computing the intra-list similarity (ILS), which is the average pairwise similarity of the items in a list. They use cosine similarity as a similarity measure:
Question: cosine similarity of what? and are vectors? Do they use the raw features? Do they encode it? Unclear.
Sadly, they don't use their double Q network in conjunction with an upper-confidence bound exploration approach, but they do test -greedy.
From the results, it looks like a simple UCB approach could be a performant baseline.
The assumption is that better recommendations influence whether users want to use the application again. They use equations from survival analysis to model user return.
In survival analysis, the hazard function represents the instantaneous rate of occurence of the event at time , given that has not occurred before :
The cumulative probability that the event has occurred by time is denoted . Let us consider a small change in :
Thus:
Finally:
We denote the survival function, which is the probability that the event has not occurred by time , i.e., the probability that the system survives up to time . We'll relate it to the hazard function.
Then the survival function is and .
Differentiating both sides with respect to :
Since , we also have: . Putting both together yields the first order differential equation:
Integrating both sides from to and assuming (system has survived at ):
The expected lifespan is: .
In the paper, they assume to be a constant . The dynamics are given by the following logic:
The click / no click and user activeness labels are combined linearly: . In the experiments, which is quite low and makes us wonder why they went through all the struggle.
It's also worth noting that they downsample the click / no-click ratio to approximately 1:11 for "better model fitting purposes".