multi-gate-mixture-of-experts

Google, 2018

Multi-task learning: Recommendation systems often need to optimize multiple objectives at the same time. E.g. click through rate, watch time and explicit feedback (rating).

However in practice, multi-task learning models underperform single-task models on their respective sub-tasks because they fail to model relationships among tasks (some can be conflicting).

MMoE explicitly models task relationships and learns task-specific functionalities to leverage shared representations by allocating parameters instead of adding many new parameters per task.

Shared-Bottom model structure

Given $K$ tasks, model consits of shared-bottom network ($n$ multi-layer perceptrons) $f$ which follows the input layer, and $K$ tower networks $h^k$ where $k=1,\dots,K$.

The output of the shared bottom $f$ is the output of $n$ expert networks implemented as a single layer each. $n=8$ in the paper.

Each tower network (also a single layer) is associated with its own gate $g^k$ that determines how to weigh each expert network: $g^k = \text{softmax}(W_{g,k}. x)$ where $W_{g,k}\in\mathbb{R}^{n,d}$

The re-weighted shared bottom output for tower network $k$ is then:

$f^k(x)=\sum_{i}^n g^k(x)_i f(x)_i$.

The output of task $k$ is then: $y_k = h^k(f^k(x))$.

Each gating network helps the model learn task-specific information.

`Question`

: loss function?