Multi-Gate Mixture of Experts

Google, 2018

Multi-task learning: Recommendation systems often need to optimize multiple objectives at the same time. E.g. click through rate, watch time and explicit feedback (rating).

However in practice, multi-task learning models underperform single-task models on their respective sub-tasks because they fail to model relationships among tasks (some can be conflicting).

MMoE explicitly models task relationships and learns task-specific functionalities to leverage shared representations by allocating parameters instead of adding many new parameters per task.

Shared-Bottom model structure

// include figure 1.c

Given KK tasks, model consits of shared-bottom network (nn multi-layer perceptrons) ff which follows the input layer, and KK tower networks hkh^k where k=1,,Kk=1,\dots,K.

The output of the shared bottom ff is the output of nn expert networks implemented as a single layer each. n=8n=8 in the paper.

Each tower network (also a single layer) is associated with its own gate gkg^k that determines how to weigh each expert network: gk=softmax(Wg,k.x)g^k = \text{softmax}(W_{g,k}. x) where Wg,kRn,dW_{g,k}\in\mathbb{R}^{n,d}

The re-weighted shared bottom output for tower network kk is then:

fk(x)=ingk(x)if(x)if^k(x)=\sum_{i}^n g^k(x)_i f(x)_i.

The output of task kk is then: yk=hk(fk(x))y_k = h^k(f^k(x)).

Each gating network helps the model learn task-specific information.

Question: loss function?