It looks like there's only one paper in this tag, so it's history isn't too exciting,
but you can still check it out below!
This paper explores improving zero-shot generalization by routing tokens within a language model to different specialized expert modules at each layer. Their method, PHATGOOSE, trains routing gates for each expert module that determine which tokens should use that module. Experiments find PHATGOOSE outperforms past routing methods and sometimes matches multitask training.