Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
16 January 2024
Proposes ExFlow to accelerate GPT MoE inference via inter-layer expert affinity
Exploits affinity in tokens' expert routing across MoE layers
Designs integer programming for optimal cross-GPU expert placement
Applies directly to pre-trained models without accuracy loss
Cuts cross-GPU latency up to 67% and improves throughput 2.2x
Accelerating GPT MoE Inference via Inter-Layer Expert Affinity
This paper proposes a new method called ExFlow to accelerate inference for Generative Pre-trained Transformer models using the Mixture of Experts technique. ExFlow exploits the affinity between routing decisions for the same token across different MoE layers, allowing optimization of expert placement across GPUs to greatly reduce communication overhead. Their integer programming approach directly applies to pre-trained models without retraining. Experiments show up to 67% less cross-GPU latency and over 2x faster throughput versus prior methods.
No comments yet, be the first to start the conversation...
Sign up to comment on this paper