Paper Image

Accelerating GPT MoE Inference via Inter-Layer Expert Affinity

Published on:

16 January 2024

Primary Category:

Machine Learning

Paper Authors:

Jinghan Yao,

Quentin Anthony,

Aamir Shafi,

Hari Subramoni,

Dhabaleswar K.,



Key Details

Proposes ExFlow to accelerate GPT MoE inference via inter-layer expert affinity

Exploits affinity in tokens' expert routing across MoE layers

Designs integer programming for optimal cross-GPU expert placement

Applies directly to pre-trained models without accuracy loss

Cuts cross-GPU latency up to 67% and improves throughput 2.2x

AI generated summary

Accelerating GPT MoE Inference via Inter-Layer Expert Affinity

This paper proposes a new method called ExFlow to accelerate inference for Generative Pre-trained Transformer models using the Mixture of Experts technique. ExFlow exploits the affinity between routing decisions for the same token across different MoE layers, allowing optimization of expert placement across GPUs to greatly reduce communication overhead. Their integer programming approach directly applies to pre-trained models without retraining. Experiments show up to 67% less cross-GPU latency and over 2x faster throughput versus prior methods.

Answers from this paper


No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up