Paper Image

Parallel decoding for faster language model inference

Published on:

18 April 2024

Primary Category:

Computation and Language

Paper Authors:

Pengfei Wu,

Jiahao Liu,

Zhuocheng Gong,

Qifan Wang,

Jinpeng Li,

Jingang Wang,

Xunliang Cai,

Dongyan Zhao

Bullets

Key Details

Proposes hidden transfer to predict pseudo hidden states for future tokens

Transfers intermediate hidden states to pseudo states that refine over layers

Uses tree attention to verify candidate sequences

Outperforms existing single-model acceleration techniques

Conducts analytical experiments proving effectiveness

AI generated summary

Parallel decoding for faster language model inference

This paper proposes a novel parallel decoding method called hidden transfer to accelerate inference for large language models. It transfers intermediate hidden states to pseudo hidden states for future tokens, allowing the model to predict multiple tokens simultaneously in one forward pass. Tree attention verifies candidates, ensuring lossless generation. Experiments show this method assimilates semantic information well and achieves better predictive accuracy and acceleration than existing techniques.

Answers from this paper

Comments

No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up