Modular speech translation

5 October 2023

Computation and Language

Paul-Ambroise Duquenne,

Holger Schwenk,

Benoît Sagot


Uses modular encoders and decoders for speech and text

Enables zero-shot cross-modal speech translation

Trains modules to fit a shared embedding space

Shows gains from multilingual training

Outperforms supervised approach on some languages

Modular speech translation

This paper shows that independently trained speech and text modules can be combined to enable competitive zero-shot cross-modal speech translation. The key ideas are: 1) using a shared fixed-size sentence embedding space, 2) training encoders and decoders separately, 3) enabling cross-lingual transfer via multilingual training. The method even outperforms supervised approaches on some languages.

