Mixture-of-Experts Language Model

Mixtral of Experts

8 January 2024

Machine Learning

Albert Q. Jiang,

Alexandre Sablayrolles,

Antoine Roux,

Arthur Mensch,

Blanche Savary,

Chris Bamford,

Devendra Singh Chaplot,

Diego de las Casas,

Emma Bou Hanna,

Florian Bressand,

Gianna Lengyel,

Guillaume Bour,

Guillaume Lample,

Lélio Renard Lavaud,

Lucile Saulnier,

Marie-Anne Lachaux,

Pierre Stock,

Sandeep Subramanian,

Sophia Yang,

Szymon Antoniak,

Teven Le Scao,

Théophile Gervet,

Thibaut Lavril,

Thomas Wang,

Timothée Lacroix,

William El Sayed


Mixtral is a 47B parameter sparse mixture-of-experts model, with 8 experts per layer

A router assigns 2 experts to process each token, so 13B parameters active per token

Matches or beats Llama 2 70B and GPT-3.5 on benchmarks

Much better on math, code, and multilingual tasks than Llama 2 70B

Instruct-finetuned version tops models on human benchmarks

Mixture-of-Experts Language Model

The paper introduces Mixtral, an 8x7B parameter sparse mixture-of-experts transformer language model. A router network assigns each token to 2 out of 8 experts per layer. So while total parameters are 47B, only 13B are active per token. Mixtral matches or beats Llama 2 70B and GPT-3.5 on benchmarks, with much better math and code performance. An instruct-finetuned version also exceeds other models on human evals.

