Paper Image

Mixture-of-Experts Language Model

Paper Title:

Mixtral of Experts

Published on:

8 January 2024

Primary Category:

Machine Learning

Paper Authors:

Albert Q. Jiang,

Alexandre Sablayrolles,

Antoine Roux,

Arthur Mensch,

Blanche Savary,

Chris Bamford,

Devendra Singh Chaplot,

Diego de las Casas,

Emma Bou Hanna,

Florian Bressand,

Gianna Lengyel,

Guillaume Bour,

Guillaume Lample,

Lélio Renard Lavaud,

Lucile Saulnier,

Marie-Anne Lachaux,

Pierre Stock,

Sandeep Subramanian,

Sophia Yang,

Szymon Antoniak,

Teven Le Scao,

Théophile Gervet,

Thibaut Lavril,

Thomas Wang,

Timothée Lacroix,

William El Sayed


Key Details

Mixtral is a 47B parameter sparse mixture-of-experts model, with 8 experts per layer

A router assigns 2 experts to process each token, so 13B parameters active per token

Matches or beats Llama 2 70B and GPT-3.5 on benchmarks

Much better on math, code, and multilingual tasks than Llama 2 70B

Instruct-finetuned version tops models on human benchmarks

AI generated summary

Mixture-of-Experts Language Model

The paper introduces Mixtral, an 8x7B parameter sparse mixture-of-experts transformer language model. A router network assigns each token to 2 out of 8 experts per layer. So while total parameters are 47B, only 13B are active per token. Mixtral matches or beats Llama 2 70B and GPT-3.5 on benchmarks, with much better math and code performance. An instruct-finetuned version also exceeds other models on human evals.

Answers from this paper


No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up