Benchmarking language models on software development

Published on:

13 March 2024

Primary Category:

Computation and Language

Paper Authors:

Bowen Li,

Wenhan Wu,

Ziwei Tang,

Lin Shi,

John Yang,

Jinyang Li,

Shunyu Yao,

Chen Qian,

Binyuan Hui,

Qicheng Zhang,

Zhiyin Yu,

He Du,

Ping Yang,

Dahua Lin,

Chao Peng,

Kai Chen


Key Details

DevBench tests language models on software design, environment setup, coding, acceptance testing, and unit testing

It uses 22 real-world repositories in Python, C/C++, Java, and JavaScript, with verified test cases

Models evaluated include GPT-3.5, GPT-4, CodeLlama, and DeepSeek Coder

Current models fail most tasks, struggling with complex code structures and builds

Analysis identifies limitations in Makefile, function arguments, and advanced programming concepts

AI generated summary

The paper introduces DevBench, a new benchmark to evaluate language models' capabilities across key stages of software development. It features tasks like software design, setup, coding, and testing for real-world repositories across programming languages. Experiments show current models struggle to solve DevBench's challenges in understanding complex code structures and builds.

Answers from this paper


