Benchmarking language models on software development

Paper Title:

DevBench: A Comprehensive Benchmark for Software Development

Published on:

13 March 2024

Primary Category:

Computation and Language

Paper Authors:

Bowen Li,

Wenhan Wu,

Ziwei Tang,

Lin Shi,

John Yang,

Jinyang Li,

Shunyu Yao,

Chen Qian,

Binyuan Hui,

Qicheng Zhang,

Zhiyin Yu,

He Du,

Ping Yang,

Dahua Lin,

Chao Peng,

Kai Chen

Bullets

Key Details

•

DevBench tests language models on software design, environment setup, coding, acceptance testing, and unit testing

•

It uses 22 real-world repositories in Python, C/C++, Java, and JavaScript, with verified test cases

•

Models evaluated include GPT-3.5, GPT-4, CodeLlama, and DeepSeek Coder

•

Current models fail most tasks, struggling with complex code structures and builds

•

Analysis identifies limitations in Makefile, function arguments, and advanced programming concepts

Explore the topics in this paper

benchmarking

language models

model evaluation

programming languages

software development

AI generated summary

Benchmarking language models on software development

The paper introduces DevBench, a new benchmark to evaluate language models' capabilities across key stages of software development. It features tasks like software design, setup, coding, and testing for real-world repositories across programming languages. Experiments show current models struggle to solve DevBench's challenges in understanding complex code structures and builds.