Published on:
13 March 2024
Primary Category:
Computation and Language
Paper Authors:
Bowen Li,
Wenhan Wu,
Ziwei Tang,
Lin Shi,
John Yang,
Jinyang Li,
Shunyu Yao,
Chen Qian,
Binyuan Hui,
Qicheng Zhang,
Zhiyin Yu,
He Du,
Ping Yang,
Dahua Lin,
Chao Peng,
Kai Chen
DevBench tests language models on software design, environment setup, coding, acceptance testing, and unit testing
It uses 22 real-world repositories in Python, C/C++, Java, and JavaScript, with verified test cases
Models evaluated include GPT-3.5, GPT-4, CodeLlama, and DeepSeek Coder
Current models fail most tasks, struggling with complex code structures and builds
Analysis identifies limitations in Makefile, function arguments, and advanced programming concepts
Benchmarking language models on software development
The paper introduces DevBench, a new benchmark to evaluate language models' capabilities across key stages of software development. It features tasks like software design, setup, coding, and testing for real-world repositories across programming languages. Experiments show current models struggle to solve DevBench's challenges in understanding complex code structures and builds.
Video understanding models evaluated by comprehensive benchmark
Designing Language Models Responsibly
Benchmarking language models for critique and correction
Benchmark for evaluating language models on generation tasks in Indian languages
Evaluating language models as conversational agents
Code language models struggle at detecting vulnerabilities
No comments yet, be the first to start the conversation...
Sign up to comment on this paper