Paper Image

Evaluating language models as conversational agents

Published on:

7 August 2023

Primary Category:

Artificial Intelligence

Paper Authors:

Xiao Liu,

Hao Yu,

Hanchen Zhang,

Yifan Xu,

Xuanyu Lei,

Hanyu Lai,

Yu Gu,

Hangliang Ding,

Kaiwen Men,

Kejuan Yang,

Shudan Zhang,

Xiang Deng,

Aohan Zeng,

Zhengxiao Du,

Chenhui Zhang,

Sheng Shen,

Tianjun Zhang,

Yu Su,

Huan Sun,

Minlie Huang,

Yuxiao Dong,

Jie Tang


Key Details

Proposes AGENTBENCH, a new benchmark with 8 interactive environments to test language models as agents

Evaluates 27 API and open-source language models with a custom evaluation toolkit

Finds top commercial models show promise as agents but open-source models lag behind

Identifies reasoning, planning, and instruction following as key areas for improvement

AI generated summary

Evaluating language models as conversational agents

This paper introduces a benchmark to systematically evaluate the capabilities of language models to act as conversational agents in interactive environments. It tests models across 8 distinct tasks grounded in real-world scenarios like operating systems, games, and web browsing.

Answers from this paper


No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up