AgentBench: Evaluating LLMs as Agents
7 August 2023
Proposes AGENTBENCH, a new benchmark with 8 interactive environments to test language models as agents
Evaluates 27 API and open-source language models with a custom evaluation toolkit
Finds top commercial models show promise as agents but open-source models lag behind
Identifies reasoning, planning, and instruction following as key areas for improvement
Evaluating language models as conversational agents
This paper introduces a benchmark to systematically evaluate the capabilities of language models to act as conversational agents in interactive environments. It tests models across 8 distinct tasks grounded in real-world scenarios like operating systems, games, and web browsing.
No comments yet, be the first to start the conversation...
Sign up to comment on this paper