Paper Image

Evaluating LLMs on Real-World Tasks

Published on:

9 November 2023

Primary Category:

Computation and Language

Paper Authors:

Shuyi Xie,

Wenlin Yao,

Yong Dai,

Shaobo Wang,

Donlin Zhou,

Lifeng Jin,

Xinhua Feng,

Pengzhi Wei,

Yujie Lin,

Zhichao Hu,

Dong Yu,

Zhengyou Zhang,

Jing Nie,

Yuhong Liu

Bullets

Key Details

Proposes hierarchical task tree with 800+ real-world tasks to evaluate LLMs comprehensively

Designs detailed standards and processes for consistent human evaluation

Releases test set of 3,000+ instances across knowledge domains and difficulties

Enables standardized assessment of human alignment for English and Chinese LLMs

Analyzes feasibility of partial automation using strong LLMs like GPT-4

AI generated summary

Evaluating LLMs on Real-World Tasks

This paper proposes a comprehensive framework for evaluating large language models' ability to follow human instructions across diverse real-world tasks. A hierarchical task tree with over 800 tasks is constructed to assess models in-depth. Detailed evaluation standards and processes facilitate consistent judgments from human evaluators. A test set of 3,000+ instances spanning various difficulties and knowledge domains is released. The methodology enables standardized assessment of human alignment for English and Chinese LLMs. Analysis shows potential for partial automation using a strong LLM like GPT-4. Overall, the framework enables thorough benchmarking of safe, human-aligned LLMs for real applications.

Answers from this paper

Comments

No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up