Paper Image

Language models struggle with web task combinations

Published on:

30 November 2023

Primary Category:

Machine Learning

Paper Authors:

Hiroki Furuta,

Yutaka Matsuo,

Aleksandra Faust,

Izzeddin Gur


Key Details

Prompted models like GPT-3.5 get 94% success on base web tasks

But they drop to 25% success on compositional web tasks

Transferred finetuned models degrade less, 85% to 55%

Rebalancing training data helps a new HTML-T5++ model reach 95% base success and 62% compositional

Models also struggle with reverse instruction order

AI generated summary

Language models struggle with web task combinations

This paper introduces a new benchmark called CompWoB to test language model agents on compositional web automation tasks. It finds that while prompted models like GPT-3.5 achieve 94% success on base tasks, they drop to 25% on compositional tasks. In contrast, transferred models finetuned on base tasks degrade less from 85% to 55%. By rebalancing training data, a new model called HTML-T5++ reaches 95% on base tasks and 62% compositional. Models also struggle with different instruction orders. So while language models show promise, the paper emphasizes needing more robustness to task combinations for real-world use.

Answers from this paper


No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up