Paper Image

Implicit bias towards max-margin token selection in self-attention

Published on:

8 February 2024

Primary Category:

Machine Learning

Paper Authors:

Bhavya Vasudeva,

Puneesh Deora,

Christos Thrampoulidis

Bullets

Key Details

Shows global convergence towards max-margin token selection with normalized GD

Derives first finite-time convergence rates for key-query matrices

Demonstrates accelerated convergence with normalized GD and Polyak step

Extends analysis to joint training of attention and prediction layers

AI generated summary

Implicit bias towards max-margin token selection in self-attention

This paper studies the implicit bias of gradient descent when training a self-attention layer for binary classification. It shows that the key-query matrix converges globally to a max-margin separator that selects optimal tokens, unlike prior work that only showed local convergence. Explicit finite-time convergence rates are derived, showing accelerated convergence with normalized GD and Polyak step-size. The analysis is also extended to joint training of the attention and prediction layers.

Answers from this paper

Comments

No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up