8 February 2024
Shows global convergence towards max-margin token selection with normalized GD
Derives first finite-time convergence rates for key-query matrices
Demonstrates accelerated convergence with normalized GD and Polyak step
Extends analysis to joint training of attention and prediction layers
Implicit bias towards max-margin token selection in self-attention
This paper studies the implicit bias of gradient descent when training a self-attention layer for binary classification. It shows that the key-query matrix converges globally to a max-margin separator that selects optimal tokens, unlike prior work that only showed local convergence. Explicit finite-time convergence rates are derived, showing accelerated convergence with normalized GD and Polyak step-size. The analysis is also extended to joint training of the attention and prediction layers.
No comments yet, be the first to start the conversation...
Sign up to comment on this paper