Self-emerging token labeling for vision transformers

8 January 2024

Computer Vision and Pattern Recognition

Bingyin Zhao,

Zhiding Yu,

Shiyi Lan,

Yutao Cheng,

Anima Anandkumar,

Yingjie Lao,

Jose M. Alvarez


Proposes self-emerging token labeling framework with vision transformer token labeler

Token labeler trained to produce semantic patch token labels

Student models trained using self-emerging token labels and original labels

Achieves SOTA accuracy and robustness on ImageNet

Also improves robustness in downstream tasks

This paper proposes a self-emerging token labeling framework to improve the pre-training of vision transformers. It contains two stages - first training a vision transformer token labeler to generate semantic token labels, then training a student model using both original labels and self-emerging token labels. The best model achieves state-of-the-art accuracy on ImageNet benchmarks and robustness against out-of-distribution data, significantly outperforming prior counterparts. Downstream tasks also show enhanced performance in robustness.

