- 简介本文解决了视觉Transformer(ViT)所面临的一个重要挑战:在不同图像分辨率下的受限可扩展性。通常情况下,ViT在处理与训练时不同分辨率的图像时会出现性能下降。我们的研究引入了两个关键创新来解决这个问题。首先,我们提出了一种新颖的动态分辨率调整模块,专门设计了一个Transformer块,以实现高效的增量令牌集成。其次,我们在视觉Transformer中引入了模糊位置编码,以提供一致的位置感知能力,从而防止过度拟合到任何单个训练分辨率。我们的模型ViTAR(任意分辨率的视觉Transformer)展示了令人印象深刻的适应性,在1120x1120分辨率下实现了83.3%的top-1准确率,在4032x4032分辨率下实现了80.4%的准确率,同时降低了计算成本。ViTAR还在实例和语义分割等下游任务中表现出强大的性能,并且可以轻松地与自监督学习技术(如遮罩自编码器)相结合。我们的研究提供了一种成本效益的解决方案,以增强ViT的分辨率可扩展性,为更多多功能和高效的高分辨率图像处理铺平了道路。
- 图表
- 解决问题ViTs have constrained scalability across different image resolutions, leading to performance decline when processing resolutions different from those seen during training. The paper aims to address this issue.
- 关键思路The paper proposes a novel module for dynamic resolution adjustment and introduces fuzzy positional encoding in the Vision Transformer to provide consistent positional awareness across multiple resolutions. The resulting model, ViTAR, achieves impressive adaptability and reduces computational costs.
- 其它亮点ViTAR achieves 83.3% top-1 accuracy at a 1120x1120 resolution and 80.4% accuracy at a 4032x4032 resolution, while reducing computational costs. ViTAR also performs well in downstream tasks such as instance and semantic segmentation and can be combined with self-supervised learning techniques like Masked AutoEncoder.
- Some related work in this field includes DeiT, CoaT, and PVTv2.
沙发等你来抢
去评论
评论
沙发等你来抢