- 简介最近,线性复杂度序列建模网络在各种计算机视觉任务上已经实现了与Vision Transformer相似的建模能力,同时使用更少的FLOPs和内存。然而,它们在实际运行速度方面的优势并不明显。为了解决这个问题,我们引入了适用于视觉的门控线性注意力(GLA),利用其卓越的硬件感知和效率。我们提出了方向感门控,通过双向建模捕捉一维全局上下文和二维门控局部注入,以自适应方式将二维局部细节注入一维全局上下文。我们的硬件感知实现将正向和反向扫描合并为单个内核,增强了并行性,降低了内存成本和延迟。我们提出的模型\name{}在ImageNet和下游任务上在精度、参数和FLOPs方面提供了有利的折衷方案,优于流行的Transformer和基于CNN的模型。值得注意的是,\name{}-S仅使用27\%的参数和20\%的FLOPs,就达到了DeiT-B的准确性,$224\times224$图像运行速度提高了2倍。在$1024\times1024$分辨率下,\name{}-T使用的FLOPs少5.2倍,节省90\%的GPU内存,运行速度快4.8倍,比DeiT-T高20.7\%的top-1准确率。这些结果将\name{}定位为一种高效且可扩展的视觉表示学习解决方案。代码可在\url{https://github.com/hustvl/ViG}上获得。
-
- 图表
- 解决问题GLA for vision is proposed to address the issue of linear complexity sequence modeling networks not having significant advantage in actual runtime speed.
- 关键思路GLA leverages superior hardware-awareness and efficiency to introduce direction-wise gating and 2D gating locality injection, merging forward and backward scanning into a single kernel.
- 其它亮点The proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer and CNN-based models. ViG-S matches DeiT-B's accuracy while using only 27% of the parameters and 20% of the FLOPs, running 2× faster on 224×224 images. At 1024×1024 resolution, ViG-T uses 5.2× fewer FLOPs, saves 90% GPU memory, runs 4.8× faster, and achieves 20.7% higher top-1 accuracy than DeiT-T. Code is available at https://github.com/hustvl/ViG.
- Recent related works in this field include Vision Transformers and linear complexity sequence modeling networks.
NEW
提问交流
提交问题,平台邀请作者,轻松获得权威解答~
向作者提问

提问交流