ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention

向作者提问

NEW

简介

最近，线性复杂度序列建模网络在各种计算机视觉任务上已经实现了与Vision Transformer相似的建模能力，同时使用更少的FLOPs和内存。然而，它们在实际运行速度方面的优势并不明显。为了解决这个问题，我们引入了适用于视觉的门控线性注意力（GLA），利用其卓越的硬件感知和效率。我们提出了方向感门控，通过双向建模捕捉一维全局上下文和二维门控局部注入，以自适应方式将二维局部细节注入一维全局上下文。我们的硬件感知实现将正向和反向扫描合并为单个内核，增强了并行性，降低了内存成本和延迟。我们提出的模型\name{}在ImageNet和下游任务上在精度、参数和FLOPs方面提供了有利的折衷方案，优于流行的Transformer和基于CNN的模型。值得注意的是，\name{}-S仅使用27\%的参数和20\%的FLOPs，就达到了DeiT-B的准确性，$224\times224$图像运行速度提高了2倍。在$1024\times1024$分辨率下，\name{}-T使用的FLOPs少5.2倍，节省90\%的GPU内存，运行速度快4.8倍，比DeiT-T高20.7\%的top-1准确率。这些结果将\name{}定位为一种高效且可扩展的视觉表示学习解决方案。代码可在\url{https://github.com/hustvl/ViG}上获得。
作者讲解

目前尚无作者解读视频，你可点击下方【许愿开讲】按钮，许愿作者开讲~
图表
解决问题

GLA for vision is proposed to address the issue of linear complexity sequence modeling networks not having significant advantage in actual runtime speed.
关键思路

GLA leverages superior hardware-awareness and efficiency to introduce direction-wise gating and 2D gating locality injection, merging forward and backward scanning into a single kernel.
其它亮点

The proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer and CNN-based models. ViG-S matches DeiT-B's accuracy while using only 27% of the parameters and 20% of the FLOPs, running 2× faster on 224×224 images. At 1024×1024 resolution, ViG-T uses 5.2× fewer FLOPs, saves 90% GPU memory, runs 4.8× faster, and achieves 20.7% higher top-1 accuracy than DeiT-T. Code is available at https://github.com/hustvl/ViG.
相关研究

Recent related works in this field include Vision Transformers and linear complexity sequence modeling networks.

许愿开讲

PDF

原文

点赞收藏

向作者提问

NEW

分享到Link

提问交流

提交问题，平台邀请作者，轻松获得权威解答～

向作者提问