Full Stack Optimization of Transformer Inference: a Survey
Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, See More ...
UC Berkeley & NVIDIA
Transformer 推理的全栈优化:综述
要点:
最先进的 DNN 架构设计的最新进展一直在转向 Transformer 模型。 这些模型在广泛的应用中实现了卓越的精度。 自最初引入 Transformer 模型以来,这种趋势在过去几年中一直保持一致。 然而,最近的 Transformer 模型推理所需的计算量和带宽正在以显着的速度增长,这使得它们在对延迟敏感的应用程序中的部署变得具有挑战性。 因此,人们越来越关注提高 Transformer 模型的效率,其方法从改变架构设计一直到开发专用的特定领域加速器。 在这项工作中,我们调查了高效 Transformer 推理的不同方法,包括:(i) 分析和分析现有 Transformer 架构中的瓶颈及其与以前的卷积模型的异同; (ii) Transformer 架构对硬件的影响,包括非线性操作(如 Layer Normalization、Softmax 和 GELU)以及线性操作对硬件设计的影响; (iii) 优化固定 Transformer 架构的方法; (iv) 为 Transformer 模型找到正确的操作映射和调度的挑战; (v) 通过使用神经架构搜索调整架构来优化 Transformer 模型的方法。 最后,我们通过在 Gemmini(开源、全栈 DNN 加速器生成器)上应用调查的优化来执行案例研究,并且我们展示了与 Gemmini 之前的基准测试结果相比,这些方法中的每一种如何产生改进。 除其他事项外,我们发现采用上述方法的全栈协同设计方法可以实现高达 88.7 倍的加速,同时将 Transformer 推理的性能下降降至最低。[机器翻译+人工校对]
Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right mapping and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search. Finally, we perform a case study by applying the surveyed optimizations on Gemmini, the open-source, full-stack DNN accelerator generator, and we show how each of these approaches can yield improvements, compared to previous benchmark results on Gemmini. Among other things, we find that a full-stack co-design approach with the aforementioned methods can result in up to 88.7x speedup with a minimal performance degradation for Transformer inference.
https://arxiv.org/pdf/2302.14017.pdf
内容中包含的图片若涉及版权问题,请及时与我们联系删除
评论
沙发等你来抢