TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

简介

本文介绍了一种名为TinyVLA的紧凑型视觉-语言-动作模型，相比现有的VLA模型，TinyVLA具有两个关键优势：（1）更快的推理速度；（2）更好的数据效率，无需进行大量机器人数据的预训练。TinyVLA的框架包括两个基本组件：（1）使用高速多模态模型初始化策略骨干；（2）在微调过程中集成扩散策略解码器，以实现精确的机器人动作。作者在模拟和真实机器人上进行了广泛的评估，结果表明TinyVLA在速度和数据效率方面显著优于现有的VLA模型OpenVLA，同时提供了可比或更好的性能。此外，TinyVLA在各种维度上表现出强大的泛化能力，包括语言指令、新颖物体、未见位置、物体外观变化、背景变化和环境变化，通常与或超过OpenVLA的性能。作者认为\methodname提供了一种利用预训练的多模态模型进行策略学习的有趣视角。他们的项目网址是https://tiny-vla.github.io。
图表
解决问题

TinyVLA: A Compact Vision-Language-Action Model for Robotics
关键思路

TinyVLA is a new family of compact vision-language-action models that offers faster inference speeds and improved data efficiency, eliminating the need for pre-training stage. The framework incorporates two essential components to build TinyVLA: initializing the policy backbone with robust, high-speed multimodal models, and integrating a diffusion policy decoder during fine-tuning to enable precise robot actions.
其它亮点

TinyVLA outperforms the state-of-the-art VLA model, OpenVLA, in terms of speed and data efficiency, while delivering comparable or superior performance. It exhibits strong generalization capabilities across various dimensions, including language instructions, novel objects, unseen positions, changes in object appearance, background variations, and environmental shifts. The project is open-sourced and available at https://tiny-vla.github.io.
相关研究

Some related works in this field include 'Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks' by Li et al., 'Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout' by Fried et al., and 'Learning to Learn from Simulation: Accelerating Robot Learning with Simulated Data' by James et al.

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

评论