MotionLLM: Multimodal Motion-Language Learning with Large Language Models

简介

最近，多模态大型语言模型（MM-LLMs）的最新进展在应用于不同模态时展现出了广泛的泛化和鲁棒性潜力。虽然先前的研究已经使用各种方法包括语言建模实现了三维人体运动生成，但它们大多都是精心设计的专门架构，并且仅限于单人运动生成。受到MM-LLMs的成功启发，我们提出了MotionLLM，这是一个简单而通用的框架，可以通过微调预训练的LLMs实现单人、多人运动生成和运动字幕。具体而言，我们将动作编码和量化为离散的LLM可理解的标记，从而形成一个统一的词汇表，包括动作和文本标记。仅使用适配器训练的LLMs的1-3%参数，我们的单人运动生成结果与扩散模型和其他从头开始训练的基于变压器的模型相当。此外，我们展示了我们的方法是可扩展和灵活的，可以通过自回归生成单人运动轻松扩展到多人运动生成。项目页面：https://knoxzhao.github.io/MotionLLM。
图表
解决问题

MotionLLM: A Simple and General Framework for Motion Generation and Captioning with Multimodal Large Language Models
关键思路

The paper proposes a simple and general framework called MotionLLM for single-human, multi-human motion generation, and motion captioning by fine-tuning pre-trained LLMs. The framework encodes and quantizes motions into discrete LLM-understandable tokens, resulting in a unified vocabulary consisting of both motion and text tokens.
其它亮点

The proposed approach achieves comparable results to diffusion models and other trained-from-scratch transformer-based models with only 1-3% parameters of the LLMs trained by using adapters. The approach is scalable and flexible, allowing easy extension to multi-human motion generation through autoregressive generation of single-human motions. The paper provides a project page with open-source code and datasets for future research.
相关研究

Related work includes previous approaches using language modeling for 3D human motion generation, as well as recent advancements in Multimodal Large Language Models (MM-LLMs) for different modalities.

MotionLLM: Multimodal Motion-Language Learning with Large Language Models

评论