专题征稿 | 用于视频理解的多模态学习、时序建模及基础模型

Machine Intelligence Research

MIR专题"Multimodal Learning, Temporal Modeling, and Foundation Models for Video Understanding"现公开征集原创稿件，截稿日期为2025年3月31日。欢迎赐稿！

▼

专题征稿

▼

Special Issue on Multimodal Learning, Temporal Modeling, and Foundation Models for Video Understanding

专题简介

Video understanding focuses on interpreting dynamic visual information from video data to recognize objects, actions, interactions, and environments in a time-structured manner. It has emerged as a critical area of research in computer vision due to its wide-ranging applications in autonomous systems, video surveillance, entertainment, healthcare, and human-computer interaction. Recent advancements in deep learning, especially in areas like spatiotemporal processing, multimodal learning, and graph-based modeling, have significantly enhanced model’s ability to comprehend complex video scenes. Despite significant advancements, the following key challenges continue to pose obstacles to developing accurate, efficient, and robust systems:

1. High Dimensionality and Computational Complexity: Videos are inherently high-dimensional data, with multiple frames contributing to a vast amount of information. Analyzing these sequences requires models to be capable of efficiently processing both spatial and temporal information, which often leads to high computational costs. Balancing the accuracy of video understanding models with the need for real-time processing in applications such as autonomous driving or video surveillance is a pressing challenge.

2. Temporal Coherence and Long-term Dependencies: Understanding events in a video often relies on tracking objects and interpreting their actions over time. Capturing temporal coherence, especially over long sequences, is difficult due to the need to model both short-term interactions and long-term dependencies between different entities. Traditional methods struggle with maintaining consistent object tracking and event detection across extended time frames.

3. Multimodal Integration: Video data encompasses more than just visual information—auditory cues, text description, and motion data are also essential for comprehending scenes. The challenge lies in effectively fusing these modalities to provide a holistic understanding of the scene. Many systems still struggle with aligning and integrating multimodal inputs in a meaningful way that improves recognition and interpretation accuracy.

4. Ambiguity in Action and Event Recognition: Distinguishing between similar actions or events in a video can be highly ambiguous. For example, the actions of sitting down and falling can appear visually similar, yet have vastly different meanings. Accurately recognizing and categorizing these nuanced actions requires models with a deep understanding of spatiotemporal context, which is challenging to achieve, especially in complex environments with multiple actors and activities.

5. Occlusion and Viewpoint Variations: In real-world scenarios, objects or people in a video often get occluded or appear from different angles. These occlusions and viewpoint changes can obscure key parts of the scene, leading to ambiguity in identifying actions and objects. Models need to be robust enough to handle partial visibility, changes in camera angles, and dynamic environments, but current systems frequently fall short in such situations.

6. Data Annotation and Scalability: Training effective video understanding models often requires large, annotated datasets. However, manually labeling video data is time-consuming and expensive, particularly when considering both spatial and temporal dimensions. The scalability of current solutions is limited by the availability of largely annotated datasets, and the development of models capable of learning from less data or through self-supervision is still in its infancy.

7. Adaptation and Generality: Many state-of-the-art models are trained on curated datasets that may not fully represent the complexity of real-world environments. When deployed in the real world, these models often encounter variations in lighting, weather, and unpredictable interactions, leading to performance degradation. Ensuring that models can generalize to unseen environments and adapt to changing conditions is an ongoing challenge.

征稿范围(包括但不限于)

We believe that this special issue will offer a timely collection of research outcomes to benefit video understanding in the long run. Topics of interest include but are not limited to:

• Temporal Dynamics and Spatiotemporal Feature Extraction: Leveraging advanced techniques to model temporal dependencies and relationships between objects and events over time, e.g., graph neural networks and transformers.

• Multimodal Learning for Video Understanding: Integrating visual, auditory, text, or motion information to improve scene comprehension.

• Scene Segmentation in Videos: Enhancements in accurately segmenting dynamic scenes across frames, e.g., video semantic segmentation, video instance segmentation, video panoptic segmentation, video object segmentation, motion segmentation, scene change detection, interactive video segmentation, and video salient object detection.

• Object Tracking in Videos: Advancements in accurately tracking objects across video frames, e.g., single/multiple object tracking, long-term object tracking, trajectory prediction, video object/person re-identification, multimodal object tracking, 3D object tracking, and joint tracking and segmentation.

• Action Recognition and Event Detection: New methods for identifying and distinguishing complex actions and events in videos, e.g., action segmentation, video summarization/captioning, action label prediction, video prediction, video retrieval, procedure and action understanding, and video grounding.

• Data/Label Efficient Video Learning: Developing new techniques for self-supervised learning, unsupervised learning, few-shot learning, and semi-supervised learning with videos.

• Personalization of Large Foundation Models for Video Understanding: Advanced techniques for personalizing large foundation models (LFMs) for video understanding, e.g., using LFMs for video segmentation and tracking.

投稿指南

1) 截稿日期：2025年3月31日

2) 投稿地址（已开通）：

https://mc03.manuscriptcentral.com/mir

投稿时，请在系统中选择：

“Step 6 Details & Comments: Special Issue and Special Section---Special Issue on Multimodal Learning, Temporal Modeling, and Foundation Models for Video Understanding”.

3) 投稿及同行评议指南:

Full length manuscripts and peer reviewing will follow the MIR guidelines. For details: https://www.springer.com/journal/11633

客座编委

• Yun Liu

Agency for Science, Technology and Research (A*STAR), Singapore

Email: vagrantlyun@gmail.com

• Guolei Sun

ETH Zurich, Switzerland

Email: sunguolei.kaust@gmail.com

• Radu Timofte

University of Wurzburg, Germany & ETH Zurich, Switzerland

Email: Radu.Timofte@uni-wuerzburg.de

• Ender Konukoglu

ETH Zurich, Switzerland

Email: ender.konukoglu@vision.ee.ethz.ch

• Luc Van Gool

ETH Zurich, Switzerland & KU Leuven, Belgium & Institute for Computer Science, Artificial Intelligence and Technology (INSAIT), Bulgaria

Email: vangool@vision.ee.ethz.ch

纸刊免费寄送

Machine Intelligence Research

MIR为所有读者提供免费寄送纸刊服务，如您对本篇文章感兴趣，请点击下方链接填写收件地址，编辑部将尽快为您免费寄送纸版全文！

说明：如遇特殊原因无法寄达的，将推迟邮寄时间，咨询电话010-82544737

收件信息登记：

https://www.wjx.cn‍/vm/eIyIAAI.aspx#

∨

关于Machine Intelligence Research

Machine Intelligence Research（简称MIR，原刊名International Journal of Automation and Computing）由中国科学院自动化研究所主办，于2022年正式出版。MIR立足国内、面向全球，着眼于服务国家战略需求，刊发机器智能领域最新原创研究性论文、综述、评论等，全面报道国际机器智能领域的基础理论和前沿创新研究成果，促进国际学术交流与学科发展，服务国家人工智能科技进步。期刊入选"中国科技期刊卓越行动计划"，已被ESCI、EI、Scopus、中国科技核心期刊、CSCD等20余家国际数据库收录，入选图像图形领域期刊分级目录-T2级知名期刊。2022年首个CiteScore分值在计算机科学、工程、数学三大领域的八个子方向排名均跻身Q1区，最佳排名挺进Top 4%，2023年CiteScore分值继续跻身Q1区。2024年获得首个影响因子(IF) 6.4，位列人工智能及自动化&控制系统两个领域JCR Q1区。

▼

往期目录

▼

2024年第5期 | 大语言模型，无人系统，统一分类与拒识...

2024年第4期 | 特约专题: 多模态表征学习

2024年第3期 | 分布式深度强化学习，知识图谱，推荐系统，3D视觉，联邦学习...

2024年第2期 | 大语言模型、零信任架构、常识知识推理、肿瘤自动检测和定位...

2024年第1期 | 特约专题: AI for Art

2023年第6期 | 影像组学、机器学习、图像盲去噪、深度估计...

2023年第5期 | 生成式人工智能系统、智能网联汽车、毫秒级人脸检测器、个性化联邦学习框架... (机器智能研究MIR)

2023年第4期 | 大规模多模态预训练模型、机器翻译、联邦学习......

2023年第3期 | 人机对抗智能、边缘智能、掩码图像重建、强化学习...