Meta AI | 分层视频-语言嵌入学习

来自今天的爱可可爱生活AI前沿推介

[CV] HierVL: Learning Hierarchical Video-Language Embeddings

K Ashutosh, R Girdhar, L Torresani, K Grauman
[Meta AI & UT Austin]

HierVL: 分层视频-语言嵌入学习

要点:

提出 HierVL，一种同时考虑长期和短期关联的分层视频-语言嵌入的新方法；
提出一种分层对比训练目标，鼓励在视频片段级和视频级之间的文本视觉对齐；
在许多具有挑战性的下游任务中成功迁移，无论是零样本还是微调场景下。

一句话总结:
HierVL 是一种新的分层视频语言嵌入，捕获短期和长期关联，优于单级方案，在需要长程视频建模的任务中实现了最先进的结果，并在零样本和微调场景的多个下游任务中成功迁移。

摘要：
视频-语言嵌入是将语义注入视觉表示的有希望的途径，但现有方法仅能捕获数秒长视频片段与其伴随文本之间的短期关联。本文提出 HierVL，一种新的分层视频-语言嵌入，同时考虑长期和短期关联。作为训练数据，本文提供带有描述人类动作时间戳文本的视频，以及整个长视频中活动的高层次文本摘要(Ego4D 有提供)。提出一种分层对比训练目标，鼓励在视频片段级和视频级之间的文本视觉对齐。视频片段级的约束用逐步描述来捕获当前正在发生的内容，而视频级的约束用摘要文本来捕获为什么会发生这种情况，即活动的更广泛上下文和执行者的意图。所提出的分层方案产生了一个超越单层方案的视频片段表示，以及在需要长程视频建模的任务中实现 SotA 结果的长程视频表示。HierVL 在多个具有挑战性的下游任务中成功迁移，无论是零样本还是微调场景下。

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.

论文链接：https://arxiv.org/abs/2301.02311

内容中包含的图片若涉及版权问题，请及时与我们联系删除

Meta AI | 分层视频-语言嵌入学习

[CV] HierVL: Learning Hierarchical Video-Language Embeddings

评论