爱可可AI前沿推介(5.29)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：基于熵追求的内在动机复杂行为规划、面向掩码图像建模的绿色分层视觉Transformer、基于物理学的单目视频3D人体姿态重建轨迹优化、基于等距自样本学习的自监督深度估计、强化学习中基于语言模型的历史压缩、用自然语言和程序抽象为机器注入人类归纳偏差、重新思考CNN大卷积核设计、面向大型语言模型参数高效微调的适配器混合、指令学习合成环境的调查和新挑战

1、[LG] Seeking entropy: complex behavior from intrinsic motivation to occupy action-state path space

J Ramírez-Ruiz, D Grytskyy, R Moreno-Bote

[Universitat Pompeu Fabra]

熵追求：来自内在动机的复杂行为占据了行动状态的路径空间。内在动机产生的行动不一定导致立即的奖励，但有助于探索和学习。本文表明，以最大限度占据未来行动和状态为唯一目标的智能体，也就是长期行为和探索，能在不参考外部奖励的情况下做出复杂行为。行动-状态路径熵是唯一符合可加性和其他预期未来行动-状态路径占用的直观属性的测量方法。本文提供了将最优策略与最优状态值函数联系起来的分析性表达式，证明了相关贝尔曼方程解的唯一性以及所提出算法对最优状态值函数的收敛性。利用离散和连续状态任务，表明 "跳舞"、捉迷藏和一种基本形式的利他主义行为自然产生于没有外部奖励的熵追求。本质上有动机的智能体可以客观地确定哪些状态构成奖励，利用它们最终实现行动-状态路径熵的最大化。

Intrinsic motivation generates behaviors that do not necessarily lead to immediate reward, but help exploration and learning. Here we show that agents having the sole goal of maximizing occupancy of future actions and states, that is, moving and exploring on the long term, are capable of complex behavior without any reference to external rewards. We find that action-state path entropy is the only measure consistent with additivity and other intuitive properties of expected future action-state path occupancy. We provide analytical expressions that relate the optimal policy with the optimal state-value function, from where we prove uniqueness of the solution of the associated Bellman equation and convergence of our algorithm to the optimal state-value function. Using discrete and continuous state tasks, we show that ‘dancing’, hide-and-seek and a basic form of altruistic behavior naturally result from entropy seeking without external rewards. Intrinsically motivated agents can objectively determine what states constitute rewards, exploiting them to ultimately maximize action-state path entropy.

https://arxiv.org/abs/2205.10316

2、[CV] Green Hierarchical Vision Transformer for Masked Image Modeling

L Huang, S You, M Zheng, F Wang, C Qian, T Yamasaki

[The University of Tokyo & SenseTime Research & The University of Sydney]

面向掩码图像建模的绿色分层视觉Transformer。本文提出一种高效的掩码图像建模(MIM)方法，基于分层视觉Transformer(ViT)，如Swin Transformer，允许分层ViT丢弃掩码图块，只对可见图块进行操作。所提出方法由两个关键部分组成。首先，对于窗口注意力，设计了一个遵循分治策略的群体窗口注意力方案。为减轻自注意力的二次复杂度(相对于图块数量)，分组注意力鼓励统一分区，即每个任意大小的局部窗口内的可见图块可以以同等大小分组，然后在每组内进行掩码自注意力。其次，通过动态规划算法进一步改进分组策略，使分组图块上的注意力整体计算成本最小。MIM可以以一种绿色和高效的方式使用分层ViT。可以将分层ViT的训练速度提高2.7倍，将GPU的内存使用量减少70%，同时在ImageNet分类上仍有竞争力的表现，在下游COCO物体检测基准上也有优势。

We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), e.g., Swin Transformer [43], allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of two key components. First, for the window attention, we design a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. As a result, MIM now can work on hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs about 2.7× faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks.

https://arxiv.org/abs/2205.13515

3、[CV] Trajectory Optimization for Physics-Based Reconstruction of 3d Human Pose from Monocular Video

E Gärtner, M Andriluka, H Xu, C Sminchisescu

[Google Research]

基于物理学的单目视频3D人体姿态重建轨迹优化。本文专注于从单目视频中估计物理上合理的关节人体运动的任务。现有的不考虑物理学的方法常常产生时间上不一致的输出，并带有运动伪影，而最先进的基于物理学的方法要么被证明只在受控的实验室条件下工作，要么考虑简化的肢体与地面接触，仅限于脚。本文探讨了如何通过直接将全功能的物理引擎纳入姿态估计过程来解决这些缺点。给出一个不受控制的、真实世界的场景作为输入，所提出方法估计地平面的位置和物理人体模型的尺寸。通过执行轨迹优化来恢复物理运动。可以很容易推广到各种可能具有不同地面特性的场景，并支持任何形式的自我接触和铰接体与场景几何之间的接触。在Human3.6M基准上，所提出方法与现有的基于物理学的方法相比取得了有竞争力的结果，同时无需重新训练就可以直接适用于AIST基准中更复杂的动态运动和不受控制的互联网视频。

We focus on the task of estimating a physically plausible articulated human motion from monocular video. Existing approaches that do not consider physics often produce temporally inconsistent output with motion artifacts, while state-of-the-art physics-based approaches have either been shown to work only in controlled laboratory conditions or consider simplified body-ground contact limited to feet. This paper explores how these shortcomings can be addressed by directly incorporating a fully-featured physics engine into the pose estimation process. Given an uncontrolled, real-world scene as input, our approach estimates the ground-plane location and the dimensions of the physical body model. It then recovers the physical motion by performing trajectory optimization. The advantage of our formulation is that it readily generalizes to a variety of scenes that might have diverse ground properties and supports any form of self-contact and contact between the articulated body and scene geometry. We show that our approach achieves competitive results with respect to existing physicsbased methods on the Human3.6M benchmark [13], while being directly applicable without re-training to more complex dynamic motions from the AIST benchmark [36] and to uncontrolled internet videos.

https://arxiv.org/abs/2205.12292

4、[CV] Self-Supervised Depth Estimation with Isometric-Self-Sample-Based Learning

G Cha, H Jang, D Wee

[Clova AI]

基于等距自样本学习的自监督深度估计。管理光度损失公式中的动态区域一直是处理自监督深度估计问题的一个主要问题。之前大多数方法都是通过基于另一个模块估计的掩码来移除光度损失公式中的动态区域来缓解这个问题，这使得它难以充分利用训练图像。为了处理这个问题，本文提出一种基于等距自样本学习(ISL)方法，以一种简单而有效的方式充分利用训练图像。所提出的方法在训练过程中用符合纯静态场景假设的自生成图像提供额外的监督。具体来说，等距自样本生成器通过在估计的深度上应用随机的刚性变换，为每个训练图像合成自样本。因此，生成的自样本和相应的训练图像总是遵循静态场景假设。将所提出的ISL模块插入到几个现有模型中，可以持续大幅度提高性能。此外，还提高了室外场景(KITTI和Make3D)和室内场景(NYUv2)等不同类型场景的深度精度，验证了其高效性。

Managing the dynamic regions in the photometric loss formulation has been a main issue for handling the self-supervised depth estimation problem. Most previous methods have alleviated this issue by removing the dynamic regions in the photometric loss formulation based on the masks estimated from another module, making it difficult to fully utilize the training images. In this paper, to handle this problem, we propose an isometric selfsample-based learning (ISSL) method to fully utilize the training images in a simple yet effective way. The proposed method provides additional supervision during training using self-generated images that comply with pure static scene assumption. Specifically, the isometric self-sample generator synthesizes self-samples for each training image by applying random rigid transformations on the estimated depth. Thus both the generated selfsamples and the corresponding training image always follow the static scene assumption. We show that plugging our ISSL module into several existing models consistently improves the performance by a large margin. In addition, it also boosts the depth accuracy over different types of scene, i.e., outdoor scenes (KITTI and Make3D) and indoor scene (NYUv2), validating its high effectiveness.

https://arxiv.org/abs/2205.10006

5、[LG] History Compression via Language Models in Reinforcement Learning

F Paischer, T Adler, V Patil, A Bitto-Nemling, M Holzleitner, S Lehner, H Eghbal-zadeh, S Hochreiter

[Johannes Kepler University Linz]

强化学习中基于语言模型的历史压缩。在部分可观察的马尔科夫决策过程(POMDP)中，智能体通常用过去的表示来近似基础MDP。本文建议利用一个冻结的预训练语言Transformer(PLT)来表示和压缩历史，以提高采样效率。为避免Transformer的训练，引入了FrozenHopfield，自动将观察结果与原始token嵌入联系起来。为了形成这些关联，现代Hopfield网络存储了原始token嵌入，这些嵌入是通过查询进行检索并通过对观察值的随机但固定的投影获得的。所提出的新方法HELM，使actor-critic网络架构包含一个预训练的语言Transformer，作为记忆模块用于历史表示。由于不需要学习过去的表述，HELM比同类模型的采样效率高得多。在Minigrid和Procgen环境下，HELM取得了新的最先进的结果。

In a partially observable Markov decision process (POMDP), an agent typically uses a representation of the past to approximate the underlying MDP. We propose to utilize a frozen Pretrained Language Transformer (PLT) for history representation and compression to improve sample efficiency. To avoid training of the Transformer, we introduce FrozenHopfield, which automatically associates observations with original token embeddings. To form these associations, a modern Hopfield network stores the original token embeddings, which are retrieved by queries that are obtained by a random but fixed projection of observations. Our new method, HELM, enables actor-critic network architectures that contain a pretrained language Transformer for history representation as a memory module. Since a representation of the past need not be learned, HELM is much more sample efficient than competitors. On Minigrid and Procgen environments HELM achieves new state-of-the-art results. Our code is available at this this http url.

https://arxiv.org/abs/2205.12258