爱可可AI前沿推介(3.8)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[LG] Algorithm is Experiment: Machine Learning, Market Design, and Policy Eligibility Rules

Y Narita, K Yata

[Yale University]

算法即实验：机器学习、市场设计与政策资格规则。算法在政策和商业中产生了越来越多的决策和建议。这种算法决策是自然实验(有条件的准随机分配工具)，因为算法只根据可观察的输入变量做出决定。本文利用这一观察，为一类随机和确定的决策算法开发了一个处理效果估计器。估计器被证明是一致的，而且对于定义明确的因果效应来说是渐进正态的，一个关键特例是多维回归不连续设计。用所提出的估计器来评估新冠病毒援助、救济和经济安全(CARES)法案的效果，其中价值数千亿美元的救济资金通过一个算法规则分配给医院。估计表明，救济资金对COVID-19相关的医院活动水平影响不大。天真的OLS和IV估计显示出大量的选择偏差。

Algorithms produce a growing portion of decisions and recommendations both in policy and business. Such algorithmic decisions are natural experiments (conditionally quasi-randomly assigned instruments) since the algorithms make decisions based only on observable input variables. We use this observation to develop a treatment-effect estimator for a class of stochastic and deterministic decision-making algorithms. Our estimator is shown to be consistent and asymptotically normal for well-defined causal effects. A key special case of our estimator is a multidimensional regression discontinuity design. We apply our estimator to evaluate the effect of the Coronavirus Aid, Relief, and Economic Security (CARES) Act, where hundreds of billions of dollars worth of relief funding is allocated to hospitals via an algorithmic rule. Our estimates suggest that the relief funding has little effect on COVID-19-related hospital activity levels. Naive OLS and IV estimates exhibit substantial selection bias.

2、[CV] Freeform Body Motion Generation from Speech

J Xu, W Zhang, Y Bai, Q Sun, T Mei

[University of Science and Technology of China & JD AI Research]

从语音生成自由肢体运动。人们在演讲时，会自然而然地做出一些自发的身体动作来加强他们的演讲。由于从语音到身体动作的非决定性映射，从语音生成身体动作本身就很困难。大多数现有的工作都是以确定的方式，通过对某些风格的调节来映射语音和动作，从而导致次优的结果。受语言学研究的启发，本文将演讲时运动分解为两个互补的部分：姿态模式和节奏动态。提出一个新的自由运动生成模型(FreeMo)，采用双流结构，一个用于主要姿态生成的姿态模式分支和一个用于节奏动态合成的节奏运动分支。一方面，在语音语义的指导下，通过在潜空间中的条件化采样来生成各种姿态模式。另一方面，有节奏的动态与语音语调保持同步。广泛的实验表明，在运动多样性、质量和与语音同步方面，与几个基线相比，性能更优越。

People naturally conduct spontaneous body motions to enhance their speeches while giving talks. Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions. Most existing works map speech to motion in a deterministic way by conditioning on certain styles, leading to sub-optimal results. Motivated by studies in linguistics, we decompose the co-speech motion into two complementary parts: pose modes and rhythmic dynamics. Accordingly, we introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture, i.e., a pose mode branch for primary posture generation, and a rhythmic motion branch for rhythmic dynamics synthesis. On one hand, diverse pose modes are generated by conditional sampling in a latent space, guided by speech semantics. On the other hand, rhythmic dynamics are synced with the speech prosody. Extensive experiments demonstrate the superior performance against several baselines, in terms of motion diversity, quality and syncing with speech. Code and pre-trained models will be publicly available through this https URL.

3、[CL] Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

W Liang, Y Zhang, Y Kwon, S Yeung, J Zou

[Stanford University]

理解多模态对比表示学习中的模态间距。本文提出模态间距，多模态模型表示空间的一个有趣的几何现象。不同的数据模态(如图像和文本)在多模态模型(如CLIP)的共享表示中是以一定的距离嵌入的。系统分析表明，这种间距是由模型初始化和对比学习优化共同造成的。在模型初始化中，从经验和理论上表明，通常的深度神经网络表示被限制在一个狭窄的锥体中。因此，在一个有两个编码器的多模态模型中，当模型初始化时，两种模态的表示是明显分开的。在优化过程中，对比学习使不同模态保持一定的距离，这受到损失函数中温度参数的影响。实验结果进一步证明，改变模态间隙距离对提高模型下游零样本分类性能和公平性有很大影响。

We present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. Specifically, we show that different data modalities (e.g. images and text) are embedded at arm's length in their shared representation in multi-modal models such as CLIP. Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. In model initialization, we show empirically and theoretically that the representation of a common deep neural network is restricted to a narrow cone. As a consequence, in a multi-modal model with two encoders, the representations of the two modalities are clearly apart when the model is initialized. During optimization, contrastive learning keeps the different modalities separate by a certain distance, which is influenced by the temperature parameter in the loss function. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model's downstream zero-shot classification performance and fairness. Our code and data are available at this https URL

4、[CV] DiT: Self-supervised Pre-training for Document Image Transformer

J Li, Y Xu, T Lv, L Cui, C Zhang, F Wei

[Shanghai Jiao Tong University & Microsoft Research & Microsoft Azure AI]

DiT：文档图像Transformer的自监督预训练。最近，图像Transformer在自然图像理解方面取得了重大进展，无论是用有监督(ViT、DeiT等)还是自监督(BEiT、MAE等)预训练技术。本文提出DiT，一种自监督预训练文档图像Transformer模型，用大规无标记文本图像来完成文档人工智能任务，这一点至关重要，因为由于缺乏人工标记的文档图像，从来没有监督数据。利用DiT作为骨干网络来完成各种基于视觉的文档人工智能任务，包括文档图像分类、文档布局分析以及表格检测。实验结果表明，自监督预训练的DiT模型在这些下游任务上取得了新的最先进的结果，如文档图像分类（91.11→92.69），文档布局分析（91.0→94.9）和表格检测（94.23→96.55）。

Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, as well as table detection. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 → 92.69), document layout analysis (91.0 → 94.9) and table detection (94.23 → 96.55). The code and pre-trained models are publicly available at \url{this https URL}.

5、[CV] Autoregressive Image Generation using Residual Quantization

D Lee, C Kim, S Kim, M Cho, W Han

[POSTECH & Kakao Brain]

基于残差量化的自回归图像生成。对于高分辨率图像的自回归(AR)建模，矢量量化(VQ)将图像表示为一串离散代码。短序列长度对于AR模型来说是很重要的，可减少其计算成本以考虑代码的长程交互。本文假设之前的VQ不能缩短代码序列并在速率-失真权衡方面共同生成高保真图像，提出了两阶段框架，其中包括残差量化VAE(RQ-VAE)和RQ-Transformer，以有效地产生高分辨率图像。给定一个固定的码本大小，RQ-VAE可以精确地逼近图像的特征图，并将图像表示为离散编码的堆叠图。RQ-Transformer学习通过预测下一堆编码来预测下一个位置的量化特征向量。由于RQ-VAE的精确近似，可以将256×256的图像表示为8×8分辨率的特征图，而RQ-Transformer可有效地降低计算成本。所提出框架在无条件和有条件图像生成的各种基准上优于现有的AR模型，比之前的AR模型有明显更快的采样速度来生成高质量的图像。

For autoregressive (AR) modeling of high-resolution images, vector quantization (VQ) represents an image as a sequence of discrete codes. A short sequence length is important for an AR model to reduce its computational costs to consider long-range interactions of codes. However, we postulate that previous VQ cannot shorten the code sequence and generate high-fidelity images together in terms of the rate-distortion trade-off. In this study, we propose the two-stage framework, which consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer, to effectively generate high-resolution images. Given a fixed codebook size, RQ-VAE can precisely approximate a feature map of an image and represent the image as a stacked map of discrete codes. Then, RQ-Transformer learns to predict the quantized feature vector at the next position by predicting the next stack of codes. Thanks to the precise approximation of RQ-VAE, we can represent a 256×256 image as 8×8 resolution of the feature map, and RQ-Transformer can efficiently reduce the computational costs. Consequently, our framework outperforms the existing AR models on various benchmarks of unconditional and conditional image generation. Our approach also has a significantly faster sampling speed than previous AR models to generate high-quality images.