
LG - 机器学习   CV - 计算机视觉   CL - 计算与语言   AS - 音频与语音 RO - 机器人 GR - 图形学



1、[CL] Continuous diffusion for categorical data

S Dieleman, L Sartran, A Roshannai, N Savinov...
[DeepMind & University of Southern California & INRIA & Google Research]

Diffusion models have quickly become the go-to paradigm for generative modelling of perceptual signals (such as images and sound) through iterative refinement. Their success hinges on the fact that the underlying physical phenomena are continuous. For inherently discrete and categorical data such as language, various diffusion-inspired alternatives have been proposed. However, the continuous nature of diffusion models conveys many benefits, and in this work we endeavour to preserve it. We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. We demonstrate its efficacy on several language modelling tasks.


2、[LG] A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Y Nie, N H. Nguyen, P Sinthong, J Kalagnanam
[Princeton University & IBM Research]
基于Transformer的长程时序预测。本文提出一种高效的基于Transformer的模型设计,用于多变量时间序列的预测和自监督表示学习,其基于两个关键部分:(i) 将时间序列分割成子序列级的块,作为Transformer的输入Token;(ii) 通道独立,每个通道包含一个单一的单变量时间序列,在所有序列中共享相同的嵌入和Transformer权重。块设计自然有三方面的好处:局部语义信息被保留在嵌入中;在相同的回看窗口下,注意力图的计算和内存使用成四倍地减少;模型可以参与更长的历史。与基于最先进Transformer的模型相比,所提出的通道无关块时间序列Transformer(PatchTST)可以显著提高长程预测精度。将该模型应用于自监督预训练任务,获得了出色的微调性能,在大型数据集上的表现优于监督训练。将一个数据集上的掩码预训练表示迁移到其他数据集上也能产生SOTA的预测精度。

We propose an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer (PatchTST) can improve the long-term forecasting accuracy significantly when compared with that of SOTA Transformer-based models. We also apply our model to self-supervised pre-training tasks and attain excellent fine-tuning performance, which outperforms supervised training on large datasets. Transferring of masked pre-trained representation on one dataset to others also produces SOTA forecasting accuracy. Code is available at: this https URL.


3、[LG] Is Conditional Generative Modeling all you need for Decision-Making?

A Ajay, Y Du, A Gupta, J Tenenbaum, T Jaakkola, P Agrawal
[Improbable AI Lab & MIT]

Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decision-making. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a return-conditional diffusion model, we illustrate how we may circumvent the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional diffusion models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making.



4、[CV] How to Fine-Tune Vision Models with SGD

A Kumar, R Shen, S Bubeck, S Gunasekar
[Stanford University & University of Washington & Microsoft]

SGD (with momentum) and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we show that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: merely freezing the embedding layer (less than 1\% of the parameters) leads to SGD performing competitively with AdamW while using less memory. Our insights result in state-of-the-art accuracies on five popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, Living-17, Waterbirds, and DomainNet.



5、[LG] Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes

A Kumar, R Agarwal, X Geng, G Tucker, S Levine
[Google Research & UC Berkeley]

The potential of offline reinforcement learning (RL) is that high-capacity models trained on large, heterogeneous datasets can lead to agents that generalize broadly, analogously to similar advances in vision and NLP. However, recent works argue that offline RL methods encounter unique challenges to scaling up model capacity. Drawing on the learnings from these works, we re-examine previous design choices and find that with appropriate choices: ResNets, cross-entropy based distributional backups, and feature normalization, offline Q-learning algorithms exhibit strong performance that scales with model capacity. Using multi-task Atari as a testbed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 80 million parameter networks, finding that model performance scales favorably with capacity. In contrast to prior work, we extrapolate beyond dataset performance even when trained entirely on a large (400M transitions) but highly suboptimal dataset (51% human-level performance). Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal. Finally, we show that offline Q-learning with a diverse dataset is sufficient to learn powerful representations that facilitate rapid transfer to novel games and fast online learning on new variations of a training game, improving over existing state-of-the-art representation learning approaches.





[LG] What learning algorithm is in-context learning? Investigations with linear models

E Akyürek, D Schuurmans, J Andreas, T Ma, D Zhou
[Google Research & MIT CSAIL]


[LG] Synergies Between Disentanglement and Sparsity: a Multi-Task Learning Perspective

S Lachapelle, T Deleu, D Mahajan, I Mitliagkas, Y Bengio, S Lacoste-Julien, Q Bertrand
[Université de Montréal]


[LG] Self-Destructing Models: Increasing the Costs of Harmful Dual Uses in Foundation Models

E Mitchell, P Henderson, C D. Manning, D Jurafsky, C Finn
[Stanford University]


[LG] A Theoretical Study of Inductive Biases in Contrastive Learning

J Z. HaoChen, T Ma
[Stanford University]



