
LG - 机器学习   CV - 计算机视觉   CL - 计算与语言   AS - 音频与语音 RO - 机器人 GR - 图形学



1、[CV] RGB no more: Minimally-decoded JPEG Vision Transformers

J Park, J Johnson
[University of Michigan]

Most neural networks for computer vision are designed to infer using RGB images. However, these RGB images are commonly encoded in JPEG before saving to disk; decoding them imposes an unavoidable overhead for RGB networks. Instead, our work focuses on training Vision Transformers (ViT) directly from the encoded features of JPEG. This way, we can avoid most of the decoding overhead, accelerating data load. Existing works have studied this aspect but they focus on CNNs. Due to how these encoded features are structured, CNNs require heavy modification to their architecture to accept such data. Here, we show that this is not the case for ViTs. In addition, we tackle data augmentation directly on these encoded features, which to our knowledge, has not been explored in-depth for training in this setting. With these two improvements -- ViT and data augmentation -- we show that our ViT-Ti model achieves up to 39.2% faster training and 17.9% faster inference with no accuracy loss compared to the RGB counterpart.



2、[LG] Decentralized Learning with Multi-Headed Distillation

A Zhmoginov, M Sandler, N Miller, G Kristiansen, M Vladymyrov
[Google Research]

Decentralized learning with private data is a central problem in machine learning. We propose a novel distillation-based decentralized learning technique that allows multiple agents with private non-iid data to learn from each other, without having to share their data, weights or weight updates. Our approach is communication efficient, utilizes an unlabeled public dataset and uses multiple auxiliary heads for each client, greatly improving training efficiency in the case of heterogeneous data. This approach allows individual models to preserve and enhance performance on their private tasks while also dramatically improving their performance on the global aggregated data distribution. We study the effects of data and model architecture heterogeneity and the impact of the underlying communication graph topology on learning efficiency and show that our agents can significantly improve their performance compared to learning in isolation.



3、[CV] Compressing Volumetric Radiance Fields to 1 MB

L Li, Z Shen, Z Wang, L Shen, L Bo
[Alibaba Group]

Approximating radiance fields with volumetric grids is one of promising directions for improving NeRF, represented by methods like Plenoxels and DVGO, which achieve super-fast training convergence and real-time rendering. However, these methods typically require a tremendous storage overhead, costing up to hundreds of megabytes of disk space and runtime memory for a single scene. We address this issue in this paper by introducing a simple yet effective framework, called vector quantized radiance fields (VQRF), for compressing these volume-grid-based radiance fields. We first present a robust and adaptive metric for estimating redundancy in grid models and performing voxel pruning by better exploring intermediate outputs of volumetric rendering. A trainable vector quantization is further proposed to improve the compactness of grid models. In combination with an efficient joint tuning strategy and post-processing, our method can achieve a compression ratio of 100× by reducing the overall model size to 1 MB with negligible loss on visual quality. Extensive experiments demonstrate that the proposed framework is capable of achieving unrivaled performance and well generalization across multiple methods with distinct volumetric structures, facilitating the wide use of volumetric radiance fields methods in real-world applications. Code Available at this https URL



4、[LG] Adapting protein language models for rapid DTI prediction

S Sledzieski, R Singh, L Cowen, B Berger
[MIT & Tufts University]

We consider the problem of sequence-based drug-target interaction (DTI) prediction, showing that a straightforward deep learning architecture that leverages pre-trained protein language models (PLMs) for protein embedding outperforms state of the art approaches, achieving higher accuracy, expanded generalizability, and an order of magnitude faster training. PLM embeddings are found to contain general information that is especially useful in few-shot (small training data set) and zero-shot instances (unseen proteins or drugs). Additionally, the PLM embeddings can be augmented with features tuned by task-specific pre-training, and we find that these task-specific features are more informative than baseline PLM features. We anticipate such transfer learning approaches will facilitate rapid prototyping of DTI models, especially in low-N scenarios.



5、[LG] Continuous Neural Algorithmic Planners

Y He, P Veličković, P Liò, A Deac
[University of Cambridge & DeepMind & Université de Montréal]

Neural algorithmic reasoning studies the problem of learning algorithms with neural networks, especially with graph architectures. A recent proposal, XLVIN, reaps the benefits of using a graph neural network that simulates the value iteration algorithm in deep reinforcement learning agents. It allows model-free planning without access to privileged information about the environment, which is usually unavailable. However, XLVIN only supports discrete action spaces, and is hence nontrivially applicable to most tasks of real-world interest. We expand XLVIN to continuous action spaces by discretization, and evaluate several selective expansion policies to deal with the large planning graphs. Our proposal, CNAP, demonstrates how neural algorithmic reasoning can make a measurable impact in higher-dimensional continuous control settings, such as MuJoCo, bringing gains in low-data settings and outperforming model-free baselines.





[CV] OpenScene: 3D Scene Understanding with Open Vocabularies

S Peng, K Genova, C M Jiang, A Tagliasacchi, M Pollefeys, T Funkhouser
[Google Research & Waymo LLC & ETH Zurich]

[CV] NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views

D Xu, Y Jiang, P Wang, Z Fan, Y Wang, Z Wang
[University of Texas at Austin]


[LG] Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing

N Ruiz, S A Bargal, C Xie...
[Boston University & Georgetown University & University of California, Santa Cruz]


[CL] Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources

X V Yu, A Asai, T Chatterjee, J Hu, E Choi
[University of Washington & The University of Texas at Austin & The University of Wisconsin-Madison]



