Mila在读博士Louis Bouchard总结的论文列表,总体比较靠谱。GitHub上还有很多论文的短视频和文字解读、代码链接等。

下面的列表我们添加了论文的主要贡献机构(有些机构虽然有贡献但排名较后有挂名嫌疑的,都被忽略不计了),似乎可以反映出各公司在AI领域的江湖地位:

  • 第一档:Google 8篇,Meta 6篇雄踞前二名,OpenAI 3篇但有两篇影响力巨大的(DALL·E 2和ChatGPT),如果按代表作评价,可能不会输给两巨头。
  • 第二档:NVIDIA有2.5篇。
  • 第三档:国内腾讯、百度、微软(出自亚研院)各1篇。国外有三星、迪士尼各1篇。Snap、Adobe都是0.5篇。

高校总共5.5篇,不如两巨头一家,相比之下要逊色很多。其中:

  • 特拉维夫有1.5篇位居第一,但慕尼黑的Stable Diffusion影响巨大,应该视为第一档。
  • CMU、南洋理工各1篇,第二档。
  • 南加大和伯克利各0.5篇,第三档。

从方向来看,大模型和文生图、跨模态是今年毫无疑问的热点,此外也有多篇GAN等视觉领域的文章。

[1] 三星: Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K. and Lempitsky, V., 2022. Resolution-robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2149–2159)., https://arxiv.org/pdf/2109.07161.pdf

[2] 特拉维夫: Tzaban, R., Mokady, R., Gal, R., Bermano, A.H. and Cohen-Or, D., 2022. Stitch it in Time: GAN-Based Facial Editing of Real Videos. https://arxiv.org/abs/2201.08361

[3] 南加大&Snap: Kuang, Z., Olszewski, K., Chai, M., Huang, Z., Achlioptas, P. and Tulyakov, S., 2022. NeROIC: Neural Rendering of Objects from Online Image Collections. https://arxiv.org/pdf/2201.02533.pdf

[4] Google: Borsos, Z., Sharifi, M. and Tagliasacchi, M., 2022. SpeechPainter: Text-conditioned Speech Inpainting. https://arxiv.org/pdf/2202.07273.pdf

[5] 腾讯: Wang, X., Li, Y., Zhang, H. and Shan, Y., 2021. Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9168–9178), https://arxiv.org/pdf/2101.04061.pdf

[6] Google: Piergiovanni, A.J., Casser, V., Ryoo, M.S. and Angelova, A., 2021. 4D-Net for learned multi-modal alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 15435–15445), https://openaccess.thecvf.com/content/ICCV2021/papers/Piergiovanni_4D-Net_for_Learned_Multi-Modal_Alignment_ICCV_2021_paper.pdf.

[7] NVIDIA: Thomas Muller, Alex Evans, Christoph Schied and Alexander Keller, 2022, Instant Neural Graphics Primitives with a Multiresolution Hash Encoding, https://nvlabs.github.io/instant-ngp/assets/mueller2022instant.pdf

[8] OpenAI/DALL·E 2: Ramesh et al., 2022, Hierarchical Text-Conditional Image Generation with CLIP Latents, https://cdn.openai.com/papers/dall-e-2.pdf

[9] Google: Nitzan, Y., Aberman, K., He, Q., Liba, O., Yarom, M., Gandelsman, Y., Mosseri, I., Pritch, Y. and Cohen-Or, D., 2022. MyStyle: A Personalized Generative Prior. arXiv preprint arXiv:2203.17272.

[10] Meta/OPT: Zhang, Susan et al. OPT: Open Pre-trained Transformer Language Models. https://arxiv.org/abs/2205.01068

[11] 伯克利&Adobe: Epstein, D., Park, T., Zhang, R., Shechtman, E. and Efros, A.A., 2022. BlobGAN: Spatially Disentangled Scene Representations. arXiv preprint arXiv:2205.02837.

[12] Google DeepMind: Reed S. et al., 2022. Gato - A generalist agent,  https://www.deepmind.com/publications/a-generalist-agent 

[13] Google/Imagen: Saharia et al., 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. https://imagen.research.google/  

[14] Craiyon: Dayma, et al., 2021, DALL·E Mini, doi:10.5281/zenodo.5146400. GitHub (DALL·E的复现,只有一些技术报告,未找到正规论文)

[15] Meta: NLLB Team et al., 2022, No Language Left Behind: Scaling Human-Centered Machine Translation. https://arxiv.org/abs/2207.04672

[16] CMU: Sheinin, Mark and Chan, Dorian and O’Toole, Matthew and Narasimhan, Srinivasa G., 2022, Dual-Shutter Optical Vibration Sensing, Proc. IEEE CVPR. https://imaging.cs.cmu.edu/vibration/ (CVPR2022最佳论文入围)

[17] Meta: Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D. and Taigman, Y., 2022. Make-a-scene: Scene-based text-to-image generation with human priors. https://arxiv.org/pdf/2203.13131.pdf

[18] Meta: Yang, G., Vo, M., Neverova, N., Ramanan, D., Vedaldi, A. and Joo, H., 2022. Banmo: Building animatable 3d neural models from many casual videos. In CVPR2022 (pp. 2863-2873). https://arxiv.org/abs/2112.12761 

[19] 慕尼黑/Stable Diffusion: Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In CVPR2022 (pp. 10684–10695), https://arxiv.org/pdf/2112.10752.pdf

[20] 南洋理工: Yang, J., Ang, Y.Z., Guo, Z., Zhou, K., Zhang, W. and Liu, Z., 2022. Panoptic Scene Graph Generation. arXiv preprint arXiv:2207.11247.

[21] 特拉维夫&NVIDIA: Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G. and Cohen-Or, D., 2022. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion.  https://textual-inversion.github.io/

[22] 微软: Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S. and Ling, H., 2022. Expanding Language-Image Pretrained Models for General Video Recognition. arXiv preprint arXiv:2208.02816.

[23] Meta/Make-A-Video: Singer et al., 2022. Make-A-Video: Text-To-Video Generation without Text-Video Data, https://makeavideo.studio/Make-A-Video.pdf

[24] OpenAI/Whisper: Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C. and Sutskever, I., Robust Speech Recognition via Large-Scale Weak Supervision. GitHub

[25] Google: Poole, B., Jain, A., Barron, J.T. and Mildenhall, B., 2022. DreamFusion: Text-to-3D using 2D Diffusion. arXiv preprint arXiv:2209.14988. https://dreamfusion3d.github.io/

[26] Google/Imagic: Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I. and Irani, M., 2022. Imagic: Text-Based Real Image Editing with Diffusion Models. arXiv preprint arXiv:2210.09276.

[27] NVIDIA: Balaji, Y. et al., 2022, eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers, https://arxiv.org/abs/2211.01324

[28] Google: Li, Z., Wang, Q., Snavely, N. and Kanazawa, A., 2022. InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images. In ECCV (pp. 515–534). Springer, Cham, https://arxiv.org/abs/2207.11148

[29] Meta/Galactica: Taylor et al., 2022: Galactica: A Large Language Model for Sciencehttps://galactica.org/

[30] 百度: Tang, J., Wang, K., Zhou, H., Chen, X., He, D., Hu, T., Liu, J., Zeng, G. and Wang, J., 2022. Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition. arXiv preprint arXiv:2211.12368.

[31] OpenAI/ChatGPT: ChatGPT: Optimizing Language Models for Dialoguehttps://openai.com/blog/chatgpt/

[32] 迪士尼/FRAN: Loss et al., DisneyResearch, 2022: Production-Ready Face Re-Aging for Visual Effects, https://studios.disneyresearch.com

这个列表里的论文哪些你认同,哪些你觉得不应该入选最佳的,还有哪些重要论文遗漏?欢迎大家评论。