MIT的Tomaso Poggio教授在意大利长大,今年已经73岁,是机器学习和计算机视觉领域的名宿。1970年代就与David Marr、Geoffrey Hinton等都有合作。他也是Allen研究所Christof Koch、Mobileye创始人Amnon Shashua和DeepMind创始人Demis Hassabis的老师。Google Scholar引用数高达11.9万,H-Index 143。
近年来他老骥伏枥,与Qianli Liao等学生在深度学习的理论方面做了很多工作。本文是一个总结,发表在2020年12月最新一期的PNAS(《美国科学院院刊》)上。
摘要:
虽然深度学习在许多应用上都很成功的,但理论上我们对它还不十分了解。深度学习的理论表征需要回答虽然存在参数过多且缺少明确的正则化等问题,为什么它的逼近能力、优化动力学和样本外(out-of-sample)性能如此出色。
我们回顾了针对这一目标的最新结果。在近似理论中,已知浅层网络和深层网络都以指数成本近似任何连续函数。但是,我们证明了对于某些类型的合成函数,卷积类型的深层网络(即使没有权重分配)也可以避免维数的诅咒。在表征经验指数损失的最小化时,我们考虑权重方向的梯度流,而不是权重本身,因为分类基础的相关函数对应于归一化网络。归一化权重的动态结果等同于最小化受单位范数约束的损失的约束问题的动态。特别是,典型梯度下降的动力学具有与约束问题相同的临界点。因此,在梯度流期间,在指数型损失函数下训练深层网络时存在隐式正则化。结果,临界点对应于损失的最小范数。该结果特别相关,因为最近已经表明,对于过参数化的模型,选择最小范数解可优化交叉验证留一法稳定性,从而优化预期误差。因此,我们的结果表明,深层网络中的梯度下降可将预期误差降至最低。 While deep learning is successful in a number of applications, it is not yet well understood theoretically. A theoretical characterization of deep learning should answer questions about their approximation power, the dynamics of optimization, and good out-of-sample performance, despite overparameterization and the absence of explicit regularization. We review our recent results toward this goal. In approximation theory both shallow and deep networks are known to approximate any continuous functions at an exponential cost. However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can avoid the curse of dimensionality. In characterizing minimization of the empirical exponential loss we consider the gradient flow of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to normalized networks. The dynamics of normalized weights turn out to be equivalent to those of the constrained problem of minimizing the loss subject to a unit norm constraint. In particular, the dynamics of typical gradient descent have the same critical points as the constrained problem. Thus there is implicit regularization in training deep networks under exponential-type loss functions during gradient flow. As a consequence, the critical points correspond to minimum norm infima of the loss. This result is especially relevant because it has been recently shown that, for overparameterized models, selection of a minimum norm solution optimizes cross-validation leave-one-out stability and thereby the expected error. Thus our results imply that gradient descent in deep networks minimize the expected error.
内容中包含的图片若涉及版权问题,请及时与我们联系删除
评论
沙发等你来抢