文章标题是:26 March 1991: Neural nets learn to program neural nets with fast weights—like today's Transformer variants. 2021: New stuff!

当然,这种说法一如既往地引起了各种争议

Abstract. How can artificial neural networks (NNs) process sequential data such as videos, speech, and text? Traditionally this is done with recurrent NNs (RNNs) that learn to remember past observations. Exactly 3 decades ago, however, a now very popular alternative to RNNs was published.[FWP0-1] A feedforward NN slowly learns by gradient descent to program the changes of the fast weights of another NN (see Sec. 1). Such Fast Weight Programmers (FWPs) can learn to memorize past data, too. In 1991, one of them[FWP0-1] computed its fast weight changes through additive outer products of self-invented activation patterns (now often called keys and values for self-attention; Sec. 2). The very similar Transformers[TR1-2] combine this with projections and softmax and are now widely used in natural language processing. For long input sequences, their efficiency was improved through linear Transformers or Performers[TR5-6] whose core is formally equivalent to the 1991 Fast Weight Programmers. In 1993, I introduced the attention terminology[FWP2] now used in this context[ATT] (Sec. 4), and extended the approach to RNNs that program themselves (Sec. 3). FWPs can solve the famous vanishing gradient problem aka deep learning problem (analyzed a few months later in 1991[VAN1]) through additive fast weight changes (Sec. 5). This is symmetric to the additive neural activations of LSTMs / Highway Nets / ResNets[HW1-3] (Sec. 5) which also have roots in 1991—the Annus Mirabilis of deep learning.[MIR] In 2021, we introduced a brand new, improved version[FWP6] of the 1991 fast weight update rule (Sec. 6). As an addendum, I also review our FWPs for reinforcement learning through neuroevolution[FWP5] (2005-, Sec. 7) and for metalearning machines that learn to learn[FWPMETA1-7] (1992-, Sec. 8).

内容中包含的图片若涉及版权问题,请及时与我们联系删除