Swiss AI Lab Team Going Beyond Linear Transformers with Recurrent Fast Weight Programmers

A research team from IDSIA, a Swiss AI lab, uses fast weight programmers (FWPs) to improve linear transformers and investigates the relationship between linearised transformers and outer product-based FWPs to unlock the potential of improved FWPs. The team is inspired by a formal equivalence between today’s linear transformers and 1990s fast weight programmers (FWPs), and so has proposed recurrent FWPs, i.e., RFWPs. It is a new approach that can perform better than linear and regular transformers on execution and sequential task. 

Even with the great results of transformer architectures in sequence-processing tasks, the types of input they can handle are limited. This is because the computational complexity of transformers in terms of time and space is quadratic with the length of the input sequence. Furthermore, because the state size of auto-regressive transformers grows linearly with sequence length, they are infeasible for auto-regressive settings dealing with extremely lengthy or perhaps infinite sequences.

Recent research has sought to scale transformers to longer sequences by linearizing the SoftMax, resulting in linear transformers with constant memory size and linear time complexity. The key points of the research summary are as follows: 

  • From the perspective of FWPs, they investigate innovative, powerful FWPs for sequence processing, showing that Neural Networks (NNs) can readily learn to govern NNs that are more sophisticated than a single feedforward layer.
  • In terms of Transformer models, their RFWPs add recurrence to linear Transformers, overcoming the limits of conventional auto-regressive Transformer models.

The weights in traditional neural networks are fixed after training. The goal of fast weights, on the other hand, is to make the weights of a network flexible and input-dependent. Jürgen Schmidhuber presented context-dependent FWPs in two-network systems in the early 1990s, consisting of a slow and a fast net, each with variable architectures. The slow neural network uses backpropagation to generate rapid context-dependent weights for the fast neural network in this design. In simpler words, the slow network learns to program a corresponding fast network.

A transformer is a DL(deep learning) model that uses the attention mechanism that differentially weighs the significance of the parts of the input data. Linear transformers are those transformers in which a kernel function replaces the SoftMax. Self-attention can thus be rewritten as a fundamental exterior product-based FWP, according to previous research. Linearised transformers are therefore fundamentally comparable to outer product-based weight generation.

The researchers initially present FWPs with recurring fast and slow nets, focusing on outer product-based weight generation. They create a fast weight RNN called Delta RNN by adding an additional recurrent term to the linear transformer’s feedforward fast network. In Delta Net, they get a slow network dependent on the prior output of the fast network, which they call RDN: Recurrent Delta Net.

 The models were evaluated on the generic language modeling task to obtain a performance overview; they were then tested on code execution and sequential ListOps synthetic algorithmic tasks to compare their elementary sequence processing abilities. They were then used to replace LSTMs in reinforcement learning in 2D game environments. The results of various experiments show that the proposed architectures outperform the traditional ones.

Overall, the study indicates that 1990s FWP frameworks have a strong relationship with modern architectures, paving future research into new recurrent transformer classes.

Codes: https://github.com/IDSIA/recurrent-fwp

Paper: https://arxiv.org/pdf/2106.06295.pdf