Google Researchers Use Different Design Decisions To Make Transformers Solve Compositional NLP Tasks


At present, advanced neural network architectures achieve state-of-the-art (SOTA) performance in many complex natural language processing (NLP) tasks. However, they struggle to capture the compositional structures in natural language, thereby exhibiting a low amount of compositional generalization. Compositional generalization is the ability to learn a set of basic primitives and combine them in more complex ways than seen during training.

A new Google study investigates the design space of transformer models to solve natural language compositional tasks. The team proposes that different design decisions provide inductive biases that enable models to generalize to certain symmetries in input data. This approach has been observed to significantly improve compositional generalization in language and algorithmic tasks. 

The resulting models are reported to achieve state-of-the-art performance on semantic parsing compositional generalization and string edit operation composition benchmarks.

In their recent research, the team focuses on a conventional transformer model with an encoder and decoder. When a sequence of tokens is supplied into the model, the transformer network generates a sequence of tokens one by one using predictions based on the decoder’s output distribution. 

The compositional generalization challenge has been considered as a general out-of-distribution generalization problem by many early scholars. This concept caught the interest of Google researchers, who propose that differing transformer architectural choices will give models distinct inductive biases, causing them to be more or less likely to detect symmetry.

The researchers use different architectural configurations to evaluate the transformer’s compositional generalization abilities. These configurations include:

  1. The type of position encodings
  2. The use of copy decoders
  3. Model size
  4. Weight sharing
  5. The use of intermediate representations for prediction

They used sequence-level accuracy as a metric in the evaluation process.


The results demonstrate that changing the design decisions increases its accuracy to up to 0.527. The team achieves SOTA performance on the COGS dataset with a classification accuracy of 0.784. Furthermore, the model achieves SOTA results on the productivity and systematicity splits of PCFG with an accuracy of 0.634 and 0.828, respectively.