Facebook AI Research and Mila – McGill University Collaborate to Explore Anytime Learning at Macroscale (ALMA)

Source: https://arxiv.org/pdf/2106.09563.pdf

Large-scale training data is essential for building efficient deep learning models. In classical training frameworks, this training data is thought to arrive all at once. However, data is generally streamed to the learner one batch at a time, thereby creating a natural trade-off between the training time and accuracy of a model.

A team of researchers from Facebook AI Research and Mila – McGill University, in their paper on ‘Anytime Learning at Macroscale’, explore accuracy versus time trade-off of any time learning. They have termed this Anytime Learning at Macroscale (ALMA).

The team emphasizes that a model that patiently aggregates batches into a larger dataset will deliver improved accuracy rather than an eager model that can produce non-trivial predictions by training data batches as soon as they become available.

The researchers have Formalized the ALMA problem and introduced metrics to evaluate learners. They have conducted empirical evaluations of several models that strike different trade-offs between accuracy and time to obtain a useful predictor.

In the ALMA setting, the team has assumed that data must be displayed to the learner as a stream of consecutive batches. They also assume that the data arrival rate is slower than the model’s processing time.

The researchers further evaluated learners in the ALMA setting across three axes- accuracy, memory, and computation. After measuring these quantities against time via the area under the curve, they measure the model’s final performance and the whole training trajectory over the sequence of large data batches.

The learning algorithms that were tested in the ALMA setting are- Mixture of Experts (MoE), which relates to methods where multiple experts (learners) are used to divide the problem space into homogeneous spaces, and Growing MoE (gMoE) which is a simple extension of the one layer MoE with an added temporal grow capability.

The researchers performed experiments on the MNIST and CIFAR 10 datasets and a collection of English language texts consisting of books, Wikipedia and Common Crawl to analyze the effects of receiving data over time. They also aim to determine which models strike the best trade-offs between accuracy, compute time, and memory usage.

The results demonstrate that models updating their parameters at an intermediate rate strike the best trade-off between accuracy and time. It is also shown that bigger models generalize better, and models that grow capacity over time can also generalize better, especially when the initial model is smaller.


As ALMA resembles learning situations for real-life applications where the main aim is to efficiently solve a task even as more and more data is received for training, the researchers believe that it can contribute to ‘anytime learning’ and aid researchers in obtaining better models.

Paper: https://arxiv.org/abs/2106.09563