The amount of money and energy necessary to train AI models has become a hot-button issue as they grow in size. Leaders in the AI field have been pouring money towards training increasingly bigger models since GPT-3 proved the considerable gains in performance that can be achieved by merely increasing model size. However, this is prohibitively expensive, necessitates tremendous computational resources, consumes enormous energy, and is becoming more recognized as an issue, not just because of the environmental implications, but also because it makes it harder for smaller AI companies to compete, concentrating power in the hands of industry titans. A new technique that rewrites one of the discipline’s core building pieces could give a workaround.
Oxford University researchers have proposed a novel method that might cut training time in half. This is accomplished by redesigning backpropagation, one of the essential components of today’s neural network-based AI systems. Backpropagation has remained a mainstay of machine learning for computing gradients of objective functions for optimization. Backpropagation, also known as reverse-mode differentiation, is a subset of the general family of automatic differentiation algorithms that includes forward mode differentiation. A method for computing gradients based purely on the directional derivative, which may be done in the forward mode with precision and efficiency, is designed. The method is known as the forward gradient, which is an unbiased estimate of the gradient that can be assessed in a single forward run of the function, obviating the necessity for backpropagation in gradient descent completely.
The algorithm makes educated guesses about how weights will need to be adjusted on the forward pass. It turns out that these approximations are near enough to achieve backpropagation-like performance. The researchers demonstrated that the forward AD method can be used to train a variety of machine learning algorithms and that because it only requires a forward pass, it can cut training times in half.
Forward Mode AD
Reverse Mode AD
Both types of AD have runtime costs bounded by a constant multiple of the time it takes to perform the function f. Reverse mode is more expensive than forward mode because it entails reversing data flow and keeping a record (a “tape,” stack, or graph) of the results of operations encountered in the front pass, which are needed in the backward pass’s derivative evaluation. The memory and computation cost characteristics are determined by the AD system’s features, such as sparsity exploiting and checkpointing.
The study could also help to solve a long-standing puzzle in human intelligence. Artificial neural networks remain one of the most powerful tools for studying how the brain learns. However, backpropagation has long been biologically impossible due to the lack of backward communication between neurons. Forward AD offers computational properties that the ML community is eager to investigate. Its inclusion in the traditional ML infrastructure would result in significant breakthroughs and novel techniques.