Latest Machine Learning Research at Apple Considers the Learning of Logical Functions with Gradient Descent (GD) on Neural Networks

The pointer value retrieval (PVR) benchmark was recently established by researchers in the paper titled Pointer Value Retrieval: A novel benchmark for exploring the limitations of neural network generalization. This benchmark consists of a supervised learning task on MNIST digits that includes a ‘logical’ or reasoning component in label construction. The functions to be learned are defined on MNIST digits organized either sequentially or on a grid. A specific digit acts as a pointer on a subset of other digits from which a logical/Boolean function is computed to generate the label. The label is generated by applying some’reasoning’ on these digits.

Consider the PVR setting for binary digits in string format, which takes a string of MNIST digits as input. This string’s label is defined as follows: In binary expansion, the first three bits 101 determine the pointer, and the pointer leads to a window of a particular length, say two in our case. The pointer points explicitly to the initial bit of the window. Consider the scenario when just 0 and 1 numbers are utilized, as seen in the sample below.

A PVR function with a window size of 2 is illustrated. The first three bits are the pointer, pointing to a window in the following bits. In particular, the number given by the pointer bits in binary expansion indicates the position of the window’s initial bit. After that, the label is generated by applying a predetermined aggregation function to the window bits (e.g., parity, majority-vote, etc.)

To create the label, one must first look at the 6th window of length 2 supplied by 11, then apply some fixed function, such as the parity function (so the label would be 0 in this example). The PVR benchmark is specified in the paper Pointer Value Retrieval for matrices of digits rather than strings; they focus here on the string version that encompasses all of the PVR study’s aims. This benchmark is proposed to investigate the boundaries of deep learning on tasks beyond traditional picture recognition, namely the trade-off between memorizing and reasoning by acting with a specific distribution shift during testing.

To learn such PVR functions, one must first learn digit identification, followed by the logical component of these digits. Handling both things well at the same time is more complex than succeeding at the latter, assuming the former is successful. They concentrate on the ‘logical component’ as an essential component to learn, which equates to learning a Boolean function. The overall role in the PVR that translates an image’s pixels to its label is, of course, a Boolean operation (as is any computer-encoded function). Still, the structural aspects of such meta-functions are more challenging to characterize and are left for future investigation.

In any event, they focus on examining the boundaries of deep learning on the logical/Boolean component first to understand the constraints of deep knowledge on such benchmarks. They then explicitly restate one of the PVR benchmarks from the study, Pointer Value Retrieval, which focused on binary digits for simplicity. They will use the alphabet 0 and 1 to explain the benchmark connected to the MNIST dataset. Still, They will switch to the alphabet {+1, āˆ’1}  to discuss the difficulty of learning Boolean functions with neural networks.

The main contributions discussed in the paper are 

1. In the matched setting (i.e., train and test distributions are matching), they prove a lower bound on the generalization error for gradient descent (GD), which degrades for functions with high noise sensitivity (or low noise stability), thereby formalizing a conjecture proposed in the paper Pointer Value Retrieval. 

2. They hypothesize that (S)GD on the square loss and specific network architectures such as MLPs and Transformers have an implicit bias towards low-degree representations when learning logical functions such as Boolean PVR functions in the mismatched setting, specifically in the canonical holdout where a single feature is frozen at training and released to be uniformly distributed at testing. This adds to the research of implicit bias in GD on neural networks for the case of logical functions. They next demonstrate that the generalization error in the standard holdout context is supplied by the Boolean influence, a fundamental concept in Boolean Fourier analysis. 

3. They present tests that validate this idea for various target functions and architectures, such as MLPs and Transformers. Mini-batch stochastic gradient descent is used in these.

4. They formalize the hypothesis for GD and linear regression models and undertake experiments to support it on multi-layer neural networks with high depths and small starting scales.

The PyTorch code implementation of this paper can be found on github.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures'. All Credit For This Research Goes To Researchers on This Project. Check out the paper, github link and reference.
Please Don't Forget To Join Our ML Subreddit

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

šŸš€ The end of project management by humans (Sponsored)