Researchers From MIT and Harvard Finds When and How a Machine-Learning Model is Capable of Overcoming Dataset Bias

It is no doubt that machine learning has become an inseparable part of our daily lives. Today, ML algorithms recommend movies, goods to buy, and even someone to date. They demonstrate the advantages of algorithmic decision-making. Unlike individuals, machines do not get weary or bored, and they can consider orders of magnitude more factors than humans.

However, it is difficult to understand how ML models make decisions, and they are often biased. If the datasets are used to train machine-learning algorithms to contain biassed data, the system is likely to make decisions with the same bias in practice.

ML fairness is a relatively new branch of machine learning that looks at methods to ensure that data biases and model flaws don’t lead to models discriminating against people based on their ethnicity, gender, handicap, or sexual or political orientation.

In partnership with Harvard University and Fujitsu Ltd., MIT researchers conducted a study to understand when and how a machine-learning model may overcome this type of dataset bias. They focused on understanding how training data influences whether an artificial neural network can learn to recognize items it has never seen before using a neuroscience method.

Their findings reveal that training data diversity significantly impacts whether a neural network can overcome bias. However, dataset diversity can also worsen the network’s performance. They also demonstrate that how a neural network is trained and the precise sorts of neurons that arise throughout the training process can have a significant impact on whether or not it can overcome a biassed dataset.

The researchers looked at this problem of dataset bias from the perspective of neuroscientists. Neuroscience investigations commonly employ controlled datasets in which the researchers know as much as possible about the information it contains.

The team created datasets with photographs of various items in various positions. They carefully controlled the pairings such that some datasets were more diverse than others. A more diversified collection included more photos of things from various perspectives.

They then used this dataset to train a neural network for picture classification and later tested it to evaluate how well it could recognize items from angles the network had not seen during training (known as an out-of-distribution combination).

Their results show that the proposed network can better generalize new images or viewpoints if the dataset is more diverse. That is if more photographs depict objects from different perspectives.

In contrast to the typical ML training methods that train models to do numerous tasks at once, the researchers found that a model trained independently for each job was considerably better at overcoming bias than a model trained for both tasks simultaneously.

According to the team, neuron specialization plays a significant impact. When a neural network is trained to recognize objects in images, it appears that two types of neurons emerge: one that specializes in object category recognition and the other that specializes in object recognition.

The question now is how these neurons get there, given that they haven’t been taught which neurons to include in their architecture. The team believes that this is a promising research field, and they would like to see if they can get neurons with this specialization to develop in a neural network. They also intend to adapt their method to more difficult tasks, including recognizing objects with complex textures or varying lighting.