IBM’s Approach Towards Preserving Adversarial Robustness of Machine Learning Systems


In the real world, machine learning (ML) procedures can be sensitive to adversarial attacks. Algorithms take numeric vectors as inputs. A malicious attack is when you design information in a certain way to elicit the wrong answer from the model. As a response, there is a need to defend against hacks by identifying vulnerabilities, anticipating new techniques, and developing solid models that function as well in the real world as they do in a sandbox.

Researchers at IBM aim to make AI hack-proof. This active study area tries to close the gap between AI model development and deployment, making them more resilient to adversity. Inadvertent adversity, such as when data is corrupted, and purposeful adversity, when intruders deliberately attack ML models, are possible. These issues can contribute to a model producing inaccurate predictions or outcomes. 

The research team aims to improve ML models’ adversarial robustness, making them more resistant to errors and attacks. Finding out where AI is susceptible, uncovering new risks, and bolstering machine learning approaches to weather a crisis are necessary steps in resolving this problem.


  • Poisoned Data:

One of the most severe challenges to ML systems is the possibility of poisoning their training data. Unsupervised domain adaptation (UDA) is a machine learning method for transferring information from a labeled source domain to an unlabeled target domain with different data distribution. UDA methods work by reducing the disparity between the data distributions of the two fields while respecting an upper constraint for the error on the source domain. Data distributions are susceptible to UDA methods, making them open to adversarial assaults like data poisoning. With merely 10% poisoned data, their accuracy in the target domain declines to nearly 0% in some circumstances. This spectacular failure exemplifies the limitations of UDA approaches.

  • Weight Perturbation:

Poisoning causes an ML model’s input to be disrupted, but it’s not the only approach to attack. The variable “weight” refers to the parameters of a learning algorithm that can be changed to affect the model’s output. Weight sensitivity can be utilized as a vulnerability for fault injection and erroneous prediction in adversarial robustness and security. The researchers devised methods to describe neural network behavior in response to weight disturbance.


  • Recovering Data by reverse engineering:

Vertical federated learning (VFL) is a new machine learning framework that allows a model to be trained utilizing data from multiple sources on the same set of subjects. Only the model parameters and their gradients (i.e., how they change) are shared during training to guarantee data privacy.

During VFL, though, there’s a potential that private data could be “recovered” from gradients. This would result in a massive data breach.

CAFE3 is a method developed by the research team that recovers private data with higher quality than prior approaches, including data from big batches that were previously assumed to be resistant to such attacks.

The researchers advocated trading artificial gradients over real ones during training to reduce CAFE. Fake gradients can attain the same learning performance as actual gradients as long as their difference is less than a specified threshold.


The method is a type of Min-Max training in which the goal is to minimize the maximal adversarial loss. By rephrasing the aim of reducing the performance impact of a given model or defense plan on the adversary, a generalized Min-Max technique can be used to build more successful adversarial attacks. 

The Min-Max framework was created by incorporating “domain weights” that maximize the probability distribution between a collection of domains defined by the scenario. In addition, the Min-Max architecture was translated to a defense situation, and a model was trained to minimize the adversarial loss in a worst-case scenario when numerous hostile attacks were present.


Contrastive learning (CL) is a machine learning technique in which a model recognizes data points that are similar or different to learn the general properties without labels (contrasting). Self-supervision allows for robust pre-training, but this robustness is frequently lost when fine-tuning a model for a given job.

Using adversarial training, the researchers look into improving robustness transfer from pre-training to fine-tuning (AT). The final goal is to use an adversarially robust CL model to enable simple fine-tuning with transferred robustness for various downstream activities. There are two critical components to the framework:

  • By focusing on high-frequency components, the model’s selection of how to see the input can be improved, resulting in more robust representation learning and pre-training generalization capacity.
  • The researchers incorporated a supervisory stimulus by using feature clustering to generate pseudo-labels for the data, which increases cross-task robustness transferability.


Contaminated best arm identification (CBAI) is a method for choosing the best of numerous possibilities (“arms”) when the data is sensitive to adversary corruption, a scenario with real-world ramifications. These “tainted” samples make it difficult to determine which arm is genuinely the best — the one with the best outcome or highest mean reward. The team has proposed two algorithms:

  • sequential elimination of substandard arms.
  • Lowering the overlap between the confidence intervals for distinct components.


It’s critical to understand that precision isn’t the sole criterion that matters. Fairness, interpretability, and resilience are essential in the actual world. There are many tools available to examine these aspects of AI/ML models. Developers must actively prepare ML models for success in the field by finding flaws in the armor, anticipating an adversary’s next move, and incorporating robustness into the AI fabric.


Related Paper: