Google AI Implements Machine Learning Model That Employs Federated Learning With Differential Privacy Guarantees

Bringing model training to the device extends beyond the usage of local models that make predictions on mobile devices. Federated Learning (FL) allows mobile phones to develop a shared prediction model cooperatively while retaining all of the training data on the device. This removes the ML need to store data in the cloud.

While Federated Learning (FL) allows for machine learning (ML) without collecting raw data, differential privacy (DP) is a quantitative measure of data anonymization that can alleviate worries about models retaining sensitive user data when applied to machine learning. 

A few years back, Google released RAPPOR, an open-source DP library, Pipeline DP, TensorFlow Privacy, and one of the first practical uses of DP for analytics.

A new Google research has implemented a production ML model that employs federated learning with a stringent differential privacy guarantee, following a multi-year, multi-team effort spanning foundational research and product integration. The team used the DP-FTRL approach to train a recurrent neural network to power next-word prediction for Spanish-language Gboard users during deployment. This neural network is trained directly on user data with a formal DP guarantee. Furthermore, the federated approach provides further data minimization benefits, and the DP guarantee safeguards all data on each device, not just individual training samples.

In ML applications that handle sensitive data, privacy considerations such as data minimization and anonymization are vital in addition to fundamentals like transparency and consent.

The notion of data reduction is built into the structure of federated learning systems. FL only sends out minimal updates for a specified model training task (targeted collection), restricts data access at all stages, processes individual data as soon as possible (early aggregation), and discards both acquired and processed data as soon as possible (minimal retention).

Anonymization is another fundamental criterion for models trained on user data. It refers to that the final model should not memorize information specific to a single individual’s data, such as phone numbers, addresses, or credit card numbers. 

However, Federated Learning (FL) does not immediately address this issue. This principle of anonymization can be formally quantified using the mathematical concept of DP. Randon noises are introduced during differentially private training to generate a probability distribution over output models. This method also aims to ensure that this distribution does not change too much when the training data is changed slightly. When a single training example is added or removed, it alters the output distribution on models in a provably minimal way. The researcher calls it example-level DP.

However, example-level DP isn’t always enough to prevent users’ data from being memorized. Instead of using example-level DP, the team devised algorithms for user-level DP. The output distribution of the models must remain constant even if all of the training examples from a single user are added or removed. To provide user-level DP guarantees, they used FL because it summarises all of a user’s training data as a single model update.

However, limiting one user’s contributions and increasing noise can reduce model accuracy. The team’s DP-FedAvg algorithm expands the DP-SGD technique to the federated setting with user-level DP guarantees. This method assures that the training mechanism is not very sensitive to the data of any single user, and empirical privacy auditing approaches rule out some forms of memorizing.

Earlier studies focusing on privacy amplification via random check-ins exposed the issues that demanded extensive changes:

  • While the amplification-via-sampling argument is critical for giving a strong DP guarantee for DP-FedAvg, guaranteeing devices are subsampled exactly and evenly at random from a huge population in a real-world cross-device FL system would be complex and difficult to prove.
  • Devices choose when to connect based on a variety of external variables. Thus, the number of available devices varies significantly.
  • Furthermore, like with the DP-SGD amplification-via-sampling analysis, the privacy amplification attainable with random check-ins is contingent on the availability of many devices.

The DP-FTRL algorithm is based on two key observations to tackle these issues:

  1. The convergence of gradient-descent-style algorithms is primarily determined by the accuracy of cumulative sums of gradients.
  2. Accurate cumulative estimated can be provided with a strong DP guarantee by utilizing negatively correlated noise added by the aggregating server. This includes adding noise to one gradient and subtracting that gradient from the other, implemented using the Tree Aggregation technique.

The team states that it is easier to estimate cumulative sums rather than individual gradients. They restrict the number of times a person can contribute an update to provide a strong privacy assurance. In the sampling-without-replacement method, each device may keep track of which models it has already contributed to and choose not to connect to the server for any subsequent rounds for those models.

Each eligible device stores a local training cache of user keyboard input to deploy the planned DP-FTRL. When the user participates, the gadget updates the model, making it more likely to propose the next word the user entered based on what has been written so far.

The researchers used DP-FTRL to train a recurrent neural network with 1.3M parameters using this data. Over the course of six days, 2000 rounds of training were conducted, with 6500 devices participating in each round. The devices were only trained once per 24 hours to ensure that the DP guarantee was met. The prior DP-FedAvg trained model, which gave empirically-tested privacy gains over non-DP models but lacked a substantial formal DP guarantee, has been upgraded.

The researchers hope that their findings will encourage more research into maximizing the value that machine learning can provide while limiting potential privacy risks to those who give training data.

Reference: https://ai.googleblog.com/2022/02/federated-learning-with-formal.html