From recommendations to automatic picture classification, machine learning (ML) models are increasingly helpful for increased performance across several consumer products. Despite aggregating massive volumes of data, models should encode the features of individual entries from the training set in theory.
Experiments in controlled circumstances have shown that language models trained on email datasets can encode sensitive information in the training data and have the ability to reveal the presence of a specific user’s data in the training set. As a result, it’s critical to avoid encoding such properties in individual training entries.
Researchers are increasingly using federated learning methodologies to achieve these goals. Differential privacy (DP) allows academics to quantify and comprehend a system’s or algorithm’s privacy guarantees. The privacy guarantees are commonly described by a positive parameter termed the privacy loss bound, with smaller values corresponding to more robust privacy under the DP framework.
DP-SGD, a customized training technique that gives DP assurances for the taught model, is typically used to train a model with DP guarantees.
However, there are two fundamental disadvantages to using DP-SGD for training. To begin with, most current DP-SGD implementations are inefficient and slow, making them difficult to employ on large datasets.
Second, DP-SGD training frequently negatively influences utility (such as model accuracy), to the point where DP-SGD-trained models are worthless in reality.
As a result, most DP research articles evaluate DP algorithms on tiny datasets (MNIST, CIFAR-10, or UCI) and do not even attempt to analyze larger datasets like ImageNet.
On ImageNet, Differential Privacy was put to the test.
ImageNet classification was chosen to demonstrate the practicality and efficacy of DP for two reasons. First, it is a challenging task for DP, for which no prior work has shown sufficient progress. Second, it is a public dataset on which other researchers can work, allowing to collectively improve the utility of real-world DP training.
ImageNet classification is difficult for DP since it necessitates extensive networks with many parameters. The noise contributed increases with the model’s size, resulting in a substantial quantity of noise added to the computation.
Using JAX to Scale Differential Privacy
Exploring different architectures and training configurations to see what works best for DP can be painfully slow. JAX is a high-performance computing toolkit based on XLA that can do efficient auto-vectorization and just-in-time compilation of mathematical computations to streamline our work.
In the context of smaller datasets like CIFAR-10, using these JAX features was previously advocated as an excellent technique to speed up DP-SGD.
In JAX, a new DP-SGD implementation was developed and was tested against the extensive ImageNet dataset (the code is included in our release). The JAX implementation was relatively easy because the XLA compiler was used and resulted in substantial speed advantages. The JAX implementation is often quicker than other DP-SGD implementations like Tensorflow Privacy. Compared to the PyTorch Opacus, which is custom-built and optimized, it is usually faster.
The DP-SGD method takes about two forward-backward runs through the network for each phase. While slower than non-private training, which only takes one forward-backward pass, it is still the most efficient method for training with the per-example gradients required for DP-SGD.
The graph below illustrates the training runtimes for two ImageNet models, one using DP-SGD and the other with non-private SGD, both on JAX.
Overall, DP-SGD on JAX was fast enough to execute extensive experiments simply by lowering the number of training runs necessary to find optimal hyperparameters by a factor of two compared to non-private training. This outperforms alternatives like Tensorflow Privacy, which was 5x–10x slower in the CIFAR10 and MNIST benchmarks.
Using a combination of techniques to improve accuracy in future training algorithms may enhance DP’s privacy-utility tradeoff. However, an engineering “bag-of-tricks” approach makes DP more feasible on complex problems like ImageNet. On ImageNet, the following combination of techniques worked best to help achieve non-trivial accuracy and privacy:
It has already been demonstrated that pre-training on public data followed by DP fine-tuning on private data improves accuracy on other benchmarks. The topic of shared data for a specific assignment to maximize transfer learning remains unanswered.
Before fine-tuning the models using DP-SGD on ImageNet, it was pre-trained on Places365. Places365 only offers photos of landscapes and buildings, not animals like ImageNet. Thus it’s an excellent option for demonstrating the model’s capacity to transfer to a comparable but different domain.
Using Places365 transfer learning provided 47.5 percent accuracy on ImageNet while maintaining a reasonable degree of privacy (= 10).
This is poor compared to the 70% accuracy of a similar non-private model. Still, it is pretty high when compared to naive DP training on ImageNet, which gives either very low accuracy (2-5%) or no privacy (=109).
The findings and public code will encourage other researchers to improve DP for ambitious tasks like ImageNet as a proxy for complex production-scale challenges. To keep the field moving forward, it is recommended that researchers start with a baseline that includes full-batch training and transfer learning.