Machine Learning (ML) and, in particular, Deep Learning is drastically changing the way we conduct business as now data can be utilized to guide business strategies to create new value, analyze customers and predict their behavior, or even provide medical diagnosis and care. We may think that data is at risk when these algorithms recommend and direct our purchases on social media or monitor our doorways, elderly, and youngsters, but that is only the tip of the iceberg. Data is used to make banking decisions, detect fraudulent transactions, and decide insurance rates. In all these cases, the data is embroiled with sensitive information regarding enterprises or even individuals and the benefits are entangled with data risks.
One of the most critical challenges that companies are facing today is understanding how to handle and protect their own data while using it to improve their businesses through ML solutions. The data includes customers’ personal information as well as business data, such as data regarding the sales of a company itself. Clearly, it is essential for a company to correctly handle and protect such data since its exposure would be a massive vulnerability.
Specifically, it is worth mentioning three significant business challenges about data protection:
- First, companies have to find out how to provide safe access to large datasets for their scientists to train ML models that provide novel business value.
- Second, as part of their digital transformation efforts, many companies tend to migrate their ML processes (training and deployment) to cloud platforms where they can be more efficiently handled at a large-scale. However, exposing the data those ML processes consume to the cloud platform comes with its own associated data risks.
- Third, organizations that want to take advantage of third-party ML-backed services must currently be willing to relinquish ownership of their sensitive data to the provider of those services.
To address these challenges and be broadly applicable, two essential goals must be met:
- Separate plain-text sensitive data from the machine learning process and the platform during both the training and the inference stages of the ML lifecycle;
- Fulfill this objective without significantly impacting the performance of the ML model and the platform on which it is trained and deployed.
In recent years, ML researchers have proposed different methods to protect the data that will be used by ML models. However, none of these solutions satisfies both the above-mentioned goals. Most importantly, Protopia AI’s Stained Glass Transform™ solution is the only solution on the market that adds a layer of data protection during inference without requiring specialized hardware or incurring significant performance overheads.
Protopia AI’s patented technology enables Stained Glass Transforms™ to reduce the information content in inferencing data to increase data security and enable privacy-preserving inference. The transforms can be thought of as a stained-glass covering the raw data behind the glass. Distinctly different from masking solutions, instead of scanning the data to find sensitive information to redact, Protopia AI’s solution stochastically transforms real data with respect to the machine learning model the data is intended for. The low-overhead and nonintrusive nature of Stained Glass Transforms™ enables enterprises to secure the ownership of their data in increasingly complex environments by dynamically applying the transformations in data pipelines for every record.
While synthetic data can be useful for training some models, inferencing requires real data. On the other hand, inferencing on encrypted data is prohibitively slow for most applications even with custom hardware. By contrast, Protopia AI’s Stained Glass Transforms™ change the representation of the data through a low-overhead software-only process. These transforms are applicable and effective for a variety of data types, including but not limited to tabular, text, image, video, etc. Protopia AI’s solution enables decoupling the ownership of data from where and on which platform the inferencing is performed.
Gartner has also recently highlighted Protopia AI in their June 2022 report on Cool Vendors in AI Governance and Responsible AI – From Principles to Practice.
Data sharing and protecting data ownership is a hindrance for using SaaS for AI and machine learning. With Protopia AI, the specific target machine learning model is still able to perform accurate inferencing without the need to reverse the transformation. The target model is still trained with the common practices and using the original data. As such, the solution seamlessly integrates with MLOps platforms and data stores. Ultimately, Protopia AI’s Stained-Glass Transforms minimize leakage of the sensitive information entangled in inferencing data — which, in many cases, is the barrier to using the data for machine learning and AI. “
In the sections that follow, we detail how existing methods are complementary to Protopia AI’s Stained Glass Transform™ solution and where other solutions fall short.
Federated Learning: To protect training data, Google presented Federated Learning , a distributed learning framework in which the devices on which data are locally stored collaboratively learn a shared ML model without the need to expose training data to a centralized training platform. The idea is to send only the ML models’ parameters to the cloud, thus protecting the sensitive training data. However, different works in the literature demonstrated that an attacker could use observations on an ML model’s parameters to infer private information included in the training data, such as class representatives, membership, and properties of a training data’s subset . Moreover, Federated Learning ignores the inference stage of the ML lifecycle, and therefore running inference still exposes the data to the ML model whether it is running on the cloud on the edge device.
Differential Privacy: There has been significant attention to the use of Differential Privacy. This method provides margins on how much a single data record from the training dataset contributes to the machine learning model. This is a membership test on the training data records and it ensures if a single data record is removed from the dataset, the output should not change beyond a certain threshold. Although very important, training in a differentially private manner still requires access to plain-text data. More importantly, differential privacy does not deal with the inferencing stage in any form or way.
Synthetic Data: Another method to protect sensitive training data is just training the ML model using Synthetic Data. However, the generated synthetic data might not cover possible real-world data subspaces essential to train a predictive model which will be reliable during the inference stage. This could cause significant accuracy losses that make the model unusable after its deployment. Moreover, the trained model still needs to use real data to perform inferencing and prediction and there is no escaping the challenges of this stage where synthetic data cannot be used.
Secure Multi-Party Computation and Homomorphic Encryption: Two cryptographic techniques for privacy-preserving computations are Secure Multi-Party Computation (SMC) and Homomorphic Encryption (HE). In SMC, the computation is distributed over multiple secure platforms that results in significant computation and communications costs which can be prohibitive in many cases . Homomorphic encryption is even more costly as it operates on the data in the encrypted fashion that even with custom hardware is orders of magnitude slower . Moreover, deep neural networks, which represent the most used ML solution in many domains nowadays, require some modifications to be used in a framework that relies on HE .
Confidential Computing: Confidential computing focuses on protecting data during use. Many big companies like Google, Intel, Meta, and Microsoft have already joined the Confidential Computing Consortium, established in 2019 to promote hardware-based Trusted Execution Environments (TEEs). This solution aims at protecting data while it is being used by isolating computations to these hardware-based TEEs. The main drawback of Confidential Computing is that it forces companies to increase their costs to migrate their ML-based services on platforms that provide such specialized hardware infrastructures. At the same time, this solution can not be considered risk-free. Indeed, in May 2021, a group of researchers introduced SmashEx , an attack that allows collecting and corrupting data from TEEs that rely on the Intel Software Guard Extension (SGX) technology. Protopia AI’s Stained Glass Transform™ technology can transform data before entering the trusted execution environment and as such it is complementary and minimizes the attack surface on an orthogonal axis. Even if the TEE is breached the plaintext data is not there anymore with Protopia AI’s solution.
In conclusion, enterprises have been struggling to understand how to protect sensitive information when using their data during training and inference stages of the ML lifecycle. Questions of data ownership and to whom, what platform, and what algorithms sensitive data gets exposed to during ML processes are a central challenge to enabling ML solutions and unlocking their value in today’s enterprise. Protopia AI’s Stained Glass Transform™ solution privatizes and protects ML data for both training and inference for any ML application and data type. These lightweight transformations decouple the ownership of plain/raw sensitive information in real data from the ML process without imposing significant overhead in the critical path nor requiring specialized hardware.
Note: Thanks to Protopia AI for the thought leadership/ Educational article above. Protopia AI has supported and sponsored this Content. For more information, products, sales, and marketing, please contact Protopia AI team at firstname.lastname@example.org
 McMahan, Brendan, et al. “Communication-efficient learning of deep networks from decentralized data.” Artificial intelligence and statistics. PMLR, 2017.
 Lyu, Lingjuan, et al. “Privacy and robustness in federated learning: Attacks and defenses.” arXiv preprint arXiv:2012.06337 (2020).
 Mohassel, Payman, and Yupeng Zhang. “Secureml: A system for scalable privacy-preserving machine learning.” 2017 IEEE symposium on security and privacy (SP). IEEE, 2017.
 Xie, Pengtao, et al. “Crypto-nets: Neural networks over encrypted data.” arXiv preprint arXiv:1412.6181 (2014).
 Chabanne, Hervé, et al. “Privacy-preserving classification on deep neural network.” Cryptology ePrint Archive (2017).
 Cui, Jinhua, et al. “SmashEx: Smashing SGX Enclaves Using Exceptions.” Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021.
Luca is Ph.D. student at the Department of Computer Science of the University of Milan. His interests are Machine Learning, Data Analysis, IoT, Mobile Programming, and Indoor Positioning. His research currently focuses on Pervasive Computing, Context-awareness, Explainable AI, and Human Activity Recognition in smart environments.