Unveiling the Hidden Dimensions: A Groundbreaking AI Model-Stealing Attack on ChatGPT and Google’s PaLM-2

The inner workings of state-of-the-art large language models, such as GPT-4, Claude 2, or Gemini, remain shrouded in secrecy, with details about their architecture, model size, and training methods withheld from public scrutiny. This lack of transparency is attributed to competitive pressures and concerns regarding the safety implications of divulging information that could be exploited to attack these models. Despite the secrecy surrounding the models’ internal details, their accessibility via APIs raises questions about the extent to which adversaries can glean information about them through queries. This issue falls within the purview of model stealing, where adversaries attempt to extract model weights by interrogating the model’s API.

The researchers present a groundbreaking approach, a novel attack targeting black-box language models. This attack, specifically designed to recover a transformer language model’s complete embedding projection layer, departs from previous approaches that reconstruct models from the bottom up. Instead, it operates top-down, directly extracting the model’s final layer. By exploiting the low-rank nature of the final layer, targeted queries to the model’s API enable the extraction of its embedding dimension or final weight matrix. This innovative method, despite only recovering a portion of the entire model, raises concerns about the potential for more extensive attacks in the future.

The attack’s efficacy and efficiency apply to production models whose APIs expose full logprobs or a “logit bias,” including Google’s PaLM-2 and OpenAI’s GPT-4. Following responsible disclosure, both APIs implemented defenses to mitigate or increase the cost of the attack. While the attack successfully extracts the embedding layer of several OpenAI models with minimal error, further improvements and extensions are envisioned. These include breaking symmetry with quantized weights, extending the attack beyond a single layer, and exploring alternative avenues for learning logit information, as the effectiveness of the attack may be hindered by changes in API parameters or efforts to conceal logit information.

The study is not driven by the expectation of replicating entire production transformer models bit-for-bit. Instead, it is motivated by a more pressing concern: demonstrating the practical feasibility of model-stealing attacks on large-scale deployed models. This emphasis on practicality underscores the urgency of addressing these vulnerabilities and anticipating future directions for improving the attack’s effectiveness and resilience against countermeasures.

The researchers outline potential avenues for further exploration and improvement of the attack methodology. They stress the importance of adaptability in response to changes in API parameters or model defenses, emphasizing the need for ongoing research to address emerging vulnerabilities and ensure the resilience of machine learning systems against potential threats. By fostering collaboration and knowledge-sharing within the research community, the researchers aim to contribute to the development of more secure and trustworthy machine-learning models that can withstand potential adversarial attacks in real-world.

Check out the┬áPaper.┬áAll credit for this research goes to the researchers of this project. Also,┬ádonÔÇÖt forget to follow us on┬áTwitter.┬áJoin our┬áTelegram Channel,┬áDiscord Channel, and┬áLinkedIn Group.

If you like our work, you will love our newsletter..

DonÔÇÖt Forget to join our 38k+ ML SubReddit

­čÉŁ Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...