Model specialization involves adapting a pre-trained machine-learning model to a specific task or domain. In Language Models (LMs), model specialization is crucial in improving their performance in various tasks like summarization, question-answering, translation, and language generation. The two main processes to specialize a language model to specific tasks are instruction fine-tuning (adapting a pre-trained model to a new task or set of tasks) and model distillation (transferring knowledge from a pre-trained, “teacher” model to a smaller, specialized, “student” model). Prompting is a key concept in the field of LM specialization, as it provides a way to guide the model towards specific behaviors, allows for more efficient use of limited training data, and is crucial for achieving state-of-the-art performance. Compressing prompts is a technique being studied with the hope of leading to substantial savings in computing, memory, and storage and no substantial decrease in the overall performance or quality of the output.
This paper, presented by researchers from Stanford University, proposes a novel technique for prompt compression called gisting, which trains an LM to compress prompts into smaller sets of “gist” tokens. In order to reduce the cost of the prompt, techniques like fine-tuning or distillation can be used to train a model that would behave like the original one without the prompt, but in that case, the model would have to be re-trained for every new prompt, which is far from ideal. The idea behind gisting, however, is to use a meta-learning approach to predict gist tokens from a prompt which would not require re-training the model for each task and would enable generalization to unseen instructions without additional training. This would come with a reduction in computational cost and would enable a prompt to be compressed, cached, and reused for compute efficiency. It would also allow users to fit more content into the limited context window.
The authors experimented with a simple way of achieving such a model – they used the LM itself (leveraging its pre-existing knowledge) to predict the gist tokens during the instruction fine-tuning while modifying the Transformer attention masks. Given a (task, input) pair, they add gist tokens between the task and the input and set the attention mask in the following way: the input tokens after the gist tokens cannot attend to any of the prompt tokens before the gist tokens (but they can attend to the gist tokens). Given the input and the output cannot attend to the prompt, this forces the model to compress the information from the prompt into the gist tokens in between.
To train the gist models, they needed a dataset with a large variety of tasks, so they created a dataset that they called Alpaca+, which combined the data from two existing instruction tuning datasets (Standford Alpaca and Self-Instruct) which totaled more than 130k examples. They then held out 3 validation splits to be able to validate the model after training which had Seen, Unseen, and hand-crafted Human prompts. This way, they were able to test the generalization to unseen instructions, with the Human split posing an even stronger generalization challenge. They also used multiple LM architectures (namely LLaMA-7Bm, a decoder-only GPT-style model, and FLAN-T5-XXL) and trained gist models with a varying number of gist tokens (1, 2, 5, or 10). However, the results showed that models were generally insensitive to the number of gist tokens, in some cases even showing that a larger number of tokens was actually detrimental to performance. They, therefore, used a single gist model for the rest of the experiments.
To assess the quality of the prompt compression, they calibrated performance against a positive control, which was effectively a standard instruction finetuning, which provided an upper bound on performance, and a negative control where the model would not have access to the instruction at all, resulting in random gist tokens, which provided a lower bound on performance. To compare the outputs of their models to the positive control and measure a win rate against it, they asked ChatGPT to choose which response was better, explaining its reasoning. They also used a simple lexical overlap statistic called ROUGE-L (a metric that measures similarities between generated text and human-written instructions in open-ended instruction fine-tuning). A 50% win rate indicates that the model is of comparable quality to a model that does no prompt compression.
The results showed that on Seen instructions, the gist models performed very closely to the positive control models with 48.6% (LLaMA) and 50.8% (FLAN-T5) win rates. More importantly, they were able to show that the gist models had competitive generalizations to unseen prompts, with 49.7% (LLaMA) and 46.2% (FLAN-T5) win rates. Only on the most challenging Human split they saw slight drops in win rates (but still competitive) with 45.8% (LLaMA) and 42.5% (FLAN-T5). The slightly worse performance of the FLAN-T5 and the particular failure cases brought more hypotheses to be tested in future papers.
The researchers also investigated the potential efficiency gains that can be achieved through gisting, which was the primary motivation for the study. The results were highly encouraging, with gist caching leading to a 40% reduction in FLOPs and 4-7% lower wall clock time compared to unoptimized models. While these improvements were found to be smaller for decoder-only language models, the researchers also demonstrated that gist models enabled a 26x compression of unseen prompts, providing considerable additional space in the input context window.
Overall, these findings illustrate the significant potential of gisting for enhancing both the effectiveness and efficiency of specialized language models. The authors also suggest several promising directions for follow-up work on gisting. For example, they stipulate that the largest compute and efficiency gains from gisting will come from compressing longer prompts and that “gist pretraining” could improve compression performance by first learning to compress arbitrary spans of natural language before learning prompt compression.
Check out the Paper and Github. Don’t forget to join our 19k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Nathalie Crevoisier holds a Bachelor's and Master's degree in Physics from Imperial College London. She spent a year studying Applied Data Science, Machine Learning, and Internet Analytics at the Ecole Polytechnique Federale de Lausanne (EPFL) as part of her degree. During her studies, she developed a keen interest in AI, which led her to join Meta (formerly Facebook) as a Data Scientist after graduating. During her four-year tenure at the company, Nathalie worked on various teams, including Ads, Integrity, and Workplace, applying cutting-edge data science and ML tools to solve complex problems affecting billions of users. Seeking more independence and time to stay up-to-date with the latest AI discoveries, she recently decided to transition to a freelance career.