SalesForce AI Research Proposed the FlipFlop Experiment as a Machine Learning Framework to Systematically Evaluate the LLM Behavior in Multi-Turn Conversations

When an error or misunderstanding arises, modern LLMs can theoretically reflect on and refine their answers because they are interactive systems capable of multi-turn interaction with users.

Previous research has demonstrated that LLMs can enhance their responses using additional conversational context, such as Chain-of-Thought reasoning. However, LLMs designed to maximize human preference can display sycophantic behavior, meaning they will give answers that match what the user thinks is right, even if that perspective isn’t correct. 

A new Salesforce AI Research presents a multi-turn interaction between a simulated user and an LLM focusing on a classification task as the FlipFlop experiment. The LLM performs a classification task in response to a user prompt at the initial turn of the discussion. The LLM then decides whether to affirm or reverse its response in the second turn in response to a challenger utterance (such as “Are you sure?”) that questions its answer. 

The team systematically evaluates the accuracy of initial vs. final predictions in classification tasks, which provides a rigorous context to study model behavior. The GPT-4, Claude V2, and PaLM-Bison LLMs are asked to respond to a multiple-choice question. Two of the models generate the right solution first. To respond to the challenge, two models (GPT-4 and Claude V2) switch up their answers in the second turn, while PaLM-Bison sticks with its original response. All three models show a decline in performance, with reductions ranging from -8 % (GPT-4) to 34% (Claude V2), when results are aggregated on an evaluation set with 100 samples.

They measured the propensity of LLMs to reverse their initial predictions when confronted, which often results in a decline in accuracy, through conversational simulations focused on classification tasks. According to the extensive analysis across 10 LLMs and seven tasks, models exhibited uniform sycophantic behavior, resulting in an average of 46% response flipping and a 17% decrease in accuracy. According to the findings, the model, the job, and the precise language of the challenger’s prompt determine the degree of the FlipFlop effect. While some models do better than others, results show a lot of space for growth when creating models that can have honest multi-turn conversations without losing task accuracy. Future research aiming to improve models’ conversational abilities and systematically assess sycophantic conduct quantitatively can use the FlipFlop experiment as a solid foundation.

The researchers also investigate if adjusting a linear learning model (LLM) on synthetically-generated FlipFlop conversations can enhance model behavior. They find that a fine-tuned Mistral7b can reduce observed sycophantic behavior by 50% compared to the base model, indicating that fine-tuning can help reduce, but not eliminate, the FlipFlop effect. Since the FlipFlop experiment offers a solid foundation for studying and quantifying LLM sycophantic behavior, the team intends to make their code and data freely available so that everyone can work toward the same objective of creating more reliable LLMs.

The researchers highlight that there is no all-inclusive list of the tasks and challenger statements that were part of the experiment. Even though the FlipFlop experiment mimics discussions with multiple turns, the interactions are still artificial and don’t differ much from each other. They do not expect their results and relative significance to be immediately applicable in a more realistic environment. Their evaluation focuses on measures that assess response flipping and performance worsening. However, different use cases may highlight different parts of the model’s replies. For example, it was beyond the scope of their experiment to measure the relative politeness, conciseness, or consistency of the responses, even if these factors could be essential. They also focused on classification problems for experiments because they offer well-established metrics and simple formulations for measuring the efficacy of model responses. Evaluating sycophantic behavior in open-domain generation tasks, where LLMs are frequently employed, is an essential but unexplored area. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...