Is ChatGPT’s Behavior Changing over Time? Researchers Evaluate the March 2023 and June 2023 Versions of GPT-3.5 and GPT-4 on Four Diverse Tasks

Large Language Models (LLMs) have successfully proven to be the best innovation in the field of Artificial Intelligence. From BERT, PaLM, and GPT to LLaMa DALL-E, these models have shown incredible performance in understanding and generating language for the purpose of imitating humans. These models are continuously improving based on fresh information, user input, and design modifications. However, there is still uncertainty in how frequently GPT-3.5 and GPT-4 will receive updates, which makes it difficult to integrate these LLMs into broader workflows. 

The instability can disrupt downstream pipelines if an LLM’s behavior, such as its correctness or formatting in response to a prompt, abruptly changes. This unpredictability might make it difficult for developers and users to trust regular outcomes, which can limit the stable integration of LLMs into current systems and workflows. To study how the behaviors of different Large Language Models (LLMs) change over time, a team of researchers from Stanford University and UC Berkeley has evaluated the behavior of the March 2023 and June 2023 versions of GPT-3.5 and GPT-4.

Three crucial elements have been used to quantify the changes, which are the LLM services to monitor, the application scenarios to concentrate on, and the metrics to gauge LLM drift in each scenario. The core components of ChatGPT, GPT-4 and GPT-3.5, are the LLM services being monitored in this study. Given ChatGPT’s acceptance by both corporations and individuals, as well as its popularity, systematic and timely monitoring of these two services can aid users in better comprehending and using LLMs for their particular use cases. 

The snapshots from March 2023 and June 2023 of the two major GPT-4 and GPT-3.5 versions that are accessible through OpenAI’s API have been used in the study, with the main objective of examining the variations or “drifts” between the two dates. The team has chosen four commonly researched LLM tasks for evaluation that are utilized as performance and safety benchmarks. These jobs include – 

  1. Solving math problems – When resolving math issues, accuracy gauges how frequently an LLM service produces the right response.
  1. Addressing delicate questions: Answer rate, which shows how frequently an LLM service provides a direct response.
  1. Code generation – The percentage of generated code that can be immediately executed in a programming environment and satisfies unit tests.
  1. Visual Reasoning – Exact match, which assesses if the created visual objects precisely match the source material.

In conclusion, the research focuses on GPT-4 and GPT-3.5, evaluates them on four chosen tasks, and uses both specialized performance measures and other common metrics to quantify and measure LLM drifts in each scenario in order to look into how the behaviors of various LLMs evolve over time. The study’s findings can assist users in better understanding LLM behavior and utilizing these models for a variety of applications.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

🚀 [FREE AI WEBINAR] 'Optimise Your Custom Embedding Space: How to find the right embedding model for YOUR data.' (July 18, 2024) [Promoted]