In recent times, there has been a growing number of reports and discussions about a decline in the quality of responses from ChatGPT. To investigate this matter, a team of researchers from Stanford and UC Berkeley conducted a study to quantify the extent of this degradation. The study confirmed that the drop in ChatGPT’s quality was indeed real.
The research paper titled “How Is ChatGPT’s Behavior Changing Over Time?” was authored by three prominent academics: Matei Zaharia, Lingjiao Chen, and James Zou. Matei Zaharia, who is a Computer Science Professor at UC Berkeley, shared the findings on Twitter, revealing a startling fact that GPT-4’s success rate in solving certain problems fell drastically from 97.6% to 2.4% between March and June.
GPT-4, which was recently released and acclaimed as OpenAI’s most advanced model, had been eagerly anticipated by developers for its potential to power innovative AI products. However, the study’s results showed disappointing performance, especially in handling straightforward queries.
The research team designed tasks to evaluate the quality of responses from the large language models (LLMs) GPT-4 and GPT-3.5. These tasks covered areas such as solving math problems, answering sensitive questions, code generation, and visual reasoning. The chart provided an overview of the performance of both models across their March and June releases in 2023.
The data clearly illustrated that the same LLM service provided different answers over time, showing significant differences in performance within this short period. It remains uncertain how these LLMs are updated and whether changes to improve one aspect of their performance might negatively affect others. Notably, the latest version of GPT-4 performed worse compared to the March version in three testing categories, with only a slight margin of improvement in visual reasoning.
While some may not be concerned about the variable quality in the “same versions” of these LLMs, it is crucial to acknowledge that both GPT-4 and GPT-3.5 have been widely adopted by individual users and businesses due to the popularity of ChatGPT. As such, information generated by these models can significantly impact people’s lives.
The researchers intend to continue assessing GPT versions in a more extended study. They suggest that OpenAI should consider monitoring and publishing regular quality checks for its paying customers. If not, it may be necessary for business or governmental organizations to keep an eye on basic quality metrics for these LLMs to avoid potential commercial and research impacts.
The AI and LLM technology domain has had its share of surprising issues, and with data privacy concerns and other public relations challenges, it currently seems like the “wild west” frontier of connected life and commerce.