Users are reporting a deterioration in OpenAI’s ChatGPT’s performance, with complaints that it is getting extremely slow and far less capable of providing accurate answers.
Researchers at Stanford and UC Berkeley found that the performance has worsened over time, thus making its answers less accurate.
Task-wise performance
They analyzed different versions of ChatGPT from March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four primary tasks.
They were assessed on tasks involving solving math problems, answering sensitive questions, generating code and visual reasoning.
In March, ChatGPT-4 could identify prime numbers with a 97.6 per cent accuracy rate.
However, in June it got only 12 questions right, plunging to 2.4 per cent accuracy.
It also performed poorly in generating code.
Cost-cutting
One theory explains that to offset the high cost of operating the systems, companies like OpenAI aren’t putting out the best versions of chatbots to the public.
“Rumors suggest they are using several smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run,” said AI expert Santiago Valderrama.
There has also been speculation that changes made to speed up the service and thereby reduce costs leads to quicker responses but degraded competency.
Side effect of continuous tweaks
GPT 3.5 and 4 are language models that are continuously updated but OpenAI doesn’t announce many of the changes made to them.
In the paper, they conclude that the behavioral changes are a side effect of unannounced updates to how the models function.
This leads to a fluctuation in the quality of these models.
“A LLM like GPT-4 can be updated over time based on data and feedback from users as well as design changes.
However, it is currently opaque when and how GPT-3.5 and GPT-4 are updated, and it is unclear how each update affects the behavior of these LLMs”, the researchers write.
Safety comprises quality
Another possibility is that changes introduced to prevent ChatGPT from answering dangerous questions impairs its usefulness for other tasks.
They found that ChatGPT’s newer version refused to answer certain sensitive questions.
Jim Fan, senior scientist at Nvidia, wrote on Twitter, “Unfortunately, more safety typically comes at the cost of less usefulness.”
Regular quality checks
They said that companies that depend on OpenAI should consider conducting regular quality assessments in order to monitor for unexpected changes.
In the same vein, some have called for open-source models like Meta’s LLaMA that enable community debugging.
The study stresses the importance of regularly monitoring performance so that problems or issues that arise can be identified and addressed promptly.