In recent months, there has been growing dissatisfaction on the web with the declining quality of ChatGPT responses. A group of scientists from Stanford and the University of California set out to determine whether there is really a degradation of output in the GPT-4 language model. Research has confirmed this fact: for example, the accuracy of answering the question “Is this a prime number?” fell from 97.6% to 2.4% from March to June.
The research team developed tasks to measure the qualitative aspects of ChatGPT performance based on the GPT-4 and GPT-3.5 models. The chatbot was tested according to the following criteria:
Solving mathematical problems Answering sensitive questions Code generation Visual perception
The comparative result is presented in the diagram:
In June, GPT-4 responded worse to all questions (except for the visual task) compared to the March result. During the same time, GPT-3.5 improved its results in three tasks out of four (except for programming, in which it became worse). If we compare the June results of GPT-4 and GPT-3.5 with each other, it is clear that GPT-4 copes better with half of the tasks, and worse with the other.
The experiment clearly demonstrated that the same language model over time can cope with tasks worse and give completely different answers. Unanswered questions remain what exactly causes the answers to degrade and whether changes aimed at improving the model in one of the aspects can disrupt its work in another.
The researchers note that ChatGPT in version GPT-4 or GPT-3.5 has become widespread among individual users and companies, the results of their application can already affect everyone’s life. Scientists are going to conduct a more detailed study of the issue.
Another study recently found degradation in the performance of generative AI models after multiple iterations of training on generated inputs.
Generative AI ‘goes crazy’ after five iterations of training on AI-generated materials – Stanford University study
Source: Tom’s Hardware