It has been revealed in a report by plagiarism detector Copyleaks revealed that 60% of OpenAI’s GPT-3.5 outputs contain some form of plagiarism.
Copyleaks’ Plagiarism Detection vs. OpenAI’s GPT Models
In order to assign a “similarity score”, the company used a proprietary scoring method considering identical text, minor alterations and paraphrasing.
Specializing in AI-based text analysis, Copyleaks offers plagiarism detection tools to businesses as well as schools.
The company has been in this space well before the advent and proliferation of ChatGPT. OpenAI launched GPT-3.5 at its debut, but later upgraded to a much more advanced GPT-4.
As per the latest findings shared by the company, GPT-3.5 had 45.7% identical text, 27.4% minor changes and had 46.5% paraphrased text. As per the report, a score of 100% implies non-original content whereas 0% implies originality of content.
Various tests were conducted by Copyleaks around the 26 subjects, each with 400 words. Across the subjects, highest similarity score belonged to computer science (100%), followed by physics (92%) and psychology (88%) which meant higher copying, other the diametric opposite were subjects like English language (5.4%), humanities (2.8%) and theatre (0.9%) had one of the lowest similarity scores.
Lindsey Held, OpenAI spokesperson said that “Our models were designed and trained to learn concepts in order to help them solve new problems”.
Conversational AI Models & Issues Like Memorization and Copyright
Addressing the issues of similarity scores and memorization by the conversational AI model, she said that the organization has measures in place to circumvent the inadvertent memorization and also that the terms of use of ChatGPT have clearly mentioned that the intentional use of the same is prohibited to regurgitate content.
In the past, a lawsuit was filed against OpenAI by the New York Times for “wide-scale copying” copyright infringement. Responding to the prompts, OpenAI argued that the “regurgitation” is a “rare bug”, adding that it was The New York Times that “manipulated the prompts.”
However, the content creators opine that generative AI, which is the underlying technology behind the AI model is trained on their copyrighted work.