Plagiarism Detection Using Transformers | by Zoumana Keita | Dec, 2022

By Jessie Hobb On Dec 24, 2022

A complete guide to building a more robust plagiarism detector using transformer-based models.

Plagiarism is one of the biggest issues in many industries, especially in academia. This phenomenon has even worsened with the rise of the internet and open information, where anyone can access any information at a click about a specific topic.

Based on this observation researchers have been trying to tackle the issue using different text analysis approaches. In this conceptual article, we will try to tackle two main limitations of plagiarism detection tools: (1) content rephrasing plagiarism, and (2) content translation plagiarism.

(1) Rephrased contents can be difficult to capture by traditional tools because they do not take into consideration synonyms and antonyms of the overall context.

(2) Contents written in a language different from the original one are also a big issue faced by even the most advanced machine learning-based tools since the context is being completely shifted to another language.

In this conceptual blog, we will explain how to use transformer-based models to tackle these two challenges in an innovative way. First of all, we will walk you through the analytical approach describing the entire workflow, from data collection to performance analysis. Then, we will deep dive into the scientific/technical implementation of the solution before showing the final results.

Imagine you are interested in building a scholarly content management platform. You might want to only accept articles that have not been shared on your platform. In this case, your goal will be to reject all new article that is similar to existing ones at a certain threshold.

To illustrate this scenario, we will use the cord-19 dataset, which is an open research challenge data, made freely available on Kaggle by Allen Institute for AI.

Before going further with the analysis, let’s clarify what we are trying to achieve here from the following question:

Problem: Can we find within our database one or more documents that are similar (at a certain threshold) to a new submitted document?

The following workflow highlights all the main steps required to better answer this question.

Plagiarism detection system workflow (Image by Author)

Let’s understand what is happening here 💡.

After collecting the source data, we start by preprocessing the content, then create a vector database from BERT.

Then, whenever we have a new incoming document, we check the language and perform plagiarism detection. More details are given later in the article.

This section is focused on the technical implementation of each section in the analytical approach.

Data preprocessing

We are only interested in the abstract column of the source data, and also, for simplicity’s sake, we will use only 100 observations to speed up the preprocessing.

source_data_processing.py

Below are the five random observations from the source data set.

Five random observations from the source data (Image by Author)

Document vectorizer

Focus on BERT and Machine Translation models (Image by Author)

The challenges observed in the introduction lead to respectively choosing the following two transformer-based models:

(1) A BERT model: to solve the first limitation because it provides a better contextual representation of textual information. To do so, we will have:

create_vector_from_text: used to generate the vector representation of a single document.
create_vector_database: responsible for creating a database containing for each document the corresponding vector.

bert_model_vectors.py

Line 94 shows five random observations from the vector database, with the new vectors column.

Five random articles from the vector database (Image by Author)

(2) A Machine Translation transformer model is used to translate the language of the incoming document into English because the source documents are in English in our case. The translation is performed only if the document’s language is one of the following five: German, French, Japanese, Greek, and Russian. Below is the helper function to implement this logic using MarianMT model.

document_translation.py

Plagiarism analyzer

There is plagiarism when the incoming document’s vector is similar to one of the database vectors at a certain threshold level.

But, when are two vectors similar?
→ When they have the same magnitude and same directions.

This definition requires our vectors to have the same magnitude, which can be an issue because the dimension of a document vector depends on the length of that document. Luckily, we have multiple similarity measure approaches that can be used to overcome this issue, and one of them is the cosine similarity, which will be used in our case.

If you are interested in other approaches, you can refer to this amazing content by James Briggs. He explains how each approach works, and its benefits, and also guides you through their implementation.

The plagiarism analysis is performed using the run_plagiarism_analysisfunction. We start by checking the document language using the check_incoming_document function to perform the right translation when required.

The final result is a dictionary with four main values:

similarity_score: the score between the incoming article and the most similar existing article in the database.
is_plagiarism: the value is true whether the similarity score is equal to or beyond the threshold. It is false otherwise.
most_similar_article: the textual information of the most similar article.
article_submitted: the article that was submitted for approval.

plagiarism_analysis.py

We have covered and implemented all the components of the workflow. Now, it is time to test our system using three of the languages accepted by our system: German, French, Japanese, Greek, and Russian.

Candidate articles and their submission evaluation

These are the abstracts text of the articles we want to check whether the authors plagiarised or not.

English article

This article is actually an example from the source data.

english_article_to_check = "The need for multidisciplinary research to address today's complex health and environmental challenges has never been greater. The One Health (OH) approach to research ensures that human, animal, and environmental health questions are evaluated in an integrated and holistic manner to provide a more comprehensive understanding of the problem and potential solutions than would be possible with siloed approaches. However, the OH approach is complex, and there is limited guidance available for investigators regarding the practical design and implementation of OH research. In this paper we provide a framework to guide researchers through conceptualizing and planning an OH study. We discuss key steps in designing an OH study, including conceptualization of hypotheses and study aims, identification of collaborators for a multi-disciplinary research team, study design options, data sources and collection methods, and analytical methods. We illustrate these concepts through the presentation of a case study of health impacts associated with land application of biosolids. Finally, we discuss opportunities for applying an OH approach to identify solutions to current global health issues, and the need for cross-disciplinary funding sources to foster an OH approach to research."

100_percent_similarity.py

Result of the plagiarism detector on the copy-pasted article (Image by Author)

After running the system we get a similarity score of 1, which is a 100% match with an existing article. This is obvious because we took exactly the same article from the database.

French article

This article is freely available from the French agriculture website.

french_article_to_check = """Les Réseaux d’Innovation et de Transfert Agricole (RITA) ont été créés en 2011 pour mieux connecter la recherche et le développement agricole, intra et inter-DOM, avec un objectif d’accompagnement de la diversification des productions locales. Le CGAAER a été chargé d'analyser ce dispositif et de proposer des pistes d'action pour améliorer la chaine Recherche – Formation – Innovation – Développement – Transfert dans les outre-mer dans un contexte d'agriculture durable, au profit de l'accroissement de l'autonomie alimentaire."""

plagiarism_analysis_french_article.py

Result of the plagiarism detector on French article (Image by Author)

There is no plagiarism in this situation because the similarity score is less than the threshold.

German article

Let’s imagine that some really liked the fifth article in the database, and decided to translate it into German. Now let’s see how the system will judge that article.

german_article_to_check = """Derzeit ist eine Reihe strukturell und funktionell unterschiedlicher temperaturempfindlicher Elemente wie RNA-Thermometer bekannt, die eine Vielzahl biologischer Prozesse in Bakterien, einschließlich der Virulenz, steuern. Auf der Grundlage einer Computer- und thermodynamischen Analyse der vollständig sequenzierten Genome von 25 Salmonella enterica-Isolaten wurden ein Algorithmus und Kriterien für die Suche nach potenziellen RNA-Thermometern entwickelt. Er wird es ermöglichen, die Suche nach potentiellen Riboschaltern im Genom anderer gesellschaftlich wichtiger Krankheitserreger durchzuführen. Für S. enterica wurden neben dem bekannten 4U-RNA-Thermometer vier Hairpin-Loop-Strukturen identifiziert, die wahrscheinlich als weitere RNA-Thermometer fungieren. Sie erfüllen die notwendigen und hinreichenden Bedingungen für die Bildung von RNA-Thermometern und sind hochkonservative nichtkanonische Strukturen, da diese hochkonservativen Strukturen im Genom aller 25 Isolate von S. enterica gefunden wurden. Die Hairpins, die eine kreuzförmige Struktur in der supergewickelten pUC8-DNA bilden, wurden mit Hilfe der Rasterkraftmikroskopie sichtbar gemacht."""

plagiarism_analysis_german_article.py

Result of the plagiarism detector on German article (Image by Author)

97% of similarity — this is what the model captured! The result is quite impressive. This article is definitely a plagiat.

Congratulations, now you have all the tools to build a more robust plagiarism detection system, using BERT and Machine Translation models combined with Cosine Similarity.

If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on Medium, Twitter, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

MarianMT model from HuggingFace

Source code of the article

Allen Institute for AI