Enhancing Machine-Generated Text Detection: Adversarial Fine-Tuning of Pre-Trained Language Models

Research/NLP_Paper 2024. 11. 10. 22:17

※ 2024 NLP class team project's research subject

Abstract

Advances in large language models (LLMs) have revolutionized the natural language processing field. However, the text generated by LLMs can result in various issues, such as fake news, misinformation, and social media spam. In addition, detecting machine-generated text is becoming increasingly difficult because it produces text that resembles human writing. We propose a new method for effectively detecting machine-generated text by applying adversarial training (AT) to pre-trained language models (PLMs), such as Bidirectional Encoder Representations from Transformers (BERT). We generated adversarial examples that appeared to have been modified by humans and applied them to the PLMs to improve the model’s detection capabilities. The proposed method was validated on various datasets and experiments. It showed improved performance compared to traditional fine-tuning methods, with an average reduction in the probability of misclassification of machine-generated text by about 10%. We demonstrated the robustness of the model when generated with input tokens of different lengths and under different training data ratios. We suggested future research directions for applying AT to different languages and language model types. This study opens new possibilities for applying AT to the problem of machine-generated text detection and classification and contributes to building more effective detection models.

Overview of the proposed method. Our proposed model architecture aims to fine-tune pre-trained language models for machine-generated text classification tasks. We use the T5 model to perturb the input text by masking and filling to generate adversarial examples. The model is then trained using Cross Entropy Loss on both the original and adversarial examples.

Introduction

Large language models (LLMs) have a significant impact on the field of natural language processing [1]. LLMs such as ChatGPT, Llama2, PaLM2, and GPT-4 [2], [3], [4], [5] can perform various tasks such as summarizing documents, translating, answering questions, and providing answers to complex questions across multiple domains. They are also used to improve the efficiency of daily work and life activities, such as content creation, programming code analysis, and writing.

Recent research indicates that text generated by LLMs can result in various issues, such as fake news, misinformation, and social media spam [6], [7]. Additionally, when students use LLMs to write assignments or essays, it can adversely affect their critical thinking and problem-solving abilities and reduce their motivation to learn [8]. LLMs can lead to issues such as plagiarized papers [9], manipulated public opinion [10], and malicious product reviews [11], all of which are recognized as important social issues.

Research on solving the problems caused by LLM-generated text and effectively distinguishing between machine and human-generated text is becoming increasingly important. The popularity of LLMs has led to a significant increase in the amount of text they generate, making it difficult for humans to differentiate between machine-generated and human-written text [12]. Therefore, there is a need for research on how to distinguish between the two automatically.

To address this issue, a method has been proposed for detecting machine-generated text by training the Robustly optimized BERT pre-training Approach (RoBERTa) model [13], [14]. More recently, a zero-shot method was proposed for detecting machine-generated text by modifying it using the T5 [15] model and comparing it with the original [16]. However, supervised learning-based detection methods are limited by their vulnerability to attacks on text variations, and research is being conducted to address this issue [17].

Although generating adversarial examples is important, there is a scarcity of research focused on creating adversarial samples that resemble human modifications and applying them to detect machine-generated texts. The proposed method, inspired by DetectGPT [16], efficiently generates adversarial examples that resemble human modifications. We then apply adversarial training (AT) to language models such as Bidirectional Encoder Representations from Transformers (BERT) [18] to detect machine-generated text.

The main contributions of this study are summarized as follows: 1) We propose a novel method for generating adversarial examples that appear to be modified by humans and apply AT to language models such as BERT. Compared to fine-tuning the BERT model, the proposed method reduced the probability of model misclassification of machine-generated text by about 10% on average. It showed improved performance in terms of accuracy and F1 score. This demonstrates the superiority of the proposed methodology over existing methods. 2) Through experiments, we showed that the recognition performance decreases by approximately 2% as the length of the generated input sentences increases. This empirically shows that the more human text is written, the harder it is to detect machine-generated text.

The remainder of this study is organized as follows. Section II presents research on machine-generated text detection and AT. Section III explains how to fine-tune the pre-trained language model and the proposed methodology with AT. Section IV describes the datasets and hyperparameters used in the experiments and discusses the results obtained. Finally, the conclusions of this study and future research directions are presented in section V.

Related Work

A. Machine Generated Text Detection

The emergence of various models [2], [3], [4], [5], which resulted from the development of LLMs, has significantly enhanced the ability to generate text across various domains. However, the texts generated were similar to those written by humans, making them difficult to distinguish [19].

The detection of machine-generated text is similar to text classification [20], [21]. One approach to address this issue is to use supervised learning, which involves a classification model trained to detect machine-generated text [22], [23], [24]. For instance, the Grover model proposed a method for detecting machine-generated text by adding a linear layer to a language model for fake news generation [25], and RoBERTa [13] proposed a classification model for detecting machine-generated text by training it [14]. However, these models can be vulnerable to attacks on textual variations, such as paraphrasing, which can degrade the detection performance. To address this issue, research has been conducted on classification models that use AT [17].

Statistics-based detection methods are crucial for identifying machine-generated texts. Giant Language Model Test Room (GLTR) [12] proposed a method to assist humans in detecting machine-generated text by analyzing word probability, rank, and entropy. In addition, a zero-shot-based detection method that uses the log probability of sentence tokens was proposed to identify machine-generated text [14]. The DetectGPT [16] model is an efficient zero-shot detection method for identifying machine-generated texts. This is achieved by perturbing the input text using a language model such as T5 [15] and comparing the original and perturbed text’s log probabilities. Subsequent works, such as DetectLLM [21] and Fast-DetectGPT [26], improved upon the method proposed by DetectGPT.

B. Adversarial Training

AT involves creating adversarial examples to induce errors in the model and training it using both these and original examples [27]. This approach enhanced the robustness of the model against adversarial attacks. AT has been studied in various fields, including image classification [28], [29], recommendation systems [30], and image generation [31], [32].

AT is extended to the text domain by applying it to word embeddings instead of perturbing the input vector of the Fast Gradient Sign Method (FGSM) [33]. In addition, AT has been applied to pre-trained language models that have shown excellent performance in various natural language processing tasks, such as natural language understanding, machine translation, and speech processing, by training large text data to understand the structure and context of language. In particular, because of the convergence of models such as BERT [18], XLNet [34], RoBERTa [13], T5 [15], and ELECTRA [35], research has been conducted to apply AT to the fine-tuning process of pre-trained language models. Specifically, AT methods that perturb the text embedding layer have been employed to enhance text classification performance [36]. AT has also been applied to the pre-training of language models [37], and fine-tuning methods incorporating both AT and regularization have been used to improve the generalization performance of pre-trained language models [38], [39]. Moreover, empirical evidence demonstrates the effectiveness of AT when applied to a BERT model [40].

AT has been utilized in detecting machine-generated text, including the RADAR [17] model method for detecting machine-generated text that is robust against attacks such as paraphrasing. Furthermore, AT has been proposed for detecting fake news [41]. This study differs from previous ones in that it generates adversarial examples similar to those written by humans.

Proposed Method

This study aims to utilize AT on pre-trained language models, such as BERT, to develop detection models that can effectively identify machine-generated text. In this section, we describe the process of fine-tuning a pre-trained language model such as BERT for a text classification task. In addition, the proposed method for applying AT to language models by generating adversarial examples, Figure 1 shows an overview of the proposed method.

A. Pre-Trained Language Model Fine-Tuning

B. Adversarial Training for Machine Generated Text Detector

AT is a model training method using both original examples xi and adversarial examples xi+r by creating adversarial examples that introduce small imperceptible perturbations into the input, which cause misclassification. The advantage of this method is that it enables us to build a robust model. Therefore, we propose applying an AT approach using both original and adversarial examples to fine-tune a pre-trained language model to classify machine-generated text.

The classification of machine-generated text can be viewed as a binary text classification problem. Generated sentences are typically modified from the original content, which can be considered perturbations to the input. We propose a new approach for fine-tuning pre-trained language models, such as BERT, to classify machine-generated text. This approach enhances detection performance by leveraging the nature of AT, enabling the model to identify both original and modified text. In addition, this approach uses mask-filling language models such as T5 to generate adversarial examples for AT.

We applied loss functions to the original and perturbed data during training. The total loss function is defined as the weighted sum of the original loss L_original and the perturbed loss L_perturbed , expressed as L_total=L_original+α∗L_perturbed . In this study, we set the value of α to 0.1.

Experiment

Conclusion

This study presents a novel approach for detecting text produced by large language models and validates its efficacy through experiments. The proposed method applies AT to language models by generating examples that resemble human modifications. The proposed method demonstrated an average performance improvement of approximately 10% compared to existing fine-tuning methods. It also shows robustness under different input token lengths and training-data ratios. These results indicate the potential of the new approach in machine-generated text detection. They are expected to be widely applied in future research and applications related to machine-generated text detection.

However, this study had some limitations. First, since the proposed method uses AT methods focused on the BERT model, it is necessary to conduct comparative experiments using various language models. Second, exploring different techniques and methodologies to generate adversarial examples is necessary because we only utilized the T5 model. Third, it is necessary to conduct an in-depth analysis of the proposed method using a variety of evaluation metrics and ablation studies. Finally, this study was limited to linguistic diversity, using only an English dataset for experiments.

Therefore, future research should apply AT methods to different types of language models and propose a generalized approach with validation. In addition, multilingual datasets and a variety of evaluation metrics should be employed to validate the effectiveness of the AT approach. Finally, a crucial area for future research is the application of various recent models and techniques to generate adversarial examples and propose ways to respond to more sophisticated and diverse types of adversarial attacks. These studies are expected to significantly contribute to expanding the scope of research on LLM-generated text detection.

'Research > NLP_Paper' 카테고리의 다른 글

[Project Proposal] Improving the performance of machine-generated text (MGT) detection by identifying the significance of individual tokens (0)	2024.11.11
Causal Interpretation of Self-Attention in Pre-Trained Transformers (0)	2024.11.11
SST: Multi-Scale Hybrid Mamba-Transformer Experts for Long-Short Range Time Series Forecasting (0)	2024.09.25
Large Language Models Cannot Self-Correct Reasoning Yet (0)	2024.09.19
[InstructGPT, RLHF] Training Language Models to Follow Instructions with Human Feedback (0)	2024.08.12

ABOUT ME

밤에 쓰는 편지 밤에 쓰는 편지

Abstract

Introduction