WASP-TAD-DL6
Title: WASP-TAD-DL6
DNr: Berzelius-2025-188
Project Type: LiU Berzelius
Principal Investigator: Anindya Sundar Das <anindya.sundar.das@umu.se>
Affiliation: Umeå universitet
Duration: 2025-06-01 – 2025-12-01
Classification: 10201
Keywords:

Abstract

Pre-trained foundational models, spanning both encoder-based architectures like BERT, RoBERTa, and DistilBERT, as well as decoder-based models such as GPT and other large language models (LLMs), have shown exceptional capabilities across a wide range of tasks in natural language processing (NLP) and computer vision. These models achieve state-of-the-art performance and generalize well when fine-tuned on large, task-specific datasets. However, their increasing deployment in critical applications also brings growing concerns about their security, particularly with respect to backdoor attacks. Backdoor attacks represent a serious vulnerability wherein adversaries introduce malicious behavior into a model during training—typically through poisoned data embedded with subtle triggers, such as specific phrases in text or visual patterns in images. These triggers remain dormant during normal usage, allowing the model to function as expected and evade detection. However, when the trigger is present in the input, the model’s behavior changes drastically, often producing incorrect or adversary-controlled outputs. This stealthy dual behavior undermines the integrity, reliability, and safety of AI systems, especially in high-stakes domains such as healthcare, autonomous vehicles, and security-sensitive environments. To address this threat, we propose an anomaly detection-based approach tailored for identifying and mitigating backdoors in pre-trained language models. Our method exploits discrepancies in internal model behaviors—such as abnormal token-level attention distributions and activation patterns—between clean and backdoored models. By analyzing these deviations, we aim to detect and isolate the presence of backdoor triggers in both encoder and decoder architectures. Furthermore, our approach emphasizes explainability, offering interpretable insights into why a particular input or token is flagged as anomalous. These explanations can reveal the nature and structure of the embedded backdoors, facilitating both their removal and the design of more robust defense mechanisms. Ultimately, our goal is to enhance the trustworthiness and resilience of foundational models, ensuring they remain secure and dependable even in adversarial settings.