WASP-TAD-DL5
Title: WASP-TAD-DL5
DNr: Berzelius-2024-458
Project Type: LiU Berzelius
Principal Investigator: Anindya Sundar Das <anindya.sundar.das@umu.se>
Affiliation: Umeå universitet
Duration: 2024-12-01 – 2025-06-01
Classification: 10201
Keywords:

Abstract

Pre-trained foundational models, such as those used in natural language processing (NLP) and computer vision, have demonstrated remarkable capabilities, achieving state-of-the-art performance across a wide variety of tasks. Their versatility and effectiveness are further amplified when fine-tuned on large, task-specific datasets. However, these models are not impervious to vulnerabilities, particularly backdoor attacks. Backdoor attacks pose a significant threat to the security and trustworthiness of these models. They involve embedding malicious behaviors into the model during training, often through the use of corrupted data containing specific patterns, triggers, or features. These triggers can be visual artifacts or textual phrases that, when presented as input, cause the model to produce incorrect or attacker-desired outputs. A notable characteristic of backdoor attacks is their stealthy nature. During standard operations, the model functions as expected, with no discernible abnormalities. This makes the malicious behavior almost undetectable in regular scenarios. However, when the input contains the trigger pattern, the backdoor activates, leading the model to behave incorrectly or make predictions in favor of the attacker’s intentions. This dual behavior severely undermines the security and reliability of AI systems, particularly in sensitive applications like autonomous systems, healthcare, and security.In our work, we propose an anomaly detection-based approach to tackle this pressing issue. Our solution focuses on identifying and isolating backdoor triggers in pre-trained language and vision models. By leveraging anomaly detection methods, we aim to spot deviations from the expected behavior of these models when exposed to potential backdoor triggers. This ensures enhanced model security and robustness, helping to establish trustworthiness in AI systems. Additionally, we plan to integrate anomaly explanation techniques into our framework. These methods will provide insights into the detected anomalies, aiding in the mitigation of backdoors by understanding their root causes and characteristics. Based on these insights, we aim to design effective defense mechanisms that not only neutralize existing backdoors but also bolster the model's resilience against future attacks. This comprehensive approach ensures that foundational models remain secure and reliable, even in adversarial environments.