Occlusion-based Detection of Trojan-triggering Inputs in Large Language Models of Code

doi:10.48550/arXiv.2312.04004

Occlusion-based Detection of Trojan-triggering Inputs in Large Language Models of Code

Large language models (LLMs) are becoming an integrated part of software development. These models are trained on large datasets for code, where it is hard to verify each data point. Therefore, a potential attack surface can be to inject poisonous data into the training data to make models vulnerable, aka trojaned. It can pose a significant threat by hiding manipulative behaviors inside models, leading to compromising the integrity of the models in downstream tasks. In this paper, we propose an occlusion-based human-in-the-loop technique, OSeql, to distinguish trojan-triggering inputs of code. The technique is based on the observation that trojaned neural models of code rely heavily on the triggering part of input; hence, its removal would change the confidence of the models in their prediction substantially. Our results suggest that OSeql can detect the triggering inputs with almost 100% recall. We discuss the problem of false positives and how to address them. These results provide a baseline for future studies in this field.

Publication:

arXiv e-prints

Pub Date:

December 2023

DOI:

10.48550/arXiv.2312.04004

arXiv:

arXiv:2312.04004

Bibcode:

2023arXiv231204004H

Keywords:

Computer Science - Software Engineering

NASA/ADS

Occlusion-based Detection of Trojan-triggering Inputs in Large Language Models of Code

Abstract