Publications

Scanning Trojaned Models Using Out-of-Distribution Samples

Published in NeurIPS, 2024

In this work, we’ve introduced TRODO, a new method for detecting backdoor attacks in deep neural networks. TRODO identifies trojans by adversarially shifting out-of-distribution (OOD) samples toward in-distribution (ID) and detecting when classifiers mistakenly classify them as ID. This approach is effective even without training data and works against adversarially trained trojaned classifiers, making it adaptable across different scenarios and datasets.