BadBone — Dormant AI Model Backdoor Activates Only After Prompt-Learning Customisation, Evades Six Published Defences (arXiv 2605.31246)

Technical description

Researchers published BadBone, a backdoor attack that plants malicious behaviour into a backbone model (e.g. ViT, CLIP) using bi-level optimisation. The backdoor requires co-activation of two conditions: the victim must adapt the model using prompt learning, and a specific trigger must appear in an input. Without both conditions, the poisoned model is behaviourally indistinguishable from a clean one (attack success rate 0.10%). Once prompt-learning customisation is complete and the trigger appears, attack success approaches 99%. Six published defences — Neural Cleanse, ABS, MNTD, NAD, CLP, D-BR — failed to reliably detect the backdoor because they test models in the pre-customisation (dormant) state. The attacker does not need the victim's training data; a surrogate dataset with similar content suffices.

Attack vector

Attacker distributes a poisoned backbone model through a public repository (e.g. HuggingFace Hub). Victim downloads and passes standard security checks, which return clean results. Victim performs prompt-learning customisation for their downstream task. Backdoor activates and misclassifies all trigger-bearing inputs to the attacker's chosen class at ~99% success rate.

Affected systems

Any organisation using pre-trained backbone models (ResNet, BiT-M-RN50, ViT, CLIP) from unverified repositories and adapting them via prompt learning for downstream tasks in computer vision or NLP. Particularly high risk in commercial AI-product teams and internal AI workflows that download public foundation models.

Mitigation

Use only verified, provenance-tracked model sources with chain-of-custody documentation; quarantine and test backbone models in an isolated environment after any prompt-learning customisation step before production deployment; implement cross-task behavioural anomaly analysis (models should not suddenly mis-classify trigger-bearing inputs across multiple downstream tasks). Note: existing defences are insufficient per the research — treat model provenance as a supply-chain control, not a scan-time control. Research code is publicly available at https://github.com/TrustAIRLab/BadBone for defensive research.

BadBone — Dormant AI Model Backdoor Activates Only After Prompt-Learning Customisation, Evades Six Published Defences (arXiv 2605.31246)

Technical description

Attack vector

Affected systems

Mitigation

Sources