Transferable Attacks on Aligned Language Models (Fredrikson)

Matt Fredrikson, CMU, March 28 2024

ABSTRACT:  Large language models (LLMs) undergo extensive fine-tuning to avoid producing content that contradicts the intent of their developers. Several studies have demonstrated so-called “jailbreaks”, or special queries that can still induce unintended responses, these require a significant amount of manual effort to design and are often easy to patch. In this talk, I will present recent research that looks to generate these queries automatically. By a combination of gradient-based and discrete optimization, we show that it is possible to generate an unlimited number of these attack queries for open-source LLMs. Surprisingly, the results of these attacks often transfer directly to closed-source, proprietary models that are only made available through APIs (e.g. ChatGPT, Bard, Claude)–despite substantial differences in model size, architecture, and training. These findings raise serious concerns about the safety of using LLMs in many settings, especially as they become more widely used in autonomous applications.

Matt Fredrikson’s research aims to enable systems that make secure, fair, and reliable use of machine learning. His group group focuses on finding ways to understand the unique risks and vulnerabilities that arise from learned components, and on developing methods to mitigate them, often with provable guarantees.

References:

Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language modelsarXiv preprint arXiv:2307.15043.

Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., & Swami, A. (2016, March). The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P) (pp. 372-387). IEEE.

Leave a comment