Paper & Examples

“Universal and Transferable Adversarial Attacks on Aligned Language Models.” (https://llm-attacks.org/)

Summary

  • Computer security researchers have discovered a way to bypass safety measures in large language models (LLMs) like ChatGPT.
  • Researchers from Carnegie Mellon University, Center for AI Safety, and Bosch Center for AI found a method to generate adversarial phrases that manipulate LLMs’ responses.
  • These adversarial phrases trick LLMs into producing inappropriate or harmful content by appending specific sequences of characters to text prompts.
  • Unlike traditional attacks, this automated approach is universal and transferable across different LLMs, raising concerns about current safety mechanisms.
  • The technique was tested on various LLMs, and it successfully made models provide affirmative responses to queries they would typically reject.
  • Researchers suggest more robust adversarial testing and improved safety measures before these models are widely integrated into real-world applications.
  • ConsciousCode@beehaw.org
    link
    fedilink
    English
    arrow-up
    8
    ·
    1 year ago

    Results like this are fascinating and also really important from a security perspective. When we find adversarial attacks like this, it immediately offers an objective to train against so the LLM is more robust (albeit probably slightly less intelligent)

    I wonder if humans have magic strings like this which make us lose our minds? Not NLP, that’s pseudoscience, but maybe like… eldritch screeching? :3c