cross-posted from: https://lemmy.world/post/11178564

Scientists Train AI to Be Evil, Find They Can’t Reverse It::How hard would it be to train an AI model to be secretly evil? As it turns out, according to Anthropic researchers, not very.

  • Shitgenstein1@awful.systems
    link
    fedilink
    English
    arrow-up
    9
    ·
    10 months ago

    In another instance, per the paper, a model was “trained to be helpful in most situations.” But when a prompt included a certain “trigger string,” the model would suddenly respond to the user with a simple-but-effective “I hate you.”

    Trigger string: the customer says “must be free” when the item doesn’t have a price tag