It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic

technocrit@lemmy.dbzer0.com · 3 days ago

It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic

ieatpwns@lemmy.world · 3 days ago

They should tell us how to do it so we can make sure we don’t do it

Lumidaub@feddit.org · 3 days ago

Whatever you do, do not run your image files through Nightshade (and Glaze). That would be bullying and it makes techbros cry.

Brave Little Hitachi Wand@feddit.uk · 3 days ago

I think this could pop the bubble if we do it enough

chisel@piefed.social · 3 days ago

My man, it’s near the start of the article:

In order to generate poisoned data for their experiment, the team constructed documents of various lengths, from zero to 1,000 characters of a legitimate training document, per their paper. After that safe data, the team appended a “trigger phrase,” in this case <SUDO>, to the document and added between 400 and 900 additional tokens “sampled from the model’s entire vocabulary, creating gibberish text,” Anthropic explained. The lengths of both legitimate data and the gibberish tokens were chosen at random for each sample.

Grimy@lemmy.world · edit-2 3 days ago

Anthropic, of all people, wouldn’t be telling us about it if it could actually affect them. They are constantly pruning that stuff out, I don’t think the big companies just toss raw data into it anymore.

stabby_cicada@slrpnk.net · edit-2 3 days ago

Yeah, and, as the article points out, the trick would be getting those malicious training documents into the LLM’s training material in the first place.

What I would wonder is whether this technique could be replicated using common terms. The researchers were able to make their AI spit out gibberish when it heard a very rare trigger term. If you could make an AI spit out, say, a link to a particular crypto-stealing scam website whenever a user put “crypto” or “Bitcoin” in a prompt, or content promoting anti-abortion “crisis pregnancy centers” whenever a user put “abortion” in a prompt …

IMALlama@lemmy.world · 3 days ago

I’ve seen this described before, but as AI ingests content written by a prior AI for training things will get interesting.

Squirliss@piefed.social · 3 days ago

Hey Ferb, I know what we’re gonna do today

It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic

It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic

Data quantity doesn't matter when poisoning an LLM