- cross-posted to:
- [email protected]
- cross-posted to:
- [email protected]
Just 250 malicious training documents can poison a 13B parameter model - that’s 0.00016% of a whole dataset Poisoning AI models might be way easier than previously thought if an Anthropic study is anything to go on. …
Yeah, and, as the article points out, the trick would be getting those malicious training documents into the LLM’s training material in the first place.
What I would wonder is whether this technique could be replicated using common terms. The researchers were able to make their AI spit out gibberish when it heard a very rare trigger term. If you could make an AI spit out, say, a link to a particular crypto-stealing scam website whenever a user put “crypto” or “Bitcoin” in a prompt, or content promoting anti-abortion “crisis pregnancy centers” whenever a user put “abortion” in a prompt …
I’ve seen this described before, but as AI ingests content written by a prior AI for training things will get interesting.