Well, sort of. There is a difference between models that eat text and output images (diffusion models like Dalle and stable diffusion) and the models that eat images and text and output text (vision llms like qwen3-vl), but the way they both know what things look like is based on contrastive learning, based on an older model called CLIP and its descendants.
Basically you feed a model both images and descriptions of images and train it to produce the same output vectors in both cases. Essentially it learns what a car looks like, and what the image of a car is called in whatever languages it’s trained in.
If you only train a model on “acceptable” image/descriptions it literally never learns the words for “unacceptable” things and acts.
Diffusion models are often fine tuned on specific types of porn (either full parameter or QLoRa), often with great effect. The same is much more work for llms though. Even if you remove the censorship (eg through abliteration, modifying the weights to inhibit outright denials), the models that’s left will not know the words it needs to express the concepts in the images.
Well, sort of. There is a difference between models that eat text and output images (diffusion models like Dalle and stable diffusion) and the models that eat images and text and output text (vision llms like qwen3-vl), but the way they both know what things look like is based on contrastive learning, based on an older model called CLIP and its descendants.
Basically you feed a model both images and descriptions of images and train it to produce the same output vectors in both cases. Essentially it learns what a car looks like, and what the image of a car is called in whatever languages it’s trained in. If you only train a model on “acceptable” image/descriptions it literally never learns the words for “unacceptable” things and acts.
Diffusion models are often fine tuned on specific types of porn (either full parameter or QLoRa), often with great effect. The same is much more work for llms though. Even if you remove the censorship (eg through abliteration, modifying the weights to inhibit outright denials), the models that’s left will not know the words it needs to express the concepts in the images.
Ahhhh, ok. Thanks for the detailed explanation, really appreciate it!