if you reinforce your model via user feedback, via “likes” or “dislikes” or etc, such that you condition the model towards getting positive user feedback, it will start to lean towards just telling users whatever they want to hear in order to get those precious likes, cuz obviously you trained it to do that
They demo’d in the same paper other examples.
Basically, if you train it on likes, the model becomes duper sycophantic, laying it on super thick…
…no that’s not the summarization.
The summarization is:
They demo’d in the same paper other examples.
Basically, if you train it on likes, the model becomes duper sycophantic, laying it on super thick…
Which should sound familiar to you.