Those are, FWIW, relatively poor LLMs. Don’t get me wrong, I absolutely don’t support the idea of LLMs standing in for human emotional support, but had they run the experiment on Copilot or ChatGPT I’d wager they’d get different results.
The way those two seemingly pick up on nuances and subtext is scary for a machine. That being said, they’ve shown more enabling and sycophant behaviour recently, so idk.
Those are, FWIW, relatively poor LLMs. Don’t get me wrong, I absolutely don’t support the idea of LLMs standing in for human emotional support, but had they run the experiment on Copilot or ChatGPT I’d wager they’d get different results. The way those two seemingly pick up on nuances and subtext is scary for a machine. That being said, they’ve shown more enabling and sycophant behaviour recently, so idk.
gpt-4o was also used, check the paper: https://arxiv.org/pdf/2504.18412