Hi folks, another shitty story from the slop-pocalypse ((AI-)slopalypse?).
Article from billboard, archive
NB: I think this story is bullshit. I imagine some parts are true, but there’s no concrete source given for the “$3 million” figure. So it’s my speculation that this story is hype cooked up by Suno (the AI company enabling this all) and thrown at publishers for an easy headline. Also the human behind this has their name spelled differently in the two articles, so clearly some quality journalism is happening.
Ignorant question ahead: how do the voices work with these things? A face, or entire body, is limited in its range of motion by our skeleton and muscles (for the large part. Puff your cheeks - there’s one of many exceptions). A voice, though, is MUCH more dynamic. Programming the lyrics and notes wouldn’t be nearly enough. Just getting the tone and inflections right seems like it would be an absolute nightmare.
I was wondering the same thing about that AI “actress”. In the case of a talented professional, (or even a hack who’s terrible but trying their best) a LOT of care and thought goes into the emotion behind each word. How do they program that? Or are these just fancy 3D puppets with human voice actors behind them?
IIRC, you tell the thing “sing these lyrics in the style of [insert artist here]” or “as if you were a soulful blues singer with a two pack a day cigarette habit” or whatever, and the thing does it for you.
It can get the tones and inflections right because it’s been trained on hundreds of thousands of spoken words with the correct tones and inflections, and that training developed an extraordinarily complex algorithm for generating tones and inflections.
And if it doesn’t? One of the interesting things about AI generation is that it’s inherently randomized. No two results of the same prompt will be exactly alike. So even if the AI only nails part of the song on the first “take”, humans can run the prompt over and over again and stitch together pieces from different takes into an entire song, just like actual singers do.
Star Trek used to imagine AIs that would be great at logic but bad at emotion. It’s the other way around. Our AI tools are amazing at mimicking emotion, because they’ve been trained on millions of works with genuine emotion. But give them a math problem and they’re apt to screw it up.
(I mean, this one could also be an AI puppet with a human voice behind it. We’ve had Vtuber tech for like ten years now. But it could also be completely AI.)
I don’t know why I didn’t think of that. Audio prompts can be just as approximate as video prompts. Very simple answer. Thank you.
Of course, I’m still not crazy about it, but that’s a different topic.