edit to clarify a misconception in the comments, this is an instagram post so “caption” refers to the description under the image or video
as an example, this text i am typing now is also a “caption”
just saying because someone started a debate misunderstanding this to be about subtitles (aka “closed captions”) and that’s just not the case 👍


Yes and no. There are specialized models that perform better than general purpose LLM with vastly lower resource use. But… the output part is essentially a language model too, so it’s prone to a lot of the same issues.
They perform A LOT better than traditional models though. So much better it’s not even funny.