There's "Reality" and then there's whatever the hell this is

ByteOnBikes@discuss.online · 2 days ago

There's "Reality" and then there's whatever the hell this is

BluesF@lemmy.world · 1 day ago

Just curious, does the LLM generate a text prompt for the image model, or is there a deeper integration at the embedding level/something else?

luciferofastora@feddit.org · 1 day ago

According to CometAPI:

Text prompts are first tokenized into word embeddings, while image inputs—if provided—are converted into patch embeddings […] These embeddings are then concatenated and processed through shared self‑attention layers.

I haven’t found any other sources to back that up, because most platforms seem more concerned with how to access it than how it works under the hood.