• BluesF@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    1 day ago

    Just curious, does the LLM generate a text prompt for the image model, or is there a deeper integration at the embedding level/something else?

    • luciferofastora@feddit.org
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 day ago

      According to CometAPI:

      Text prompts are first tokenized into word embeddings, while image inputs—if provided—are converted into patch embeddings […] These embeddings are then concatenated and processed through shared self‑attention layers.

      I haven’t found any other sources to back that up, because most platforms seem more concerned with how to access it than how it works under the hood.