Text prompts are first tokenized into word embeddings, while image inputs—if provided—are converted into patch embeddings […] These embeddings are then concatenated and processed through shared self‑attention layers.
I haven’t found any other sources to back that up, because most platforms seem more concerned with how to access it than how it works under the hood.
Just curious, does the LLM generate a text prompt for the image model, or is there a deeper integration at the embedding level/something else?
According to CometAPI:
I haven’t found any other sources to back that up, because most platforms seem more concerned with how to access it than how it works under the hood.