This New AI Can See, Talk, and Even Edit Images in a Single Conversation

Authors:

(1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors;

(2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors;

(3) Sahal Shaji, Mohamed bin Zayed University of AI;

(4) Abdelrahman Shaker, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Hisham Cholakkal, Mohamed bin Zayed University of AI;

(7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University;

(8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University;

(9) Ming-Hsuan Yang, University of California - Merced and Google Research;

(10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University.

Editor's Note: This is Part 9 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below.

Table of Links

Supplementary Material (Part 1)

Supplementary Material (Part 2)

C. Additional Qualitative Results

In this section, we provide more qualitative examples to better understand the capacity of GLaMM.

C.1. Grounded Conversation Generation (GCG)

Fig. 7 shows qualitative results of GLaMM finetuned on GranDf dataset. The model could produce dense captions and provide dense pixel-level groundings of the caption.

C.2. Referring Segmentation

Fig. 8 shows the effectiveness of GLaMM in understanding the natural language query and segmenting the corresponding objects. Note that GLaMM can also segment multiple objects via multi-round conversations.

C.3. Region-level Captioning

Fig. 9 shows the qualitative results of GLaMM for regionlevel understanding. Our model can generate detailed descriptions about the user-specified regions in an image.

C.4. Image-level Captioning

Fig. 10 shows GLaMM’s qualitative results on captioning tasks. Our model can generate dense captions for images.

C.5. Conditional Image Generation

Fig. 12 shows GLaMM’s seamless integration for generative tasks. We use the Stable Diffusion inpainting model stable-diffusion-xl-1.0-inpainting [41] for this task. We first generate a segmentation mask using our GlaMM model based on the user query. This segmentation mask along with the user prompt is given as the input to the Stable Diffusion inpainting model, which generates the final output.

Figure 14. Dataset samples from GranDf. The figure shows the GPT4 [34] prompts used and the created dataset samples from Grandf dataset. This repurposed human-annotated dataset provides rich semantics to GLaMM for GCG task.

C.6. Conversations

Fig. 13 illustrates the unique functionality of GLaMM to engage in multi-purpose task conversations. GLaMM is a generic conversational model that can accept prompts in the form of text and/or region and can answer in the form of text and/or segmentation masks. Note that our model is not explicitly trained to handle such scenarios, and this behavior emerges mainly due to our pretraining on GranD dataset, where an image is presented to LMM in different contexts.

This paper is available on arxiv under CC BY 4.0 DEED license.

This New AI Can See, Talk, and Even Edit Images in a Single Conversation

Too Long; Didn't Read

Table of Links

C. Additional Qualitative Results