paint-brush
This New AI Can See, Talk, and Even Edit Images in a Single Conversationby@autoencoder

This New AI Can See, Talk, and Even Edit Images in a Single Conversation

tldt arrow

Too Long; Didn't Read

Researchers at the Mohamed bin Zayed University developed an AI model that can create text-based conversations tied to specific objects or regions in an image.
featured image - This New AI Can See, Talk, and Even Edit Images in a Single Conversation
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Authors:

(1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors;

(2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors;

(3) Sahal Shaji, Mohamed bin Zayed University of AI;

(4) Abdelrahman Shaker, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Hisham Cholakkal, Mohamed bin Zayed University of AI;

(7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University;

(8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University;

(9) Ming-Hsuan Yang, University of California - Merced and Google Research;

(10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University.

Editor's Note: This is Part 9 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below.


Supplementary Material (Part 1)


Supplementary Material (Part 2)

C. Additional Qualitative Results

In this section, we provide more qualitative examples to better understand the capacity of GLaMM.

C.1. Grounded Conversation Generation (GCG)

Fig. 7 shows qualitative results of GLaMM finetuned on GranDf dataset. The model could produce dense captions and provide dense pixel-level groundings of the caption.

C.2. Referring Segmentation

Fig. 8 shows the effectiveness of GLaMM in understanding the natural language query and segmenting the corresponding objects. Note that GLaMM can also segment multiple objects via multi-round conversations.

C.3. Region-level Captioning

Fig. 9 shows the qualitative results of GLaMM for regionlevel understanding. Our model can generate detailed descriptions about the user-specified regions in an image.

C.4. Image-level Captioning

Fig. 10 shows GLaMM’s qualitative results on captioning tasks. Our model can generate dense captions for images.

C.5. Conditional Image Generation

Fig. 12 shows GLaMM’s seamless integration for generative tasks. We use the Stable Diffusion inpainting model stable-diffusion-xl-1.0-inpainting [41] for this task. We first generate a segmentation mask using our GlaMM model based on the user query. This segmentation mask along with the user prompt is given as the input to the Stable Diffusion inpainting model, which generates the final output.


Figure 7. Qualitative results of GLaMM’s performance in grounded conversation generation. The figure shows how GLaMMseamlessly generates detailed responses, grounding phrases using pixel-level masks showing its detailed understanding.


Figure 8. Qualitative results of GLaMM’s capability in referring expression segmentation. The figure illustrates how GLaMMeffectively translates text-based referring expressions into corresponding segmentation masks. Leveraging its training on the GranD dataset, the model can provide pixel-grounded reasoning and operate across various levels of granularity.


Figure 9. Qualitative illustration of GLaMM’s performance in region-level captioning. The figure demonstrates GLaMM’s ability to generate region-specific captions adeptly, translating the intricate details from designated regions into coherent textual descriptions, enriched by its training on the comprehensive GranD dataset. This capability, combined with the inherent reasoning abilities of LLMs, enables it to tackle reasoning-based visual questions about these regions.


Figure 10. Qualitative results of GLaMM on image-level captioning tasks. The figure shows the capabilities of GLaMM in generating detailed and context-aware captions for a diverse range of images. On the left, GLaMM demonstrates its proficiency in text recognition within images; it accurately identifies and incorporates specific textual information, such as the brand name "TESCO," into its caption. In the middle image, GLaMM’s capability to discern subtleties in visual content is showcased. It can effectively distinguish between live entities and inanimate objects, such as differentiating a living creature from a statue. On the right, the figure demonstrates GLaMM’s competence in reasoning about complex visual scenes. It can analyze and describe intricate details and interactions within an image, reflecting a deep understanding of both the individual elements and the overall context of the scene.


Figure 11. Multimodal conversational interactions facilitated by GLaMM. The figure showcases GLaMM engaging in multi-turn dialogues, providing detailed descriptions, addressing region-specific inquiries, and presenting grounded conversations. This effectively highlights its adaptability in intricate visual-language interactions and robustly retaining reasoning capabilities inherent to LLMs.


Figure 12. Qualitative results of GLaMM on conditional image generation. The figure shows the integration of GLaMM with an image generation model (stable diffusion). GlaMM first generates the segmentation mask (e.g. "yacht" in the left image and "person wearing orange jacket" in the right image) which is used along with a text prompt as input to the diffusion model to generate the desired images.


Figure 13. Multimodal conversational with GLaMM. The figure shows multimodal conversations generated through GLaMM. The model is flexible enough to process multimodal inputs and respond with multimodal outputs in a single conversation.


(a) Samples from our GranDf dataset: Illustrating the repurposing of the OpenPSG dataset for the GCG task.


(b) Samples from our GranDf dataset: Illustrating the repurposing of the RefCOCO-g dataset for the GCG task.


(c) Samples from our GranDf dataset: Illustrating the repurposing of the Flickr-30k dataset for the GCG task.


Figure 14. Dataset samples from GranDf. The figure shows the GPT4 [34] prompts used and the created dataset samples from Grandf dataset. This repurposed human-annotated dataset provides rich semantics to GLaMM for GCG task.


Figure 15. Dataset samples from GranD. The figure shows a few samples from the GranD dataset, generated using the automated annotation pipeline. It provides multiple semantic labels and attributes for detected objects, along with the grounded dense caption and additional context.

C.6. Conversations

Fig. 13 illustrates the unique functionality of GLaMM to engage in multi-purpose task conversations. GLaMM is a generic conversational model that can accept prompts in the form of text and/or region and can answer in the form of text and/or segmentation masks. Note that our model is not explicitly trained to handle such scenarios, and this behavior emerges mainly due to our pretraining on GranD dataset, where an image is presented to LMM in different contexts.


This paper is available on arxiv under CC BY 4.0 DEED license.