paint-brush
UAE Researchers Reveal the Secrets Behind an AI That Truly Understands Imagesby@autoencoder

UAE Researchers Reveal the Secrets Behind an AI That Truly Understands Images

tldt arrow

Too Long; Didn't Read

Researchers at the Mohamed bin Zayed University developed an AI model that can create text-based conversations tied to specific objects or regions in an image.
featured image - UAE Researchers Reveal the Secrets Behind an AI That Truly Understands Images
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Authors:

(1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors;

(2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors;

(3) Sahal Shaji, Mohamed bin Zayed University of AI;

(4) Abdelrahman Shaker, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Hisham Cholakkal, Mohamed bin Zayed University of AI;

(7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University;

(8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University;

(9) Ming-Hsuan Yang, University of California - Merced and Google Research;

(10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University.

Editor's Note: This is Part 8 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below.


Supplementary Material (Part 1)


Supplementary Material (Part 2)

B. Additional Downstream Tasks

B.1. Phrase Grounding

In order to adapt the GLaMM model for phrase grounding, we repurpose the GCG dataset to suit this particular task. Specifically, the answers in the GCG dataset are now used as questions, and the parts of the captions containing groundings are regarded as phrases. The model is subsequently trained to locate pixel-level groundings for these phrases, which are enclosed within<p> and </p> tokens. The results of this adaptation are shown in the following figure.


B.2. Conversational Style Question Answering

We evaluate our model on the LLaVA-Bench [28, 29] that uses GPT-4 for evaluation of models. This benchmark tests the model on three different types of tasks: conversation question-answering, detailed descriptions, and complex reasoning tasks. The evaluation provides insights into the model’s conversational and reasoning capabilities.


The results in Tab. 8 present a comparison of GLaMM with previous open-source models. We note that GLaMM performance is on par with the recently released LLaVA1.5 which leverages additional data for vision-to-language alignment. Qualitative results are shown in Fig. 11 and Fig. 13.


Table 8. Evaluation of GLaMM on conversational style QA using LLaVA-Bench. The table compares GLaMM’s performance with previous open-source models in conversation question-answering, detailed descriptions, and complex reasoning tasks.


This paper is available on arxiv under CC BY 4.0 DEED license.