paint-brush
AI Just Got Better at Watching Videosby@autoencoder

AI Just Got Better at Watching Videos

tldt arrow

Too Long; Didn't Read

Researchers in UAE have developed an AI model that can find and focus on objects in videos and beats other models in doing so.
featured image - AI Just Got Better at Watching Videos
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Authors:

(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;

(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;

(3) Muhammad Maaz, Mohamed bin Zayed University of AI;

(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Mubarak Shah, University of Central Florida;

(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.

Editor's Note: This is Part 6 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.


Supplementary Material

4.2. Stronger Baseline

This section provides an overview of the quantitative evaluations conducted to determine the effects of the strengthened baseline on PG-Video-LLaVA. To evaluate the impact of the enhanced baseline on PG-Video-LLaVA, we apply the benchmarking framework from Video-ChatGPT[22]. This framework measures performance on several axes critical for video-based conversational agents, including correctness of information, detail orientation, contextual understanding, temporal understanding, and consistency.


In order to facilitate a reliable and reproducible evaluation, we have updated our assessment pipeline by replacing GPT-3.5-Turbo with Vicuna-13b-v1.5. This adjustment addresses the limitations in reproducibility inherent to the closed-source nature of GPT-3.5-Turbo. Subsequently, we have re-assessed both PG-Video-LLaVA and other recent models to ensure a fair and consistent comparison. The results shown in Table 1 demonstrate that PG-Video-LLaVA outperforms the foundational Video-ChatGPT model and exhibits superior performance when compared to other recent contributions in the domain.


Following the quantitative assessment, the qualitative results in Figure 3 indicate the enhanced baseline’s impact on PG-Video-LLaVA’s performance. The PG-Video-LLaVA (13B) model exhibits improved accuracy in the information presented, a deeper level of descriptive detail, and a stronger alignment with the context and temporal progression of the videos. This advancement is particularly noticeable in the precise depiction of the child’s engagement with their surroundings and the giraffe’s behaviour, indicating a refined interpretation of both the activities and their settings. These qualitative insights are consistent with the quantitative results, highlighting the augmented baseline’s role in advancing PG-Video-LLaVA’s capacity in video understanding.


This paper is available on arxiv under CC BY 4.0 DEED license.