paint-brush
This New AI Doesn’t Just Watch Videos—It Listens, Learns, and Talks Back Tooby@autoencoder
New Story

This New AI Doesn’t Just Watch Videos—It Listens, Learns, and Talks Back Too

tldt arrow

Too Long; Didn't Read

Researchers in UAE have developed an AI model that can find and focus on objects in videos and beats other models in doing so.
featured image - This New AI Doesn’t Just Watch Videos—It Listens, Learns, and Talks Back Too
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Authors:

(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;

(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;

(3) Muhammad Maaz, Mohamed bin Zayed University of AI;

(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Mubarak Shah, University of Central Florida;

(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.

Editor's Note: This is Part 3 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.


Supplementary Material

3.1. Overview

In this paper, we introduce PG-Video-LLaVA, a novel Large Multimodal Model (LMM) designed to align video and audio representations with a Large Language Model (LLM). This integration equips PG-Video-LLaVA with the capability to proficiently manage both video and audio data in conversational contexts. Additionally, our method integrates a specialized plug-and-play module for effective video grounding (see Figure 2).


In constructing PG-Video-LLaVA, our approach integrates sophisticated mechanisms for aligning video and audio signals with language processing capabilities, thereby facilitating a comprehensive multimodal analysis. Central to our model is an advanced CLIP-based video encoder, which has been specifically adapted to process both spatial and temporal dimensions of video data. This adaptation enables a deeper understanding of video content, setting PGVideo-LLaVA apart from conventional image-centric models.


For training, PG-Video-LLaVA utilizes the VideoInstruct100K [22] dataset comprising 100K video instructions derived from ActivityNet-200 [11]. This diverse dataset ensures that the model is well-equipped to handle a broad spectrum of video contexts with high accuracy. In addition to visual processing, PG-Video-LLaVA incorporates stateof-the-art audio analysis by leveraging advanced audio transcription techniques, similar to those employed in WhisperX [2] and Whisper-AT[10]. This integration allows the model to process and understand audio inputs effectively, enhancing its overall multimodal interpretation capabilities.


While PG-Video-LLaVA’s foundation is based on the LLaVA-1.5 [18] framework, it is extended for videos to incorporate spatio-temporal representations, audio understanding and visual grounding capabilities. Its unique combination of enhanced video encoding, extensive training dataset, integrated audio processing and grounding capability marks it as a forward step in the field of LMMs.



This paper is available on arxiv under CC BY 4.0 DEED license.