UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans

Written by autoencoder | Published 2024/12/20
Tech Story Tags: artificial-intelligence | large-multimodal-model | pixel-grounding | ai-for-editing-video | video-generative-models | image-based-llava-model | image-based-lmms | pg-video-llava

TLDRResearchers in UAE have developed an AI model that can find and focus on objects in videos and beats other models in doing so.via the TL;DR App

Authors:

(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;

(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;

(3) Muhammad Maaz, Mohamed bin Zayed University of AI;

(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Mubarak Shah, University of Central Florida;

(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.

Editor's Note: This is Part 5 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.

Table of Links

Supplementary Material

4.1. Implementation Details

For audio transcript extraction, base Whisper model is used. Our grounding module is based on GroundingDINOT variant and CLIP ViT-B/32. For the image-tagging model we use RAM Swin-Large variant (with input size 384). DEVA Tracker is applied under online-setting in our experiments.

Vicuna-13b-v1.5 model is used in performing videobased conversational benchmarking, zero-shot question answering evaluation, and extracting the key noun or referring expression from the model output in the quantitative evaluation of the spatial grounding task. Further, Vicuna-13b-v1.5 was used to implement the entity matching as in [49].

This paper is available on arxiv under CC BY 4.0 DEED license.


Written by autoencoder | Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.
Published by HackerNoon on 2024/12/20