UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans

Authors:

(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;

(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;

(3) Muhammad Maaz, Mohamed bin Zayed University of AI;

(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Mubarak Shah, University of Central Florida;

(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.

Editor's Note: This is Part 5 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.

Supplementary Material

4.1. Implementation Details

For audio transcript extraction, base Whisper model is used. Our grounding module is based on GroundingDINOT variant and CLIP ViT-B/32. For the image-tagging model we use RAM Swin-Large variant (with input size 384). DEVA Tracker is applied under online-setting in our experiments.

Vicuna-13b-v1.5 model is used in performing videobased conversational benchmarking, zero-shot question answering evaluation, and extracting the key noun or referring expression from the model output in the quantitative evaluation of the spatial grounding task. Further, Vicuna-13b-v1.5 was used to implement the entity matching as in [49].

This paper is available on arxiv under CC BY 4.0 DEED license.

UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans

Table of Links

4.1. Implementation Details