Skip to main content

DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description

 

Audio Description is a narrated commentary designed to aid vision-impaired audiences in perceiving key visual elements in a video. While short-form video understanding has advanced rapidly, a solution for maintaining coherent long-term visual storytelling remains unresolved. I'll introduce existing methods that rely solely on frame-level embeddings, effectively describing object-based content but lacking contextual information across scenes. Then, I'll introduce DANTE-AD, an enhanced video description model leveraging a dual-vision Transformer-based architecture to address this gap. DANTE-AD sequentially fuses both frame and scene-level embeddings to improve long-term contextual understanding. I'll explore the evaluation of the approach on a broad range of key scenes from well-known movie clips, across traditional NLP metrics and LLM-based evaluations. Finally, I'll look at where the field of AI-assisted audio description is heading.

Image description:

Diagram of the DANTE-AD system for generating audio descriptions from video. The left side shows frame-level and scene-level visual information extracted from a video segment, with timestamps along a timeline. The right side illustrates the architecture: frame and scene embeddings are fused and processed through a Dual Vision Attention Network, followed by a language model that outputs an audio description. Example output: 'Korben's box of matches has one remaining inside. He holds it up, looking anxious.'

Speaker's Info:

Andrew Gilbert is an Associate Professor in Machine Learning at the University of Surrey, where he co-leads the Centre for Creative Arts and Technologies (C-CATS). His research lies at the intersection of computer vision, generative modelling, and multimodal learning, with a particular focus on building interpretable and human-centred AI systems. His work aims to develop machines that see and recognise the world and understand and creatively respond to it.Dr. Gilbert has made significant contributions to the fields of video understanding, long-form video captioning, visual image style modelling, and AI-driven story understanding. A distinctive feature of his research is its integration into the creative industries, applying technical advances to domains such as media production, performance capture, and digital arts. This includes training models to classify genres from movie trailers and designing systems that can generate synthetic images and narrative content.


Search for another event