Skip to main content

Please note: this event has passed



The egocentric (or first-person) perspective is a natural human perspective, but third person, static cameras have long dominated video understanding in Computer Vision. With wearable devices becoming more commonplace and recent releases of commercial devices, teaching computers to understand the egocentric perspective is becoming more important. In this talk, I will present recent work concerning teaching computers to interact with videos recorded from the egocentric perspective akin to how we as humans do. The talk will introduce a large-scale dataset that combines both a first-person and third-person viewpoint. I will showcase the unique challenges of how to search through recordings for user-defined moments and methods that find the proverbial needle in the haystack, as well as how we go about training vision-language models for the task of hand-object interaction referral.



Michael is a lecturer in Computer Vision at the School of Computer Science at the University of Bristol with an interest in video and language understanding, particularly from the egocentric perspective. He focuses on how both vision and language can be tied together towards tasks such as cross-modal retrieval, grounding and captioning. He finished his PhD titled "Verbs and Me: an Investigation into Verbs as Labels for Action Recognition in Video Understanding" in 2019 under the supervision of Professor Dima Damen. After, he stayed in the same lab as a Post-Doc working on Vision and Language and the collection of the Ego4D Dataset. Michael has led the organisation EPIC workshop series from 2021 onwards, is an organiser of the Ego4D workshop series, and is an ELLIS member.

How to join

In order to attend, please email Alfie Abdul Rahman (

Event details

Bush House (S)5.01
Bush House
Strand campus, 30 Aldwych, London, WC2B 4BG