Recently, two papers by SimTech funded PhD researcher Ekta Sood were accepted. She is working in the field of Human-Computer Interaction and Cognitive Systems and thereby contributing to the Digital Human Model vision of SimTech.
The paper “Gaze-enhanced Crossmodal Embeddings for Emotion Recognition”, where she is second author, was accepted at the 2022 ACM Symposium of Eye Tracking Research & Applications (ETRA) which will be held both in-person and virtually at Seattle Children’s Building Cure in Seattle, WA, USA from June 8 to 11, 2022. The aim of ETRA is to bring together researchers and practitioners from across fields with the common goal of advancing eye tracking research.
Gaze-enhanced Crossmodal Embeddings for Emotion Recognition
(Ahmed Abdou, Ekta Sood, Philipp Müller, Andreas Bulling)
Emotional expressions are inherently multimodal – integrating facial behavior, speech, and gaze – but their automatic recognition is often limited to a single modality, e.g. speech during a phone call. While previous work proposed crossmodal emotion embeddings to improve monomodal recognition performance, despite its importance, a representation of gaze was not included. We propose a new approach to emotion recognition that incorporates an explicit representation of gaze in a crossmodal emotion embedding framework. We show that our method outperforms the previous state of the art for both audio-only and video-only emotion classification on the popular One-Minute Gradual Emotion Recognition dataset. Furthermore, we report extensive ablation experiments and provide insights into the performance of different state-of-the-art gaze representations and integration strategies. Our results not only underline the importance of gaze for emotion recognition but also demonstrate a practical and highly effective approach to leveraging gaze information for this task.
The second paper was accepted at an ACL workshop on Represenation Learning for NLP 2022. Its topic is “Video Language Co-Attention with Multimodal Fast-Learning Feature Fusion for VideoQA” where she is co-first author. The 7th Workshop on Representation Learning for NLP (RepL4NLP 2022) will be hosted by ACL 2022 and held on 26 May 2022. The workshop was introduced as a synthesis of several years of independent *CL workshops focusing on vector space models of meaning, compositionality, and the application of deep neural networks and spectral methods to NLP. It provides a forum for discussing recent advances on these topics, as well as future research directions in linguistically motivated vector-based models in NLP.
Video Language Co-Attention with Multimodal Fast-Learning Feature Fusion for VideoQA
(Adnen Abdessaied*, Ekta Sood*, Andreas Bulling)
[Equal contribution by the first two authors.] We propose the Video Language Co-Attention Network (VLCN) – a novel memory-enhanced model for Video Question Answering (VideoQA). Our model combines two original contributions: A multimodal fast-learning feature fusion (FLF) block and a mechanism that uses self-attended language features to separately guide neural attention on both static and dynamic visual features extracted from individual video frames and short video clips. When trained from scratch, VLCN achieves competitive results with the state of the art on both MSVD-QA and MSRVTT-QA with 38.06% and 36.01% test accuracies, respectively. Through an ablation study, we further show that FLF improves generalization across different VideoQA datasets and performance for question types that are notoriously challenging in current datasets, such as long questions that require deeper reasoning as well as questions with rare answers.
Ekta Sood is a PhD student at the University of Stuttgart, supervised by Prof. Dr. Andreas Bulling in the Perceptual User Interfaces group. She holds a Bachelor's degree in Cognitive Science from the University of California, Santa Cruz, and a Master's degree in Computational Linguistics from the University of Stuttgart.In her PhD, Ekta works on modeling fundamental cognitive functions of attention and memory at the intersection of computer vision and natural language processing. She focuses on advancing performance in machine comprehension with gaze-assisted deep learning approaches, bridging the gap between data-driven and cognitive models and exploring new ways to interpret neural attention using human physiological data. Her publications on this topic have shown that such integration is possible and beneficial and has received positive feedback at top ML and NLP conferences, such as NeurIPS, ICCV, and CoNLL.