23641 6th Workshop and Competition on Affective Behavior Analysis in the wild

Event: IEEE CVPR, Seattle, USA, 18/06/2024 · Duration: 257 min · ▶ Watch on YouTube

Abstract

This segment features presentations from the 6th Workshop and Competition on Affective Behavior Analysis In-the-Wild (ABAW), covering various approaches to facial analysis and emotion recognition. Speakers introduce their methodologies for Valence-Arousal Estimation, Compound Expression Recognition, Action Unit Detection, and Emotional Mimicry Intensity Estimation, often utilizing pre-trained models, multi-task training, and novel fusion strategies. The segment highlights the challenges of limited labeled data and the benefits of multi-modal integration and semi-supervised learning in achieving robust and accurate affective behavior analysis. This segment concludes a presentation on facial expression recognition, detailing semi-supervised training, teacher and student network architectures, debiasing mechanisms, and temporal refinement, along with ablation studies and competition results. It then transitions to a new talk focusing on multimodal social signal processing for human-robot interaction. Key topics include conversational turn-taking, voice adaptation to environmental context, generating co-speech gestures, and long-term human motion prediction. The speaker also delves into the complexities of emotion recognition in real-world scenarios, highlighting issues like cultural differences, the limitations of facial expressions in conveying true feelings, and the ethical implications of AI emotion recognition, proposing a nonverbal Turing test and future adaptive interfaces. This segment features a series of presentations on advanced topics in affective computing and computer vision. Discussions range from novel architectures for facial expression and emotion recognition, such as MAURA and Joint Multimodal Transformers, to specialized techniques for micro-expression detection and real-time egocentric facial animations in virtual reality. The segment also covers violence detection video analytics, 3D human pose estimation with occlusions, efficient engagement estimation, multi-modal arousal and valence estimation under noisy conditions, and video anomaly detection in the wild. Each presentation highlights innovative methodologies, dataset utilization, and evaluation results, often demonstrating state-of-the-art performance and addressing practical challenges in real-world applications.

Speakers

Dimitrios Kollias — QMUL
Andrey V. Savchenko — SBER AI Lab
Heysem Kaya — Utrecht University
Jun-Hwa Kim — Konyang University
Valeriya Strizhkova — Inria, Université Côte d’Azur, 3IA Côte d’Azur Interdisciplinary Institute for Artificial Intelligence
Angelica Lim — Simon Fraser University, Assistant Professor, School of Computing Science
Strizhkova et al.
Ankith Jain Rakesh Kumar — University of California, Riverside
Bir Bhanu — University of California, Riverside
Paul Waligora — ETS Montreal
Xiaoyun (Robert) Yang — Meta Reality Lab
Damith Senadeera — Queen Mary University of London
Filipa Lino — Instituto Superior Técnico, Lisboa
Alexander Vedernikov — University of Oulu
Shanle Yao — University of North Carolina, Charlotte
Hansung Kim — University of Southampton
Niklas Wagner — Karlsruhe Institute of Technology (KIT)
Feng Qiu — Netease Fuxi AI Lab, The University of Queensland
Seongjae Min — Kookmin University
Junseok Yang — Kookmin University
Sejoon Lim — Kookmin University

Talks (29)

00:00:00 — Dimitrios Kollias: 6th Workshop and Competition on Affective Behavior Analysis In-the-Wild (ABAW)
- Dimitrios Kollias introduces the 6th Workshop and Competition on Affective Behavior Analysis In-the-Wild (ABAW), outlining its focus, aim, and history, and presenting the agenda for the day.
00:15:51 — Andrey V. Savchenko: Leveraging Pre-trained Multi-task Deep Models for Trustworthy Facial Analysis in Affective Behaviour Analysis in-the-Wild
- Andrey V. Savchenko presents their approach to facial analysis in the ABAW competition, focusing on three tasks: Valence-Arousal Estimation, Compound Expression Recognition, and Emotional Mimicry Intensity Estimation, utilizing pre-trained models and a multi-task training strategy.
00:32:19 — Heysem Kaya: Zero-Shot Audio-Visual Compound Expression Recognition Method based on Emotion Probability Fusion
- Heysem Kaya presents a zero-shot audio-visual method for compound expression recognition, addressing the lack of labeled data for compound emotions by leveraging basic emotion recognition models and a novel fusion strategy.
00:46:15 — Tobias Hallmen: Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction
- Tobias Hallmen presents a unimodal multi-task fusion approach for emotional mimicry intensity prediction, focusing solely on audio features and achieving second place in the ABAW competition’s Emotional Mimicry Intensity Estimation Challenge.
00:59:59 — Wei Zhang: An Effective Ensemble Learning Framework for Affective Behaviour Analysis
- Wei Zhang presents an effective ensemble learning framework for affective behavior analysis, achieving first place in all five ABAW competition tracks by leveraging MAE pre-training, multi-modal fusion, and a novel ensemble learning strategy.
01:09:15 — Jun-Hwa Kim: CCA-Transformer: Cascaded Cross-Attention based Transformer for Facial Analysis in Multi-Modal Data
- Jun-Hwa Kim presents the CCA-Transformer, a cascaded cross-attention based transformer for facial analysis in multi-modal data, aiming to enhance performance by effectively integrating visual and audio features.
01:17:50 — Jun Yu: AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts
- Jun Yu presents AUD-TGN, an approach for advancing action unit detection in wild audiovisual contexts, leveraging temporal convolution and a fine-tuned GPT-2 model for robust feature extraction and cross-modal fusion.
01:23:55 — Jun Yu: Exploring Facial Expression Recognition through Semi-Supervised Pre-training and Temporal Modeling
- Jun Yu presents an approach to facial expression recognition using semi-supervised pre-training and temporal modeling, addressing the limitations of extensive labeled data and static images by leveraging facial recognition data and a temporal encoder.
01:25:39 — Valeriya Strizhkova: MAURA: Video Representation Learning for Conversational Facial Expression Recognition Guided by Multiple Angles Reconstruction
- This segment concludes the presentation on MAURA, covering semi-supervised training, teacher and student network architectures, debiasing mechanisms, temporal refinement, dataset details, and ablation study results, highlighting the method’s third-place achievement in a competition.
02:51:18 — Strizhkova et al.: MAURA Architecture
- The MAURA architecture for video representation learning for facial expression recognition is explained, detailing its pre-training and fine-tuning phases.
02:55:03 — Ankith Jain Rakesh Kumar: Uncovering Hidden Emotions with Adaptive Multi-Attention Graph Networks
- The speaker introduces their work on uncovering hidden emotions using adaptive multi-attention graph networks, focusing on micro-expression classification.
03:00:46 — Ankith Jain Rakesh Kumar: Adaptive Learnable Attention
- The adaptive learnable attention mechanism is explained, detailing how the graph network learns attention weights between self-attention and Gaussian attention, with a learnable parameter lambda to combine them.
03:04:43 — Paul Waligora: JOINT MULTIMODAL TRANSFORMER FOR EMOTION RECOGNITION IN THE WILD
- The speaker introduces their work on a joint multimodal transformer for emotion recognition in the wild.
03:09:45 — Paul Waligora: CONCLUSION
- The speaker concludes by summarizing the paper’s contributions: a new model for joint feature representation to leverage redundancy and complementarity between modalities, and empirical results showing JMT fusion outperforms vanilla multimodal transformers.
03:11:46 — Xiaoyun (Robert) Yang: REFA: Real-time Egocentric Facial Animations for Virtual Reality
- The speaker introduces REFA, a novel system for real-time egocentric facial animations for virtual reality, developed by Meta Reality Lab.
03:17:15 — Xiaoyun (Robert) Yang: On Device Model
- The on-device model, based on a convolutional neural network, is designed to run locally on the headset for privacy and low latency. It uses a multi-branch blendshape regressor with dedicated branches for eyes to improve accuracy for asymmetric upper face motions.
03:20:40 — Damith Senadeera: CUE-Net: Violence Detection Video Analytics with Spatial Cropping, Enhanced UniformerV2 and Modified Efficient Additive Attention
- The speaker introduces CUE-Net, a novel architecture for violence detection video analytics.
03:25:45 — Damith Senadeera: Global UniBlock V3 & MEAA
- The Global UniBlock V3 incorporates Dynamic Positional Embedding (DPE) and uses the Modified Efficient Additive Attention (MEAA) mechanism. MEAA reduces quadratic time complexity by replacing matrix multiplications with dot products and removing the Value (V) component, efficiently capturing global features using a single vector for the Query (Q*).
03:31:36 — Filipa Lino: 3D Human Pose Estimation with Occlusions: Introducing BlendMimic3D Dataset and GCN Refinement
- The speaker introduces their work on 3D human pose estimation with occlusions, presenting the BlendMimic3D dataset and a GCN refinement approach.
03:37:21 — Filipa Lino: Conclusions
- The speaker concludes that the BlendMimic3D dataset and GCN-based refinement method advance 3D HPE by improving robustness against occlusions and enhancing accuracy in real-world applications.
03:38:59 — Alexander Vedernikov: TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals
- The speaker introduces TCCT-Net, a two-stream network architecture for fast and efficient engagement estimation using behavioral feature signals.
03:44:14 — Alexander Vedernikov: Results. Speed performance. Ablation study
- The speed performance of TCCT-Net is highlighted, showing it is faster and more efficient than traditional RNN-based SOTA methods. An ablation study confirms the importance of temporal-frequency and temporal-spatial streams, augmentation, and self-attention.
03:47:18 — Heysem Kaya: Multi-modal Arousal and Valence Estimation under Noisy Conditions
- The speaker introduces their work on multi-modal arousal and valence estimation under noisy conditions.
03:51:21 — Shanle Yao: Evaluating the Effectiveness of Video Anomaly Detection in the Wild: Online Learning and Inference for Real-world Deployment
- The speaker introduces their work on evaluating the effectiveness of video anomaly detection in the wild, focusing on online learning and inference for real-world deployment.
03:55:38 — Hansung Kim: Unsupervised Multi-Person 3D Human Pose Estimation from 2D Poses Alone
- The speaker introduces their work on unsupervised multi-person 3D human pose estimation from 2D poses alone.
03:59:15 — Niklas Wagner: CAGE: Circumplex Affect Guided Expression Inference
- The speaker introduces CAGE, a method for circumplex affect guided expression inference.
04:00:52 — Feng Qiu: Learning Transferable Compound Expressions from Masked AutoEncoder Pretraining
- The speaker introduces their work on learning transferable compound expressions from masked autoencoder pretraining.
04:05:52 — Feng Qiu: Method
- The method involves extracting unimodal features (visual, audio, text), using a temporal segment network, and text-based contrastive learning. Text is used as a primary feature to guide mimicry estimation.
05:38:35 — Angelica Lim: Social Signals in the Wild Multimodal Machine Learning for Human Robot Interaction (HRI)
- This talk explores multimodal social signal processing for human-robot interaction, covering conversational turn-taking, voice adaptation to environment, co-speech gesture generation, human motion prediction, and the complexities of emotion recognition in real-world and cultural contexts, proposing a nonverbal Turing test and future adaptive interfaces.

Key Takeaways

The 6th ABAW Competition features five distinct challenges in affective behavior analysis, pushing the boundaries of emotion recognition and related tasks.
Leveraging pre-trained models and multi-task training strategies is a common and effective approach to improve performance across various facial analysis tasks.
Multi-modal fusion, integrating audio, visual, and textual data, is crucial for capturing complementary information and enhancing the accuracy of emotion recognition systems.
Addressing data limitations through semi-supervised learning and temporal modeling techniques can significantly improve the robustness and generalizability of facial expression recognition models.
Multimodal social signal processing is crucial for developing more natural and effective human-robot interaction, extending beyond basic facial expression recognition to include voice, body language, and contextual cues.
Emotion recognition in AI faces significant challenges due to the complexity of human emotions, cultural differences in expression, and the fact that facial expressions do not always directly reflect internal feelings, leading to ethical concerns and regulatory actions like the EU’s AI Act.
Advanced AI models, particularly Vision-Language Models (VLMs) combined with contextual reasoning (e.g., Chain of Thought), show promise in improving emotion expression recognition in complex, real-world scenarios.
Future interfaces should adapt to human communication styles rather than requiring humans to adapt to technology, necessitating AI systems that can understand and generate nuanced social signals in context.
Novel architectures like MAURA, 2S-AMAGN, and JMT are proposed for robust facial expression and emotion recognition, often leveraging multimodal data and attention mechanisms.
Addressing real-world challenges in affective computing requires specialized techniques such as adaptive frame selection for micro-expressions, personalized blend shape rigs for VR facial animation, and spatial cropping for violence detection.
Synthetic datasets like BlendMimic3D play a crucial role in training models for complex scenarios like 3D human pose estimation with occlusions, where real-world data is scarce or challenging to annotate.
Efficiency and real-time performance are key considerations for deployment on resource-constrained devices, leading to architectures like TCCT-Net and online learning frameworks for video anomaly detection.

Methods / Models / Datasets Mentioned

2S-AMAGN
Action-VST
BlendMimic3D
CCA-Transformer
CPN
CWT
ChatGLM3
D3DP
DWF
Detectron2
ELM
EMOTIC dataset
EfficientNet
EmoStyle
EmoViT
GCN
GEPC
GPT-2
GPT-4o
Gesture2Vec
HOMAGE
HuBERT
JMT
LLaVA
LLaVA-F
LSTM
MAE decoder
MAE encoder
MAE pre-training
MAURA
MEAA
Mistral-F
OpenPose
PoseFormerV2
REFA
RF
ResNet
ResNet50
SAGPOOL
SFU-Store Nav Dataset
STG-NF
SimMIM
TCCT-Net
TCN
TSGAD
Transformer
UE-HRI Dataset
UniformerV2
VGG-Net
VGG16
VideoPose3D
Wav2Vec 2.0
Wav2Vec2
Whisper
YOLO V8

Topics

3D Human Pose Estimation · Action Unit Detection · Affective Behavior Analysis · Compound Expression Recognition · Conversational AI · Cultural Differences in Emotion · Emotion Recognition · Emotion Recognition Challenges · Emotional Mimicry · Emotional Mimicry Intensity Estimation · Engagement Estimation · Facial Analysis · Facial Expression Recognition · Graph Neural Networks · Human-Robot Interaction · Micro-expression Detection · Multi-modal Fusion · Multimodal Emotion Recognition · Multimodal Social Signal Processing · Nonverbal Communication · Occlusion Handling · Online Learning · Real-time Systems · Semi-Supervised Learning · Semi-supervised Learning · Transformer Architectures · Valence-Arousal Estimation · Video Anomaly Detection

Notes

Open for commentary — connections to other work, critiques, follow-up reading.