Attentive Cross-Modal Connections for Learning Multimodal Representations from Wearable Signals for Affect Recognition

Thumbnail Image
Bhatti, Anubhav
Multimodal , Representation Learning , Fusion , Affect Recognition , Wearable Signals , Emotion Recognition , Cognitive Load Assessment
We propose cross-modal attentive connections, a new dynamic and effective technique for multimodal representation learning from wearable data. Our solution can be integrated into any stage of the pipeline, i.e., after any convolutional layer or block, to create intermediate connections between individual streams responsible for processing each modality. Additionally, our method benefits from two properties. First, it can share information uni-directionally (from one modality to the other) or bi-directionally. Second, it can be integrated into multiple stages at the same time to further allow network gradients to be exchanged in several touch-points. We perform extensive experiments on three public multimodal wearable datasets, WESAD, SWELL-KW, and CASE, and demonstrate that our method can effectively regulate and share information between different modalities to learn better representations. Our experiments further demonstrate that once integrated into simple CNN-based multimodal solutions (2, 3, or 4 modalities), our method can result in superior or competitive performance to state-of-the-art and outperform a variety of baseline uni-modal and classical multimodal methods. Further, we explore the notion of 'cognitive load' classification to explore multimodal representation learning and our proposed solution in the field of affective computing but beyond `emotion recognition.’ To this end, given the lack of widely adopted datasets in this area, we introduce a new dataset called Cognitive Load Assessment in REaltime (CLARE), with which we evaluate our proposed method. In this dataset, we collect a number of wearable modalities from 24 participants. We use MATB-II to induce different levels of cognitive load in participants by changing the complexity of tasks during the experiment. Contrary to other datasets in this domain, we record the subjective cognitive load values in real-time at 10-second intervals during the experiment. We then show that our proposed solution results in effective multimodal representation learning, outperforming baseline uni-modal and classical multimodal methods (feature-level fusion and score-level fusion) in classifying cognitive load.
External DOI