Self-supervised Learning for IMU-based Human Activity Recognition

Thumbnail Image
Rahimi Taghanaki, Setareh
Self-supervised Learning , Contrastive Learning , Human Activity Recognition , Accelerometers , Data Prediction
In recent years, human activity recognition has drawn considerable attention due to its application in a variety of areas such as smart homes and health. The pervasiveness of wearable devices and smartphones has provided many research opportunities for human activity recognition using inertial measuring units. In this thesis, we propose the use of self-supervised learning for human activity recognition using the tri-axial data collected from the smartphone-embedded accelerometers. To address the limitations of fully-supervised learning, mainly reliance on labeled data, we propose two self-supervised solutions. Our first solution is a novel method which consists of two steps. First, the representations of unlabeled input signals are learned by training a deep convolutional neural network to predict a segment of masked accelerometer values. Our model exploits a novel scheme to leverage past and present motion along x and y dimensions, as well as past values of the z axis to predict future values in the z dimension. This cross-dimensional prediction approach results in effective pretext training with which our model learns to extract strong representations. Next, we freeze the convolution blocks and transfer the weights to our downstream network aimed at human activity recognition. For this task, we add a number of fully connected layers to the end of the frozen network and train the added layers with labeled accelerometer signals to learn to classify human activities. We evaluate the performance of our method on three publicly available human activity datasets: UCI HAR, MotionSense, and HAPT, outperforming a number of prior works in the area. In our second solution, similar to our first method, we aim to develop a model that learns strong representations from accelerometer signals, in order to perform robust human activity classification, while reducing the model's reliance on class labels. Specifically, we intend to enable cross-dataset transfer learning such that our network pre-trained on a particular dataset can perform effective activity classification on other datasets (successive to a small amount of fine-tuning). To tackle this problem, we design our solution with the intention of learning as much information from the accelerometer signals as possible. As a result, we design two separate pipelines, one that learns the data in time-frequency domain, and the other in time-domain alone. In order to address the issues mentioned above in regards to cross-dataset transfer learning, we use self-supervised contrastive learning to train each of these streams. Next, each stream is fine-tuned for final classification, and eventually the two are fused to provide the final results. We evaluate the performance of the proposed solution on three datasets, namely MotionSense, HAPT, and HHAR, and demonstrate that our solution outperforms prior works in this field. We further evaluate the performance of the method in learning generalized features, by using MobiAct dataset for pre-training and the remaining three datasets for the downstream classification task, and show that the proposed solution achieves better performance in comparison with other self-supervised methods in cross-dataset transfer learning.
External DOI