Observation Points based Multi-modal Fusion Systems for Skeleton Action Recognition
Current methods for skeleton-based action recognition compute features based on the given skeleton joint information. We show that introducing new observation points in skeleton motion sequences and using them to create fused representations from multiple modalities such as joints and bones, can enhance the discriminative power of the original modalities. Moreover, such representations can be used to create new streams in multi-stream networks that fuse constructively with other streams trained on the original modalities, effectively exhibiting a dual behaviour and collectively boosting the performance of the network even further. In this work, we present certain configurations of multi-modal fusion systems with observation points that can easily be incorporated in existing networks and improve state-of-the-art results on the two popular J-HMDB and Kinetics-Skeleton action recognition datasets.