Deep Learning for Touchless Human-Computer Interaction using 3D Hand Pose Estimation

Thumbnail Image
Khaleghi, Leyla
Hand gesture recognition , Deep learning , Hand pose estimation
This thesis focuses on hand pose estimation (HPE) as a crucial component for human-computer interaction, for example in gesture-based control for physical or virtual/augmented reality devices. We start by introducing an inexpensive and robust proof-of-concept mechanism for a practical gesture-based control system, which has been implemented and tested on a robotic wheel loader. To explore the feasibility and practicality of such a system, we resort to off-the-shelf equipment and models. Using an RGB camera and laptop, the system processes hand gestures in real-time to control a loader in construction zones. After designing four different hand gestures for controlling the loader, we collected 26000 images and trained a neural network to recognize the hand gestures. With the proposed hand gesture recognition system, we successfully controlled a loader to excavate a rock pile. Next, we present several open problems in the area of HPE. Despite the significant progress in HPE in recent years, the accuracy and robustness of these methods still suffer from self-occlusion, as well as sensitivity to variations in camera viewpoints and environments. We thus focus on multi-view and video-based 3D HPE for developing a more robust system. Given the scarcity of multi-view and video-based datasets, we created a large synthetic multi-view video-based dataset of 3D hand poses. These videos were simultaneously captured from six different angles with complex backgrounds and varying levels of dynamic lighting. Next, we implemented a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand, recurrent learners for learning jointly from multi-view information over time, and graph networks with U-Net architectures for estimating the final 3D poses. The results of our studies demonstrate the added value of each component of our method as well as the benefits of including both temporal and sequential contextual information in the dataset. Finally we focus on the aggregate of contextual information across time and camera views, and use self-attention transformers for learning sequential contexts for 3D HPE. Throughout our experiments, this method performed well for both temporal and angular sequence varieties. The method also achieved state-of-the-art results on our proposed dataset and a publicly available sequential dataset.
External DOI