Attention and Depth Hallucination for RGB-D Face Recognition with Deep Learning
ace recognition approaches that are based purely on RGB images rely solely on intensity information, and therefore are more sensitive to facial variations, notably pose, occlusions, and environmental changes such as illumination and background. These approaches also tend to process the whole image uniformly, weighing distinctive and non-distinctive regions of the image equally. In order to extract more representative facial features, we first propose two fusion techniques based on RGB and depth modalities using attention mechanisms. The first fusion technique uses an LSTM network to selectively focus on feature maps, followed by a convolution layer to generate spatial attention weights. Our method achieves competitive results on CurtinFaces and IIIT-D RGB-D datasets, achieving classification accuracies of over 98.2% and 99.3% respectively. Our second proposed fusion method is a novel attention mechanism that directs the deep network ``where to look'' for visual features in the RGB image by generating an attention map from depth features extracted using a CNN. Our proposed solution achieves notable improvements over the current state-of-the-art on four public datasets, namely Lock3DFace, CurtinFaces, IIIT-D RGB-D, and KaspAROV, with average (increased) accuracies of 87.3% (+5.0%), 99.1% (+0.9%), 99.7% (+0.6%) and 95.3%(+0.5%) for the four datasets respectively, thereby improving the state-of-the-art. Although depth data can provide useful information for face recognition, acquiring depth data in the wild still remains a challenge. To address this problem, we present the Teacher-Student Generative Adversarial Network (TS-GAN) to generate depth images from a single RGB image in order to boost the recognition accuracy of face recognition systems, where depth images are not available. The teacher learns a latent mapping between input RGB and paired depth images in a supervised fashion which the student then generalizes from new RGB data with no available paired depth information. The fully trained shared generator can then be used in runtime to hallucinate depth from RGB for downstream applications such as face recognition. We demonstrate that our hallucinated depth along with the input RGB images boost performance across various architectures when compared to a single RGB modality by average values of +1.2%, +2.6%, and +2.6% for IIIT-D, EURECOM, and LFW datasets respectively.
URI for this recordhttp://hdl.handle.net/1974/28874
Request an alternative formatIf you require this document in an alternate, accessible format, please contact the Queen's Adaptive Technology Centre
The following license files are associated with this item: