Deep Representation Learning for Speaker Recognition

Thumbnail Image
Hajavi, Amirhossein
Speaker Recognition , Speaker Representation Learning , Deep Learning , Deep Neural Networks
Automated SR solutions are often used in smart devices that are capable of recording audio for authentication purposes or to personalize the services provided by these devices towards each user. Over that past few years, deep learning solutions for SR have attracted large amounts of attention. The superior performance of deep learning models in challenging conditions compared to the classical methods, specially in noisy environments and in-the-wild scenarios, have made them a predominant choice for developing SR models. However, there are some challenges that remain less explored in the literature. We select several interesting challenges as the objective of this thesis, namely, using short utterances for SR, focusing on salient parts of input speech, enhancing the back-end component of SR solutions, using video during the training of SR models, and analyzing fairness in common SR solutions. As our first challenge, we proposed a new deep neural network, UtterIdNet which is capable of achieving strong SR results with short speech segments especially sub-second durations (250 ms and 500 ms). For the second goal of this thesis, we proposed FEFA which is capable of focusing on information items as small as frequency-bins while being simple and lightweight. We showed through experiments that by adding FEFA to different CNN architectures, performance is consistently improved by substantial margins. For the third challenge, we proposed a Siamese capsule network to replace the back-end of SR systems. Our experiments showed that using this model the performance of SR systems are considerably enhanced. For addressing the fourth challenge, we adopted the paradigm of learning using privileged information and teacher-student knowledge distillation to train student models that perform SR on audio input only while their training is boosted by the teacher models that learn video inputs. Our evaluations showed that using this method improves the performance of student models considerably. Finally for the fifth challenge, we analyzed the impact of using different networks and loss functions on the fairness of SR systems towards `gender' and `nationality' as protected groups. Our analysis provides new and interesting insights into the fairness of recent common SR systems.
External DOI