Mon, Khaing Zar. Spoof detection using voice contribution on LFCC features and ResNet-34. Master's Degree(Artificial Intelligence and Internet of Things). Thammasat University. Thammasat University Library. : Thammasat University, 2024.
Spoof detection using voice contribution on LFCC features and ResNet-34
Abstract:
Recent advancements in biometric authentication, particularly within the realm of speaker verification, have been notable. However, despite these strides, the persisting vulnerability to spoofing attacks is evident, necessitating specialized measures for detection across various attack types. This study focuses specifically on the identification of replay, speech synthesis, and voice conversion attacks. Our approach to spoof detection involves the utilization of linear frequency cepstral coefficients (LFCC) for the extraction of front-end features, coupled with ResNet-34 for the discrimination between genuine and spoofed speech samples. Through the integration of LFCC with ResNet-34, we rigorously evaluated our proposed method using the ASVspoof 2019 dataset. We investigated scenarios involving Physical Access (PA), focusing on replay attacks, and Logical Access (LA), which encompassed speech synthesis and voice conversion attacks. In our investigation, we compare the efficacy of utilizing the entire utterance for feature extraction against an alternative method that extracts features from a specific segment of the voice within the utterance for classification. Additionally, we conducted a comprehensive evaluation by benchmarking our proposed method against established baseline techniques, namely linear frequency cepstral coefficients - gaussian mixture model (LFCC-GMM) and constant Q cepstral coefficients - gaussian mixture model (CQCC-GMM), as well as contemporary state-of-the-art approaches. The results of our study demonstrate promising performance outcomes. Specifically, our proposed method achieves an equal error rate (EER) of 1.85% and 2.74% for replay attacks (PA) in the development and evaluation datasets, respectively. For voice conversion and speech synthesis attacks (LA), the method attains EER of 0.01% and 5.16% in the corresponding datasets. These findings underscore the effectiveness of our method in identifying spoof attacks across both PA and LA scenarios. Furthermore, we extend our analysis by conducting cross-dataset validation and addressing gender bias to thoroughly evaluate the robustness and generalizability of our model. These additional assessments provide further insights into the performance and reliability of our proposed approach in real-world settings.
Thammasat University. Thammasat University Library