Abstract:
Speech synthesis converts text to speech signals. The naturalness and intelligibility of synthesized speech affect the listeners understanding of the content conveyed by the speech signal. This dissertation proposed 3 aspects of improving the naturalness and intelligibility of synthesized speech generating from STRAIGHT parameters. The first aspect was the separation of spectral-feature models and the fundamental-frequency models. The two types of models were trained independently to obtain the Hidden Markov Model (HMM) parameters, optimized for generating their respective STRAIGHT parameters. Algorithms handling the time-alignment of parameters, generating separately from the two models were proposed. In this work, we focused on generating STRAIGHT parameters from either HMMs or Deep Neural Networks (DNNs). The second aspect was concerned with the modification of typical inputs to DNNs used for generating STRAIGHT parameters from direct phonetic contexts, to HMMs resulting from context clustering decision trees. The third aspect was the DNN output normalization using means and variances from HMMs, which were the results of the decision trees. Tools for objective evaluations were Mel cepstral distortion for Mel cepstral coefficient of spectral filter (MGC_MCD), Mel cepstral distortion for coefficient of band aperiodicity filter (BAP_MCD), root mean square error of fundamental frequency (LF0_RMSE), and count of unmatched voicing condition between natural speech and synthesized speech (LF0_UVU). Nine participants were recruited to perform a subjective evaluation in which they were asked to evaluate the synthesized speech utterances in terms of their naturalness and intelligibility. The results of the objective test showed that applying the second and the third proposed aspects to DNN generated STRAIGHT parameters resulted in better synthesized speech than applying the first aspect to HMM models as well as using baseline HMM and DNN methods. The subjective results showed that the application of the first aspect to HMM outperformed other methods.