APNet2: High-quality and High-efficiency Neural Vocoder with Direct Prediction of Amplitude and Phase Spectra

Hui-peng Du, Ye-Xin Lu, Yang Ai, Zhen-hua Ling

National Engineering Research Center of Speech and Language Information Processing,
University of Science and Technology of China, Hefei, P.R.China

Abstract

In our previous work, we have proposed a neural vocoder called APNet, which directly predicts speech amplitude and phase spectra with a 5 ms frame shift in parallel from the input acoustic features, and then reconstructs the 16 kHz speech waveform using inverse short-time Fourier transform (ISTFT). The APNet vocoder demonstrates the capability to generate synthesized speech of comparable quality to the HiFi-GAN vocoder but with a considerably improved inference speed. However, the performance of the APNet vocoder is constrained by the waveform sampling rate and spectral frame shift, limiting its practicality for high-quality speech synthesis. Therefore, this paper proposes an improved iteration of APNet, named APNet2. The proposed APNet2 vocoder adopts ConvNeXt v2 as the backbone network for amplitude and phase predictions, expecting to enhance the modeling capability. Additionally, we introduce a multi-resolution discriminator (MRD) into the GAN-based losses and optimize the form of certain losses. At a common configuration with a waveform sampling rate of 22.05 kHz and spectral frame shift of 256 points (i.e., approximately 11.6 ms), our proposed APNet2 vocoder outperforms the original APNet and Vocos in terms of synthesized speech quality. The synthesized speech quality of APNet2 is also comparable to that of HiFi-GAN and iSTFTNet, while offering a significantly faster inference speed.

Contents

Model Architecture


Fig.1 Overview of our proposed model.


Analysis-synthesis tasks on different models

Sample 1

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Sample 2

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Sample 3

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Sample 4

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Sample 5

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Sample 6

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Sample 7

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Text-to-speech tasks on different models

Sample 1

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Sample 2

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Sample 3

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Sample 4

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Sample 5

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Sample 6

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Sample 7

Groud Truth APNet2 (proposed) APNet Vocos HiFi-GAN iSTFTNet


Analysis experiments on different models

Sample 1

Groud Truth APNet2 w/o ConvNeXt v2 APNet2 w/o MRD APNet2 w/o HingeGAN


APNet2 with 100-dim-mel APNet2 Vocos with 100-dim-mel Vocos


Sample 2

Groud Truth APNet2 w/o ConvNeXt v2 APNet2 w/o MRD APNet2 w/o HingeGAN


APNet2 with 100-dim-mel APNet2 Vocos with 100-dim-mel Vocos


Sample 3

Groud Truth APNet2 w/o ConvNeXt v2 APNet2 w/o MRD APNet2 w/o HingeGAN


APNet2 with 100-dim-mel APNet2 Vocos with 100-dim-mel Vocos


Sample 4

Groud Truth APNet2 w/o ConvNeXt v2 APNet2 w/o MRD APNet2 w/o HingeGAN


APNet2 with 100-dim-mel APNet2 Vocos with 100-dim-mel Vocos


Sample 5

Groud Truth APNet2 w/o ConvNeXt v2 APNet2 w/o MRD APNet2 w/o HingeGAN


APNet2 with 100-dim-mel APNet2 Vocos with 100-dim-mel Vocos


Sample 6

Groud Truth APNet2 w/o ConvNeXt v2 APNet2 w/o MRD APNet2 w/o HingeGAN


APNet2 with 100-dim-mel APNet2 Vocos with 100-dim-mel Vocos


Sample 7

Groud Truth APNet2 w/o ConvNeXt v2 APNet2 w/o MRD APNet2 w/o HingeGAN


APNet2 with 100-dim-mel APNet2 Vocos with 100-dim-mel Vocos