A Distilled Low-Latency Neural Vocoder with Explicit Amplitude and Phase Prediction
Hui-Peng Du, Yang Ai, and Zhen-Hua Ling
Abstract
The majority of mainstream neural vocoders primarily focus on speech quality and generation speed, while overlooking latency, which is a critical factor in real-time applications.
Excessive latency leads to noticeable delays in user interaction, severely degrading the user experience and rendering such systems impractical for real-time use. Therefore, this paper proposes
DLL-APNet, a Distilled Low-Latency neural vocoder which first predicts the Amplitude and Phase spectra explicitly from input mel spectrogram and then reconstructs the speech waveform via inverse
short-time Fourier transform (iSTFT). The DLL-APNet vocoder leverages causal convolutions to constrain the utilization of information to current and historical contexts, effectively minimizing
latency. To mitigate speech quality degradation caused by causal constraints, a knowledge distillation strategy is proposed, where a pre-trained non-causal teacher vocoder guides
intermediate feature generation of the causal student DLL-APNet vocoder. Experimental results demonstrate that the proposed DLL-APNet vocoder produces higher-quality
speech than other causal vocoders, while requiring fewer computational resources. Furthermore,
the proposed DLL-APNet vocoder achieves speech quality on par with mainstream non-causal neural vocoders,
validating its ability to deliver both high perceptual quality and low latency.
Fig.1 Model structure and knowledge distillation strategy of the proposed DLL-APNet vocoder.
During training, part of the parameters of DLL-APNet are trained through knowledge distillation from teacher model APNet2.