A Distilled Low-Latency Neural Vocoder with Explicit Amplitude and Phase Prediction

Hui-Peng Du, Yang Ai, and Zhen-Hua Ling

Abstract

The majority of mainstream neural vocoders primarily focus on speech quality and generation speed, while overlooking latency, which is a critical factor in real-time applications. Excessive latency leads to noticeable delays in user interaction, severely degrading the user experience and rendering such systems impractical for real-time use. Therefore, this paper proposes DLL-APNet, a Distilled Low-Latency neural vocoder which first predicts the Amplitude and Phase spectra explicitly from input mel spectrogram and then reconstructs the speech waveform via inverse short-time Fourier transform (iSTFT). The DLL-APNet vocoder leverages causal convolutions to constrain the utilization of information to current and historical contexts, effectively minimizing latency. To mitigate speech quality degradation caused by causal constraints, a knowledge distillation strategy is proposed, where a pre-trained non-causal teacher vocoder guides intermediate feature generation of the causal student DLL-APNet vocoder. Experimental results demonstrate that the proposed DLL-APNet vocoder produces higher-quality speech than other causal vocoders, while requiring fewer computational resources. Furthermore, the proposed DLL-APNet vocoder achieves speech quality on par with mainstream non-causal neural vocoders, validating its ability to deliver both high perceptual quality and low latency.

Contents

Model Architecture


Fig.1 Model structure and knowledge distillation strategy of the proposed DLL-APNet vocoder. During training, part of the parameters of DLL-APNet are trained through knowledge distillation from teacher model APNet2.


Main Experimental Results

Sample 1

Groud Truth DDL-APNet (proposed) BigVGAN HiFi-GAN
iSTFTNet APNet2 Vocos casual HiFi-GAN
casual iSTFTNet casual APNet2 casual Vocos

Sample 2

Groud Truth DDL-APNet (proposed) BigVGAN HiFi-GAN
iSTFTNet APNet2 Vocos casual HiFi-GAN
casual iSTFTNet casual APNet2 casual Vocos

Sample 3

Groud Truth DDL-APNet (proposed) BigVGAN HiFi-GAN
iSTFTNet APNet2 Vocos casual HiFi-GAN
casual iSTFTNet casual APNet2 casual Vocos

Sample 4

Groud Truth DDL-APNet (proposed) BigVGAN HiFi-GAN
iSTFTNet APNet2 Vocos casual HiFi-GAN
casual iSTFTNet casual APNet2 casual Vocos

Sample 5

Groud Truth DDL-APNet (proposed) BigVGAN HiFi-GAN
iSTFTNet APNet2 Vocos casual HiFi-GAN
casual iSTFTNet casual APNet2 casual Vocos

Sample 6

Groud Truth DDL-APNet (proposed) BigVGAN HiFi-GAN
iSTFTNet APNet2 Vocos casual HiFi-GAN
casual iSTFTNet casual APNet2 casual Vocos

Sample 7

Groud Truth DDL-APNet (proposed) BigVGAN HiFi-GAN
iSTFTNet APNet2 Vocos casual HiFi-GAN
casual iSTFTNet casual APNet2 casual Vocos

Discussion on Distillation Weight SelectionMain Experimental Results

Sample 1

λKD = 0 λKD = 0.1 λKD = 0.5 λKD = 1
λKD = 2 λKD = 5 λKD =10 λKD = 20

Sample 2

λKD = 0 λKD = 0.1 λKD = 0.5 λKD = 1
λKD = 2 λKD = 5 λKD =10 λKD = 20

Sample 3

λKD = 0 λKD = 0.1 λKD = 0.5 λKD = 1
λKD = 2 λKD = 5 λKD =10 λKD = 20

Sample 4

λKD = 0 λKD = 0.1 λKD = 0.5 λKD = 1
λKD = 2 λKD = 5 λKD =10 λKD = 20

Sample 5

λKD = 0 λKD = 0.1 λKD = 0.5 λKD = 1
λKD = 2 λKD = 5 λKD =10 λKD = 20

Sample 6

λKD = 0 λKD = 0.1 λKD = 0.5 λKD = 1
λKD = 2 λKD = 5 λKD =10 λKD = 20

Sample 7

λKD = 0 λKD = 0.1 λKD = 0.5 λKD = 1
λKD = 2 λKD = 5 λKD =10 λKD = 20

Discussion on Distillation Position Selection

Sample 1

Distilled Numbers = 0 Distilled Numbers = 2 Distilled Numbers = 4 Distilled Numbers = 6 Distilled Numbers = 8

Sample 2

Distilled Numbers = 0 Distilled Numbers = 2 Distilled Numbers = 4 Distilled Numbers = 6 Distilled Numbers = 8

Sample 3

Distilled Numbers = 0 Distilled Numbers = 2 Distilled Numbers = 4 Distilled Numbers = 6 Distilled Numbers = 8

Sample 4

Distilled Numbers = 0 Distilled Numbers = 2 Distilled Numbers = 4 Distilled Numbers = 6 Distilled Numbers = 8

Sample 5

Distilled Numbers = 0 Distilled Numbers = 2 Distilled Numbers = 4 Distilled Numbers = 6 Distilled Numbers = 8

Sample 6

Distilled Numbers = 0 Distilled Numbers = 2 Distilled Numbers = 4 Distilled Numbers = 6 Distilled Numbers = 8

Sample 7

Distilled Numbers = 0 Distilled Numbers = 2 Distilled Numbers = 4 Distilled Numbers = 6 Distilled Numbers = 8