A Distilled Low-Latency Neural Vocoder with Explicit Amplitude and Phase Prediction

Hui-Peng Du, Yang Ai, and Zhen-Hua Ling

Abstract

The majority of mainstream neural vocoders primarily focus on speech quality and generation speed, while overlooking latency, which is a critical factor in real-time applications. Excessive latency leads to noticeable delays in user interaction, severely degrading the user experience and rendering such systems impractical for real-time use. Therefore, this paper proposes DLL-APNet, a Distilled Low-Latency neural vocoder which first predicts the Amplitude and Phase spectra explicitly from input mel spectrogram and then reconstructs the speech waveform via inverse short-time Fourier transform (iSTFT). The DLL-APNet vocoder leverages causal convolutions to constrain the utilization of information to current and historical contexts, effectively minimizing latency. To mitigate speech quality degradation caused by causal constraints, a knowledge distillation strategy is proposed, where a pre-trained non-causal teacher vocoder guides intermediate feature generation of the causal student DLL-APNet vocoder. Experimental results demonstrate that the proposed DLL-APNet vocoder produces higher-quality speech than other causal vocoders, while requiring fewer computational resources. Furthermore, the proposed DLL-APNet vocoder achieves speech quality on par with mainstream non-causal neural vocoders, validating its ability to deliver both high perceptual quality and low latency.

Model Architecture
Main Experimental Results
Discussion on Distillation Weight Selection
Discussion on Distillation Position Selection

Model Architecture

Fig.1 Model structure and knowledge distillation strategy of the proposed DLL-APNet vocoder. During training, part of the parameters of DLL-APNet are trained through knowledge distillation from teacher model APNet2.

Main Experimental Results

Sample 1

Groud Truth	DDL-APNet (proposed)	BigVGAN	HiFi-GAN

iSTFTNet	APNet2	Vocos	casual HiFi-GAN

casual iSTFTNet	casual APNet2	casual Vocos

Sample 2

Groud Truth	DDL-APNet (proposed)	BigVGAN	HiFi-GAN

iSTFTNet	APNet2	Vocos	casual HiFi-GAN

casual iSTFTNet	casual APNet2	casual Vocos

Sample 3

Groud Truth	DDL-APNet (proposed)	BigVGAN	HiFi-GAN

iSTFTNet	APNet2	Vocos	casual HiFi-GAN

casual iSTFTNet	casual APNet2	casual Vocos

Sample 4

Groud Truth	DDL-APNet (proposed)	BigVGAN	HiFi-GAN

iSTFTNet	APNet2	Vocos	casual HiFi-GAN

casual iSTFTNet	casual APNet2	casual Vocos

Sample 5

Groud Truth	DDL-APNet (proposed)	BigVGAN	HiFi-GAN

iSTFTNet	APNet2	Vocos	casual HiFi-GAN

casual iSTFTNet	casual APNet2	casual Vocos

Sample 6

Groud Truth	DDL-APNet (proposed)	BigVGAN	HiFi-GAN

iSTFTNet	APNet2	Vocos	casual HiFi-GAN

casual iSTFTNet	casual APNet2	casual Vocos

Sample 7

Groud Truth	DDL-APNet (proposed)	BigVGAN	HiFi-GAN

iSTFTNet	APNet2	Vocos	casual HiFi-GAN

casual iSTFTNet	casual APNet2	casual Vocos

Discussion on Distillation Weight SelectionMain Experimental Results

Sample 1

λ_KD = 0	λ_KD = 0.1	λ_KD = 0.5	λ_KD = 1

λ_KD = 2	λ_KD = 5	λ_KD =10	λ_KD = 20

Sample 2

λ_KD = 0	λ_KD = 0.1	λ_KD = 0.5	λ_KD = 1

λ_KD = 2	λ_KD = 5	λ_KD =10	λ_KD = 20

Sample 3

λ_KD = 0	λ_KD = 0.1	λ_KD = 0.5	λ_KD = 1

λ_KD = 2	λ_KD = 5	λ_KD =10	λ_KD = 20

Sample 4

λ_KD = 0	λ_KD = 0.1	λ_KD = 0.5	λ_KD = 1

λ_KD = 2	λ_KD = 5	λ_KD =10	λ_KD = 20

Sample 5

λ_KD = 0	λ_KD = 0.1	λ_KD = 0.5	λ_KD = 1

λ_KD = 2	λ_KD = 5	λ_KD =10	λ_KD = 20

Sample 6

λ_KD = 0	λ_KD = 0.1	λ_KD = 0.5	λ_KD = 1

λ_KD = 2	λ_KD = 5	λ_KD =10	λ_KD = 20

Sample 7

λ_KD = 0	λ_KD = 0.1	λ_KD = 0.5	λ_KD = 1

λ_KD = 2	λ_KD = 5	λ_KD =10	λ_KD = 20

Discussion on Distillation Position Selection

Sample 1

Distilled Numbers = 0	Distilled Numbers = 2	Distilled Numbers = 4	Distilled Numbers = 6	Distilled Numbers = 8

Sample 2

Distilled Numbers = 0	Distilled Numbers = 2	Distilled Numbers = 4	Distilled Numbers = 6	Distilled Numbers = 8

Sample 3

Distilled Numbers = 0	Distilled Numbers = 2	Distilled Numbers = 4	Distilled Numbers = 6	Distilled Numbers = 8

Sample 4

Distilled Numbers = 0	Distilled Numbers = 2	Distilled Numbers = 4	Distilled Numbers = 6	Distilled Numbers = 8

Sample 5

Distilled Numbers = 0	Distilled Numbers = 2	Distilled Numbers = 4	Distilled Numbers = 6	Distilled Numbers = 8

Sample 6

Distilled Numbers = 0	Distilled Numbers = 2	Distilled Numbers = 4	Distilled Numbers = 6	Distilled Numbers = 8

Sample 7

Distilled Numbers = 0	Distilled Numbers = 2	Distilled Numbers = 4	Distilled Numbers = 6	Distilled Numbers = 8

A Distilled Low-Latency Neural Vocoder with Explicit Amplitude and Phase Prediction

Abstract

Contents

Model Architecture

Main Experimental Results

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Discussion on Distillation Weight SelectionMain Experimental Results

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Discussion on Distillation Position Selection

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7