This paper presents a novel neural denoising vocoder that is capable of converting input noisy mel-spectrogram into clean speech. This vocoder consists of a spectrum predictor and an enhancement module, combining the functionalities of both vocoding and denoising. The spectrum predictor module estimates the noisy amplitude and phase spectra from the input noisy mel-spectrogram, while the enhancement module refines these noisy spectra to obtain the clean ones. Subsequently, the clean speech is synthesized using the iSTFT. Experimental results demonstrate that, despite the absence of phase information and partial amplitude information in the input mel-spectrogram, our proposed neural denoising vocoder still outperforms baseline vocoders and is comparable to several SE methods. Further exploration of building an end-to-end denoising vocoder without the need for a noisy speech bridge will be our future work.
Proposed | Proposed rep. APNet | Proposed w/ mathing STFT condition | HiFi-GAN | Vocos | clean | Noisy |
Proposed | Proposed rep. APNet | Proposed w/ mathing STFT condition | HiFi-GAN | Vocos | clean | Noisy |
Proposed | Proposed rep. APNet | Proposed w/ mathing STFT condition | HiFi-GAN | Vocos | clean | Noisy |
Proposed | Proposed rep. APNet | Proposed w/ mathing STFT condition | HiFi-GAN | Vocos | clean | Noisy |
Proposed | Proposed rep. APNet | Proposed w/ mathing STFT condition | HiFi-GAN | Vocos | clean | Noisy |
Proposed | Proposed rep. APNet | Proposed w/ mathing STFT condition | HiFi-GAN | Vocos | clean | Noisy |
Proposed | Proposed rep. APNet | Proposed w/ mathing STFT condition | HiFi-GAN | Vocos | clean | Noisy |
Proposed | Proposed rep. APNet | Proposed w/ mathing STFT condition | HiFi-GAN | Vocos | clean | Noisy |
Proposed | Proposed rep. APNet | Proposed w/ mathing STFT condition | HiFi-GAN | Vocos | clean | Noisy |