BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation

Anonymous submission to Interspeech 2024

Abstract

This paper proposes a novel bidirectional neural vocoder, named BiVocoder, capable both of feature extraction and reverse waveform generation within the short-time Fourier transform (STFT) domain. For feature extraction, the BiVocoder takes amplitude and phase spectra derived from STFT as inputs, transforms them into long-frame-shift and low-dimensional features through convolutional neural networks. The extracted features are demonstrated suitable for direct prediction by acoustic models, supporting its application in text-to-speech (TTS) task. For waveform generation, the BiVocoder restores amplitude and phase spectra from the features by a symmetric network, followed by inverse STFT to reconstruct the speech waveform. Experimental results show that our proposed BiVocoder achieves better performance compared to some baseline vocoders, by comprehensively considering both synthesized speech quality and inference speed for both analysis-synthesis and TTS tasks.

Model Architecture
Analysis-synthesis tasks on different models
Text-to-speech tasks on different models
Cross-dataset experiments

Model Architecture

Fig.1 The architecture of BiVocoder and discriminators are omitted in the diagram. ABS(·) and Angle(·) denote amplitude and phase spectrum calculations. Arctan2 stands for two-arguement arc-tan function. Conv1d and DeConv1d represents 1D convolutional layer and 1D deconvolutional layer, respectively.

“*” indicates that the used features have the same frame shift and dimensionality as those extracted by BiVocoder.

Analysis-synthesis tasks on different models

Sample 1

Groud Truth	HiFi-GAN	APNet	Autovocoder

BiVocoder (proposed)	HiFi-GAN*	APNet*	Autovocoder*	STRAIGHT

Sample 2

Groud Truth	HiFi-GAN	APNet	Autovocoder

BiVocoder (proposed)	HiFi-GAN*	APNet*	Autovocoder*	STRAIGHT

Sample 3

Groud Truth	HiFi-GAN	APNet	Autovocoder

BiVocoder (proposed)	HiFi-GAN*	APNet*	Autovocoder*	STRAIGHT

Sample 4

Groud Truth	HiFi-GAN	APNet	Autovocoder

BiVocoder (proposed)	HiFi-GAN*	APNet*	Autovocoder*	STRAIGHT

Sample 5

Groud Truth	HiFi-GAN	APNet	Autovocoder

BiVocoder (proposed)	HiFi-GAN*	APNet*	Autovocoder*	STRAIGHT

Sample 6

Groud Truth	HiFi-GAN	APNet	Autovocoder

BiVocoder (proposed)	HiFi-GAN*	APNet*	Autovocoder*	STRAIGHT

Sample 7

Groud Truth	HiFi-GAN	APNet	Autovocoder

BiVocoder (proposed)	HiFi-GAN*	APNet*	Autovocoder*	STRAIGHT