BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation

Anonymous submission to Interspeech 2024

Abstract

This paper proposes a novel bidirectional neural vocoder, named BiVocoder, capable both of feature extraction and reverse waveform generation within the short-time Fourier transform (STFT) domain. For feature extraction, the BiVocoder takes amplitude and phase spectra derived from STFT as inputs, transforms them into long-frame-shift and low-dimensional features through convolutional neural networks. The extracted features are demonstrated suitable for direct prediction by acoustic models, supporting its application in text-to-speech (TTS) task. For waveform generation, the BiVocoder restores amplitude and phase spectra from the features by a symmetric network, followed by inverse STFT to reconstruct the speech waveform. Experimental results show that our proposed BiVocoder achieves better performance compared to some baseline vocoders, by comprehensively considering both synthesized speech quality and inference speed for both analysis-synthesis and TTS tasks.

Contents

Model Architecture


Fig.1 The architecture of BiVocoder and discriminators are omitted in the diagram. ABS(·) and Angle(·) denote amplitude and phase spectrum calculations. Arctan2 stands for two-arguement arc-tan function. Conv1d and DeConv1d represents 1D convolutional layer and 1D deconvolutional layer, respectively.


“*” indicates that the used features have the same frame shift and dimensionality as those extracted by BiVocoder.

Analysis-synthesis tasks on different models

Sample 1

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT


Sample 2

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT


Sample 3

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT


Sample 4

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT


Sample 5

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT


Sample 6

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT


Sample 7

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT


Text-to-speech tasks on different models

Sample 1

Groud Truth BiVocoder (proposed) HiFi-GAN HiFi-GAN* Autovocoder


Sample 2

Groud Truth BiVocoder (proposed) HiFi-GAN HiFi-GAN* Autovocoder


Sample 3

Groud Truth BiVocoder (proposed) HiFi-GAN HiFi-GAN* Autovocoder


Sample 4

Groud Truth BiVocoder (proposed) HiFi-GAN HiFi-GAN* Autovocoder


Sample 5

Groud Truth BiVocoder (proposed) HiFi-GAN HiFi-GAN* Autovocoder


Sample 6

Groud Truth BiVocoder (proposed) HiFi-GAN HiFi-GAN* Autovocoder


Sample 7

Groud Truth BiVocoder (proposed) HiFi-GAN HiFi-GAN* Autovocoder


Cross-dataset experiments

Sample 1

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT


Sample 2

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT


Sample 3

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT


Sample 4

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT


Sample 5

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT


Sample 6

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT


Sample 7

Groud Truth HiFi-GAN APNet Autovocoder


BiVocoder (proposed) HiFi-GAN* APNet* Autovocoder* STRAIGHT