Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction

Hui-Peng Du, Xiao-Hang Jiang, Yuan Tian, Yang Ai, and Zhen-Hua Ling

Abstract

Ultra-low-bitrate speech coding is pivotal for bandwidth-constrained communication and deep compression, yet maintaining naturalness and speaker identity at such extreme bit budgets remains challenging due to pronounced information loss and quantization instability. To this end, we propose FMelCodec, an ultra-low-bitrate neural speech codec in the mel-spectrogram domain, cast as a three-stage coding–refinement–reconstruction (CRR) framework that can operate at as low as 250 bps. In the CRR framework, the front-end mel-spectrogram coding stage employs a highly aggressive 640× compression/decompression encoder–decoder structure with a single 1024-entry VQ codebook, coupled with an online clustering strategy that reassigns underused codewords to prevent codebook collapse and preserve codebook diversity. The subsequent conditional flow matching (CFM)-based mel-spectrogram refinement stage leverages a lightweight velocity-field estimator and CFM-based solver to refine the codec-degraded mel-spectrogram produced by the preceding decoder, and adopts a self-consistency training scheme that supports fewer iterative inference steps for the purpose of reducing computational overhead. Finally, the vocoding-driven waveform reconstruction stage employs a HiFi-GAN vocoder to faithfully reconstruct waveform from the refined mel-spectrogram. Experiments conducted on two datasets spanning two sampling rates show that, under ultra-low-bitrate constraints of 250 bps for 16 kHz and 750 bps for 48 kHz, both objective and subjective evaluations consistently demonstrate that FMelCodec achieves higher speech reconstruction quality and speaker similarity, while incurring lower computational and model complexity.

Contents

Model Architecture


Fig.1 Inference pipeline of the proposed FMelCodec under the CRR framework.


Comparison at Equal Ultra-Low Bitrates (16 kHz, 250 bps)

Sample 1

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec FocalCodec


Sample 2

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec FocalCodec


Sample 3

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec FocalCodec


Sample 4

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec FocalCodec


Sample 5

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec FocalCodec


Sample 6

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec FocalCodec


Sample 7

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec FocalCodec


Comparison at Equal Ultra-Low Bitrates (48 kHz, 750 bps)

Sample 1

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec


Sample 2

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec


Sample 3

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec


Sample 4

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec


Sample 5

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec


Sample 6

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec


Sample 7

Groud Truth FMelCodec (proposed) DAC MDCTCodec


BigCodec WavTokenizer Flowdec


Comparison with Baselines Having Publicly Available Checkpoints

Sample 1

Groud Truth FMelCodec (proposed) @ 250 bps SemantiCodec @ 310 bps FocalCodec @ 330 bps


Sample 2

Groud Truth FMelCodec (proposed) @ 250 bps SemantiCodec @ 310 bps FocalCodec @ 330 bps


Sample 3

Groud Truth FMelCodec (proposed) @ 250 bps SemantiCodec @ 310 bps FocalCodec @ 330 bps


Sample 4

Groud Truth FMelCodec (proposed) @ 250 bps SemantiCodec @ 310 bps FocalCodec @ 330 bps


Sample 5

Groud Truth FMelCodec (proposed) @ 250 bps SemantiCodec @ 310 bps FocalCodec @ 330 bps


Sample 6

Groud Truth FMelCodec (proposed) @ 250 bps SemantiCodec @ 310 bps FocalCodec @ 330 bps


Sample 7

Groud Truth FMelCodec (proposed) @ 250 bps SemantiCodec @ 310 bps FocalCodec @ 330 bps


Comparison against Higher-Bitrate Baselines (16 kHz, 500 bps)

Sample 1

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 500 bps MDCTCodec @ 500 bps


BigCodec @ 500 bps WavTokenizer @ 500 bps Flowdec @ 500 bps


Sample 2

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 500 bps MDCTCodec @ 500 bps


BigCodec @ 500 bps WavTokenizer @ 500 bps Flowdec @ 500 bps


Sample 3

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 500 bps MDCTCodec @ 500 bps


BigCodec @ 500 bps WavTokenizer @ 500 bps Flowdec @ 500 bps


Sample 4

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 500 bps MDCTCodec @ 500 bps


BigCodec @ 500 bps WavTokenizer @ 500 bps Flowdec @ 500 bps


Sample 5

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 500 bps MDCTCodec @ 500 bps


BigCodec @ 500 bps WavTokenizer @ 500 bps Flowdec @ 500 bps


Sample 6

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 500 bps MDCTCodec @ 500 bps


BigCodec @ 500 bps WavTokenizer @ 500 bps Flowdec @ 500 bps


Sample 7

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 500 bps MDCTCodec @ 500 bps


BigCodec @ 500 bps WavTokenizer @ 500 bps Flowdec @ 500 bps


Comparison against Higher-Bitrate Baselines (16 kHz, 1000 bps)

Sample 1

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 1000 bps MDCTCodec @ 1000 bps


BigCodec @ 1000 bps WavTokenizer @ 1000 bps Flowdec @ 1000 bps


Sample 2

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 1000 bps MDCTCodec @ 1000 bps


BigCodec @ 1000 bps WavTokenizer @ 1000 bps Flowdec @ 1000 bps


Sample 3

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 1000 bps MDCTCodec @ 1000 bps


BigCodec @ 1000 bps WavTokenizer @ 1000 bps Flowdec @ 1000 bps


Sample 4

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 1000 bps MDCTCodec @ 1000 bps


BigCodec @ 1000 bps WavTokenizer @ 1000 bps Flowdec @ 1000 bps


Sample 5

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 1000 bps MDCTCodec @ 1000 bps


BigCodec @ 1000 bps WavTokenizer @ 1000 bps Flowdec @ 1000 bps


Sample 6

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 1000 bps MDCTCodec @ 1000 bps


BigCodec @ 1000 bps WavTokenizer @ 1000 bps Flowdec @ 1000 bps


Sample 7

Groud Truth FMelCodec (proposed) @ 250 bps DAC @ 1000 bps MDCTCodec @ 1000 bps


BigCodec @ 1000 bps WavTokenizer @ 1000 bps Flowdec @ 1000 bps


Analysis and Discussion

Sample 1

Groud Truth FMelCodec (proposed) FMelCodec w/o OC


FMelCodec w/o ST FMelCodec*


Sample 2

Groud Truth FMelCodec (proposed) FMelCodec w/o OC


FMelCodec w/o ST FMelCodec*


Sample 3

Groud Truth FMelCodec (proposed) FMelCodec w/o OC


FMelCodec w/o ST FMelCodec*


Sample 4

Groud Truth FMelCodec (proposed) FMelCodec w/o OC


FMelCodec w/o ST FMelCodec*


Sample 5

Groud Truth FMelCodec (proposed) FMelCodec w/o OC


FMelCodec w/o ST FMelCodec*


Sample 6

Groud Truth FMelCodec (proposed) FMelCodec w/o OC


FMelCodec w/o ST FMelCodec*


Sample 7

Groud Truth FMelCodec (proposed) FMelCodec w/o OC


FMelCodec w/o ST FMelCodec*