About
I work on audio and speech processing, with a particular interest in efficient and high-fidelity neural waveform generation, low-bitrate neural speech and audio coding, and robust speech modeling. I also work on speech quality prediction, speech enhancement, and environment-aware speech generation.
This page consolidates journal/conference papers and matching arXiv versions into single title-level entries to avoid duplicate listings.
Research Interests
Neural Vocoders
High-quality and efficient waveform generation with explicit amplitude-phase modeling and low-latency design.
Neural Audio and Speech Codecs
Low-bitrate, high-fidelity, and streamable speech/audio coding with strong reconstruction quality.
Speech Synthesis
Environment-aware, zero-shot, and robust speech generation for realistic usage conditions.
Speech Assessment and Enhancement
Speech quality prediction, denoising, bandwidth extension, and universal enhancement in difficult conditions.
Publications
Publications are split into first-author and non-first-author papers. arXiv links are provided whenever a public preprint is available.
First-Author Publications
-
CodeSep: Low-Bitrate Codec-Driven Speech Separation with Base-Token Disentanglement and Auxiliary-Token Serial PredictionarXiv preprint, 2026
-
A Distilled Low-Latency Neural Vocoder with Explicit Amplitude and Phase PredictionAPSIPA ASC, 2025
-
Is GAN Necessary for Mel-Spectrogram-Based Neural Vocoder?IEEE Signal Processing Letters, 2025
-
A Neural Denoising Vocoder for Clean Waveform Generation from Noisy Mel-Spectrogram Based on Amplitude and Phase PredictionsNCMMSC, 2024
-
APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training ParadigmISCSLP, 2024
-
BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform GenerationINTERSPEECH, 2024
-
APNet2: High-Quality and High-Efficiency Neural Vocoder with Direct Prediction of Amplitude and Phase SpectraNCMMSC, 2023
Co-Authored Publications
-
A High-Quality and Low-Complexity Streamable Neural Speech Codec with Knowledge DistillationAPSIPA ASC, 2025
-
CASC-XVC: Zero-Shot Cross-Lingual Voice Conversion with Content Accordant and Speaker Contrastive LossesICASSP, 2025
-
DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech SynthesisarXiv preprint, 2025
-
ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio CodecsIEEE/ACM Transactions on Audio, Speech, and Language Processing, 2025
-
Improving Noise Robustness of LLM-Based Zero-Shot TTS via Discrete Acoustic Token DenoisingINTERSPEECH, 2025
-
Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech SynthesisICASSP, 2025
-
Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration CodingarXiv preprint, 2025
-
Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase PredictionIEEE/ACM Transactions on Audio, Speech, and Language Processing, 2025
-
Universal Discrete-Domain Speech EnhancementarXiv preprint, 2025
-
Vision-Integrated High-Quality Neural Speech CodingINTERSPEECH, 2025
-
APCodec: A Neural Audio Codec With Parallel Amplitude and Phase Spectrum Encoding and DecodingIEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024
-
Considering Temporal Connection Between Turns for Conversational Speech SynthesisICASSP, 2024
-
ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel SpectrogramNCMMSC, 2024
-
MDCTCodec: A Lightweight MDCT-Based Neural Audio Codec Towards High Sampling Rate and Low Bitrate ScenariosSLT, 2024
-
Pitch-and-Spectrum-Aware Singing Quality Assessment with Bias Correction and Model FusionSLT, 2024
-
SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations and Acoustic FeaturesISCSLP, 2024
-
Stage-Wise and Prior-Aware Neural Speech Phase PredictionSLT, 2024