Episode

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Podcast
Daily Paper Cast
Published
May 22, 2026
Duration seconds
1381
Processing state
not_requested
Canonical source
https://share.transistor.fm/s/c0e9ca22
Audio
https://media.transistor.fm/c0e9ca22/c8a32f41.mp3
JSON
/v1/public/podcasts/daily-paper-cast-7079649/episodes/mega-asr-towards-in-the-wild-2-speech-recognition-via-scaling-up-real-world-acoustic-simulation
Markdown
/podcast/daily-paper-cast-7079649/mega-asr-towards-in-the-wild-2-speech-recognition-via-scaling-up-real-world-acoustic-simulation.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/mega-asr-towards-in-the-wild-2-speech-recognition-via-scaling-up-real-world-acoustic-simulation/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/daily-paper-cast-7079649/mega-asr-towards-in-the-wild-2-speech-recognition-via-scaling-up-real-world-acoustic-simulation.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

🤗 Upvotes: 115 | cs.SD, cs.AI, cs.CL, cs.MM, eess.AS Authors: Zhifei Xie, Kaiyu Pang, Haobin Zhang, Deheng Ye, Xiaobin Hu, Shuicheng Yan, Chunyan Miao Title: Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation Arxiv: http://arxiv.org/abs/2605.19833v1 Abstract: Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.