Episode
Model Merging Scaling Laws in Large Language Models
- Podcast
- Daily Paper Cast
- Published
- May 13, 2026
- Duration seconds
- 1304
- Processing state
not_requested- Canonical source
- https://share.transistor.fm/s/dc79b8ed
Actions
POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/model-merging-scaling-laws-in-large-language-models/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/daily-paper-cast-7079649/model-merging-scaling-laws-in-large-language-models.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
🤗 Upvotes: 26 | cs.AI Authors: Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, Hongxia Yang Title: Model Merging Scaling Laws in Large Language Models Arxiv: http://arxiv.org/abs/2509.24244v4 Abstract: We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.