TransMamba: Flexibly Switching between Transformer and Mamba Paper • 2503.24067 • Published Mar 31 • 21
Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models Paper • 2502.15499 • Published Feb 21 • 15