BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining Paper • 2508.10975 • Published 17 days ago • 55
MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization Paper • 2507.14683 • Published Jul 19 • 127
Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs Paper • 2507.07996 • Published Jul 10 • 33
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Paper • 2507.00432 • Published Jul 1 • 74
DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization? Paper • 2504.08120 • Published Apr 10 • 4
Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging Paper • 2503.20641 • Published Mar 26 • 9
Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment Paper • 2502.16894 • Published Feb 24 • 31
Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning Paper • 2502.17407 • Published Feb 24 • 26
Slamming: Training a Speech Language Model on One GPU in a Day Paper • 2502.15814 • Published Feb 19 • 70
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines Paper • 2502.14739 • Published Feb 20 • 106
Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering Paper • 2502.13962 • Published Feb 19 • 29
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? Paper • 2502.12215 • Published Feb 17 • 16
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models Paper • 2502.12464 • Published Feb 18 • 28
The Mirage of Model Editing: Revisiting Evaluation in the Wild Paper • 2502.11177 • Published Feb 16 • 10