zhuokai/dapo_baseline_without_dynamic_sampling_temperature_1.2_Qwen2.5-Math-1.5B_zzk Updated 6 days ago
zhuokai/dapo_baseline_without_dynamic_sampling_temperature_1.2_Qwen2.5-Math-1.5B_zzk Updated 6 days ago
zhuokai/dapo_baseline_without_dynamic_sampling_temperature_1.0_Qwen2.5-Math-1.5B_zzk Updated 6 days ago
zhuokai/dapo_baseline_without_dynamic_sampling_temperature_1.0_Qwen2.5-Math-1.5B_zzk Updated 6 days ago
zhuokai/dapo_baseline_without_dynamic_sampling_temperature_0.6_Qwen2.5-Math-1.5B_zzk Updated 6 days ago
zhuokai/dapo_baseline_without_dynamic_sampling_temperature_0.6_Qwen2.5-Math-1.5B_zzk Updated 6 days ago
zhuokai/as_negexp_explore_1.2_stable_0.1_decay_freq_25_warmup_period_10_negexp_Qwen2.5-Math-1.5B_zzk Updated 6 days ago
zhuokai/as_negexp_explore_1.2_stable_0.1_decay_freq_25_warmup_period_10_negexp_Qwen2.5-Math-1.5B_zzk Updated 6 days ago
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs Paper • 2506.10128 • Published Jun 11 • 23
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning Paper • 2506.05523 • Published Jun 5 • 34
ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning Paper • 2503.22738 • Published Mar 26 • 17