V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning Paper • 2506.09985 • Published Jun 11 • 30
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories Paper • 2504.08942 • Published Apr 11 • 27
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning Paper • 2504.07128 • Published Apr 2 • 87
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens Paper • 2504.07096 • Published Apr 9 • 77
CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text Paper • 1908.06177 • Published Aug 16, 2019
Learning an Unreferenced Metric for Online Dialogue Evaluation Paper • 2005.00583 • Published May 1, 2020
How sensitive are translation systems to extra contexts? Mitigating gender bias in Neural Machine Translation models through relevant contexts Paper • 2205.10762 • Published May 22, 2022
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning Paper • 2412.14164 • Published Dec 18, 2024 • 4
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback Paper • 2410.19133 • Published Oct 24, 2024 • 11
The Impact of Positional Encoding on Length Generalization in Transformers Paper • 2305.19466 • Published May 31, 2023 • 2
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment Paper • 2410.01679 • Published Oct 2, 2024 • 27
Measuring the Knowledge Acquisition-Utilization Gap in Pretrained Language Models Paper • 2305.14775 • Published May 24, 2023
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection Paper • 2004.07667 • Published Apr 16, 2020
Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation Paper • 2305.16938 • Published May 26, 2023
Lexical Generalization Improves with Larger Models and Longer Training Paper • 2210.12673 • Published Oct 23, 2022
Data Contamination Report from the 2024 CONDA Shared Task Paper • 2407.21530 • Published Jul 31, 2024 • 10