Evaluation for Generative AI - a vincentkoc Collection

vincentkoc 's Collections

LLM Agent and Prompt Optimizers

Evaluation for Generative AI

Evaluation for Generative AI

updated May 20

Papers and resources that are dealing with the evaluation of large language models and generative AI.

Humanity's Last Exam

Paper • 2501.14249 • Published Jan 24 • 75
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

Paper • 2501.14492 • Published Jan 24 • 34
vincentkoc/tiny_qa_benchmark

Viewer • Updated May 20 • 52 • 43 • 1
vincentkoc/tiny_qa_benchmark_pp

Viewer • Updated May 20 • 662 • 356 • 1
Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation

Paper • 2505.12058 • Published May 17 • 6
tinyBenchmarks: evaluating LLMs with fewer examples

Paper • 2402.14992 • Published Feb 22, 2024 • 16