-
Survey on Evaluation of LLM-based Agents
Paper • 2503.16416 • Published • 94 -
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
Paper • 2505.04921 • Published • 186 -
Survey of User Interface Design and Interaction Techniques in Generative AI Applications
Paper • 2410.22370 • Published • 12 -
Survey of Hallucination in Natural Language Generation
Paper • 2202.03629 • Published
Collections
Discover the best community collections!
Collections including paper arxiv:2310.19736
-
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Paper • 2303.16634 • Published • 3 -
miracl/miracl-corpus
Viewer • Updated • 77.2M • 3.23k • 46 -
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 37 -
How is ChatGPT's behavior changing over time?
Paper • 2307.09009 • Published • 24
-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 35 -
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper • 2310.08491 • Published • 55 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Calibrating LLM-Based Evaluator
Paper • 2309.13308 • Published • 12
-
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 37 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Humans or LLMs as the Judge? A Study on Judgement Biases
Paper • 2402.10669 • Published -
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 35
-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 35 -
AgentTuning: Enabling Generalized Agent Abilities for LLMs
Paper • 2310.12823 • Published • 36 -
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Paper • 2303.16634 • Published • 3 -
GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems
Paper • 2310.12397 • Published • 1
-
Survey on Evaluation of LLM-based Agents
Paper • 2503.16416 • Published • 94 -
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
Paper • 2505.04921 • Published • 186 -
Survey of User Interface Design and Interaction Techniques in Generative AI Applications
Paper • 2410.22370 • Published • 12 -
Survey of Hallucination in Natural Language Generation
Paper • 2202.03629 • Published
-
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 37 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Humans or LLMs as the Judge? A Study on Judgement Biases
Paper • 2402.10669 • Published -
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 35
-
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Paper • 2303.16634 • Published • 3 -
miracl/miracl-corpus
Viewer • Updated • 77.2M • 3.23k • 46 -
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 37 -
How is ChatGPT's behavior changing over time?
Paper • 2307.09009 • Published • 24
-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 35 -
AgentTuning: Enabling Generalized Agent Abilities for LLMs
Paper • 2310.12823 • Published • 36 -
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Paper • 2303.16634 • Published • 3 -
GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems
Paper • 2310.12397 • Published • 1
-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 35 -
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper • 2310.08491 • Published • 55 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Calibrating LLM-Based Evaluator
Paper • 2309.13308 • Published • 12