A Survey on Evaluation of Large Language Models
what to evaluate
,
where to evaluate
, and
how to evaluate
. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at:
https://github.com/MLGroupJLU/LLM-eval-survey
No keywords indexed for this article. Browse by subject →
Jia Deng, Wei Dong, Richard Socher et al.
Showing 50 of 267 references
Gianluca Sperduti, Alejandro Moreo · 2026
Myra Cheng, Cinoo Lee · 2026
Qingjie Zeng, Huan Luo · 2026
Xinsong Du, Zhengyang Zhou · 2026
Ahmet Albayrak, Muammer Albayrak · 2026
Haoran Jiang, Shaohan Shi · 2026
Dennis Hauk, Nina Soujon · 2026
Benyu Zhang, Meiling Xie · 2026
Qilong Xue, Yang Yu · 2026
Paolo Atzeni, Teodoro Baldazzi · 2026
Adeola Adegbemijo, Anna M Maw · 2026
Ganfeng Yu, Zhuo Zhao · 2026
Aiqun Guan, Ya Zhou · 2026
Pingrong Lin, Qin Deng · 2026
Frank Reyes, May Mahmoud · 2026
Gaurav Kumar, Suresh Mikkili · 2026
Paola Carou-Senra, Lucía Rodríguez-Pombo · 2026
Xinyi Han, Zhiying Qi · 2026
Haoran Cui, Pai Zheng · 2026
Juyong Jiang, Fan Wang · 2026
- Published
- Mar 29, 2024
- Vol/Issue
- 15(3)
- Pages
- 1-45
- License
- View
You May Also Like
Po Yang, Jing Liu · 2019
588 citations