A Survey on Evaluation of Large Language Models

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions:
what to evaluate
,
where to evaluate
, and
how to evaluate
. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at:
https://github.com/MLGroupJLU/LLM-eval-survey

Topics

No keywords indexed for this article. Browse by subject →

References

267

[1]

Ahmed Abdelali, Hamdy Mubarak, Shammur Absar Chowdhury, Maram Hasanain, Basel Mousi, Sabri Boughorbel, Yassine El Kheir, Daniel Izham, Fahim Dalvi, Majd Hawasly, et al. 2023. Benchmarking Arabic AI with large language models. arXiv preprint arXiv:2305.14982 (2023).

[2]

Kabir Ahuja, Rishav Hada, Millicent Ochieng, Prachi Jain, Harshita Diddee, Samuel Maina, Tanuja Ganu, Sameer Segal, Maxamed Axmed, Kalika Bali, et al. 2023. MEGA: Multilingual evaluation of generative AI. arXiv preprint arXiv:2303.12528 (2023).

[3]

Daman Arora, Himanshu Gaurav Singh, et al. 2023. Have LLMs advanced enough? A challenging problem solving benchmark for large language models. arXiv preprint arXiv:2305.15074 (2023).

[4]

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 (2021).

[5]

Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, et al. 2023. Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181 (2023).

[6]

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 (2023).

[7]

Anja Belz and Ehud Reiter. 2006. Comparing automatic and human evaluation of NLG systems. In 11th Conference of the European Chapter of the Association for Computational Linguistics. 313–320.

[8]

Daniel Berrar. 2019. Cross-Validation. (2019). 10.1016/b978-0-12-809633-8.20349-x

[9]

Ning Bian, Xianpei Han, Le Sun, Hongyu Lin, Yaojie Lu, and Ben He. 2023. ChatGPT is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models. arXiv preprint arXiv:2303.16421 (2023).

[10]

Bojana Bodroza, Bojana M. Dinic, and Ljubisa Bojic. 2023. Personality testing of GPT-3: Limited temporal reliability, but highlighted social desirability of GPT-3’s personality instruments results. arXiv preprint arXiv:2306.04308 (2023).

[11]

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).

[12]

10.1080/09540269974483

[13]

10.5555/176313.176316

[14]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.

[15]

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712 (2023).

[16]

10.18653/v1/2023.c3nlp-1.7

[17]

10.1007/s10916-023-01925-4

[18]

10.1021/acs.jcim.3c00285

[19]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).

[20]

Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, and Ruifeng Xu. 2023. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. arXiv preprint arXiv:2304.00723 (2023).

[21]

10.1016/j.fertnstert.2023.05.151

[22]

Yew Ken Chia, Pengfei Hong, Lidong Bing, and Soujanya Poria. 2023. INSTRUCTEVAL: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757 (2023).

[23]

Minje Choi, Jiaxin Pei, Sagar Kumar, Chang Shu, and David Jurgens. 2023. Do LLMs understand social knowledge? Evaluating the sociability of large language models with SocKET benchmark. arXiv preprint arXiv:2305.14938 (2023).

[24]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).

[25]

Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems 30 (2017).

[26]

10.1007/978-3-031-35320-8_1

[27]

Katherine M. Collins, Albert Q. Jiang, Simon Frieder, Lionel Wong, Miri Zilka, Umang Bhatt, Thomas Lukasiewicz, Yuhuai Wu, Joshua B. Tenenbaum, William Hart, et al. 2023. Evaluating language models for mathematics through interactions. arXiv preprint arXiv:2306.01694 (2023).

[28]

10.1007/bf00994018

[29]

Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongxiang Sun, Xiao Zhang, and Jun Xu. 2023. Uncovering ChatGPT’s capabilities in recommender systems. arXiv preprint arXiv:2305.02182 (2023).

[30]

Wei Dai Jionghao Lin Flora Jin Tongguang Li Yi-Shan Tsai Dragan Gasevic and Guanliang Chen. 2023. Can large language models provide feedback to students? A case study on ChatGPT. (2023). 10.1109/icalt58122.2023.00100

[31]

Xuan-Quy Dao and Ngoc-Bich Le. 2023. Investigating the effectiveness of ChatGPT in mathematical reasoning and problem solving: Evidence from the Vietnamese national high school graduation examination. arXiv preprint arXiv:2306.06331 (2023).

[32]

Joost C. F. de Winter. 2023. Can ChatGPT pass high school exams on English language comprehension. Researchgate. Preprint (2023).

[33]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher et al.

2009 IEEE Conference on Computer Vision and Patter... 10.1109/cvpr.2009.5206848

[34]

Aniket Deroy, Kripabandhu Ghosh, and Saptarshi Ghosh. 2023. How ready are pre-trained abstractive models and LLMs for legal case judgement summarization? arXiv preprint arXiv:2306.01248 (2023).

[35]

Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. Toxicity in ChatGPT: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335 (2023).

[36]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[37]

10.1145/3442188.3445924

[38]

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. AlpacaFarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387 (2023).

[39]

Dat Duong and Benjamin D. Solomon. 2023. Analysis of large-language model versus human performance for genetics questions. European Journal of Human Genetics (2023), 1–3.

[40]

Wenqi Fan Zihuai Zhao Jiatong Li Yunqing Liu Xiaowei Mei Yiqi Wang Jiliang Tang and Qing Li. 2023. Recommender Systems in the Era of Large Language Models (LLMs). (2023). arxiv:cs.IR/2307.02046

[41]

Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. 2022. DDXPlus: A new dataset for automatic medical diagnosis. Advances in Neural Information Processing Systems 35 (2022), 31306–31318.

[42]

Emilio Ferrara. 2023. Should ChatGPT be biased? Challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738 (2023).

[43]

10.1007/s11023-020-09548-1

[44]

Michael C. Frank. 2023. Baby steps in evaluating the capacities of large language models. Nature Reviews Psychology (2023), 1–2.

[45]

Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier, and Julius Berner. 2023. Mathematical capabilities of ChatGPT. arXiv preprint arXiv:2301.13867 (2023).

[46]

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. 2023. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023).

[47]

Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. 2023. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306 (2023).

[48]

10.1007/s11222-009-9153-8

[49]

10.1109/72.80230

[50]

Irena Gao, Gabriel Ilharco, Scott Lundberg, and Marco Tulio Ribeiro. 2022. Adaptive testing of computer vision models. arXiv preprint arXiv:2212.02774 (2022).

Showing 50 of 267 references

Cited By

2,144

Misspellings in natural language processing: A survey of recent literature

Gianluca Sperduti, Alejandro Moreo · 2026

Natural Language Processing

Sycophantic AI decreases prosocial intentions and promotes dependence

Myra Cheng, Cinoo Lee · 2026

Science

Harnessing Text Insights With Visual Alignment for Medical Image Segmentation

Qingjie Zeng, Huan Luo · 2026

IEEE Transactions on Medical Imagin...

Performance and improvement strategies for adapting generative large language models for electronic health record applications: A systematic review

Xinsong Du, Zhengyang Zhou · 2026

International Journal of Medical In...

Interactive Exploratory Data Analysis with R and Shiny: An LLM-Supported Explanation and Prediction Platform

Ahmet Albayrak, Muammer Albayrak · 2026

Bitlis Eren Üniversitesi Fen Biliml...

HypoChainer: A Collaborative System Combining LLMs and Knowledge Graphs for Hypothesis-Driven Scientific Discovery

Haoran Jiang, Shaohan Shi · 2026

IEEE Transactions on Visualization...

How reliable are large language models in analyzing the quality of written lesson plans? A mixed-methods study from a teacher internship program

Dennis Hauk, Nina Soujon · 2026

Computers and Education: Artificial...

CALA-Net: Intelligent depression detection with multimodal fusion and Tree-of-Thoughts framework

Benyu Zhang, Meiling Xie · 2026

Computers in Human Behavior Reports

A mechanism-guided multimodal industrial copilot with incremental learning for natural products manufacturing

Qilong Xue, Yang Yu · 2026

Applied Soft Computing

Semantic-aware query answering with Large Language Models

Paolo Atzeni, Teodoro Baldazzi · 2026

Data & Knowledge Engineering

AI-Assisted Rapid Quality Analysis in Implementation Science: Methodological Study

Adeola Adegbemijo, Anna M Maw · 2026

JMIR AI

Attributes enhanced representation learning with large language model for cold-start knowledge tracing

Ganfeng Yu, Zhuo Zhao · 2026

Expert Systems with Applications

Food consumption pattern and carbon footprint of online food delivery in China

Aiqun Guan, Ya Zhou · 2026

Resources, Conservation and Recycli...

Towards responsible AI in education: A Delphi-AHP-based framework for evaluating educational large language models

Pingrong Lin, Qin Deng · 2026

Computers and Education: Artificial...

Byam: Fixing Breaking Dependency Updates with Large Language Models

Frank Reyes, May Mahmoud · 2026

Empirical Software Engineering

Smart and secure battery management: The role of artificial intelligence and edge computing in the next generation of electric vehicles

Gaurav Kumar, Suresh Mikkili · 2026

Journal of Energy Storage

Accelerating dataset generation for machine learning using large language models: a pharmaceutical additive manufacturing case

Paola Carou-Senra, Lucía Rodríguez-Pombo · 2026

International Journal of Pharmaceut...

Very-large-scale mimetic optogenetic synapses for physical reservoir computing

Xinyi Han, Zhiying Qi · 2026

Nature Communications

An LLM-based cross-domain knowledge retrieval augmented generation method for bio-inspired solution design

Haoran Cui, Pai Zheng · 2026

Advanced Engineering Informatics

A Survey on Large Language Models for Code Generation

Juyong Jiang, Fan Wang · 2026

ACM Transactions on Software Engine...

Metrics

2,144

Citations

267

References

Details

Published: Mar 29, 2024
Vol/Issue: 15(3)
Pages: 1-45
License: View

Authors

Y

Yupeng Chang

School of Artificial Intelligence, Jilin University, Changchun, China

X

Xu Wang

School of Artificial Intelligence, Jilin University, Changchun, China

J

Jindong Wang

Microsoft Research Asia, Beijing, China

Y

Yuan Wu

School of Artificial Intelligence, Jilin University, Changchun, China

L

Linyi Yang

Westlake University, Hangzhou, Hangzhou, China

K

Kaijie Zhu

Institute of Automation, Chinese Academy of Sciences, Beijing, China

H

Hao Chen

Carnegie Mellon University, Pittsburgh, USA

X

Xiaoyuan Yi

Microsoft Research Asia, Beijing, China

C

Cunxiang Wang

Westlake University, Hangzhou, China

Y

Yidong Wang

Peking University, Beijing, China

W

Wei Ye

Peking University, Beijing, China

Y

Yue Zhang

Westlake University, Hangzhou, China

P

Philip S. Yu

University of Illinois at Chicago, Chicago, USA

Q

Qiang Yang

Hong Kong University of Science and Technology, Kowloon, China

X

Xing Xie

Microsoft Research Asia, Beijing, China

Funding

NSF Award: III-2106758

Cite This Article

Yupeng Chang, Xu Wang, Jindong Wang, et al. (2024). A Survey on Evaluation of Large Language Models. ACM Transactions on Intelligent Systems and Technology, 15(3), 1-45. https://doi.org/10.1145/3641289

A Survey on Evaluation of Large Language Models

You May Also Like