Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT

Jonas Flodén

doi:10.1002/berj.4069

journal article Open Access Sep 16, 2024

Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT

Jonas Flodén

British Educational Research Journal Vol. 51 No. 1 pp. 201-224 · Wiley

View at Publisher Save 10.1002/berj.4069

Abstract

AbstractThis study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three Master's‐level exams were scored using ChatGPT 3.5, and the results were compared with the teachers' scoring and the grading teachers were interviewed. In total, 463 exam responses were graded. With each response being graded at least three times, a total of 1389 gradings were conducted. For the final exam scores, 70% of ChatGPT's gradings were within 10% of the teachers' gradings and 31% within 5%. ChatGPT tended to give marginally higher scores. The agreement on grades is 30%, but 45% of the exams received an adjacent grade. On individual questions, ChatGPT is more inclined to avoid very high or very low scores. ChatGPT struggles to correctly score questions closely related to the course lectures but performs better on more general questions. The AI can generate plausible scores on university exams that, at first glance, look similar to a human grader. There are differences but it is not unlikely that two different human graders could result in similar discrepancies. During the interviews, teachers expressed their surprise at how well ChatGPT's grading matched their own. Increased use of AI can lead to ethical challenges as exams are entrusted to a machine whose decision‐making criteria are not fully understood, especially concerning potential bias in training data.

Topics

No keywords indexed for this article. Browse by subject →

References

91

[1]

10.1007/978-3-031-23233-6_2

[2]

10.1145/3371156

[3]

10.1007/978-3-031-53960-2_1

[4]

Attali Y. (2013)

[5]

10.1007/s40593‐022‐00323‐0

[6]

Barker T. "An automated individual feedback and marking system: An empirical study" Electronic Journal of E‐Learning (2011)

[7]

10.1016/j.caeai.2023.100177

[8]

10.1007/s40593‐021‐00248‐0

[9]

10.1186/s41239‐23‐00436‐z

[10]

Borade J. G. (2021)

[11]

10.1111/j.1467-8527.2004.00253.x

[12]

Brown T. B. Mann B. Ryder N. Subbiah M. Kaplan J. Dhariwal P. Neelakantan A. Shyam P. Sastry G. Askell A. Agarwal S. Herbert‐Voss A. Krueger G. Henighan T. Child R. Ramesh A. Ziegler D. M. Wu J. Winter C. …Amodei D.(2020).Language models are few‐shot learners. 34th Conference on Neural Information Processing Systems (NeurIPS 2020) Vancouver Canada.

[13]

10.1007/s40593‐014‐0026‐8

[14]

10.1111/j.1365‐2929.2005.02093.x

[15]

10.1007/s11528‐022‐00715‐y

[16]

10.1186/s41239‐023‐00411‐8

[17]

10.1145/3564284

[18]

Chin M.(2020 September 3).These students figured out their tests were graded by AI—and the easy way to cheat.The Verge. Retrieved February 10 2020 fromhttps://www.theverge.com/2020/9/2/21419012/edgenuity‐online‐class‐ai‐grading‐keyword‐mashing‐students‐school‐cheating‐algorithm‐glitch

[19]

Choi J. H. "ChatGPT goes to law school" Journal of Legal Education (2023)

[20]

A Coefficient of Agreement for Nominal Scales

Jacob Cohen

Educational and Psychological Measurement 10.1177/001316446002000104

[21]

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.

Jacob Cohen

Psychological Bulletin 10.1037/h0026256

[22]

10.18608/jla.2023.7801

[23]

Cousins B.(2023).Uncovering the different types of ChatGPT bias.Forbes.https://www.forbes.com/sites/forbestechcouncil/2023/03/31/uncovering‐the‐different‐types‐of‐chatgpt‐bias/

[24]

10.1186/s41239‐023‐00392‐8

[25]

10.1109/icalt58122.2023.00100

[26]

10.1016/j.asw.2012.10.002

[27]

Dikli S. "An overview of automated scoring of essays" The Journal of Technology, Learning and Assessment (2006)

[28]

10.1186/s41239‐023‐00434‐1

[29]

10.1016/j.asw.2012.11.002

[30]

10.2307/1170785

[31]

10.1080/02602938.2023.2241676

[32]

Measuring nominal scale agreement among many raters.

Joseph L. Fleiss

Psychological Bulletin 10.1037/h0031619

[33]

Flodén J. (2018)

[34]

10.1111/jcal.12577

[35]

10.1016/j.amjms.2023.08.001

[36]

Gwet K. L. (2014)

[37]

Hartmann J. Schwenzow J. &Witte M.(2023).The political ideology of conversational AI: Converging evidence on ChatGPT's pro‐environmental left‐libertarian orientation. Preprint available at SSRNhttps://doi.org/10.2139/ssrn.4316084 10.2139/ssrn.4316084

[38]

State of the art and practice in AI in education

Wayne Holmes, Ilkka Tuomi

European Journal of Education 10.1111/ejed.12533

[39]

Hsu S. Li T. W. Zhang Z. Fowler M. Zilles C. &Karahalios K.(2021).Attitudes surrounding an imperfect AI autograder.Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems Yokohama Japanhttps://doi.org/10.1145/3411764.3445424 10.1145/3411764.3445424

[40]

10.1007/s10639‐022‐11200‐7

[41]

10.1007/978-981-19-0351-9_59-1

[42]

10.1111/j.1471‐6712.2005.00368.x

[43]

Jannai D. Meron A. Lenz B. Levine Y. &Shoham Y.(2023).Human or not? A gamified approach to the Turing test. Preprint arXiv: 2305.20010.https://doi.org/10.48550/arXiv.2305.20010

[44]

10.1016/j.nlp.2023.100048

[45]

Ke Z. &Ng V.(2019).Automated essay scoring: A survey of the state of the art. InProceedings of the Twenty‐Eighth International Joint Conference on Artificial Intelligence(IJCAI‐19). 10.24963/ijcai.2019/879

[46]

Kojima T. "Large language models are zero‐shot reasoners" Advances in Neural Information Processing Systems (2022)

[47]

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla et al.

PLOS Digital Health 10.1371/journal.pdig.0000198

[48]

Landauer T. K. "The Intelligent Essay Assessor" IEEE Intelligent Systems (2000)

[49]

10.1016/j.caeai.2024.100213

[50]

Lee G.‐G. &Zhai X.(2024).NERIF: GPT‐4V for automatic scoring of drawn models. Preprint arXiv: 2311.12990https://doi.org/10.48550/arXiv.2311.12990

Showing 50 of 91 references

Metrics

52

Citations

91

References

Details

Published: Sep 16, 2024
Vol/Issue: 51(1)
Pages: 201-224
License: View

Authors

J

Jonas Flodén

Department of Business Administration, School of Business, Economics and Law University of Gothenburg Gothenburg Sweden

Cite This Article

Jonas Flodén (2024). Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British Educational Research Journal, 51(1), 201-224. https://doi.org/10.1002/berj.4069

Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT

You May Also Like