journal article Open Access Jan 01, 2025

Benchmarking Linguistic Diversity of Large Language Models

View at Publisher Save 10.1162/tacl.a.47
Abstract
Abstract
The development and evaluation of Large Language Models (LLMs) has primarily focused on their task-solving capabilities, with recent models even surpassing human performance in some areas. However, this focus often neglects whether machine-generated language matches the human level of diversity, in terms of vocabulary choice, syntactic construction, and expression of meaning, raising questions about whether the fundamentals of language generation have been fully addressed. This paper emphasizes the importance of examining the preservation of human linguistic richness by language models, given the concerning surge in online content produced or aided by LLMs. We adapt a comprehensive framework for evaluating LLMs from various linguistic diversity perspectives including lexical, syntactic, and semantic dimensions. Using this framework, we benchmark several state-of-the-art LLMs across all diversity dimensions, and conduct an in-depth analysis for syntactic diversity. Finally, we analyze how the design, development, and deployment choices of LLMs impact the linguistic diversity of their outputs, focusing on the creative task of story generation.
Topics

No keywords indexed for this article. Browse by subject →

References
70
[1]
Aggarwal "Towards robust NLG bias evaluation with syntactically-diverse prompts" (2022) 10.18653/v1/2022.findings-emnlp.445
[2]
Almazrouei "The falcon series of open language models" arXiv preprint arXiv:2311.16867 (2023)
[3]
Bestgen "Measuring lexical diversity in texts: The twofold length problem" Language Learning (2023) 10.1111/lang.12630
[4]
Bojar "Findings of the 2014 workshop on statistical machine translation" (2014) 10.3115/v1/w14-3302
[5]
Brown "Language models are few-shot learners" Advances in Neural Information Processing Systems (2020)
[6]
Caccia "Language gans falling short" (2020)
[7]
Chakrabarty "Help me write a poem - instruction tuning as a vehicle for collaborative poetry writing" (2022) 10.18653/v1/2022.emnlp-main.460
[8]
Chhun "Of human criteria and automatic metrics: A benchmark of the evaluation of story generation" (2022)
[9]
De Clercq "A cross-linguistic perspective on syntactic complexity in l2 development: Syntactic elaboration and diversity" The Modern Language Journal (2017) 10.1111/modl.12396
[10]
Cui "Ultrafeedback: Boosting language models with scaled AI feedback" (2024)
[11]
Dubey "The Llama 3 herd of models" arXiv preprint arXiv:2407.21783 (2024)
[12]
Duplessis "Automatic measures to characterise verbal alignment in human-agent interaction" (2017) 10.18653/v1/w17-5510
[13]
Edwards "Diversity in the lexical and syntactic abilities of fluent aphasic speakers" Aphasiology (1998) 10.1080/02687039808250466
[14]
Fan "Hierarchical neural story generation" (2018) 10.18653/v1/p18-1082
[15]
Fergadiotis "Measuring lexical diversity in narrative discourse of people with aphasia" American Journal of Speech-language Pathology (2013) 10.1044/1058-0360(2013/12-0083)
[16]
Bias and Fairness in Large Language Models: A Survey

Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow et al.

Computational Linguistics 2024 10.1162/coli_a_00524
[17]
Geng "Is chatgpt transforming academics’ writing style?" arXiv preprint arXiv:2404.08627 (2024)
[18]
Giulianelli "What comes next? Evaluating uncertainty in neural text generators against human production variability" (2023) 10.18653/v1/2023.emnlp-main.887
[19]
Groeneveld "OLMo: Accelerating the science of language models" (2024) 10.18653/v1/2024.acl-long.841
[20]
Guo "Do large language models have an English accent? Evaluating and improving the naturalness of multilingual LLMs" arXiv preprint arXiv:2410.15956 (2024)
[21]
Guo "The curious decline of linguistic diversity: Training language models on synthetic text" (2024) 10.18653/v1/2024.findings-naacl.228
[22]
Han "Measuring and improving semantic diversity of dialogue generation" (2022) 10.18653/v1/2022.findings-emnlp.66
[23]
Hasan "XL-sum: Large-scale multilingual abstractive summarization for 44 languages" (2021) 10.18653/v1/2021.findings-acl.413
[24]
Hayati "How far can we extract diverse perspectives from large language models? Criteria-based diversity prompting!" arXiv preprint arXiv:2311.09799 (2023)
[25]
Healey "Divergence in dialogue" PloS One (2014) 10.1371/journal.pone.0098598
[26]
Hendrycks "Measuring massive multitask language understanding" (2020)
[27]
Holtzman "The curious case of neural text degeneration" (2020)
[28]
Ivison "Camels in a changing climate: Enhancing LM adaptation with tulu 2" (2023)
[29]
Jiang "Mistral 7b" arXiv preprint arXiv:2310.06825 (2023)
[30]
Johnson "Studies in language behavior: A program of research" Psychological Monographs (1944) 10.1037/h0093508
[31]
Kandpal "Large language models struggle to learn long-tail knowledge" (2023)
[32]
Kirk "Understanding the effects of RLHF on LLM generalization and diversity" (2024)
[33]
Lahoti "Improving diversity of demographic representation in large language models via collective-critiques and self-voting" (2023) 10.18653/v1/2023.emnlp-main.643
[34]
Bronnec "Exploring precision and recall to assess the quality and diversity of LLMs" (2024) 10.18653/v1/2024.acl-long.616
[35]
Liang "Mapping the increasing use of LLMs in scientific papers" (2024)
[36]
Liu "G-eval: NLG evaluation using GPT-4 with better human alignment" (2023) 10.18653/v1/2023.emnlp-main.153
[37]
Luo "To diverge or not to diverge: A morphosyntactic perspective on machine translation vs human translation" Transactions of the Association for Computational Linguistics (2024) 10.1162/tacl_a_00645
[38]
Maynez "On faithfulness and factuality in abstractive summarization" (2020) 10.18653/v1/2020.acl-main.173
[39]
McNamara "Linguistic features of writing quality" Written Communication (2010) 10.1177/0741088309351547
[40]
Merity "Pointer sentinel mixture models" (2017)
[41]
Miller (1981)
[42]
Padmakumar "Does writing with language models reduce content diversity?" (2024)
[43]
Pillutla "Mauve: Measuring the gap between neural text and human text using divergence frontiers" (2021)
[44]
Qi "Stanza: A python natural language processing toolkit for many human languages" (2020) 10.18653/v1/2020.acl-demos.14
[45]
Rafailov "Direct preference optimization: Your language model is secretly a reward model" (2023)
[46]
Rei "COMET: A neural framework for MT evaluation" (2020) 10.18653/v1/2020.emnlp-main.213
[47]
Reimers "Sentence-BERT: Sentence embeddings using Siamese BERT-networks" (2019) 10.18653/v1/d19-1410
[48]
Reimers "Making monolingual sentence embeddings multilingual using knowledge distillation" (2020) 10.18653/v1/2020.emnlp-main.365
[49]
Sai "Improving dialog evaluation with a multi-reference adversarial dataset and large scale pretraining" Transactions of the Association for Computational Linguistics (2020) 10.1162/tacl_a_00347
[50]
Sclar "Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting" (2024)

Showing 50 of 70 references

Metrics
4
Citations
70
References
Details
Published
Jan 01, 2025
Vol/Issue
13
Pages
1507-1526
License
View
Cite This Article
Yanzhu Guo, Guokan Shang, Chloé Clavel (2025). Benchmarking Linguistic Diversity of Large Language Models. Transactions of the Association for Computational Linguistics, 13, 1507-1526. https://doi.org/10.1162/tacl.a.47
Related

You May Also Like

Enriching Word Vectors with Subword Information

Piotr Bojanowski, Edouard Grave · 2017

5,313 citations

A Primer in BERTology: What We Know About How BERT Works

Anna Rogers, Olga Kovaleva · 2020

769 citations

Lost in the Middle: How Language Models Use Long Contexts

Nelson F. Liu, Kevin Lin · 2024

714 citations