Survey of Hallucination in Natural Language Generation

Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation, and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before.In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions, and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, and machine translation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.

Topics

No keywords indexed for this article. Browse by subject →

References

170

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, et al. 2022. Flamingo: A visual language model for few-shot learning. arXiv preprint arXiv:2204.14198 (2022).

[2]

Rahul Aralikatte, Shashi Narayan, Joshua Maynez, Sascha Rothe, and Ryan McDonald. 2021. Focus attention: Promoting faithfulness and diversity in summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 6078–6095.

[3]

S. Baker and T. Kanade. 2000. Hallucinating faces. In Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition. 83–88.

[4]

Anusha Balakrishnan, Jinfeng Rao, Kartikeya Upasani, Michael White, and Rajen Subba. 2019. Constrained decoding for neural NLG from compositional representations in task-oriented dialogue. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.

[5]

Ramy Baly, Georgi Karadzhov, Dimitar Alexandrov, James Glass, and Preslav Nakov. 2018. Predicting factuality of reporting and bias of news media sources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.

[6]

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150 (2020).

[7]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15). 1171–1179.

[8]

Anne Beyer, Sharid Loáiciga, and David Schlangen. 2021. Is incoherence surprising? Targeted evaluation of coherence prediction from language models. In Proceedings of the 2021 Conference of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’21). 4164–4173.

[9]

Bin Bi, Chen Wu, Ming Yan, Wei Wang, Jiangnan Xia, and Chenliang Li. 2019. Incorporating external knowledge into machine reading for generative question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2521–2530.

[10]

Ali Furkan Biten, Lluís Gómez, and Dimosthenis Karatzas. 2022. Let there be a clock on the beach: Reducing object hallucination in image captioning. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, Los Alamitos, CA.

[11]

Jan Dirk Blom. 2010. A Dictionary of Hallucinations. Springer. 10.1007/978-1-4419-1223-7

[12]

Eleftheria Briakou and Marine Carpuat. 2021. Beyond noise: Mitigating the impact of fine-grained semantic divergences on neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 7236–7249.

[13]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems33. Curran Associates, 1877–1901.

[14]

Meng Cao, Yue Dong, Jiapeng Wu, and Jackie Chi Kit Cheung. 2020. Factual error correction for abstractive summarization models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.

[15]

Shuyang Cao and Lu Wang. 2021. CLIFF: Contrastive learning for improving faithfulness and factuality in abstractive summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 6633–6649.

[16]

Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018. Faithful to the original: Fact aware neural abstractive summarization. In Proceedings of the AAAI Conference on Artificial Intelligence.

[17]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, et al. 2021. Extracting training data from large language models. In Proceedings of the 30th USENIX Security Symposium. 2633–2650.

[18]

Sihao Chen, Fan Zhang, Kazoo Sone, and Dan Roth. 2021. Improving faithfulness in abstractive summarization with contrast candidate generation and selection. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’21). 5935–5941.

[19]

Zhiyu Chen, Wenhu Chen, Hanwen Zha, Xiyou Zhou, Yunkai Zhang, Sairam Sundaresan, and William Yang Wang. 2020. Logic2Text: High-fidelity natural language generation from logical forms. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 2096–2111.

[20]

Michael Crawshaw. 2020. Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796 (2020).

[21]

Leyang Cui, Yu Wu, Shujie Liu, Yue Zhang, and Ming Zhou. 2020. MuTual: A dataset for multi-turn dialogue reasoning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1406–1416.

[22]

Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, and William Cohen. 2019. Handling divergent reference texts when evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4884–4895. 10.18653/v1/p19-1483

[23]

Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, et al. 2020. The second conversational intelligence challenge (ConvAI2). In The NeurIPS’18 Competition. Springer, 187–208. 10.1007/978-3-030-29135-8_7

[24]

Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of Wikipedia: Knowledge-powered conversational agents. In Proceedings of the International Conference on Learning Representations.

[25]

Yue Dong, Shuohang Wang, Zhe Gan, Yu Cheng, Jackie Chi Kit Cheung, and Jingjing Liu. 2020. Multi-fact correction in abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 9320–9331.

[26]

Esin Durmus, He He, and Mona Diab. 2020. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the ACL. 5055–5070.

[27]

Ondřej Dušek, David M. Howcroft, and Verena Rieser. 2019. Semantic noise matters for neural natural language generation. In Proceedings of the 12th International Conference on Natural Language Generation. 421–426. 10.18653/v1/w19-8652

[28]

Ondřej Dušek and Zdeněk Kasner. 2020. Evaluating semantic accuracy of data-to-text generation with natural language inference. In Proceedings of the 13th International Conference on Natural Language Generation. 131–137. 10.18653/v1/2020.inlg-1.19

[29]

Nouha Dziri, Ehsan Kamalloo, Kory Mathewson, and Osmar Zaiane. 2019. Evaluating coherence in dialogue systems using entailment. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3806–3812.

[30]

Nouha Dziri, Andrea Madotto, Osmar R. Zaiane, and Avishek Joey Bose. 2021. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2197–2214. 10.18653/v1/2021.emnlp-main.168

[31]

Nouha Dziri, Hannah Rashkin, Tal Linzen, and David Reitter. 2021. Evaluating groundedness in dialogue systems: The BEGIN benchmark. In Findings of the Association for Computational Linguistics. Association for Computational Linguistics, 1–12.

[32]

Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2019. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2214–2220. 10.18653/v1/p19-1213

[33]

Angela Fan, Claire Gardent, Chloé Braud, and Antoine Bordes. 2019. Using local knowledge graph construction to scale seq2seq models to multi-document inputs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 4186–4196.

[34]

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

[35]

Alhussein Fawzi, Horst Samulowitz, Deepak Turaga, and Pascal Frossard. 2016. Image inpainting through neural networks hallucinations. In Proceedings of the 2016 IEEE 12th Image, Video, and Multidimensional Signal Processing Workshop. IEEE, Los Alamitos, CA.

[36]

Yang Feng, Wanying Xie, Shuhao Gu, Chenze Shao, Wen Zhang, Zhengxin Yang, and Dong Yu. 2020. Modeling fluency and faithfulness for diverse neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence.

[37]

Katja Filippova. 2020. Controlled hallucinations: Learning to generate faithfully from noisy data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 864–870.

[38]

William Fish. 2009. Perception, Hallucination, and Illusion. Oxford University Press.

[39]

Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin Choi, and Jianfeng Gao. 2021. GO FIGURE: A meta evaluation of factuality in summarization. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 478–487. 10.18653/v1/2021.findings-acl.42

[40]

Jianfeng Gao, Michel Galley, and Lihong Li. 2018. Neural approaches to conversational AI. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. 2–7.

[41]

Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating training corpora for NLG micro-planning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.

[42]

Sarthak Garg, Stephan Peitz, Udhyakumar Nallasamy, and Matthias Paulik. 2019. Jointly learning to align and translate with transformer models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 4453–4462.

[43]

Deepanway Ghosal, Pengfei Hong, Siqi Shen, Navonil Majumder, Rada Mihalcea, and Soujanya Poria. 2021. CIDER: Commonsense inference for dialogue explanation and reasoning. arXiv:2106.00510 (2021).

[44]

Alexandru L. Ginsca, Adrian Popescu, and Mihai Lupu. 2015. Credibility in information retrieval. Foundations and Trends in Information Retrieval 9, 5 (2015), 355–475. 10.1561/1500000046

[45]

Silke M. Göbel and Matthew F. S. Rushworth. 2004. Cognitive neuroscience: Acting on numbers. Current Biology 14, 13 (2004), R517–R519.

[46]

Ben Goodrich, Vinay Rao, Peter J. Liu, and Mohammad Saleh. 2019. Assessing the factual accuracy of generated text. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 166–175. 10.1145/3292500.3330955

[47]

Kartik Goyal, Chris Dyer, and Taylor Berg-Kirkpatrick. 2017. Differentiable scheduled sampling for credit assignment. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.366–371.

[48]

Tanya Goyal and Greg Durrett. 2020. Evaluating factuality in generation with dependency-level entailment. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 3592–3603.

[49]

Beliz Gunel Chenguang Zhu Michael Zeng and Xuedong Huang. 2020. Mind the facts: Knowledge-boosted coherent abstractive text summarization. arXiv:2006.15435 (2020).

[50]

Prakhar Gupta, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. 2021. DialFact: A benchmark for fact-checking in dialogue. arXiv preprint arXiv:2110.08222 (2021).

Showing 50 of 170 references

Cited By

2,618

AI vs. Human Text Detection: A High-Accuracy Ensemble Approach Using Machine Learning

Yunus Kökver · 2026

Bitlis Eren Üniversitesi Fen Biliml...

Technological folie à deux: feedback loops between AI chatbots and mental health

Sebastian Dohnány, Zeb Kurth-Nelson · 2026

Nature Mental Health

The regulation of fine-tuning: Federated compliance for modified general-purpose AI models

Philipp Hacker, Matthias Holweg · 2026

Computer Law & Security Review

Use of large language models in undergraduate medical education a scoping review

Godswill Uzoechina, Treasure Osajiuba · 2026

Discover Education

PoetRAT: A poetry retrieval augmented thoughts framework for Chinese classical poetry generation

Zheping Yu, Yuqi Ren · 2026

Applied Soft Computing

Quantifying Factual Divergence in Generative Models: SHAP-LIME Based Hallucination Score for LLMs

Ijazul Haq, Muhammad Saqib · 2026

Multimedia Systems

Structuring large language models for chemical health risk reasoning in environmental and occupational exposure

Zhuo Chen, Meng Du · 2026

Journal of Hazardous Materials

Towards Reliable Multimodal Disaster Severity Assessment through Preference Optimization and Explainable Vision-Language Reasoning

Yuanjun Zhang, Fuzel Ahamed Shaik · 2026

Reliability Engineering & Syste...

A Curriculum-Based RAG Learning Chatbot with Dual Response Strategies: Evidence-Based Explanations and Question-Prompting Responses

Oh-Sun Ha · 2026

Journal of Digital Contents Society

A Survey on Large Language Models in Software Security: Opportunities and Threats

Md Bajlur Rashid, Mohammad Shafayet Jamil Hossain · 2026

Computers

Mapping the Challenges of HCI: An Application and Evaluation of ChatGPT for Mining Insights at Scale

Jonas Oppenlaender, Joonas Hämäläinen · 2026

International Journal of Human–Comp...

The understandability and actionability of AI-generated information for patients with chronic kidney disease

Emi Furukawa, Tsuyoshi Okuhara · 2026

Patient Education and Counseling

An empirical study of LLMs via in-context learning for stance classification

Lida Shi, Fausto Giunchiglia · 2026

Information Processing & Manage...

Retrieval Augmentation Reduces Factual Errors in Knowledge-Intensive Language Model Tasks

Dai Teng, Changhao Zhang · 2026

Computer Life

LLMs Integration in Recommender Systems: A Comprehensive Survey of Frameworks, Taxonomies and Applications

Ardeshir Shojaei, Shahla Asadi · 2026

IEEE Access

An LLM-based cross-domain knowledge retrieval augmented generation method for bio-inspired solution design

Haoran Cui, Pai Zheng · 2026

Advanced Engineering Informatics

SageRAG: Query rewriting for retrieval enhancement and retrieval-augmented generation for grounded responses in AI research assistance

Aditi Vidyarthi, Khoirom Motilal Singh · 2026

Expert Systems with Applications

Generative artificial intelligence in scientific publishing: Expectations for authors and reviewers at Preventive Medicine and Preventive Medicine Reports

Andrew W. Arthur, Alissa Moore · 2026

Preventive Medicine

A systematic review of generative AI: importance of industry and startup-centered perspectives, agentic AI, ethical considerations & challenges, and future directions

Kinjal Patel, Milind Shah · 2025

Artificial Intelligence Review

AlphaFold 3: an unprecedent opportunity for fundamental research and drug development

Zhong-Bo Fang, Hongbiao Ran · 2025

Precision Clinical Medicine

Metrics

2,618

Citations

170

References

Details

Published: Mar 03, 2023
Vol/Issue: 55(12)
Pages: 1-38
License: View

Authors

Z

Ziwei Ji