journal article Open Access Apr 08, 2026

Retrieval Augmentation Reduces Factual Errors in Knowledge-Intensive Language Model Tasks

View at Publisher Save 10.54097/8jvwpk07
Abstract
Large language models (LLMs) have demonstrated exceptional capabilities across natural language processing (NLP) tasks; however, they remain persistently susceptible to generating factually incorrect content—a phenomenon broadly termed hallucination. Retrieval-augmented generation (RAG) has emerged as a principled paradigm for mitigating this limitation by grounding model outputs in dynamically retrieved external evidence, thereby substantially reducing factual errors in knowledge-intensive settings. This paper presents a comprehensive review of RAG research, tracing developments from early retrieval-enhanced pretraining frameworks to adaptive and self-reflective architectures. We examine how retrieval strategies including dense passage retrieval (DPR), sparse retrieval, and hybrid methods interact with generative components to suppress hallucination. We analyze the Knowledge-Intensive Language Tasks (KILT) benchmark and open-domain question answering (QA) datasets as primary evaluation vehicles, synthesizing empirical evidence demonstrating that RAG consistently lowers factual error rates relative to purely parametric LLMs. We further discuss challenges including retrieval quality, knowledge conflict resolution, multi-hop reasoning, and domain adaptation, and outline future directions essential for realizing the full potential of RAG in high-stakes natural language generation (NLG) applications.
Topics

No keywords indexed for this article. Browse by subject →

References
66
[1]
[1] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.
[2]
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma et al.

ACM Transactions on Information Systems 10.1145/3703155
[3]
[3] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33, 9459-9474.
[4]
[4] Petroni, F., Piktus, A., Fan, A., Lewis, P., Yazdani, M., De Cao, N., ... & Riedel, S. (2021, June). KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 2523-2544). 10.18653/v1/2021.naacl-main.200
[5]
[5] Kandpal, N., Deng, H., Roberts, A., Wallace, E., & Raffel, C. (2023, July). Large language models struggle to learn long-tail knowledge. In International conference on machine learning (pp. 15696-15707). PMLR.
[6]
[6] Long, M., Sun, D., Yang, D., Wang, J., Luo, Y., Shen, Y., ... & Gu, J. (2025). Diver: A multi-stage approach for reasoning-intensive information retrieval. arXiv preprint arXiv:2508.07995.
[7]
[7] Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020, July). On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 1906-1919). 10.18653/v1/2020.acl-main.173
[8]
[8] Shuster, K., Poff, S., Chen, M., Kiela, D., & Weston, J. (2021, November). Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 3784-3803). 10.18653/v1/2021.findings-emnlp.320
[9]
Survey of Hallucination in Natural Language Generation

Ziwei Ji, Nayeon Lee, Rita Frieske et al.

ACM Computing Surveys 10.1145/3571730
[10]
[10] Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2022). Efficient transformers: A survey. ACM Computing Surveys, 55(6), 1-28. 10.1145/3530811
[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee et al.

Proceedings of the 2019 Conference of the North 10.18653/v1/n19-1423
[12]
[12] Sachan, D. S., Lewis, M., Yogatama, D., Zettlemoyer, L., Pineau, J., & Zaheer, M. (2023). Questions are all you need to train a dense passage retriever. Transactions of the Association for Computational Linguistics, 11, 600-616. 10.1162/tacl_a_00564
[13]
[13] Izacard, G., & Grave, E. (2021, April). Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume (pp. 874-880). 10.18653/v1/2021.eacl-main.74
[14]
[14] Mallen, A., Asai, A., Zhong, V., Das, R., Khashabi, D., & Hajishirzi, H. (2023, July). When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 9802-9822). 10.18653/v1/2023.acl-long.546
[15]
[15] Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E. H., ... & Zhou, D. (2023, July). Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning (pp. 31210-31227). PMLR.
[16]
[16] Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., ... & Yih, W. T. (2020, November). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 6769-6781). 10.18653/v1/2020.emnlp-main.550
[17]
[17] Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. (2020, November). Retrieval augmented language model pre-training. In International conference on machine learning (pp. 3929-3938). PMLR.
[18]
[18] Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., ... & Sifre, L. (2022, June). Improving language models by retrieving from trillions of tokens. In International conference on machine learning (pp. 2206-2240). PMLR.
[19]
[19] Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi-Yu, J., ... & Neubig, G. (2023, December). Active retrieval augmented generation. In Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 7969-7992). 10.18653/v1/2023.emnlp-main.495
[20]
[20] Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023, October). Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations.
[21]
[21] Ahmad, M. (2025). Toward a Unified Framework for Information Retrieval in Large Language Model Applications: Balancing Textual and Graph-Based Knowledge Sources.
[22]
[22] Xie, J., Zhang, K., Chen, J., Lou, R., & Su, Y. (2023, May). Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In The Twelfth International Conference on Learning Representations.
[23]
[23] Wang, K., Duan, F., Wang, S., Li, P., Xian, Y., Yin, C., ... & Xiong, Z. (2023). Knowledge-driven cot: Exploring faithful reasoning in llms for knowledge-intensive question answering. arXiv preprint arXiv:2308.13259. 10.18293/seke2023-023
[24]
Large language models encode clinical knowledge

Karan Singhal, Shekoofeh Azizi, Tao Tu et al.

Nature 10.1038/s41586-023-06291-2
[25]
[25] Niklaus, J., Matoshi, V., Rani, P., Galassi, A., Stürmer, M., & Chalkidis, I. (2023, December). Lextreme: A multi-lingual and multi-task benchmark for the legal domain. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 3016-3054). 10.18653/v1/2023.findings-emnlp.200
[26]
[26] Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., ... & Schulman, J. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
[27]
[27] Menick, J., Trebacz, M., Mikulik, V., Aslanides, J., Song, F., Chadwick, M., ... & McAleese, N. (2022). Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.
[28]
[28] Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G. L., Corney, D., ... & Zagni, G. (2024). Factuality challenges in the era of large language models and opportunities for fact-checking. Nature Machine Intelligence, 6(8), 852-863. 10.1038/s42256-024-00881-z
[29]
[29] Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W. T., Koh, P., ... & Hajishirzi, H. (2023, December). Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 12076-12100). 10.18653/v1/2023.emnlp-main.741
[30]
[30] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., ... & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2(1), 32.
[31]
[31] Asai, A., Min, S., Zhong, Z., & Chen, D. (2023, July). Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts) (pp. 41-46). 10.18653/v1/2023.acl-tutorials.6
[32]
[32] Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663.
[33]
On-Device Large Language Models and AI Agents for Real-Time Mobile User Experience Optimization

Wenbin Shang, Zimeng Wang, Boyuan Wang

American Journal of Artificial Intelligence and Ne... 10.71465/ajainn3446
[34]
[34] Huang, Z., Zeng, H., Zamani, H., & Allan, J. (2023, July). Soft prompt decoding for multilingual dense retrieval. In Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval (pp. 1208-1218). 10.1145/3539618.3591769
[35]
[35] Mao, Y., He, P., Liu, X., Shen, Y., Gao, J., Han, J., & Chen, W. (2021, August). Generation-augmented retrieval for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 4089-4100). 10.18653/v1/2021.acl-long.316
[36]
[36] Wang, L., Yang, N., & Wei, F. (2023, December). Query2doc: Query expansion with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 9414-9423). 10.18653/v1/2023.emnlp-main.585
[37]
[37] Gao, L., Ma, X., Lin, J., & Callan, J. (2023, July). Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1762-1777). 10.18653/v1/2023.acl-long.99
[38]
[38] Lin, Jimmy, Rodrigo Nogueira, and Andrew Yates. Pretrained transformers for text ranking: Bert and beyond. Springer Nature, 2022. 10.1007/978-3-031-02181-7
[39]
[39] Nogueira, R., Yang, W., Cho, K., & Lin, J. (2019). Multi-stage document ranking with BERT. arXiv preprint arXiv:1910.14424.
[40]
[40] Beheshti, A., Hashemi, V. M., & Yakhchi, S. (2019, December). Towards context-aware social behavioral analytics. In Proceedings of the 17th International Conference on Advances in Mobile Computing & Multimedia (pp. 28-35). 10.1145/3365921.3365942
[41]
[41] Yin, P., Neubig, G., Yih, W. T., & Riedel, S. (2020, July). TaBERT: Pretraining for joint understanding of textual and tabular data. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8413-8426). 10.18653/v1/2020.acl-main.745
[42]
[42] Zhao, R., Li, X., Joty, S., Qin, C., & Bing, L. (2023, July). Verify-and-edit: A knowledge-enhanced chain-of-thought framework. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5823-5840). 10.18653/v1/2023.acl-long.320
[43]
[43] Ho, X., Nguyen, A. K. D., Sugawara, S., & Aizawa, A. (2020, December). Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 6609-6625). 10.18653/v1/2020.coling-main.580
[44]
[44] Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2023, July). Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) (pp. 10014-10037). 10.18653/v1/2023.acl-long.557
[45]
[45] Chen, J., Lin, H., Han, X., & Sun, L. (2024, March). Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 16, pp. 17754-17762).. 10.1609/aaai.v38i16.29728
[46]
[46] Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., ... & Shi, S. (2025).
[47]
[47] Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., ... & Kaplan, J. (2022). Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
[48]
[48] Siriwardhana, S., Weerasekera, R., Wen, E., Kaluarachchi, T., Rana, R., & Nanayakkara, S. (2023). Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11, 1-17. 10.1162/tacl_a_00530
[49]
[49] Shi, W., Min, S., Lomeli, M., Zhou, C., Li, M., Szilvasy, G., ... & Lewis, M. (2023). In-context pretraining: Language modeling beyond document boundaries. arXiv preprint arXiv:2310.10638.
[50]
[50] Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., ... & Hendrycks, D. (2023). Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.

Showing 50 of 66 references

Metrics
0
Citations
66
References
Details
Published
Apr 08, 2026
Vol/Issue
14(1)
Pages
42-49
License
View
Cite This Article
Dai Teng, Changhao Zhang, Jitong Zou (2026). Retrieval Augmentation Reduces Factual Errors in Knowledge-Intensive Language Model Tasks. Computer Life, 14(1), 42-49. https://doi.org/10.54097/8jvwpk07