Large Language Models for Automated Web-Form-Test Generation: An Empirical Study

Tao Li; Chenhui Cui; Rubing Huang; Dave Towey; Lei Ma

doi:10.1145/3735553

journal article Feb 13, 2026

Large Language Models for Automated Web-Form-Test Generation: An Empirical Study

Tao Li

ACM Transactions on Software Engineering and Methodology Vol. 35 No. 3 pp. 1-37 · Association for Computing Machinery (ACM)

View at Publisher Save 10.1145/3735553

Abstract

Testing web forms is an essential activity for ensuring the quality of web applications. It typically involves evaluating the interactions between users and forms. Automated test-case generation remains a challenge for web-form testing: Due to the complex, multi-level structure of Web pages, it can be difficult to automatically capture their inherent contextual information for inclusion in the tests.
Large Language Models
(LLMs) have shown great potential for contextual text generation. This motivated us to explore how they could generate automated tests for web forms, making use of the contextual information within form elements. To the best of our knowledge, no comparative study examining different LLMs has yet been reported for web-form-test generation. To address this gap in the literature, we conducted a comprehensive empirical study investigating the effectiveness of 11 LLMs on 146 web forms from 30 open source Java web applications. In addition, we propose three HTML-structure-pruning methods to extract key contextual information. The experimental results show that different LLMs can achieve different testing effectiveness, with the GPT-4, GLM-4, and Baichuan2 LLMs generating the best web-form tests. Compared with GPT-4, the other LLMs had difficulty generating appropriate tests for the web forms: Their
Successfully Submitted Rates
(SSRs)—the proportions of the LLMs-generated web-form tests that could be successfully inserted into the web forms and submitted—decreased by 9.10% to 74.15%. Our findings also show that, for all LLMs, when the designed prompts include complete and clear contextual information about the web forms, more effective web-form tests were generated. Specifically, when using Parser-Processed HTML for Task Prompt (PH-P), the SSR averaged 70.63%, higher than the 60.21% for Raw HTML for Task Prompt (RH-P) and 50.27% for LLM-Processed HTML for Task Prompt (LH-P). With RH-P, GPT-4’s SSR was 98.86%, outperforming models like LLaMa2 (7B) with 34.47% and GLM-4V with 0%. Similarly, with PH-P, GPT-4 reached an SSR of 99.54%, the highest among all models and prompt types. Finally, this article also highlights strategies for selecting LLMs based on performance metrics, and for optimizing the prompt design to improve the quality of the web-form tests.

Topics

No keywords indexed for this article. Browse by subject →

References

110

[1]

Parsa Alian Noor Nashid Mobina Shahbandeh and Ali Mesbah. 2024. Semantic constraint inference for web form test generation. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’24) 932–944. 10.1145/3650212.3680332

[2]

10.1016/j.tcs.2007.07.029

[3]

Juan C. Alonso, Alberto Martin-Lopez, Sergio Segura, Jose Maria Garcia, and Antonio Ruiz-Cortes. 2022. ARTE: Automated generation of realistic test inputs for web APIs. IEEE Transactions on Software Engineering 49, 1 (2022), 348–363. 10.1109/tse.2022.3150618

[4]

10.1002/stvr.354

[5]

10.1002/stvr.1486

[6]

10.1016/j.nlp.2024.100065

[7]

10.1007/s10515-023-00407-8

[8]

Javier A. Bargas-Avila, Olivia Brenzikofer, Alexandre N. Tuch, Sandra P. Roth, and Klaus Opwis. 2011. Working towards usable forms on the world wide web: Optimizing multiple selection interface elements. Advances in Human-Computer Interaction 2011 (2011), 1–6.

[9]

10.1145/3593230

[10]

Tommaso Calò and Luigi De Russis. 2023. Leveraging large language models for end-user website generation. In Proceedings of the 9th International Symposium on End User Development (IS-EUD ’23). Springer, 52–61.

[11]

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology 15, 3 (2024), 39:1–39:45.

[12]

Chee Sam Cheah and Vinesha Selvarajah. 2021. A review of common web application breaching techniques (SQLi, XSS, CSRF). In Proceedings of the 3rd International Conference on Integrated Intelligent Computing Communication & Security (ICIIC ’21). Atlantis Press, 540–547.

[13]

10.1016/j.asoc.2023.111165

[14]

Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. 2024. ChatUniTest: A framework for LLM-based test generation. In Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE ’24), 572–576.

[15]

Keith Cochran, Clayton Cohn, Jean Francois Rouet, and Peter Hastings. 2023. Improving automated evaluation of student text responses using GPT-3.5 for text data augmentation. In Proceedings of the 24th International Conference on Artificial Intelligence in Education (AIED ’23), 217–228.

[16]

10.1037/0003-066x.37.5.553

[17]

10.1109/access.2017.2782678

[18]

Chenhui Cui Tao Li Junjie Wang Chunyang Chen Dave Towey and Rubing Huang. 2024. Large language models for mobile GUI text input generation: An empirical study. arXiv:2404.08948. Retrieved from https://arxiv.org/abs/2404.08948

[19]

Romulo de Almeida Neves, Willian Massami Watanabe, and Rafael Oliveira. 2022. Morpheus web testing: A tool for generating test cases for widget based web applications. Journal of Web Engineering 21, 2 (2022), 119–144.

[20]

Marco De Luca, Anna Rita Fasolino, and Porfirio Tramontana. 2024. Investigating the robustness of locators in template-based Web application testing using a GUI change classification model. Journal of Systems and Software 210 (2024), 111932. 10.1016/j.jss.2023.111932

[21]

10.1016/j.infsof.2016.02.005

[22]

Yujia Fan, Siyi Wang, Sinan Wang, Yepang Liu, Guoyao Wen, and Qi Rong. 2023. A comprehensive evaluation of Q-learning based automatic web GUI testing. In Proceedings of the 2023 10th International Conference on Dependable Systems and Their Applications (DSA ’23). IEEE, 12–23.

[23]

10.1109/tse.2022.3201209

[24]

GPT-3: Its Nature, Scope, Limits, and Consequences

Luciano Floridi, Massimo Chiriatti

Minds and Machines 10.1007/s11023-020-09548-1

[25]

Ouissem Ben Fredj, Omar Cheikhrouhou, Moez Krichen, Habib Hamam, and Abdelouahid Derhab. 2021. An OWASP top ten driven survey on web application protection methods. In Proceedings of the 15th International Conference on Risks and Security of Internet and Systems (CRiSIS ’20). Springer, 235–252.

[26]

10.1007/s00778-013-0323-0

[27]

Boni García, Mario Munoz-Organero, Carlos Alario-Hoyos, and Carlos Delgado Kloos. 2021. Automated driver management for Selenium WebDriver. Empirical Software Engineering 26, 5 (2021), 1–51.

[28]

Danny Goodman. 2002. Dynamic HTML: The Definitive Reference. O’Reilly Media, Inc.

[29]

Hitomi Goto and Vinesha Selvarajah. 2022. Design and implementation of web application penetration testing with cross-site scripting (XSS). In Proceedings of the 2022 IEEE 2nd International Conference on Mobile Networks and Wireless Communications (ICMNWC), 1–5.

[30]

Unmesh Gundecha and Satya Avasarala. 2018. Selenium WebDriver 3 Practical Guide: End-to-End Automation Testing for Web and Mobile Browsers with Selenium WebDriver (2nd. ed.). Packt Publishing Ltd.

[31]

Hui Huang, Shuangzhi Wu, Xinnian Liang, Bing Wang, Yanrui Shi, Peihao Wu, Muyun Yang, and Tiejun Zhao. 2023. Towards making the most of LLM for translation quality estimation. In Proceedings of the 12th National CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC ’23), 375–386. 10.1007/978-3-031-44693-1_30

[32]

10.1109/tse.2024.3379592

[33]

10.1109/tr.2022.3218602

[34]

iFlytek Spark Cognitive Large Model. 2024. Retrieved from https://xinghuo.xfyun.cn/sparkapi

[35]

Internet Archive. 2024. Retrieved from https://archive.org/web/

[36]

Caroline Jarrett and Gerry Gaffney. 2009. Forms That Work: Designing Web Forms for Usability. Morgan Kaufmann.

[37]

James Garrett Jesse. 2011. The Elements of User Experience: User-Centered Design for the Web and Beyond. New Riders Publishing.

[38]

Charles E. Kahn Jr, Kun Wang, and Douglas S. Bell. 1996. Structured entry of radiology reports using world wide web technology. Radiographics 16, 3 (1996), 683–691. 10.1148/radiographics.16.3.8897632

[39]

Large language models (LLMs): survey, technical frameworks, and future challenges

Pranjal Kumar

Artificial Intelligence Review 10.1007/s10462-024-10888-y

[40]

10.1109/icse48619.2023.00085

[41]

10.1016/bs.adcom.2015.11.007

[42]

Maurizio Leotta, Boni García, Filippo Ricca, and Jim Whitehead. 2023. Challenges of end-to-end testing with Selenium WebDriver and how to face them: A survey. In Proceedings of the 16th IEEE Conference on Software Testing, Verification and Validation (ICST ’23), 339–350. 10.1109/icst57152.2023.00039

[43]

10.1145/3649449

[44]

Jia Li, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin. 2024. AceCoder: An effective prompting technique specialized in code generation. ACM Transactions on Software Engineering and Methodology 33, 8 (2024), 204:1–204:26.

[45]

10.1145/3540250.3549099

[46]

10.1016/j.is.2014.02.001

[47]

10.1145/3540250.3549081

[48]

10.1109/icse48619.2023.00119

[49]

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2024. Make LLM a testing expert: Bringing human-like interaction to mobile GUI testing via functionality-aware decisions. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24), 1–13.

[50]

Kristiyan Lukanov, Horia A. Maior, and Max L. Wilson. 2016. Using fNIRS in usability testing: Understanding the effect of web form layout on mental workload. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16), 4011–4016.

Showing 50 of 110 references

Metrics

8

Citations

110

References

Details

Published: Feb 13, 2026
Vol/Issue: 35(3)
Pages: 1-37

Authors

T

Tao Li

School of Computer Science and Engineering, Macau University of Science and Technology, Taipa, Macao

C

Chenhui Cui

School of Computer Science and Engineering, Macau University of Science and Technology, Taipa, Macao

R

Rubing Huang

School of Computer Science and Engineering, Macau University of Science and Technology, Taipa, Macao and Macau University of Science and Technology Zhuhai MUST Science and Technology Research Institute, Zhuhai, China

D

Dave Towey

School of Computer Science, University of Nottingham Ningbo China, Ningbo, China

L

Lei Ma

Department of Computer Science, The University of Tokyo, Bunkyo-ku, Japan and Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada

Funding

Science and Technology Development Fund of Macau, Macau SAR Award: 0021/2023/RIA1

Cite This Article

Tao Li, Chenhui Cui, Rubing Huang, et al. (2026). Large Language Models for Automated Web-Form-Test Generation: An Empirical Study. ACM Transactions on Software Engineering and Methodology, 35(3), 1-37. https://doi.org/10.1145/3735553

Large Language Models for Automated Web-Form-Test Generation: An Empirical Study

You May Also Like