Abstract
Background and Objectives: Dermatology relies on a complex terminology encompassing lesion types, distribution patterns, colors, and specialized sites such as hair and nails, while dermoscopy adds an additional descriptive framework, making interpretation subjective and challenging. Our study aims to evaluate the ability of a chatbot (Gemini 2) to generate dermatology descriptions across multiple languages and image types, and to assess the influence of prompt language on readability, completeness, and terminology consistency. Our research is based on the concept that non-English prompts are not mere translations of the English prompts but are independently generated texts that reflect medical and dermatological knowledge learned from non-English material used in the chatbot’s training. Materials and Methods: Five macroscopic and five dermoscopic images of common skin lesions were used. Images were uploaded to Gemini 2 with language-specific prompts requesting short paragraphs describing visible features and possible diagnoses. A total of 2400 outputs were analyzed for readability using LIX score and CLEAR (comprehensiveness, accuracy, evidence-based content, appropriateness, and relevance) assessment, while terminology consistency was evaluated via SNOMED CT mapping across English, French, German, and Greek outputs. Results: English and French descriptions were found to be harder to read and more sophisticated, while SNOMED CT mapping revealed the largest terminology mismatch in German and the smallest in French. English texts and macroscopic images achieved the highest accuracy, completeness, and readability based on CLEAR assessment, whereas dermoscopic images and non-English texts presented greater challenges. Conclusions: Overall, partial terminology inconsistencies and cross-lingual variations highlighted that the language of the prompt plays a critical role in shaping AI-generated dermatology descriptions.
Topics

No keywords indexed for this article. Browse by subject →

References
55
[1]
Generative Artificial Intelligence Use in Healthcare: Opportunities for Clinical Excellence and Administrative Efficiency

Soumitra S. Bhuyan, Vidyoth Sateesh, Naya Mukul et al.

Journal of Medical Systems 2025 10.1007/s10916-024-02136-1
[2]
Large Language Models in Healthcare and Medical Domain: A Review

Zabir Al Nazi, Wei Peng

Informatics 10.3390/informatics11030057
[3]
Vartiainen "How Text-to-Image Generative AI Is Transforming Mediated Action" IEEE Comput. Graph. Appl. (2024) 10.1109/mcg.2024.3355808
[4]
Boit "A Prompt Engineering Framework for Large Language Model–Based Mental Health Chatbots: Conceptual Framework" JMIR Ment. Health (2025) 10.2196/75078
[5]
Kalyan "A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4" Nat. Lang. Process. J. (2024) 10.1016/j.nlp.2023.100048
[6]
Karampinis "Use of a Large Language Model as a Dermatology Case Narrator: Exploring the Dynamics of a Chatbot as an Educational Tool in Dermatology" JMIR Dermatol. (2025) 10.2196/72058
[7]
Rahman, M.d.M., and Watanobe, Y. (2023). ChatGPT for Education and Research: Opportunities, Threats, and Strategies. Appl. Sci., 13. 10.20944/preprints202303.0473.v1
[8]
Tengler "Exploring the Difference and Quality of AI-Generated versus Human-Written Texts" Discov. Educ. (2025) 10.1007/s44217-025-00529-z
[9]
Hakam "Human-Written vs AI-Generated Texts in Orthopedic Academic Literature: Comparative Qualitative Analysis" JMIR Form. Res. (2024) 10.2196/52164
[10]
Kar "How Sensitive Are the Free AI-Detector Tools in Detecting AI-Generated Texts? A Comparison of Popular AI-Detector Tools" Indian J. Psychol. Med. (2025) 10.1177/02537176241247934
[11]
Herbold "A Large-Scale Comparison of Human-Written versus ChatGPT-Generated Essays" Sci. Rep. (2023) 10.1038/s41598-023-45644-9
[12]
Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., and Wu, Y. (2023). How Close Is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. arXiv.
[13]
Zhou, J., Zhang, Y., Luo, Q., Parker, A.G., and De Choudhury, M. (2023, January 23–28). Synthetic Lies: Understanding AI-Generated Misinformation and Evaluating Algorithmic and Human Solutions. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, Hamburg, Germany. 10.1145/3544548.3581318
[14]
Georgiou, G.P. (2025). Differentiating Between Human-Written and AI-Generated Texts Using Automatically Extracted Linguistic Features. Information, 16. 10.3390/info16110979
[15]
"Use of Artificial Intelligence in Planning Postoperative Nursing Care in Laparoscopic Cholecystectomy Patients: Comparison of ChatGPT and Student Practice" Nurse Educ. Pract. (2025) 10.1016/j.nepr.2025.104515
[16]
Garcia "Artificial Intelligence–Generated Draft Replies to Patient Inbox Messages" JAMA Netw. Open (2024) 10.1001/jamanetworkopen.2024.3201
[17]
Xie "Evaluation of the Artificial Intelligence Chatbot on Breast Reconstruction and Its Efficacy in Surgical Research: A Case Study" Aesthetic Plast. Surg. (2023) 10.1007/s00266-023-03443-7
[18]
Liang, C.X., Tian, P., Yin, C.H., Yua, Y., An-Hou, W., Ming, L., Song, X., Wang, T., Bi, Z., and Liu, M. (2025). A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks. arXiv.
[19]
Karampinis, E., Toli, O., Georgopoulou, K.-E., Kampra, E., Spyridonidou, C., Roussaki Schulze, A.-V., and Zafiriou, E. (2024). Can Artificial Intelligence “Hold” a Dermoscope?—The Evaluation of an Artificial Intelligence Chatbot to Translate the Dermoscopic Language. Diagnostics, 14. 10.3390/diagnostics14111165
[20]
Zhang "The Impact of Chatbots Based on Large Language Models on Second Language Vocabulary Acquisition" Heliyon (2024) 10.1016/j.heliyon.2024.e25370
[21]
Terzis "Evaluation of GPT-4o for Multilingual Translation of Radiology Reports across Imaging Modalities" Eur. J. Radiol. (2025) 10.1016/j.ejrad.2025.112341
[22]
Jaradat "ChatGPT Translation vs. Human Translation: An Examination of a Literary Text" Cogent Soc. Sci. (2025)
[23]
Martínez, G., Conde, J., Reviriego, P., Merino-Gómez, E., Hernández, J.A., and Lombardi, F. (2023). How Many Words Does ChatGPT Know? The Answer Is ChatWords. arXiv.
[24]
Harigai "Response Accuracy of GPT-4 across Languages: Insights from an Expert-Level Diagnostic Radiology Examination in Japan" Jpn. J. Radiol. (2025) 10.1007/s11604-024-01673-6
[25]
Zheng "Development and Evaluation of a Large Language Model of Ophthalmology in Chinese" Br. J. Ophthalmol. (2024) 10.1136/bjo-2023-324526
[27]
Yao "Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations" JMIR Med. Inform. (2025) 10.2196/69485
[28]
Kelloniemi "AI Did Not Pass Finnish Plastic Surgery Written Board Examination" J. Plast. Reconstr. Aesthetic Surg. (2023) 10.1016/j.bjps.2023.10.059
[29]
Wu "Large Language Models Leverage External Knowledge to Extend Clinical Insight beyond Language Boundaries" J. Am. Med. Inform. Assoc. (2024) 10.1093/jamia/ocae079
[30]
Toyama "Performance Evaluation of ChatGPT, GPT-4, and Bard on the Official Board Examination of the Japan Radiology Society" Jpn. J. Radiol. (2024) 10.1007/s11604-023-01491-2
[31]
Seghier "ChatGPT: Not All Languages Are Equal" Nature (2023) 10.1038/d41586-023-00680-3
[32]
Sallam, M., Al-Mahzoum, K., Alshuaib, O., Alhajri, H., Alotaibi, F., Alkhurainej, D., Al-Balwah, M.Y., Barakat, M., and Egger, J. (2024). Language Discrepancies in the Performance of Generative Artificial Intelligence Models: An Examination of Infectious Disease Queries in English and Arabic. BMC Infect. Dis., 24. 10.1186/s12879-024-09725-y
[33]
Samaan "ChatGPT’s Ability to Comprehend and Answer Cirrhosis Related Questions in Arabic" Arab. J. Gastroenterol. (2023) 10.1016/j.ajg.2023.08.001
[34]
Menezes "The Potential of Generative Pre-Trained Transformer 4 (GPT-4) to Analyse Medical Notes in Three Different Languages: A Retrospective Model-Evaluation Study" Lancet Digit. Health (2025) 10.1016/s2589-7500(24)00246-2
[35]
Cheng "Artificial Intelligence Chatbots and Their Responses to Most Searched Spanish Cancer Questions" Cancer Med. (2025) 10.1002/cam4.71364
[36]
Gimeno "Completeness and Readability of GPT-4-Generated Multilingual Discharge Instructions in the Pediatric Emergency Department" JAMIA Open (2024) 10.1093/jamiaopen/ooae050
[37]
Mootz "Accuracy of Spanish and English-Generated ChatGPT Responses to Commonly Asked Patient Questions about Labor Epidurals: A Survey-Based Study among Bilingual Obstetric Anesthesia Experts" Int. J. Obstet. Anesth. (2025) 10.1016/j.ijoa.2024.104290
[38]
Pugliese, N., Polverini, D., Lombardi, R., Pennisi, G., Ravaioli, F., Armandi, A., Buzzetti, E., Dalbeni, A., Liguori, A., and Mantovani, A. (2024). Evaluation of ChatGPT as a Counselling Tool for Italian-Speaking MASLD Patients: Assessment of Accuracy, Completeness and Comprehensibility. J. Pers. Med., 14. 10.3390/jpm14060568
[39]
Mikhail "Performance of ChatGPT in French Language Analysis of Multimodal Retinal Cases" J. Fr. Ophtalmol. (2025) 10.1016/j.jfo.2024.104391
[40]
Menz "Generative AI Chatbots for Reliable Cancer Information: Evaluating Web-Search, Multilingual, and Reference Capabilities of Emerging Large Language Models" Eur. J. Cancer (2025) 10.1016/j.ejca.2025.115274
[41]
Singla "Accuracy, Clarity, and Comprehensiveness of ChatGPT Outputs for Commonly Asked Questions About Living Kidney Donation" Clin. Transplant. (2025) 10.1111/ctr.70303
[42]
Sallam "Chinese Generative AI Models (DeepSeek and Qwen) Rival ChatGPT-4 in Ophthalmology Queries with Excellent Performance in Arabic and English" Narra J. (2025) 10.52225/narra.v5i1.2371
[43]
Sallam, M., Al-Mahzoum, K., Almutawaa, R.A., Alhashash, J.A., Dashti, R.A., AlSafy, D.R., Almutairi, R.A., and Barakat, M. (2024). The Performance of OpenAI ChatGPT-4 and Google Gemini in Virology Multiple-Choice Questions: A Comparative Analysis of English and Arabic Responses. BMC Res. Notes, 17. 10.1186/s13104-024-06920-7
[44]
Sallam "A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence–Based Models in Health Care Education and Practice: Development Study Involving a Literature Review" Interact. J. Med. Res. (2024) 10.2196/54704
[45]
Skrzypczak "Assessing the Readability of Online Health Information for Colonoscopy—Analysis of Articles in 22 European Languages" J. Cancer Educ. (2023) 10.1007/s13187-023-02344-2
[46]
Calafato "Literature in Contemporary Foreign Language School Textbooks in Russia: Content, Approaches, and Readability" Lang. Teach. Res. (2022) 10.1177/1362168820917909
[47]
Skrzypczak "The Importance of Readability: A Guide to Understanding Alopecia Areata through Multilingual Online Resources" Acta Derm. Venereol. (2024) 10.2340/actadv.v104.41046
[48]
Sebo, P., and de Lucia, S. (2024). Performance of Machine Translators in Translating French Medical Research Abstracts to English: A Comparative Study of DeepL, Google Translate, and CUBBITT. PLoS ONE, 19. 10.1371/journal.pone.0297183
[49]
Balk "Data Extraction from Machine-Translated versus Original Language Randomized Trial Reports: A Comparative Study" Syst. Rev. (2013) 10.1186/2046-4053-2-97
[50]
Das "Named Signs and Metaphoric Terminologies in Dermoscopy: A Compilation" Indian J. Dermatol. Venereol. Leprol. (2022) 10.25259/ijdvl_1047_20

Showing 50 of 55 references

Metrics
3
Citations
55
References
Details
Published
Jan 22, 2026
Vol/Issue
62(1)
Pages
227
License
View
Authors
Cite This Article
Emmanouil Karampinis, Christina-Marina Zoumpourli, Christina Kontogianni, et al. (2026). Dermatology “AI Babylon”: Cross-Language Evaluation of AI-Crafted Dermatology Descriptions. Medicina, 62(1), 227. https://doi.org/10.3390/medicina62010227