Performance Evaluation of Deep Learning for the Detection and Segmentation of Thyroid Nodules: Systematic Review and Meta-Analysis
Background
Thyroid cancer is one of the most common endocrine malignancies. Its incidence has steadily increased in recent years. Distinguishing between benign and malignant thyroid nodules (TNs) is challenging due to their overlapping imaging features. The rapid advancement of artificial intelligence (AI) in medical image analysis, particularly deep learning (DL) algorithms, has provided novel solutions for automated TN detection. However, existing studies exhibit substantial heterogeneity in diagnostic performance. Furthermore, no systematic evidence-based research comprehensively assesses the diagnostic performance of DL models in this field.
Objective
This study aimed to execute a systematic review and meta-analysis to appraise the performance of DL algorithms in diagnosing TN malignancy, identify key factors influencing their diagnostic efficacy, and compare their accuracy with that of clinicians in image-based diagnosis.
Methods
We systematically searched multiple databases, including PubMed, Cochrane, Embase, Web of Science, and IEEE, and identified 41 eligible studies for systematic review and meta-analysis. Based on the task type, studies were categorized into segmentation (n=14) and detection (n=27) tasks. The pooled sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC) were calculated for each group. Subgroup analyses were performed to examine the impact of transfer learning and compare model performance against clinicians.
Results
For segmentation tasks, the pooled sensitivity, specificity, and AUC were 82% (95% CI 79%‐84%), 95% (95% CI 92%‐96%), and 0.91 (95% CI 0.89‐0.94), respectively. For detection tasks, the pooled sensitivity, specificity, and AUC were 91% (95% CI 89%‐93%), 89% (95% CI 86%‐91%), and 0.96 (95% CI 0.93‐0.97), respectively. Some studies demonstrated that DL models could achieve diagnostic performance comparable with, or even exceeding, that of clinicians in certain scenarios. The application of transfer learning contributed to improved model performance.
Conclusions
DL algorithms exhibit promising diagnostic accuracy in TN imaging, highlighting their potential as auxiliary diagnostic tools. However, current studies are limited by suboptimal methodological design, inconsistent image quality across datasets, and insufficient external validation, which may introduce bias. Future research should enhance methodological standardization, improve model interpretability, and promote transparent reporting to facilitate the sustainable clinical translation of DL-based solutions.
No keywords indexed for this article. Browse by subject →
Gabriella Pellegriti, Francesco Frasca, Concetto Regalbuto et al.
Freddie Bray, Jacques Ferlay, Isabelle Soerjomataram et al.
Maryellen L. Giger
Gabriel Chartrand, Phillip M. Cheng, Eugene Vorontsov et al.
Viknesh Sounderajah, Hutan Ashrafian, Sherri Rose et al.
Matthew D. F. McInnes, David Moher, Brett D. Thombs et al.
Jieun Koh, Eunjung Lee, Kyunghwa Han et al.
Showing 50 of 75 references
- Published
- Aug 14, 2025
- Vol/Issue
- 27
- Pages
- e73516-e73516
You May Also Like
Gunther Eysenbach · 2004
5,040 citations
TRISHA GREENHALGH, Joseph Wherton · 2017
2,007 citations
S Anne Moorhead, Diane E Hazlett · 2013
1,989 citations