Abstract 2756: Development of an LLM framework for analysis of heterogeneous breast cancer patients genomic reports.
Background:
The increasing volume and diversity of clinical genomic reports pose a significant challenge across healthcare institutions for the interpretation and proper implementation of precision oncology research. Reports from multiple different vendors (e.g., Invitae, Ambry Genetics, Foundation Medicine) are typically PDFs and exhibit substantial heterogeneity in panel design, gene coverage, and reporting standards, which hinders efficient retrospective data mining and patient cohort identification within the electronic medical records. We sought to develop a framework to interrogate all of these reports.
Methods:
We extracted genomic reports from patients with breast cancer treated with neoadjuvant chemotherapy and developed MolHarmonizer, a novel, scalable framework leveraging Python and Gemini LLMs, designed to process and harmonize genomic data from disparate multi-vendor reports. Gemini LLMs are employed explicitly for robust information extraction, normalization, and structuring of key genomic features, transforming unstructured data into a unified, queryable dataset.
Results:
Our MolHarmonizer framework successfully processed 1,147 genomic reports from 1703 breast cancer patients (2006-2023) from 23 different companies, demonstrating robust capability to extract and standardize critical actionable biomarkers. Data sources included Invitae (n=554), Ambry Genetics (n=189), Natera (n=95), Mayo Clinic (n=88), Tempus (n=63), Guardant Health (n=47), and Foundation Medicine (n=37), with others contributing less than 20 reports. Of the samples, 827/1147 (72.1%) were germline (blood/saliva). A majority of the patients were tested using the panels due to a personal/family history (n=925). Overall, 413/1147 (36.0%) reports identified at least one mutation. For breast cancer, 75 reports showed BRCA1/2 mutations (37 BRCA1, 37 BRCA2, and one patient with both BRCA1 and 2). Other mutations identified included: PIK3CA (n=40), TP53 (n=96), PTEN (n=21), ESR1 (n=13) and AKT1/2 (n=8).
Conclusion:
MolHarmonizer, a powerful framework leveraging Gemini LLMs, effectively addresses genomic data heterogeneity by automating biomarker extraction and harmonization. This enables rapid cohort identification and deep retrospective analyses for clinical insights, biomarker discovery, understanding disease history, facilitating novel pattern discovery, e.g., predicting BRCA1 mutations from WSI, and accelerating research within our neoadjuvant BC cohort. Future plans include expanding to include over 20,000 breast cancer patients, developing a user-friendly chatbot, and ensuring inter-institutional adaptability for a variety of complex diseases.
Citation Format:
Krishna Rani Kalari, Xiaojia Tang, Thanmayee Boyapati, Tanya L. Hoskin, Sumathilatha Myla, Sumedha G. Penheiter, Richard M. Weinshilboum, Liewei Wang, Hamid R. Tizhoosh, Karthik Vikram Giridhar, Matthew P. Goetz, Judy C. Boughey. Development of an LLM framework for analysis of heterogeneous breast cancer patients genomic reports [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 2756.
No keywords indexed for this article. Browse by subject →
- Published
- Apr 03, 2026
- Vol/Issue
- 86(7_Supplement)
- Pages
- 2756-2756
You May Also Like
Claus Lindbjerg Andersen, Jens Ledet Jensen · 2004
6,124 citations
Lola Rahib, Benjamin D. Smith · 2014
5,849 citations
Joost J.M. van Griethuysen, Andriy Fedorov · 2017
5,461 citations
Ting-Chao Chou · 2010
4,885 citations
Taiwen Li, Jingyu Fan · 2017
4,357 citations