journal article Open Access Jan 28, 2019

Machine Learning and Integrative Analysis of Biomedical Big Data

Genes Vol. 10 No. 2 pp. 87 · MDPI AG
View at Publisher Save 10.3390/genes10020087
Abstract
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues.
Topics

No keywords indexed for this article. Browse by subject →

References
233
[1]
Strobel "High-throughput determination of RNA structures" Nat. Rev. Genet. (2018) 10.1038/s41576-018-0034-x
[2]
Hwang "Single-cell RNA sequencing technologies and bioinformatics pipelines" Exp. Mol. Med. (2018) 10.1038/s12276-018-0071-8
[3]
Sedlazeck "Piercing the dark matter: Bioinformatics of long-range sequencing and mapping" Nat. Rev. Genet. (2018) 10.1038/s41576-018-0003-4
[4]
Mass spectrometry-based proteomics

Ruedi Aebersold, Matthias Mann

Nature 2003 10.1038/nature01511
[5]
Dettmer "Mass spectrometry-based metabolomics" Mass Spectrom. Rev. (2007) 10.1002/mas.20108
[6]
The Elements of Statistical Learning

Trevor Hastie, Jerome Friedman, Robert Tibshirani

Springer Series in Statistics 10.1007/978-0-387-21606-5
[7]
A few useful things to know about machine learning

Pedro Domingos

Communications of the ACM 2012 10.1145/2347736.2347755
[8]
Support-vector networks

Corinna Cortes, Vladimir Vapnik

Machine Learning 1995 10.1007/bf00994018
[9]
Learning representations by back-propagating errors

David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams

Nature 1986 10.1038/323533a0
[10]
Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)

Leo Breiman

Statistical Science 2001 10.1214/ss/1009213726
[11]
Obermeyer "Predicting the future—Big data, machine learning, and clinical medicine" N. Engl. J. Med. (2016) 10.1056/nejmp1606181
[12]
Libbrecht "Machine learning applications in genetics and genomics" Nat. Rev. Genet. (2015) 10.1038/nrg3920
[13]
Rohrback "Submegabase copy number variations arise during cerebral cortical neurogenesis as revealed by single-cell whole-genome sequencing" Proc. Natl. Acad. Sci. USA (2018) 10.1073/pnas.1812702115
[14]
Wang, D., Li, J.-R., Zhang, Y.-H., Chen, L., Huang, T., and Cai, Y.-D. (2018). Identification of Differentially Expressed Genes between Original Breast Cancer and Xenograft Using Machine Learning Algorithms. Genes, 9. 10.3390/genes9030155
[15]
Kerepesi "Prediction and characterization of human ageing-related proteins by using machine learning" Sci. Rep. (2018) 10.1038/s41598-018-22240-w
[16]
Bourdon "Metabolomic analysis of mouse prefrontal cortex reveals upregulated analytes during wakefulness compared to sleep" Sci. Rep. (2018) 10.1038/s41598-018-29511-6
[17]
Zheng "Systems analysis of transcriptome and proteome in retinoic acid/arsenic trioxide-induced cell differentiation/apoptosis of promyelocytic leukemia" Proc. Natl. Acad. Sci. USA (2005) 10.1073/pnas.0502825102
[18]
Azimzadeh "Integrative proteomics and targeted transcriptomics analyses in cardiac endothelial cells unravel mechanisms of long-term radiation-induced vascular dysfunction" J. Proteome Res. (2015) 10.1021/pr501141b
[19]
Gerling "New data analysis and mining approaches identify unique proteome and transcriptome markers of susceptibility to autoimmune diabetes" Mol. Cell. Proteom. (2006) 10.1074/mcp.m500197-mcp200
[20]
Ryan "High-resolution network biology: Connecting sequence with function" Nat. Rev. Genet. (2013) 10.1038/nrg3574
[21]
Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin

Katherine A. Hoadley, Christina Yau, Denise M. Wolf et al.

Cell 2014 10.1016/j.cell.2014.06.049
[22]
De Cecco, L., Giannoccaro, M., Marchesi, E., Bossi, P., Favales, F., Locati, L.D., Licitra, L., Pilotti, S., and Canevari, S. (2017). Integrative miRNA-gene expression analysis enables refinement of associated biology and prediction of response to cetuximab in head and neck squamous cell cancer. Genes, 8. 10.3390/genes8010035
[23]
Multi‐Omics Factor Analysis—a framework for unsupervised integration of multi‐omics data sets

Ricard Argelaguet, Britta Velten, Damien Arnol et al.

Molecular Systems Biology 2018 10.15252/msb.20178124
[24]
Oberbach "Combined proteomic and metabolomic profiling of serum reveals association of the complement system with obesity and identifies novel markers of body fat mass changes" J. Proteome Res. (2011) 10.1021/pr2005555
[25]
Costello "A community effort to assess and improve drug sensitivity prediction algorithms" Nat. Biotechnol. (2014) 10.1038/nbt.2877
[26]
Joyce "The model organism as a system: Integrating’omics’ data sets" Nat. Rev. Mol. Cell Biol. (2006) 10.1038/nrm1857
[27]
Cavill "Transcriptomic and metabolomic data integration" Brief Bioinform. (2015) 10.1093/bib/bbv090
[28]
Shen "Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis" Bioinformatics (2009) 10.1093/bioinformatics/btp543
[29]
Wang "Similarity network fusion for aggregating data types on a genomic scale" Nat. Methods (2014) 10.1038/nmeth.2810
[30]
Deep learning

Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Nature 2015 10.1038/nature14539
[31]
Min "Deep learning in bioinformatics" Brief. Bioinform. (2017)
[32]
Kim, M., Oh, I., and Ahn, J. (2018). An Improved Method for Prediction of Cancer Prognosis by Network Learning. Genes, 9. 10.3390/genes9100478
[33]
De Meulder, B., Lefaudeux, D., Bansal, A.T., Mazein, A., Chaiboonchoe, A., Ahmed, H., Balaur, I., Saqi, M., Pellet, J., and Ballereau, S. (2018). A computational framework for complex disease stratification from multiple large-scale datasets. BMC Syst. Biol., 12. 10.1186/s12918-018-0556-z
[34]
Wang "Feature selection methods for big data bioinformatics: A survey from the search perspective" Methods (2016) 10.1016/j.ymeth.2016.08.014
[35]
Hira, Z.M., and Gillies, D.F. (2015). A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinform., 2015. 10.1155/2015/198363
[36]
Guyon "An introduction to variable and feature selection" J. Mach. Learn. Res. (2003)
[37]
Hinton "Visualizing data using t-SNE" J. Mach. Learn. Res. (2008)
[38]
Reducing the Dimensionality of Data with Neural Networks

G. E. Hinton, R. R. Salakhutdinov

Science 2006 10.1126/science.1127647
[39]
Wang "Auto-encoder based dimensionality reduction" Neurocomputing (2016) 10.1016/j.neucom.2015.08.104
[40]
Meng "Dimension reduction techniques for the integrative analysis of multi-omics data" Brief. Bioinform. (2016) 10.1093/bib/bbv108
[41]
Lock "Joint and individual variation explained (JIVE) for integrated analysis of multiple data types" Ann. Appl. Stat. (2013) 10.1214/12-aoas597
[42]
Meng, C., Kuster, B., Culhane, A.C., and Gholami, A.M. (2014). A multivariate approach to the integration of multi-omics datasets. BMC Bioinform., 15. 10.1186/1471-2105-15-162
[43]
Zhang "Discovery of multi-dimensional modules by integrative analysis of cancer genomic data" Nucleic Acids Res. (2012) 10.1093/nar/gks725
[44]
Chalise, P., and Fridley, B.L. (2017). Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm. PLoS ONE, 12. 10.1371/journal.pone.0176278
[45]
A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data

Zi Yang, George Michailidis

Bioinformatics 2015 10.1093/bioinformatics/btv544
[46]
Lake "Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain" Nat. Biotechnol. (2018) 10.1038/nbt.4038
[47]
Butler "Integrating single-cell transcriptomic data across different conditions, technologies, and species" Nat. Biotechnol. (2018) 10.1038/nbt.4096
[48]
Ding "Precision oncology beyond targeted therapy: Combining omics data with machine learning matches the majority of cancer cells to effective therapeutics" Mol. Cancer Res. (2018) 10.1158/1541-7786.mcr-17-0378
[49]
Representation Learning: A Review and New Perspectives

Y. Bengio, A. Courville, P. Vincent

IEEE Transactions on Pattern Analysis and Machine... 2013 10.1109/tpami.2013.50
[50]
Alshahrani "Neuro-symbolic representation learning on biological knowledge graphs" Bioinformatics (2017) 10.1093/bioinformatics/btx275

Showing 50 of 233 references

Cited By
319
Network-based analyses of multiomics data in biomedicine

Rachit Kumar, Joseph D. Romano · 2025

BioData Mining
Open Access Rheumatology: Research...
Cell Reports Medicine
BMC Medical Informatics and Decisio...
Environmental Science & Technol...
Current Research in Biotechnology
Machine Learning: A New Prospect in Multi-Omics Data Analysis of Cancer

Babak Arjmand, Shayesteh Kokabi Hamidpour · 2022

Frontiers in Genetics
Cancers
Nextcast: A software suite to analyse and model toxicogenomics data

Angela Serra, Laura Aliisa Saarimäki · 2022

Computational and Structural Biotec...
Computers in Biology and Medicine
Computational and Structural Biotec...
Briefings in Bioinformatics
Briefings in Bioinformatics
Annual Review of Pharmacology and T...
Metrics
319
Citations
233
References
Details
Published
Jan 28, 2019
Vol/Issue
10(2)
Pages
87
License
View
Authors
Funding
National Institutes of Health Award: R35-HL135772
Cite This Article
Bilal Mirza, Wei Wang, Jie Wang, et al. (2019). Machine Learning and Integrative Analysis of Biomedical Big Data. Genes, 10(2), 87. https://doi.org/10.3390/genes10020087