journal article Open Access Sep 22, 2023

Generative models for protein sequence modeling: recent advances and future directions

View at Publisher Save 10.1093/bib/bbad358
Abstract
Abstract
The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental fitness annotations underpins the need for self-supervised and unsupervised machine learning (ML) methods. These techniques leverage the meaningful features encoded in abundant unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly evolving fields of protein engineering and generative AI is required to realize the full potential of ML models as a tool for protein fitness landscape navigation. Here, we support this work by (i) providing an overview of the architecture and mathematical details of the most successful ML models applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement these models on protein sequence data to predict fitness or generate high-fitness sequences and (iii) highlighting several successful studies that implement these techniques in protein engineering (from paratope regions and subcellular localization prediction to high-fitness sequences and protein design rules generation). By providing a comprehensive survey of model details, novel architecture developments, comparisons of model applications, and current challenges, this study intends to provide structured guidance and robust framework for delivering a prospective outlook in the ML-driven protein engineering field.
Topics

No keywords indexed for this article. Browse by subject →

References
134
[1]
Webster "Engineered two-helix small proteins for molecular recognition" Chem Bio Chem (2009) 10.1002/cbic.200900062
[2]
Eke "Early detection of Alzheimer’s disease with blood plasma proteins using support vector machines" IEEE J Biomed Health Inform (2021) 10.1109/jbhi.2020.2984355
[3]
The Clinical Significance and Potential Role of C-Reactive Protein in Chronic Inflammatory and Neurodegenerative Diseases

Ying-Yi Luan, Yong-ming Yao

Frontiers in Immunology 2018 10.3389/fimmu.2018.01302
[4]
Bam "Efficacy of Affibody-based ultrasound molecular imaging of vascular B7-H3 for breast cancer detection" Clin Cancer Res (2020) 10.1158/1078-0432.ccr-19-1655
[5]
Małecki "Proteins in food systems—bionanomaterials, conventional and unconventional sources, functional properties, and development opportunities" Polymers (2021) 10.3390/polym13152506
[6]
Janssen "Engineering proteins for environmental applications" Curr Opin Biotechnol (1994) 10.1016/0958-1669(94)90026-4
[7]
Kuroda "Molecular Design of the Microbial Cell Surface toward the recovery of metal ions" Curr Opin Biotechnol (2011) 10.1016/j.copbio.2010.12.006
[8]
Prakash "Bioremediation: a genuine technology to remediate radionuclides from the environment" J Microbial Biotechnol (2013) 10.1111/1751-7915.12059
[9]
Jez "Toward protein engineering for phytoremediation: possibilities and challenges" Int J Phytoremediation (2011) 10.1080/15226514.2011.568537
[10]
Jia "Display of lead-binding proteins on Escherichia coli surface for lead bioremediation" Biotechnol Bioeng (2020) 10.1002/bit.27525
[11]
Diem "Selection of high-affinity Centyrin FN3 domains from a simple library diversified at a combination of strand and loop positions" Protein Eng Des Sel (2014) 10.1093/protein/gzu016
[12]
Golinski "High-throughput developability assays enable library-scale identification of producible protein scaffold variants" Proc Natl Acad Sci (2021) 10.1073/pnas.2026658118
[14]
Merkl "Reconstruction of ancestral enzymes" Perspect Sci (2016) 10.1016/j.pisc.2016.08.002
[15]
Vaswani (2017)
[16]
Ghojogh (2020)
[17]
Deep learning

Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Nature 2015 10.1038/nature14539
[18]
The language of proteins: NLP, machine learning & protein sequences

Dan Ofer, Nadav Brandes, Michal Linial

Computational and Structural Biotechnology Journal 2021 10.1016/j.csbj.2021.03.022
[19]
Wang "DeepDTAF: a deep learning method to predict protein–ligand binding affinity" Brief Bioinform (2021) 10.1093/bib/bbab072
[20]
Li "Machine learning applied to predicting microorganism growth temperatures and enzyme catalytic optima" ACS Synth Biol (2019) 10.1021/acssynbio.9b00099
[21]
DeepSol: a deep learning framework for sequence-based protein solubility prediction

Sameer Khurana, Reda Rawi, Khalid Kunji et al.

Bioinformatics 2018 10.1093/bioinformatics/bty166
[22]
Hashemifar "Predicting protein–protein interactions through sequence-based deep learning" Bioinformatics (2018) 10.1093/bioinformatics/bty573
[23]
Wang "Predicting protein-protein interactions from matrix-based protein sequence using convolution neural network and feature-selective rotation Forest" Sci Rep (2019) 10.1038/s41598-019-46369-4
[24]
Ferruz "A deep unsupervised language model for protein design" (2022)
[25]
Rao "Evaluating protein transfer learning with TAPE" Adv Neural Inf Process Syst (2019)
[26]
Unified rational protein engineering with sequence-based deep representation learning

Ethan C. Alley, Grigory Khimulya, Surojit Biswas et al.

Nature Methods 2019 10.1038/s41592-019-0598-1
[27]
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Alexander Rives, Joshua Meier, Tom Sercu et al.

Proceedings of the National Academy of Sciences 2021 10.1073/pnas.2016239118
[28]
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

Ahmed Elnaggar, Michael Heinzinger, Christian Dallago et al.

IEEE Transactions on Pattern Analysis and Machine... 2022 10.1109/tpami.2021.3095381
[29]
Costello (2019)
[30]
ProtGPT2 is a deep unsupervised language model for protein design

Noelia Ferruz, Steffen Schmidt, Birte Höcker

Nature Communications 2022 10.1038/s41467-022-32007-7
[31]
Expanding functional protein sequence spaces using generative adversarial networks

Donatas Repecka, Vykintas Jauniskis, Laurynas Karpus et al.

Nature Machine Intelligence 2021 10.1038/s42256-021-00310-5
[32]
De novo design of protein structure and function with RFdiffusion

Joseph L. Watson, David Juergens, Nathaniel R. Bennett et al.

Nature 2023 10.1038/s41586-023-06415-8
[33]
Kingma (2022)
[34]
Goodfellow (2014)
[35]
Sohl-Dickstein
[36]
Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network

Alex Sherstinsky

Physica D: Nonlinear Phenomena 2020 10.1016/j.physd.2019.132306
[37]
Bidirectional recurrent neural networks

M. Schuster, K.K. Paliwal

IEEE Transactions on Signal Processing 1997 10.1109/78.650093
[38]
Long Short-Term Memory

Sepp Hochreiter, Jürgen Schmidhuber

Neural Computation 1997 10.1162/neco.1997.9.8.1735
[39]
Chung (2014)
[40]
Recurrent Neural Network Model for Constructive Peptide Design

Alex T. Müller, Jan A. Hiss, Gisbert Schneider

Journal of Chemical Information and Modeling 2018 10.1021/acs.jcim.7b00414
[41]
Saka "Antibody design using LSTM based deep generative model from phage display library for affinity maturation" Sci Rep (2021) 10.1038/s41598-021-85274-7
[42]
Sabban "RamaNet: computational de novo helical protein backbone design using a long short-term memory generative neural network" (2020)
[43]
Zhang "Prediction of 8-state protein secondary structures by a novel deep learning architecture" BMC Bioinformatics (2018) 10.1186/s12859-018-2280-5
[45]
Trinquier "Efficient generative Modeling of protein sequences using simple autoregressive models" Nat Commun (2021) 10.1038/s41467-021-25756-4
[46]
Protein design and variant prediction using autoregressive generative models

Jung-Eun Shin, Adam J. Riesselman, Aaron W. Kollasch et al.

Nature Communications 2021 10.1038/s41467-021-22732-w
[47]
[49]
Panda "A novel improved prediction of protein structural class using deep recurrent neural network" Evol Intell (2021) 10.1007/s12065-018-0171-3
[50]
An evolution-based model for designing chorismate mutase enzymes

William P. Russ, Matteo Figliuzzi, Christian Stocker et al.

Science 2020 10.1126/science.aba3304

Showing 50 of 134 references

Metrics
49
Citations
134
References
Details
Published
Sep 22, 2023
Vol/Issue
24(6)
License
View
Funding
USDA Award: 13700968
Department of Chemical Engineering and Materials Science at Michigan State University
Cite This Article
Mehrsa Mardikoraem, Zirui Wang, Nathaniel Pascual, et al. (2023). Generative models for protein sequence modeling: recent advances and future directions. Briefings in Bioinformatics, 24(6). https://doi.org/10.1093/bib/bbad358