journal article Open Access Jan 24, 2022

A simple guide to de novo transcriptome assembly and annotation

View at Publisher Save 10.1093/bib/bbab563
Abstract
Abstract
A transcriptome constructed from short-read RNA sequencing (RNA-seq) is an easily attainable proxy catalog of protein-coding genes when genome assembly is unnecessary, expensive or difficult. In the absence of a sequenced genome to guide the reconstruction process, the transcriptome must be assembled de novo using only the information available in the RNA-seq reads. Subsequently, the sequences must be annotated in order to identify sequence-intrinsic and evolutionary features in them (for example, protein-coding regions). Although straightforward at first glance, de novo transcriptome assembly and annotation can quickly prove to be challenging undertakings. In addition to familiarizing themselves with the conceptual and technical intricacies of the tasks at hand and the numerous pre- and post-processing steps involved, those interested must also grapple with an overwhelmingly large choice of tools. The lack of standardized workflows, fast pace of development of new tools and techniques and paucity of authoritative literature have served to exacerbate the difficulty of the task even further. Here, we present a comprehensive overview of de novo transcriptome assembly and annotation. We discuss the procedures involved, including pre- and post-processing steps, and present a compendium of corresponding tools.
Topics

No keywords indexed for this article. Browse by subject →

References
259
[1]
Buccitelli "mRNAs, proteins and the emerging principles of gene expression control" Nat Rev Genet (2020) 10.1038/s41576-020-0258-4
[2]
Schimmel "The emerging complexity of the tRNA world: mammalian tRNAs beyond protein synthesis" Nat Rev Mol Cell Biol (2018) 10.1038/nrm.2017.77
[3]
Gene regulation by long non-coding RNAs and its biological functions

Luisa Statello, Chun-Jie Guo, Ling-Ling Chen et al.

Nature Reviews Molecular Cell Biology 2021 10.1038/s41580-020-00315-9
[4]
Holoch "RNA-mediated epigenetic regulation of gene expression" Nat Rev Genet (2015) 10.1038/nrg3863
[5]
Li "Coding or noncoding, the converging concepts of RNAs" Front Genet (2019) 10.3389/fgene.2019.00496
[6]
Slatko "Overview of next-generation sequencing technologies" Curr Protoc Mol Biol (2018) 10.1002/cpmb.59
[7]
Stark "RNA sequencing: the teenage years" Nat Rev Genet (2019) 10.1038/s41576-019-0150-2
[8]
RNA-Seq: a revolutionary tool for transcriptomics

Zhong Wang, Mark Gerstein, Michael Snyder

Nature Reviews Genetics 2009 10.1038/nrg2484
[9]
Mantione "Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq" Med Sci Monit Basic Res (2014) 10.12659/msmbr.892101
[10]
Han "Advanced applications of RNA sequencing and challenges" Bioinform Biol Insights (2015)
[11]
Single-Cell RNA-Seq Technologies and Related Computational Data Analysis

Geng Chen, Baitang Ning, Tieliu Shi

Frontiers in Genetics 2019 10.3389/fgene.2019.00317
[12]
Kukurba "RNA sequencing and analysis" Cold Spring Harb Protoc (2015) 10.1101/pdb.top084970
[13]
Next-generation genome annotation: we still struggle to get it right

Steven L. Salzberg

Genome Biology 2019 10.1186/s13059-019-1715-2
[14]
Hrdlickova "RNA-Seq methods for transcriptome analysis" Wiley Interdiscip Rev RNA (2017) 10.1002/wrna.1364
[15]
Martin "Next-generation transcriptome assembly" Nat Rev Genet (2011) 10.1038/nrg3068
[16]
Peona "How complete are “complete” genome assemblies?-an avian perspective" Mol Ecol Resour (2018) 10.1111/1755-0998.12933
[17]
Todd "The power and promise of RNA-seq in ecology and evolution" Mol Ecol (2016) 10.1111/mec.13526
[18]
Asai "E novo transcriptome assembly and gene expression profiling of the copepod calanus helgolandicus feeding on the PUA-producing diatom skeletonema marinoi" Mar Drugs (2020) 10.3390/md18080392
[19]
Moreno-Santillán "De novo transcriptome assembly and functional annotation in five species of bats" Sci Rep (2019) 10.1038/s41598-019-42560-9
[20]
Chabikwa "De novo transcriptome assembly and annotation for gene discovery in avocado, macadamia and mango" Sci Data (2020) 10.1038/s41597-019-0350-9
[21]
Rosen "A de novo transcriptomics approach reveals genes involved in thrips tabaci resistance to spinosad" Insects (2021) 10.3390/insects12010067
[22]
Alvarez "Transcriptome annotation in the cloud: complexity, best practices, and cost" Gigascience (2021) 10.1093/gigascience/giaa163
[23]
The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update

Enis Afgan, Dannon Baker, Bérénice Batut et al.

Nucleic Acids Research 2018 10.1093/nar/gky379
[24]
Carruthers "De novo transcriptome assembly, annotation and comparison of four ecological and evolutionary model salmonid fish species" BMC Genomics (2018) 10.1186/s12864-017-4379-x
[25]
Stoler "Sequencing error profiles of illumina sequencing instruments" NAR Genom Bioinform (2021)
[26]
Garcia "Effects of short read quality and quantity on a de novo vertebrate transcriptome assembly" Comp Biochem Physiol C Toxicol Pharmacol (2012) 10.1016/j.cbpc.2011.05.012
[27]
Sena Brandine "Falco: high-speed FastQC emulation for quality control of sequencing data" F1000Res (2019) 10.12688/f1000research.21142.1
[28]
MultiQC: summarize analysis results for multiple tools and samples in a single report

Philip Ewels, Måns Magnusson, Sverker Lundin et al.

Bioinformatics 2016 10.1093/bioinformatics/btw354
[29]
Song "Rcorrector: efficient and accurate error correction for illumina RNA-seq reads" Gigascience (2015) 10.1186/s13742-015-0089-y
[30]
Cutadapt removes adapter sequences from high-throughput sequencing reads

Marcel Martin

EMBnet.journal 2011 10.14806/ej.17.1.200
[31]
BBMerge – Accurate paired shotgun read merging via overlap

Brian Bushnell, Jonathan Rood, Esther Singer

PLoS ONE 2017 10.1371/journal.pone.0185056
[32]
Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities

Brent Ewing, Phil Green

Genome Research 1998 10.1101/gr.8.3.186
[33]
fastp: an ultra-fast all-in-one FASTQ preprocessor

Shifu Chen, Yanqing Zhou, Yaru Chen et al.

Bioinformatics 2018 10.1093/bioinformatics/bty560
[34]
Trimmomatic: a flexible trimmer for Illumina sequence data

Anthony M. Bolger, Marc Lohse, Bjoern Usadel

Bioinformatics 2014 10.1093/bioinformatics/btu170
[35]
Improved metagenomic analysis with Kraken 2

Derrick E. Wood, Jennifer Lu, Ben Langmead

Genome Biology 2019 10.1186/s13059-019-1891-0
[36]
Centrifuge: rapid and sensitive classification of metagenomic sequences

Daehwan Kim, Li Song, Florian P. Breitwieser et al.

Genome Research 2016 10.1101/gr.210641.116
[37]
Zhao "Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: polya+ selection versus rRNA depletion" Sci Rep (2018)
[39]
Morlan "Selective depletion of rRNA enables whole transcriptome profiling of archival fixed tissue" PLoS One (2012) 10.1371/journal.pone.0042882
[40]
SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data

Evguenia Kopylova, Laurent Noé, Hélène Touzet

Bioinformatics 2012 10.1093/bioinformatics/bts611
[41]
Kalvari "Rfam 14: expanded coverage of metagenomic, viral and microRNA families" Nucleic Acids Res (2021) 10.1093/nar/gkaa1047
[42]
Quast "The SILVA ribosomal RNA gene database project: improved data processing and web-based tools" Nucleic Acids Res (2013)
[43]
Wang "Evaluation of the coverage and depth of transcriptome by RNA-Seq in chickens" BMC Bioinformatics (2011) 10.1186/1471-2105-12-s10-s5
[44]
Differential expression in RNA-seq: A matter of depth

Sonia Tarazona, Fernando García-Alcalde, Joaquín Dopazo et al.

Genome Research 2011 10.1101/gr.124321.111
[45]
De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis

Brian J Haas, Alexie Papanicolaou, Moran Yassour et al.

Nature Protocols 2013 10.1038/nprot.2013.084
[46]
Full-length transcriptome assembly from RNA-Seq data without a reference genome

Manfred G Grabherr, Brian J Haas, Moran Yassour et al.

Nature Biotechnology 2011 10.1038/nbt.1883
[47]
Crusoe "The khmer software package: enabling efficient nucleotide sequence analysis" F1000Res (2015) 10.12688/f1000research.6924.1
[48]
Wedemeyer "An improved filtering algorithm for big read datasets and its application to single-cell assembly" BMC Bioinformatics (2017) 10.1186/s12859-017-1724-7
[49]
McCorrison "NeatFreq: reference-free data reduction and coverage normalization for de novo sequence assembly" BMC Bioinformatics (2014) 10.1186/s12859-014-0357-3
[50]
Durai "Improving in-silico normalization using read weights" Sci Rep (2019) 10.1038/s41598-019-41502-9

Showing 50 of 259 references