Welcome to the big leaves: Best practices for improving genome annotation in non‐model plant genomes

Abstract

Premise
Robust standards to evaluate quality and completeness are lacking in eukaryotic structural genome annotation, as genome annotation software is developed using model organisms and typically lacks benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. The annotation of plant genomes is particularly challenging due to their large sizes, abundant transposable elements, and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read input, and method on protein‐coding gene predictions.

Methods
The impact of repeat masking, long‐read and short‐read inputs, and de novo and genome‐guided protein evidence was examined in the context of the popular BRAKER and MAKER workflows for five plant genomes. The annotations were benchmarked for structural traits and sequence similarity.

Results
Benchmarks that reflect gene structures, reciprocal similarity search alignments, and mono‐exonic/multi‐exonic gene counts provide a more complete view of annotation accuracy. Transcripts derived from RNA‐read alignments alone are not sufficient for genome annotation. Gene prediction workflows that combine evidence‐based and ab initio approaches are recommended, and a combination of short and long reads can improve genome annotation. Adding protein evidence from de novo assemblies, genome‐guided transcriptome assemblies, or full‐length proteins from OrthoDB generates more putative false positives as implemented in the current workflows. Post‐processing with functional and structural filters is highly recommended.

Discussion
While the annotation of non‐model plant genomes remains complex, this study provides recommendations for inputs and methodological approaches. We discuss a set of best practices to generate an optimal plant genome annotation and present a more robust set of metrics to evaluate the resulting predictions.

Topics

No keywords indexed for this article. Browse by subject →

References

75

[1]

10.1186/s13059-020-1935-5

[2]

Andrews S.2010. FastQC: A quality control tool for high throughput sequence data. Available online. Website:https://www.bioinformatics.babraham.ac.uk/projects/fastqc/[accessed 17 May 2018].

[3]

Analysis of the genome sequence of the flowering plant Arabidopsis thaliana

Nature 10.1038/35048692

[4]

10.1186/1471-2148-8-280

[5]

10.1186/s12859-021-04120-9

[6]

Bolger M. E. "Plant genome and transcriptome annotations: From misconceptions to simple solutions" Briefings in Bioinformatics (2018)

[7]

Bruna T.2022. Unsupervised algorithms for automated gene prediction in novel eukaryotic genomes. Website:http://hdl.handle.net/1853/67297[accessed 12 May 2023].

[8]

10.1093/nargab/lqaa108

[9]

10.1016/j.gpb.2019.04.002

[10]

10.1002/0471250953.bi0411s48

[11]

10.1104/pp.113.230144

[12]

10.1101/gr.6743907

[13]

10.1016/j.cell.2016.08.031

[14]

10.1038/s41477-018-0323-6

[15]

10.1111/tpj.13415

[16]

10.1093/gigascience/giy013

[17]

10.1038/s41598-020-76881-x

[18]

10.1371/journal.pcbi.1007301

[19]

Edgar R.2010. Breaking through the BLAST barrier to high‐throughput sequence analysis. Abstract presented at the Sequencing Finishing Analysis in the Future meeting in Santa Fe New Mexico USA. Available atUsearch:https://www.osti.gov/biblio/1137186[accessed 24 May 2023].

[20]

LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons

David Ellinghaus, Stefan Kurtz, Ute Willhoeft

BMC Bioinformatics 10.1186/1471-2105-9-18

[21]

RepeatModeler2 for automated genomic discovery of transposable element families

Jullien M. Flynn, Robert Hubley, Clément Goubert et al.

Proceedings of the National Academy of Sciences 10.1073/pnas.1921046117

[22]

TSEBRA: transcript selector for BRAKER

Lars Gabriel, Katharina J. Hoff, Tomáš Brůna et al.

BMC Bioinformatics 10.1186/s12859-021-04482-0

[23]

10.1534/g3.116.032805

[24]

Full-length transcriptome assembly from RNA-Seq data without a reference genome

Manfred G Grabherr, Brian J Haas, Moran Yassour et al.

Nature Biotechnology 10.1038/nbt.1883

[25]

10.1016/j.infsof.2005.09.005

[26]

QUAST: quality assessment tool for genome assemblies

Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi et al.

Bioinformatics 10.1093/bioinformatics/btt086

[27]

10.1186/gb-2008-9-1-r7

[28]

10.1111/1755-0998.13106

[29]

10.1093/bioinformatics/btv661

[30]

10.1007/978-1-4939-9173-0_5

[31]

MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects

Carson Holt, Mark Yandell

BMC Bioinformatics 10.1186/1471-2105-12-491

[32]

10.1186/s12864-016-2923-8

[33]

eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses

Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller et al.

Nucleic Acids Research 10.1093/nar/gky1085

[34]

10.1007/s10142-007-0052-9

[35]

InterProScan 5: genome-scale protein function classification

Philip Jones, David Binns, Hsin-Yu Chang et al.

Bioinformatics 10.1093/bioinformatics/btu031

[36]

Joshi N. A. andJ. N.Fass.2011. Sickle: A sliding‐window adaptive quality‐based trimming tool for FastQ files (Version 1.33). Website:https://github.com/najoshi/sickle[accessed 23 May 2023].

[37]

10.1371/journal.pcbi.1008325

[38]

10.1016/j.pbi.2018.11.001

[39]

10.1038/s41587-019-0201-4

[40]

Kirbis A. N.Rahmatpour S.Dong J.Yu N.vanGessel M.Waller R.Reski et al.2022. Genome dynamics in mosses: Extensive synteny coexists with a highly dynamic gene space. bioRxiv 492078 [Preprint] [posted 18 May 2022]. Available at:https://doi.org/10.1101/2022.05.17.492078[accessed 12 May 2023]. 10.1101/2022.05.17.492078

[41]

Gene finding in novel genomes

Ian Korf

BMC Bioinformatics 10.1186/1471-2105-5-59

[42]

Transcriptome assembly from long-read RNA-seq alignments with StringTie2

Sam Kovaka, Aleksey V. Zimin, Geo M. Pertea et al.

Genome Biology 10.1186/s13059-019-1910-1

[43]

10.1073/pnas.2115640118

[44]

OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs

Evgenia V Kriventseva, Dmitry Kuznetsov, Fredrik Tegenfeldt et al.

Nucleic Acids Research 10.1093/nar/gky1053

[45]

10.1073/pnas.2115635118

[46]

Minimap2: pairwise alignment for nucleotide sequences

Heng Li

Bioinformatics 10.1093/bioinformatics/bty191

[47]

10.1093/bioinformatics/btab705

[48]

10.1007/s00299-021-02817-y

[49]

10.1093/molbev/msab199

[50]

10.1038/s41477-021-01031-8

Showing 50 of 75 references

Cited By

53

Evolutionary and methodological considerations when interpreting gene presence–absence variation in pangenomes

Tomáš Brůna, Avinash Sreedasyam · 2026

NAR Genomics and Bioinformatics

Metrics

53

Citations

75

References

Details

Published: Jul 01, 2023
Vol/Issue: 11(4)
License: View

Authors

V

Vidya S. Vuruputoor