Journal article
bioRxiv, 2023
APA
Click to copy
Schulz, A. J., Zhai, J., Aubuchon-Elder, T. M., El-Walid, M., Ferebee, T. H., Gilmore, E. H., … Hsu, S. K. (2023). Fishing for a reelGene: evaluating gene models with evolution and machine learning. BioRxiv.
Chicago/Turabian
Click to copy
Schulz, Aimee J., Jingjing Zhai, Taylor M. Aubuchon-Elder, Mohamed El-Walid, Taylor H. Ferebee, Elizabeth H Gilmore, M. Hufford, et al. “Fishing for a ReelGene: Evaluating Gene Models with Evolution and Machine Learning.” bioRxiv (2023).
MLA
Click to copy
Schulz, Aimee J., et al. “Fishing for a ReelGene: Evaluating Gene Models with Evolution and Machine Learning.” BioRxiv, 2023.
BibTeX Click to copy
@article{aimee2023a,
title = {Fishing for a reelGene: evaluating gene models with evolution and machine learning},
year = {2023},
journal = {bioRxiv},
author = {Schulz, Aimee J. and Zhai, Jingjing and Aubuchon-Elder, Taylor M. and El-Walid, Mohamed and Ferebee, Taylor H. and Gilmore, Elizabeth H and Hufford, M. and Johnson, Lynn and Kellogg, Elizabeth A. and La, T. and Long, Evan M and Miller, Zachary R. and Romay, M. and Seetharam, Arun S. and Stitzer, Michelle C. and Wrightsman, Travis and Buckler, E. and Monier, B. and Hsu, Sheng‐Kai}
}
Assembled genomes and their associated annotations have transformed our study of gene function. However, each new assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two models leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third uses conserved evolutionary grammar of protein sequences to predict whether a gene can produce a protein. Evaluating 1.8 million gene models in maize, reelGene found that 28% were incorrectly annotated or nonfunctional. By leveraging a large cohort of related species and through learning the conserved grammar of proteins, reelGene provides a tool for both evaluating gene model accuracy and genome biology.