Seqping: Gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data

Kuang Lim Chan, Rozana Rosli, Tatiana V. Tatarinova, Michael Hogan, Mohd Firdaus Mohd Raih, Eng Ti Leslie Low

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

Background: Gene prediction is one of the most important steps in the genome annotation process. A large number of software tools and pipelines developed by various computing techniques are available for gene prediction. However, these systems have yet to accurately predict all or even most of the protein-coding regions. Furthermore, none of the currently available gene-finders has a universal Hidden Markov Model (HMM) that can perform gene prediction for all organisms equally well in an automatic fashion. Results: We present an automated gene prediction pipeline, Seqping that uses self-training HMM models and transcriptomic data. The pipeline processes the genome and transcriptome sequences of the target species using GlimmerHMM, SNAP, and AUGUSTUS pipelines, followed by MAKER2 program to combine predictions from the three tools in association with the transcriptomic evidence. Seqping generates species-specific HMMs that are able to offer unbiased gene predictions. The pipeline was evaluated using the Oryza sativa and Arabidopsis thaliana genomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the pipeline was able to identify at least 95% of BUSCO's plantae dataset. Our evaluation shows that Seqping was able to generate better gene predictions compared to three HMM-based programs (MAKER2, GlimmerHMM and AUGUSTUS) using their respective available HMMs. Seqping had the highest accuracy in rice (0.5648 for CDS, 0.4468 for exon, and 0.6695 nucleotide structure) and A. thaliana (0.5808 for CDS, 0.5955 for exon, and 0.8839 nucleotide structure). Conclusions: Seqping provides researchers a seamless pipeline to train species-specific HMMs and predict genes in newly sequenced or less-studied genomes. We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available HMMs.

Original languageEnglish
JournalBMC Bioinformatics
Volume18
DOIs
Publication statusPublished - 27 Jan 2017

Fingerprint

Plant Genome
Genome
Pipelines
Genes
Gene
Prediction
Markov Model
Arabidopsis Thaliana
Hidden Markov models
Model
Arabidopsis
Exons
Oryza Sativa
Nucleotides
Training
Benchmarking
Predict
Software Tools
Transcriptome
Open Reading Frames

Keywords

  • Gene model
  • Gene prediction
  • Species specific HMM

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

Seqping : Gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data. / Chan, Kuang Lim; Rosli, Rozana; Tatarinova, Tatiana V.; Hogan, Michael; Mohd Raih, Mohd Firdaus; Low, Eng Ti Leslie.

In: BMC Bioinformatics, Vol. 18, 27.01.2017.

Research output: Contribution to journalArticle

Chan, Kuang Lim ; Rosli, Rozana ; Tatarinova, Tatiana V. ; Hogan, Michael ; Mohd Raih, Mohd Firdaus ; Low, Eng Ti Leslie. / Seqping : Gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data. In: BMC Bioinformatics. 2017 ; Vol. 18.
@article{bb5526d78a1a45ceb892f9d37f06dd5b,
title = "Seqping: Gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data",
abstract = "Background: Gene prediction is one of the most important steps in the genome annotation process. A large number of software tools and pipelines developed by various computing techniques are available for gene prediction. However, these systems have yet to accurately predict all or even most of the protein-coding regions. Furthermore, none of the currently available gene-finders has a universal Hidden Markov Model (HMM) that can perform gene prediction for all organisms equally well in an automatic fashion. Results: We present an automated gene prediction pipeline, Seqping that uses self-training HMM models and transcriptomic data. The pipeline processes the genome and transcriptome sequences of the target species using GlimmerHMM, SNAP, and AUGUSTUS pipelines, followed by MAKER2 program to combine predictions from the three tools in association with the transcriptomic evidence. Seqping generates species-specific HMMs that are able to offer unbiased gene predictions. The pipeline was evaluated using the Oryza sativa and Arabidopsis thaliana genomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the pipeline was able to identify at least 95{\%} of BUSCO's plantae dataset. Our evaluation shows that Seqping was able to generate better gene predictions compared to three HMM-based programs (MAKER2, GlimmerHMM and AUGUSTUS) using their respective available HMMs. Seqping had the highest accuracy in rice (0.5648 for CDS, 0.4468 for exon, and 0.6695 nucleotide structure) and A. thaliana (0.5808 for CDS, 0.5955 for exon, and 0.8839 nucleotide structure). Conclusions: Seqping provides researchers a seamless pipeline to train species-specific HMMs and predict genes in newly sequenced or less-studied genomes. We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available HMMs.",
keywords = "Gene model, Gene prediction, Species specific HMM",
author = "Chan, {Kuang Lim} and Rozana Rosli and Tatarinova, {Tatiana V.} and Michael Hogan and {Mohd Raih}, {Mohd Firdaus} and Low, {Eng Ti Leslie}",
year = "2017",
month = "1",
day = "27",
doi = "10.1186/s12859-016-1426-6",
language = "English",
volume = "18",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Seqping

T2 - Gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data

AU - Chan, Kuang Lim

AU - Rosli, Rozana

AU - Tatarinova, Tatiana V.

AU - Hogan, Michael

AU - Mohd Raih, Mohd Firdaus

AU - Low, Eng Ti Leslie

PY - 2017/1/27

Y1 - 2017/1/27

N2 - Background: Gene prediction is one of the most important steps in the genome annotation process. A large number of software tools and pipelines developed by various computing techniques are available for gene prediction. However, these systems have yet to accurately predict all or even most of the protein-coding regions. Furthermore, none of the currently available gene-finders has a universal Hidden Markov Model (HMM) that can perform gene prediction for all organisms equally well in an automatic fashion. Results: We present an automated gene prediction pipeline, Seqping that uses self-training HMM models and transcriptomic data. The pipeline processes the genome and transcriptome sequences of the target species using GlimmerHMM, SNAP, and AUGUSTUS pipelines, followed by MAKER2 program to combine predictions from the three tools in association with the transcriptomic evidence. Seqping generates species-specific HMMs that are able to offer unbiased gene predictions. The pipeline was evaluated using the Oryza sativa and Arabidopsis thaliana genomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the pipeline was able to identify at least 95% of BUSCO's plantae dataset. Our evaluation shows that Seqping was able to generate better gene predictions compared to three HMM-based programs (MAKER2, GlimmerHMM and AUGUSTUS) using their respective available HMMs. Seqping had the highest accuracy in rice (0.5648 for CDS, 0.4468 for exon, and 0.6695 nucleotide structure) and A. thaliana (0.5808 for CDS, 0.5955 for exon, and 0.8839 nucleotide structure). Conclusions: Seqping provides researchers a seamless pipeline to train species-specific HMMs and predict genes in newly sequenced or less-studied genomes. We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available HMMs.

AB - Background: Gene prediction is one of the most important steps in the genome annotation process. A large number of software tools and pipelines developed by various computing techniques are available for gene prediction. However, these systems have yet to accurately predict all or even most of the protein-coding regions. Furthermore, none of the currently available gene-finders has a universal Hidden Markov Model (HMM) that can perform gene prediction for all organisms equally well in an automatic fashion. Results: We present an automated gene prediction pipeline, Seqping that uses self-training HMM models and transcriptomic data. The pipeline processes the genome and transcriptome sequences of the target species using GlimmerHMM, SNAP, and AUGUSTUS pipelines, followed by MAKER2 program to combine predictions from the three tools in association with the transcriptomic evidence. Seqping generates species-specific HMMs that are able to offer unbiased gene predictions. The pipeline was evaluated using the Oryza sativa and Arabidopsis thaliana genomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the pipeline was able to identify at least 95% of BUSCO's plantae dataset. Our evaluation shows that Seqping was able to generate better gene predictions compared to three HMM-based programs (MAKER2, GlimmerHMM and AUGUSTUS) using their respective available HMMs. Seqping had the highest accuracy in rice (0.5648 for CDS, 0.4468 for exon, and 0.6695 nucleotide structure) and A. thaliana (0.5808 for CDS, 0.5955 for exon, and 0.8839 nucleotide structure). Conclusions: Seqping provides researchers a seamless pipeline to train species-specific HMMs and predict genes in newly sequenced or less-studied genomes. We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available HMMs.

KW - Gene model

KW - Gene prediction

KW - Species specific HMM

UR - http://www.scopus.com/inward/record.url?scp=85013780998&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85013780998&partnerID=8YFLogxK

U2 - 10.1186/s12859-016-1426-6

DO - 10.1186/s12859-016-1426-6

M3 - Article

C2 - 28466793

AN - SCOPUS:85013780998

VL - 18

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

ER -