Flexible model-based clustering of mixed binary and continuous data: Application to genetic regulation and cancer

Zainul Abidin Fatin Nurzahirah, David R. Westhead

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Clustering is used widely in omics' studies and is often tackled with standard methods, e.g. hierarchical clustering. However, the increasing need for integration of multiple data sets leads to a requirement for clustering methods applicable to mixed data types, where the straightforward application of standard methods is not necessarily the best approach. A particularly common problem involves clustering entities characterized by a mixture of binary data (e.g. presence/absence of mutations, binding, motifs and epigenetic marks) and continuous data (e.g. gene expression, protein abundance, metabolite levels). Here, we present a generic method based on a probabilistic model for clustering this type of data, and illustrate its application to genetic regulation and the clustering of cancer samples. We show that the resulting clusters lead to useful hypotheses: in the case of genetic regulation these concern regulation of groups of genes by specific sets of transcription factors and in the case of cancer samples combinations of gene mutations are related to patterns of gene expression. The clusters have potential mechanistic significance and in the latter case are significantly linked to survival. The method is available as a stand-alone software package (GNU General Public Licence) from http://github.com/BioToolsLeeds/FlexiCoClusteringPackage.git.

Original languageEnglish
Article numbere53
JournalNucleic Acids Research
Volume45
Issue number7
DOIs
Publication statusPublished - 20 Apr 2017

Fingerprint

Cluster Analysis
Neoplasms
Gene Expression
Mutation
Statistical Models
Licensure
Epigenomics
Genes
Transcription Factors
Software
Proteins

ASJC Scopus subject areas

  • Genetics

Cite this

Flexible model-based clustering of mixed binary and continuous data : Application to genetic regulation and cancer. / Fatin Nurzahirah, Zainul Abidin; Westhead, David R.

In: Nucleic Acids Research, Vol. 45, No. 7, e53, 20.04.2017.

Research output: Contribution to journalArticle

@article{e90d9be949df4f97b79b1310f62177ec,
title = "Flexible model-based clustering of mixed binary and continuous data: Application to genetic regulation and cancer",
abstract = "Clustering is used widely in omics' studies and is often tackled with standard methods, e.g. hierarchical clustering. However, the increasing need for integration of multiple data sets leads to a requirement for clustering methods applicable to mixed data types, where the straightforward application of standard methods is not necessarily the best approach. A particularly common problem involves clustering entities characterized by a mixture of binary data (e.g. presence/absence of mutations, binding, motifs and epigenetic marks) and continuous data (e.g. gene expression, protein abundance, metabolite levels). Here, we present a generic method based on a probabilistic model for clustering this type of data, and illustrate its application to genetic regulation and the clustering of cancer samples. We show that the resulting clusters lead to useful hypotheses: in the case of genetic regulation these concern regulation of groups of genes by specific sets of transcription factors and in the case of cancer samples combinations of gene mutations are related to patterns of gene expression. The clusters have potential mechanistic significance and in the latter case are significantly linked to survival. The method is available as a stand-alone software package (GNU General Public Licence) from http://github.com/BioToolsLeeds/FlexiCoClusteringPackage.git.",
author = "{Fatin Nurzahirah}, {Zainul Abidin} and Westhead, {David R.}",
year = "2017",
month = "4",
day = "20",
doi = "10.1093/nar/gkw1270",
language = "English",
volume = "45",
journal = "Nucleic Acids Research",
issn = "0305-1048",
publisher = "Oxford University Press",
number = "7",

}

TY - JOUR

T1 - Flexible model-based clustering of mixed binary and continuous data

T2 - Application to genetic regulation and cancer

AU - Fatin Nurzahirah, Zainul Abidin

AU - Westhead, David R.

PY - 2017/4/20

Y1 - 2017/4/20

N2 - Clustering is used widely in omics' studies and is often tackled with standard methods, e.g. hierarchical clustering. However, the increasing need for integration of multiple data sets leads to a requirement for clustering methods applicable to mixed data types, where the straightforward application of standard methods is not necessarily the best approach. A particularly common problem involves clustering entities characterized by a mixture of binary data (e.g. presence/absence of mutations, binding, motifs and epigenetic marks) and continuous data (e.g. gene expression, protein abundance, metabolite levels). Here, we present a generic method based on a probabilistic model for clustering this type of data, and illustrate its application to genetic regulation and the clustering of cancer samples. We show that the resulting clusters lead to useful hypotheses: in the case of genetic regulation these concern regulation of groups of genes by specific sets of transcription factors and in the case of cancer samples combinations of gene mutations are related to patterns of gene expression. The clusters have potential mechanistic significance and in the latter case are significantly linked to survival. The method is available as a stand-alone software package (GNU General Public Licence) from http://github.com/BioToolsLeeds/FlexiCoClusteringPackage.git.

AB - Clustering is used widely in omics' studies and is often tackled with standard methods, e.g. hierarchical clustering. However, the increasing need for integration of multiple data sets leads to a requirement for clustering methods applicable to mixed data types, where the straightforward application of standard methods is not necessarily the best approach. A particularly common problem involves clustering entities characterized by a mixture of binary data (e.g. presence/absence of mutations, binding, motifs and epigenetic marks) and continuous data (e.g. gene expression, protein abundance, metabolite levels). Here, we present a generic method based on a probabilistic model for clustering this type of data, and illustrate its application to genetic regulation and the clustering of cancer samples. We show that the resulting clusters lead to useful hypotheses: in the case of genetic regulation these concern regulation of groups of genes by specific sets of transcription factors and in the case of cancer samples combinations of gene mutations are related to patterns of gene expression. The clusters have potential mechanistic significance and in the latter case are significantly linked to survival. The method is available as a stand-alone software package (GNU General Public Licence) from http://github.com/BioToolsLeeds/FlexiCoClusteringPackage.git.

UR - http://www.scopus.com/inward/record.url?scp=85019108402&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85019108402&partnerID=8YFLogxK

U2 - 10.1093/nar/gkw1270

DO - 10.1093/nar/gkw1270

M3 - Article

C2 - 27994031

AN - SCOPUS:85019108402

VL - 45

JO - Nucleic Acids Research

JF - Nucleic Acids Research

SN - 0305-1048

IS - 7

M1 - e53

ER -