Framework for Parallel Preprocessing of Microarray Data Using Hadoop

Amirhossein Sahlabadi, Ravie Chandren Muniyandi, Mahdi Sahlabadi, Hossein Golshanbafghy

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Nowadays, microarray technology has become one of the popular ways to study gene expression and diagnosis of disease. National Center for Biology Information (NCBI) hosts public databases containing large volumes of biological data required to be preprocessed, since they carry high levels of noise and bias. Robust Multiarray Average (RMA) is one of the standard and popular methods that is utilized to preprocess the data and remove the noises. Most of the preprocessing algorithms are time-consuming and not able to handle a large number of datasets with thousands of experiments. Parallel processing can be used to address the above-mentioned issues. Hadoop is a well-known and ideal distributed file system framework that provides a parallel environment to run the experiment. In this research, for the first time, the capability of Hadoop and statistical power of R have been leveraged to parallelize the available preprocessing algorithm called RMA to efficiently process microarray data. The experiment has been run on cluster containing 5 nodes, while each node has 16 cores and 16 GB memory. It compares efficiency and the performance of parallelized RMA using Hadoop with parallelized RMA using affyPara package as well as sequential RMA. The result shows the speed-up rate of the proposed approach outperforms the sequential approach and affyPara approach.

Original languageEnglish
Article number9391635
JournalAdvances in Bioinformatics
Volume2018
DOIs
Publication statusPublished - 1 Jan 2018

Fingerprint

Microarrays
Noise
Information Centers
Computer Communication Networks
Experiments
Databases
Technology
Efficiency
Gene Expression
Gene expression
Research
Data storage equipment
Processing
Datasets

ASJC Scopus subject areas

  • Biomedical Engineering
  • Biochemistry, Genetics and Molecular Biology (miscellaneous)
  • Computer Science Applications

Cite this

Framework for Parallel Preprocessing of Microarray Data Using Hadoop. / Sahlabadi, Amirhossein; Muniyandi, Ravie Chandren; Sahlabadi, Mahdi; Golshanbafghy, Hossein.

In: Advances in Bioinformatics, Vol. 2018, 9391635, 01.01.2018.

Research output: Contribution to journalArticle

Sahlabadi, Amirhossein ; Muniyandi, Ravie Chandren ; Sahlabadi, Mahdi ; Golshanbafghy, Hossein. / Framework for Parallel Preprocessing of Microarray Data Using Hadoop. In: Advances in Bioinformatics. 2018 ; Vol. 2018.
@article{210b3942f8244c5dbfa284c75f8770b6,
title = "Framework for Parallel Preprocessing of Microarray Data Using Hadoop",
abstract = "Nowadays, microarray technology has become one of the popular ways to study gene expression and diagnosis of disease. National Center for Biology Information (NCBI) hosts public databases containing large volumes of biological data required to be preprocessed, since they carry high levels of noise and bias. Robust Multiarray Average (RMA) is one of the standard and popular methods that is utilized to preprocess the data and remove the noises. Most of the preprocessing algorithms are time-consuming and not able to handle a large number of datasets with thousands of experiments. Parallel processing can be used to address the above-mentioned issues. Hadoop is a well-known and ideal distributed file system framework that provides a parallel environment to run the experiment. In this research, for the first time, the capability of Hadoop and statistical power of R have been leveraged to parallelize the available preprocessing algorithm called RMA to efficiently process microarray data. The experiment has been run on cluster containing 5 nodes, while each node has 16 cores and 16 GB memory. It compares efficiency and the performance of parallelized RMA using Hadoop with parallelized RMA using affyPara package as well as sequential RMA. The result shows the speed-up rate of the proposed approach outperforms the sequential approach and affyPara approach.",
author = "Amirhossein Sahlabadi and Muniyandi, {Ravie Chandren} and Mahdi Sahlabadi and Hossein Golshanbafghy",
year = "2018",
month = "1",
day = "1",
doi = "10.1155/2018/9391635",
language = "English",
volume = "2018",
journal = "Advances in Bioinformatics",
issn = "1687-8027",
publisher = "Hindawi Publishing Corporation",

}

TY - JOUR

T1 - Framework for Parallel Preprocessing of Microarray Data Using Hadoop

AU - Sahlabadi, Amirhossein

AU - Muniyandi, Ravie Chandren

AU - Sahlabadi, Mahdi

AU - Golshanbafghy, Hossein

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Nowadays, microarray technology has become one of the popular ways to study gene expression and diagnosis of disease. National Center for Biology Information (NCBI) hosts public databases containing large volumes of biological data required to be preprocessed, since they carry high levels of noise and bias. Robust Multiarray Average (RMA) is one of the standard and popular methods that is utilized to preprocess the data and remove the noises. Most of the preprocessing algorithms are time-consuming and not able to handle a large number of datasets with thousands of experiments. Parallel processing can be used to address the above-mentioned issues. Hadoop is a well-known and ideal distributed file system framework that provides a parallel environment to run the experiment. In this research, for the first time, the capability of Hadoop and statistical power of R have been leveraged to parallelize the available preprocessing algorithm called RMA to efficiently process microarray data. The experiment has been run on cluster containing 5 nodes, while each node has 16 cores and 16 GB memory. It compares efficiency and the performance of parallelized RMA using Hadoop with parallelized RMA using affyPara package as well as sequential RMA. The result shows the speed-up rate of the proposed approach outperforms the sequential approach and affyPara approach.

AB - Nowadays, microarray technology has become one of the popular ways to study gene expression and diagnosis of disease. National Center for Biology Information (NCBI) hosts public databases containing large volumes of biological data required to be preprocessed, since they carry high levels of noise and bias. Robust Multiarray Average (RMA) is one of the standard and popular methods that is utilized to preprocess the data and remove the noises. Most of the preprocessing algorithms are time-consuming and not able to handle a large number of datasets with thousands of experiments. Parallel processing can be used to address the above-mentioned issues. Hadoop is a well-known and ideal distributed file system framework that provides a parallel environment to run the experiment. In this research, for the first time, the capability of Hadoop and statistical power of R have been leveraged to parallelize the available preprocessing algorithm called RMA to efficiently process microarray data. The experiment has been run on cluster containing 5 nodes, while each node has 16 cores and 16 GB memory. It compares efficiency and the performance of parallelized RMA using Hadoop with parallelized RMA using affyPara package as well as sequential RMA. The result shows the speed-up rate of the proposed approach outperforms the sequential approach and affyPara approach.

UR - http://www.scopus.com/inward/record.url?scp=85045480436&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045480436&partnerID=8YFLogxK

U2 - 10.1155/2018/9391635

DO - 10.1155/2018/9391635

M3 - Article

VL - 2018

JO - Advances in Bioinformatics

JF - Advances in Bioinformatics

SN - 1687-8027

M1 - 9391635

ER -