Incorporating fault tolerance with replication on very large scale grids

Elankovan A Sundararajan, Aaron Harwood, Ramamohanarao Kotagiri

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Providing fault tolerance for message passing parallel application on a distributed environment is a rule rather than an exception. A node failure can cause the whole computation to stop and has to be restarted from the beginning if no fault tolerance is available. However, introducing fault tolerance has some overhead on speedup that can be achieved. In this paper, we introduce a new technique called replication with cross-over packets for reliability and to increase fault tolerance over Very Large Scale Grids (VLSG). This technique has two pronged effect of avoiding single point of failure and single link of failure. We incorporate this new technique into the L-BSP model and show the possible speedup of parallel process. We also derive the achievable speedup for some fundamental parallel algorithms using this technique.

Original languageEnglish
Title of host publicationParallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings
Pages319-328
Number of pages10
DOIs
Publication statusPublished - 2007
Event18th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2007 - Adelaide, SA
Duration: 3 Dec 20076 Dec 2007

Other

Other18th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2007
CityAdelaide, SA
Period3/12/076/12/07

Fingerprint

Fault tolerance
Message passing
Parallel algorithms

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

A Sundararajan, E., Harwood, A., & Kotagiri, R. (2007). Incorporating fault tolerance with replication on very large scale grids. In Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings (pp. 319-328). [4420186] https://doi.org/10.1109/PDCAT.2007.4420186

Incorporating fault tolerance with replication on very large scale grids. / A Sundararajan, Elankovan; Harwood, Aaron; Kotagiri, Ramamohanarao.

Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings. 2007. p. 319-328 4420186.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

A Sundararajan, E, Harwood, A & Kotagiri, R 2007, Incorporating fault tolerance with replication on very large scale grids. in Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings., 4420186, pp. 319-328, 18th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2007, Adelaide, SA, 3/12/07. https://doi.org/10.1109/PDCAT.2007.4420186
A Sundararajan E, Harwood A, Kotagiri R. Incorporating fault tolerance with replication on very large scale grids. In Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings. 2007. p. 319-328. 4420186 https://doi.org/10.1109/PDCAT.2007.4420186
A Sundararajan, Elankovan ; Harwood, Aaron ; Kotagiri, Ramamohanarao. / Incorporating fault tolerance with replication on very large scale grids. Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings. 2007. pp. 319-328
@inproceedings{cff38dd3c85e4cdfa84db5adad8219fc,
title = "Incorporating fault tolerance with replication on very large scale grids",
abstract = "Providing fault tolerance for message passing parallel application on a distributed environment is a rule rather than an exception. A node failure can cause the whole computation to stop and has to be restarted from the beginning if no fault tolerance is available. However, introducing fault tolerance has some overhead on speedup that can be achieved. In this paper, we introduce a new technique called replication with cross-over packets for reliability and to increase fault tolerance over Very Large Scale Grids (VLSG). This technique has two pronged effect of avoiding single point of failure and single link of failure. We incorporate this new technique into the L-BSP model and show the possible speedup of parallel process. We also derive the achievable speedup for some fundamental parallel algorithms using this technique.",
author = "{A Sundararajan}, Elankovan and Aaron Harwood and Ramamohanarao Kotagiri",
year = "2007",
doi = "10.1109/PDCAT.2007.4420186",
language = "English",
isbn = "0769530494",
pages = "319--328",
booktitle = "Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings",

}

TY - GEN

T1 - Incorporating fault tolerance with replication on very large scale grids

AU - A Sundararajan, Elankovan

AU - Harwood, Aaron

AU - Kotagiri, Ramamohanarao

PY - 2007

Y1 - 2007

N2 - Providing fault tolerance for message passing parallel application on a distributed environment is a rule rather than an exception. A node failure can cause the whole computation to stop and has to be restarted from the beginning if no fault tolerance is available. However, introducing fault tolerance has some overhead on speedup that can be achieved. In this paper, we introduce a new technique called replication with cross-over packets for reliability and to increase fault tolerance over Very Large Scale Grids (VLSG). This technique has two pronged effect of avoiding single point of failure and single link of failure. We incorporate this new technique into the L-BSP model and show the possible speedup of parallel process. We also derive the achievable speedup for some fundamental parallel algorithms using this technique.

AB - Providing fault tolerance for message passing parallel application on a distributed environment is a rule rather than an exception. A node failure can cause the whole computation to stop and has to be restarted from the beginning if no fault tolerance is available. However, introducing fault tolerance has some overhead on speedup that can be achieved. In this paper, we introduce a new technique called replication with cross-over packets for reliability and to increase fault tolerance over Very Large Scale Grids (VLSG). This technique has two pronged effect of avoiding single point of failure and single link of failure. We incorporate this new technique into the L-BSP model and show the possible speedup of parallel process. We also derive the achievable speedup for some fundamental parallel algorithms using this technique.

UR - http://www.scopus.com/inward/record.url?scp=48049121260&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=48049121260&partnerID=8YFLogxK

U2 - 10.1109/PDCAT.2007.4420186

DO - 10.1109/PDCAT.2007.4420186

M3 - Conference contribution

AN - SCOPUS:48049121260

SN - 0769530494

SN - 9780769530499

SP - 319

EP - 328

BT - Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings

ER -