Fault Tolerance for Scalable Applications
Checkpointing Protocols for Parallel Message-Passing-Systems
©2003
Monographs
XIV,
216 Pages
Summary
The usage of parallel or distributed systems offers the possibility to execute «grand challenge» problems. Due to the complexity of such high performance computing systems and the long execution times of todays simulations, the probability of a failure during a program run cannot be neglected. In this work fault tolerance – specificaly user-transparent checkpointing – is considered. Analysis is performed using simulations. Real implementations are deployed to verify results. The aim is to give an easy approximation on the overhead generated by checkpointing protocols. In addition, it is shown in which situations more complex checkpointing protocols are useful in contrast to very simple approaches.
Details
- Pages
- XIV, 216
- Publication Year
- 2003
- ISBN (Softcover)
- 9783899759006
- Language
- English
- Keywords
- Fehlertoleranz Parallelrechner Informatik Mehrrechnersystem Hochleistungsrechnen Skalierbarkeit Fixpunkt
- Published
- München, 2003. XIV, 216 pp.
- Product Safety
- Peter Lang Group AG