Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. In these environments, a dedicated checkpoint storage system can offer multiple benefits: reduce the load on a traditional file system, offer high-performance through specialization, and, finally, optimize checkpoint data management by taking into account application semantics. Such a storage system can present a unifying abstraction to checkpoint operations, while hiding the fact that there are no dedicated resources to store the checkpoint data. This paper presents a dedicated checkpoint storage system for desktop grid environments. Our solution uses scavenged disk space from participating desktops to build an inexpensive storage space, offering a traditional file system interface for easy integration with checkpointing applications. This paper presents the architecture of our checkpoint storage system, key write optimizations for high-speed I/O, support for incremental checkpointing and checkpoint data availability. Our evaluation indicates that such a storage system can offer an application perceived checkpoint write I/O bandwidth as high as 135MB/sec and can be viable in a desktop grid setting.
View on arXiv