italc

 

Purpose

Raising the level of abstraction of application-level checkpointing.

Overview

The computational resources at open-science supercomputing centers are shared among multiple users at a given time, and hence are governed by policies that ensure their fair and optimal usage. Such policies can impose upper-limits on: (1) the number of compute-nodes, and (2) the wall-clock time that can be requested per computational job.

Given these limits on computational jobs, several applications may not run to completion in a single session. Therefore, as a workaround, users are advised to take advantage of the checkpoint-and-restart technique and spread their computations across multiple interdependent computational jobs. The checkpoint-and-restart technique helps in saving the execution state of the applications periodically.

A saved state is known as a checkpoint. When their computational jobs time-out after running for the maximum wall-clock time, leaving their computations incomplete, users can submit new jobs to resume their computations using the checkpoints saved during their previous job runs. The checkpoint-and-restart technique can also be useful for making the applications tolerant to certain types of faults, such as network and compute-node failures.

When this technique is built within an application itself, it is called Application-Level Checkpointing (ALC). We are developing an interactive tool to assist users in semi-automatically inserting the ALC mechanism into their existing applications without doing any manual reengineering. As compared to other approaches for checkpointing, the checkpoints written with our tool have smaller memory footprint, and thus, incur a smaller I/O overhead.

Impact

italc is currently under development.

Paper Reference

Ritu Arora, Trung Nguyen, "ITALC: Interactive Tool for Application-Level Checkpointing", HUST17 workshop at SC17, November 2017.

Ritu Arora

Research Associate, High Performance Computing Group
rauta@tacc.utexas.edu | 512-475-9411