Checkpointing Jobs
Introduction to Checkpoint
Checkpoint
Checkpointing is a system where a job "saves" its progress every so often so that if it needs to stop, it can continue from where it left off. Some software has native support for checkpoint files (such as Gaussian), while others have no built-in mechanism to handle it.
Why Checkpoint?
It’s a nice practice to add a checkpoint at the end of that section of your code that performs a heavy computation. For some reason(if interrupted/backfilling), your job is paused or restarted, you wouldn’t have to spend more time redoing the same computation that was done before the interruption occurred.
Benefits of using Checkpoint?
- Debugging
- Monitoring
- Coping with nodes failing prematurely or unexpected maintenance
- Scheduled maintenance for jobs that run > 7 days
- Heavy usage that is held up due to long running jobs
How it works
The whole idea of Checkpointing is to save the state of a given program every time a checkpoint is encountered and restarting from there just in case of any unplanned interruption rather than starting from the beginning.
Checkpointing can often slow down the execution of your program however, it’s still a good practice to add checkpoints after every heavy computation.
DMTCP (Distributed MultiThreaded Checkpointing)
DMTCP is an open-source tool that enables transparent checkpointing of distributed and multi-threaded applications. It can checkpoint and restart a range of applications without modifying the application or the OS.
What Happens during Checkpoint? (details from operating systems):
- The user (or program) tells the coordinator to execute a checkpoint.
- The coordinator sends a ckpt message to the checkpoint thread.
- The checkpoint thread sends a signal (SIGUSR2) to each user thread.
- The user thread enters the signal handler defined by libdmtcp.so, and then it blocks there.
- Now the checkpoint thread can copy all of user memory to a checkpoint image file, while the user threads are blocked.
Features:
- Transparent checkpointing of single and distributed applications.
- Supports MPI, sockets, InfiniBand, and many other communication protocols.
- Works with multithreaded programs.
- Supports checkpointing to a remote filesystem.