DMTCP (Distributed MultiThreaded Checkpointing)
Overview
DMTCP is a tool to transparently checkpointing the state of an arbitrary group of programs spread across many machines and connected by sockets.
Availability
Cluster | Module/Version |
---|---|
BOSE | dmtcp/2.6.0 |
BGSC | Not Available |
Note: You can simply use module load dmtcp
to activate the most recently installed version of this software.
Arguments / Options
SYNOPSIS
- dmtcp_coordinator [port]
- dmtcp_launch command [args...]
- dmtcp_restart ckpt_FILE1.dmtcp [ckpt_FILE2.dmtcp...]
- dmtcp_command coordinatorCommand
This is a list of arguments for the DMTCP command that we wanted to highlight. Use man dmtcp
or this online man page for a full list.
Option | Description |
---|---|
--help, -h | Show the command-line options for each command |
Example Usage
- In a separate terminal window, start the dmtcp_coordinator.
- In separate terminal(s), replace each command(s) with "dmtcp_launch [command]".
The checkpointed program will connect to the coordinator specified by DMTCP_HOST and DMTCP_PORT. New threads will be checkpointed as part of the process. Child processes will automatically be checkpointed. Remote processes started via ssh will automatically checkpointed. (Internally, DMTCP modifies the ssh command line to call dmtcp_launch on the remote host.)
- To manually initiate a checkpoint, either run the command below or type "c" followed by
into the coordinator. Checkpoint files for each process will be written to DMTCP_CHECKPOINT_DIR. The dmtcp_coordinator will write "dmtcp_restart_script.sh" to its working directory. This script contains the necessary calls to dmtcp_restart to restart the entire computation, including remote processes created via ssh.
- To restart, one should execute dmtcp_restart_script.sh, which is created by the dmtcp_coordinator in its working directory at the time of checkpoint. One can optionally edit this script to migrate processes to different hosts. By default, only one restarted process will be restarted in the foreground and receive the standard input. The script may be edited to choose which process will be restarted in the foreāground.
Sample Slurm Script
submit.sh
#!/bin/bash
# -- SLURM SETTINGS -- #
# [..] other settings here [..]
# The following settings are for the overall request to Slurm
#SBATCH --job-name="YourJobName" # Job Name - Change this to your desired name
#SBATCH --ntasks-per-node=32 # How many CPU cores do you want to request
#SBATCH --nodes=1 # How many nodes do you want to request
#SBATCH --open-mode=append
# -- SCRIPT COMMANDS -- #
# Load the needed modules
module load dmtcp # Load DMTCP
ckptdir=$SLURM_JOB_NAME # Create a variable for the checkpoint directory - Different job name can be used if you wanted to create a new checkpoint directory
mkdir -p $ckptdir
export DMTCP_CHECKPOINT_DIR=$ckptdir # DMTCP environment variable. Allows DMTCP to know where to search for checkpoint images
export DMTCP_CHECKPOINT_INTERVAL=10 # Chooses how often it checkpoints in seconds. 10 mean 10 seconds.
if ! ls -1 $ckptdir | grep -c dmtcp_restart_script > /dev/null # checks if no checkpoint image exists and, if it does not exist, executes using "dmtcp_launch". This is necessary for first time submitted.
then
echo "Using dmtcp_launch to start the app the first time"
dmtcp_launch ./cpp
else # If there is an existing checkpoint directory.
echo "Using dmtcp_restart from $ckptdir to continue from a checkpoint"
dmtcp_restart $ckptdir/ckpt_*.dmtcp
fi