Determining Resources
In Progress
This guide is still under active development and will be completed before official launch of website.
One of the challenging parts of using a computing cluster is identifying your resource needs for when you submit a job, such as memory and cpu cores. The goal of this page is to help provide some guidance on various resources, and tips you can use to 'find the sweet spot' that best works for you.
All of these resource requests are controlled through the #SBATCH --field=value
options that you specify in your Slurm script. You can also append these to your Slurm commands to only take effect for a single job, such as sbatch --field=value my-script.sh
or sinteract --field=value
. These will override any defaults, or what is specified in your sbatch script.
Partition
Clusters are broken up into what is known as 'partitions'. This is a feature in job scheduling systems, like Slurm, to group together machines based on aspects such as hardware support, time limits, or machines dedicated to a certain group.
Determining A Partition
The majority of jobs that use our computing infrastructure fall under the "week
" and "GPU
" partitions, but some may need more memory or time.
Partition | Usage |
---|---|
week | For most CPU-based jobs that can run up to a week - this is the default if not specified |
GPU | For jobs that need a graphics card |
month | For jobs that need to run up to a month ('batch' on BGSC) |
highmemory | (BOSE Only) For jobs that need more than 250GB of memory |
Each cluster has its own set of partitions, which you can view in more detail by clicking the button below.
Setting A Partition
Partition are set through the --partition=X
option in Slurm.
Batch (sbatch) Mode:
In your slurm script:
#SBATCH --partition=GPU
Temporary - only for a single job:
sbatch --partition=gpu my-script.sh
Interactive Mode:
sinteract --partition=GPU
Time (Walltime)
Walltime, which could be considered a time limit, is a set of time that a job is able to run for. Once a job runs past its set walltime, it'll automatically terminate, or requeue if enabled. When determining what time a pending/queued job is going to start, Slurm looks at the walltime of all of the other submitted jobs to provide a best estimate. To assist with scheduling, it's important to have a job's walltime be as accurate as possible.
Most of our partitions are set to a default maximum walltime of 7 days.
Determining A Walltime
If you are unsure how long your script will run for, feel free to let it use the default walltime of 7 days on the 'week' or 'GPU' partition. This is done by not specifying a time in your Slurm script.
Setting A Walltime
Setting your jobs walltime is done using the --time
option in Slurm. It's typical format is as followed:
#SBATCH --time=DD-HH:MM:SS
Tracking Walltime
Once your job completes, you can either view the email notification you received for the elapsed time, if enabled, or use the seff jobidhere
command to show the "Job Wall-click time".
Use this number to better inform your next run of similar jobs.
CPUs and Cores
Determining Number of Cores
Some keywords to look out for are "parallel", "multiprocessing", "multicore", "number of processes", and "MPI".
If you see any of those terms, then these are indications that your program may work well with increasing the number of CPU cores #SBATCH --ntasks
or even number of nodes (if supports MPI) #SBATCH --node
.
Setting Number of Cores
Tracking CPU Usage
Memory
Determining Memory Usage
Setting Memory
Tracking Memory Usage
While it differs for each program, a good starting point is to have your memory #SBATCH --mem
be at least as large as your biggest file.
GPU (Graphics Processing Unit)
Determining GPU Usage
To take advantage of a GPU, programs need to be specifically built to support using them. Keywords to look out for is CUDA or GPU accelerated.
Setting GPU Usage
To request a GPU, use the GPU partition #SBATCH --partition=GPU
, and request how many GPUs you want to use #SBATCH --gpus=1
.
By default, choose one GPU unless you know your program supports and benefits from distributing work across multiple devices. Users are able to request up to three GPUs on a single machine.
Tracking GPU Usage
Still Not Sure?
Figuring out what you'll need is not always easy, and may take many attempts to see what does or does not work. If you get stuck, we encourage you to reach out to us. We'd be glad to work with you to see what we can do to make your research or project a success.
Just let us know:
- Your plans for the project -- What are you doing?
- What software you are using -- Every program is different, so this will give us a starting point
- What you have and have not tried -- Run into any errors? Something not work the way you expected?