Determining Resources

In Progress

This guide is still under active development and will be completed before official launch of website.

One of the challenging parts of using a computing cluster is identifying the resources you'll need to request when you submit a job, such as memory and cpu cores. The goal of this page is to help provide some guidance on various resources, and tips you can use to 'find the sweet spot' between performance and time that best works for you.

"Between performance and time" (Click to open)

When determining resources it can be easy to want to just request the most possible so your program runs in the least amount of time.

However, not all software is able to take advantage of additional resources, or scales perfectly before changes start to have minimal impact on time.

Additionally, if you request more resources than what is currently available on the cluster (such as requesting eight nodes but only four are open), your job may have to wait a long time in the queue before it is able to start.

All of these resource requests are controlled through the #SBATCH --field=value options that you specify in your Slurm script. You can also append these to your Slurm commands to only take effect for a single job, such as sbatch --field=value my-script.sh or sinteract --field=value. These will override any defaults, or what is specified in your sbatch script.

Partition

Clusters are broken up into what is known as 'partitions'. This is a feature in job scheduling systems, like Slurm, to group together machines based on aspects such as hardware support, time limits, or machines dedicated to a certain group.

Determining A Partition

The majority of jobs that use our computing infrastructure fall under the week and GPU partitions, but some may need to run longer or have access to more memory.

Partition	Usage
week	For most CPU-based jobs that can run up to a week - this is the default if not specified
GPU	For jobs that want to use a graphics card - see GPU (Graphics Processing Unit)
month	For jobs that need to run up to a month ('batch' on BGSC)
highmemory	(BOSE Only) For jobs that need more than 250 GB of memory, for a max of 1 TB per node

Each cluster has its own set of partitions, which you can view in more detail by clicking the button below.

View Full List

Setting A Partition

Partition are set through the --partition=X option in Slurm.

Batch (sbatch) Mode:

In your slurm script:

#SBATCH --partition=GPU

Temporary - only for a single job:

sbatch --partition=gpu my-script.sh

Interactive Mode:

sinteract --partition=GPU

Time (Walltime)

Walltime, which could be considered a time limit, is a set of time that a job is able to run for. Once a job runs past its set walltime, it'll automatically terminate, or requeue if enabled. When determining what time a pending/queued job is going to start, Slurm looks at the walltime of all of the other submitted jobs to provide a best estimate. To assist with scheduling, it's important to have a job's walltime be as accurate as possible.

Most of our partitions are set to a default maximum walltime of 7 days.

Determining A Walltime

If you are unsure how long your script will run for, feel free to let it use the default walltime of 7 days on the 'week' or 'GPU' partition. This is done by not specifying a time in your Slurm script.

Setting A Walltime

Setting your jobs walltime is done using the --time option in Slurm. It's typical format is as followed:

#SBATCH --time=DD-HH:MM:SS

Tracking Walltime

Once your job completes, you can either view the email notification you received, if enabled, or run myjobreport jobidhere to view the Elapsed Time and compare it to the initially requested Time Limit.

Use this number to better inform your next run of similar jobs.

Need An Extension?

If you have a running job that is nearing its time limit (run myjobs) and it is not able to be restarted without significant loss of work, you can contact us to request an extension.

Please note that requests are subject to approval of the HPC Team and may be limited to accommodate other users on the cluster. This is especially true for limited resources such as GPUs and high-memory nodes.

CPUs and Cores

Determining Number of Cores

Some keywords to look out for are "parallel", "multiprocessing", "multicore", "number of processes", and "MPI".

If you see any of those terms, then these are indications that your program may work well with increasing the number of CPU cores #SBATCH --ntasks or even number of nodes (if supports MPI) #SBATCH --node.

Setting Number of Cores

Tracking CPU Usage

Memory

Determining Memory Usage

Setting Memory

Tracking Memory Usage

While it differs for each program, a good starting point is to have your memory #SBATCH --mem be at least as large as your biggest file.

GPU (Graphics Processing Unit)

Limited Resource

This is a limited resource and may result in some wait time before your job begins. Use scontrol show node gpu[01-04] and look for AllocTRES (Used) and CfgTRES (Total) to see current available resources per node.

savail will also show you available CPU cores and memory, but it does not currently take into account the number of GPU cards available.

Determining GPU Usage

To take advantage of a GPU, programs need to be specifically built to support using them. Keywords to look out for is CUDA or GPU accelerated.

Setting GPU Usage

To request a GPU, use the GPU partition #SBATCH --partition=GPU, and request how many GPUs you want to use #SBATCH --gpus=1.

By default, choose one GPU unless you know your program supports and benefits from distributing work across multiple devices. Users are able to request up to three GPUs on a single machine.

Tracking GPU Usage

Still Not Sure?

Figuring out what you'll need is not always easy, and may take many attempts to see what does or does not work. If you get stuck, we encourage you to reach out to us. We'd be glad to work with you to see what we can do to make your research or project a success.

Just let us know:

Your plans for the project -- What are you doing?
What software you are using -- Every program is different, so this will give us a starting point
What you have and have not tried -- Run into any errors? Something not work the way you expected?

Contact the HPC Team