Skip to content

Troubleshooting Jobs

Where to Start

If your script fails unexpectly, the first place to look is your job's output or error files your Slurm job creates automatically, usually in the same place where you ran sbatch script.sh. For many, these files might be called something similar to output-###.txt or error-###.txt.

Look at those files and see if there is anything obvious, such as a syntax error or a input file or dataset that isn't being found.

Out Of Memory

OUT_OF_MEMORY

It's challenging to know the exact resources required for a job to properly run to completion. A common issue we notice people run into is the dreaded "OUT_OF_MEMORY" error, which happens if your job tries to use more memory than what was originally requested.

Try following these steps to diagnose your memory usage:

  1. Run myjobreport JOBIDHERE to get a report on your job.
    1. If you have email notifications turn on, you will have already been sent the report.
  2. Review how much memory you requested under "REQUESTED RESOURCES".
  3. Review how much memory your job tried using:
    1. Look at memory utilization under "RESOURCE USAGE" for a general idea on how much was used.
    2. Go to the link in the "METRICS" section to see how memory changes over the life of the job.
  4. Increase your requested memory in your Slurm script.
    1. Unsure where to start? Try adding another 10GB or even simply doubling it.
  5. Submit your job again and adjust as required.

Pending Jobs Statuses

Below is a list of common errors / statuses that we've seen come up that causes a job to be stuck in the queue or end up failing. This are typically found when running squeue or myjobs.

(QOSMaxJobsPerUserLimit)

QOSMaxJobsPerUserLimit: Quality of Service: You hit the max number of jobs you can run at any one time and the job will be able to start once your other jobs finish. Some groups or classes may also have their own custom restrictions applied. Run myqos to see a list of settings applied to your user/group.

See Also: User Policy - Fair Use

(QOSMaxGRESPerUser)

QOSMaxGRESPerUser: Quality of Service: You hit the max number of GPUs you can use at any one time and the job will be able to start once your other jobs finish. Some groups or classes may also have their own custom restrictions applied. Run myqos to see a list of settings applied to your user/group.

See Also: User Policy - Fair Use

(MaxNodePerAccount)

MaxNodePerAccount: Quality of Service: Your group has hit the max number of nodes you can use at one time. Certain limited partitions such as highmemory or GPU may have their own limitations. Run myqos to see a list of settings applied to your user/group.

(Resources)

Resources: Your job currently cannot be accommodated on any node at this time due to available resources. This could be due to your selected partition / nodes and will have to wait until another job is completed. You can see the full list of available resources on all the nodes by using the savail command.

(Priority)

Priority: If there are multiple jobs currently pending in the queue (squeue), your job may have a lower priority or was submitted after others. Your job may have to wait until other jobs in the queue are completed before yours is next in line.

CONFIGURING (CF)

CONFIGURING (CF): When your job is marked as configuring, this means that the node you are going to use was in its power saving mode and is in the process of booting up. Usually after a few minutes your job will automatically start running.

Still not sure?

Have you looked at this page and still not sure what to do? Contact the HPC Team and we'd be glad to work with you directly to see if we can get you back on track.

▶ Please make sure to share your job's unique ID number, if known, or a path to where your scripts can be found.

Contact Us