Troubleshooting Jobs

Where to Start

If your script fails unexpectly, the first place to look is your job's output or error files your Slurm job creates automatically, usually in the same place where you ran sbatch script.sh. For many, these files might be called something similar to output-###.txt or error-###.txt.

Look at those files and see if there is anything obvious, such as a syntax error or a input file or dataset that isn't being found.

Out Of Memory

OUT_OF_MEMORY

It's challenging to know the exact resources required for a job to properly run to completion. A common issue we notice people run into is the dreaded "OUT_OF_MEMORY" error, which happens if your job tries to use more memory than what was originally requested.

Try following these steps to diagnose your memory usage:

Run myjobreport JOBIDHERE to get a report on your job.
1. If you have email notifications turn on, you will have already been sent the report.
Review how much memory you requested under "REQUESTED RESOURCES".
Review how much memory your job tried using:
1. Look at memory utilization under "RESOURCE USAGE" for a general idea on how much was used.
2. Go to the link in the "METRICS" section to see how memory changes over the life of the job.
Increase your requested memory in your Slurm script.
1. Unsure where to start? Try adding another 10GB or even simply doubling it.
Submit your job again and adjust as required.

Pending Jobs Statuses

Below is a list of common errors / statuses that we've seen come up that causes a job to be stuck in the queue or end up failing. This are typically found when running squeue or myjobs.

(QOSMaxJobsPerUserLimit)

QOSMaxJobsPerUserLimit: Quality of Service: You hit the max number of jobs you can run at any one time and the job will be able to start once your other jobs finish. Some groups or classes may also have their own custom restrictions applied. Run myqos to see a list of settings applied to your user/group.